Anthropic leadership conversation

zach-stein-perlman

Anthropic leadership conversation

post by Zach Stein-Perlman · 2024-12-20T22:00:45.229Z · LW · GW · 17 comments

This is a link post for https://www.youtube.com/watch?v=om2lIWXLLN4

  Tom Brown at 20:00 
  Daniela Amodei at 20:26
  Sam McCandlish at 21:30
  Dario Amodei at 22:00 
  Daniela Amodei at 23:25
  Chris Olah at 24:20
  Daniela Amodei at 29:04
  Jared Kaplan at 25:11
  Dario Amodei at 41:38
  Daniela Amodei at 42:08
  Dario Amodei at 48:07
None
17 comments

This is interesting and I'm glad Anthropic did it. I quoted interesting-to-me parts and added some unjustified commentary.

Tom Brown at 20:00

the US treats the Constitution as like the holy document—which I think is just a big thing that strengthens the US, like we don't expect the US to go off the rails in part because just like every single person in the US is like The Constitution is a big deal, and if you tread on that, like, I'm mad. I think that the RSP, like, it holds that thing. It's like the holy document for Anthropic. So it's worth doing a lot of iterations getting it right.

Daniela Amodei at 20:26

Some of what I think has been so cool to watch about the RSP development at Anthropic is it feels like it has gone through so many different phases and there's so many different skills that are needed to make it work, right? There's the big ideas; Dario and Paul and Sam and Jared and so many others are like, what are the principles? Like what are we trying to say? How do we know if we're right? But there's also this very operational approach to just iterating where we're like, well, we thought that we were gonna see this at this safety level, and we didn't, so should we change it so that we're making sure that we're holding ourselves accountable? And then there's all kinds of organizational things, right? We just were like, let's change the structure of the RSP organization for clearer accountability. And I think my sense is that for a document that's as important as this, right, I love the Constitution analogy, it's like there's all of these bodies and systems that exist in the US to make sure that we follow the Constitution, right? There's the courts, there's the Supreme Court, there's the presidency, there's, you know, both houses of Congress and they do all kinds of other things, of course, but there's like all of this infrastructure that you need around this one document, and I feel like we're also learning that lesson here.

Does the new RSP promote "clearer accountability"? I guess a little; per the new RSP:

Internal and external accountability: We have made a number of changes to our previous “procedural commitments.” These include expanding the duties of the Responsible Scaling Officer; adding internal critique and external expert input on capability and safeguard assessments; new procedures related to internal governance; and maintaining a public page for overviews of past Capability and Safeguard Reports, RSP-related updates, and future plans.

But mostly the new RSP is just "more flexible and nuanced," I think.

Also, minor:

well, we thought that we were gonna see this at this safety level, and we didn't, so should we change it so that we're making sure that we're holding ourselves accountable?

I don't really understand (like, I can't imagine an example that would be well-described by this) but I'm slightly annoyed because it suggests a vibe of we have made the RSP stronger at least once.

Sam McCandlish at 21:30

[The RSP] sort of reflects a view a lot of us have about safety, which is that it's a solvable problem. It's just a very, very hard problem that's gonna take tons and tons of work. All of these institutions that we need to build up, like there's all sorts of institutions built up around like automotive safety, built up over many, many years. But we're like, Do we have the time to do that? We've gotta go as fast as we can to figure out what the institutions we need for AI safety are, and build those and try to build them here first, but make it exportable.

Dario Amodei at 22:00

[The RSP] forces unity because if any part of the org is not kind of in line with our safety values, it shows up through the RSP, right? The RSP is gonna block them from doing what they want to do, and so it's a way to remind everyone over and over again, basically, to make safety a product requirement, part of the product planning process. And so it's not just a bunch of bromides that we repeat; it's something that you actually, if you show up here and you're not aligned, you actually run into it. And you either have to learn to get with the program or it doesn't work out.

I would agree if the RSP was stronger.

Daniela Amodei at 23:25

In addition to driving alignment, [the RSP] also drives clarity. Because it's written down what we're trying to do and it's legible to everyone in the company, and it's legible externally what we think we're supposed to be aiming towards from a safety perspective. It's not perfect. We're iterating on it, we're making it better, but I think there's some value in saying like, this is what we're worried about, this thing over here. Like you can't just use this word to sort of derail something in either direction, right? To say, oh, because of safety, we can't do X, or because of safety, we have to do X. We're really trying to make it clearer what we mean.

Dario adds:

Yeah, it prevents you from worrying about every last little thing under the sun. Because it's actually the fire drills that damage the cause of safety in the long run. I've said, like, If there's a building and the fire alarm goes off every week, that's a really unsafe building. Because when there's actually a fire, you're just gonna be like oh, it just goes off all the time. So it's very important to be calibrated.

Chris Olah at 24:20

I think that RSP creates healthy incentives at a lot of levels. So I think internally it aligns the incentives of every team with safety because it means that if we don't make progress on safety, we're gonna block. I also think that externally it creates a lot of healthier incentives than other possibilities, at least that I see, because it means that if we at some point have to take some kind of dramatic action, like if at some point we have to say we've reached some point and we can't yet make a model safe, it aligns that with sort of the point where there's evidence that supports that decision and there's sort of a preexisting framework for thinking about it, and it's legible. And so I think there's a lot of levels at which the RSP, I think in ways that maybe I didn't initially understand when we were talking about the early versions of it, it creates a better framework than any of the other ones that I've thought about.

I would agree if the RSP was stronger.

Daniela Amodei at 29:04

the way that I think about the RSP the most is what it sounds like, just like the tone. I think we just did a big rewrite of the tone of the RSP because it felt overly technocratic and even a little bit adversarial. I spend a lot of time thinking about how do you build a system that people want to be a part of. It's so much better if [] everyone in the company can walk around and tell you [] what are the top goals of the RSP, how do we know if we're meeting them, what AI safety level are we at right now—are we at ASL-2, are we at ASL-3—that people know what to look for because that is how you're going to have good common knowledge of if something's going wrong. If it's overly technocratic and it's something that only particular people in the company feel is accessible to them, [that's bad], and it's been really cool to watch it transition into this document where I think most if not everyone at the company, regardless of their role, could read it and say this feels really reasonable; I want to make sure that we're building AI in the following ways and I see why I would be worried about these things and I also kind of know what to look for if I bump into something. It's almost like, make it simple enough that if you're working at a manufacturing plant and you're like huh it looks like the safety seatbelt on this should connect this way but it doesn't connect — that you can spot it and that there's healthy feedback flow between leadership and the board and the rest of the company and the people that are actually building it because I actually think the way this stuff goes wrong in most cases is just like the wires get crossed. And that would be a really sad way for things to go wrong. It's all about operationalizing it, making it easy for people to understand.

I feel bad about this [edit: for reasons I fail to immediately articulate, sorry].

Jared Kaplan at 25:11

Pushes back a little: all of the above were reasons to be excited about the RSP ex ante but it's been surprisingly hard and complicated to determine evals and thresholds; in AI there's a big range where you don't know whether a model is safe. (Kudos.)

Dario Amodei at 41:38

"Race to the top" works in practice:

a few months after we came out with our RSP, the 3 most prominent AI companies had one; interpretability research, that's another area we've done it; just the focus on safety overall, like collaboration with the AI Safety Institutes, other areas.

Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.

My take: Anthropic gets a little credit for RSP adoption; focus on interp isn't clearly good; Anthropic doesn't get credit for collaboration with AISIs (did it do better than GDM and OpenAI?); on red-teaming I'm not familiar with Anthropic's timeline and interested in takes but, like, I think GDM wrote Model evaluation for extreme risks before any Anthropic-racing-to-the-top on evals.

Daniela Amodei at 42:08

Says customers say they prefer Claude because it's safer (in terms of hallucinations and jailbreaks).

Is it true that Claude is safer? Would be news to me.

Dario Amodei at 48:07

He's excited about places where there's (apparent) consensus, what everyone wise thinks, and then it breaks. He thinks that's about to happen in interp, among other places.

I think interpretability is both the key to steering and making safe AI systems . . . , and interpretability contains insights about intelligent optimization problems and about how the human brain works.

I'd bet against + I wish Anthropic's alignment bets were less interp-y. (But largely because of vibes about what everyone wise thinks.)

(I claim Anthropic's RSP is not very ambitious and is quite insufficient to prevent catastrophe from Anthropic models, especially because ASL-4 hasn't been defined yet but also I worry that the ASL-3 standard will not be sufficient for upper-ASL-3 models. [I'm also skeptical that the RSP is as relevant to most of the staff as this conversation suggests.])

17 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-12-21T01:17:00.847Z · LW(p) · GW(p)

Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.

Wait what? I didn't hear about this. What other companies have frontier red teams? Where can I learn about them?

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-12-21T01:59:44.055Z · LW(p) · GW(p)

I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).

Edit: maybe I don't know what he's referring to.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-12-21T13:39:12.774Z · LW(p) · GW(p)

DC evals got started in summer of '22, across all three leading companies AFAICT. And I was on the team that came up with the idea and started making it happen (both internally and externally), or at least, as far as I can tell we came up with the idea -- I remember discussions between Beth Barnes and Jade Leung (who were both on the team in spring '22), and I remember thinking it was mostly their idea (maybe also Cullen's?) It's possible that they got it from Anthropic but it didn't seem that way to me. Update: OK, so apparently @evhub [LW · GW] had joined Anthropic just a few months earlier [EDIT this is false evhub joined much later, I misread the dates, thanks Lawrence] -- it's possible the Frontier Red Team was created when he joined then, and information spread to the team I was on (but not to me) about it. I'm curious to find out what happened here, anyone wanna weigh in?

At any rate I don't think there exists any clone or near-clone of the Frontier Red Team at OpenAI or any other company outside Anthropic.

Replies from: LawChan

↑ comment by LawrenceC (LawChan) · 2024-12-21T20:40:39.853Z · LW(p) · GW(p)

Evan joined Anthropic in late 2022 no? (Eg his post announcing it was Jan 2023 https://www.alignmentforum.org/posts/7jn5aDadcMH6sFeJe/why-i-m-joining-anthropic [AF · GW])

I think you’re correct on the timeline, I remember Jade/Jan proposing DC Evals in April 2022, (which was novel to me at the time), and Beth started METR in June 2022, and I don’t remember there being such teams actually doing work (at least not publically known) when she pitched me on joining in August 2022.

It seems plausible that anthropic’s scaring laws project was already under work before then (and this is what they’re referring to, but proliferating QA datasets feels qualitatively than DC Evals). Also, they were definitely doing other red teaming, just none that seem to be DC Evals

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-12-21T21:09:54.801Z · LW(p) · GW(p)

scaring laws

lol

comment by Lukas Finnveden (Lanrian) · 2024-12-22T07:11:55.583Z · LW(p) · GW(p)

We did the the 80% pledge thing, and that was like a thing that everybody was just like, "Yes, obviously we're gonna do this."

Does anyone know what this is referring to? (Maybe a pledge to donate 80%? If so, curious about 80% of what & under what conditions.)

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-12-22T15:30:15.826Z · LW(p) · GW(p)

All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: an Anthropic human.)

This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged [EA(p) · GW(p)].

comment by MondSemmel · 2024-12-21T00:36:26.935Z · LW(p) · GW(p)

Thanks for posting this. Editing feedback: I think the post would look quite a bit better if you used headings and LW quotes. This would generate a timestamped and linkable table of contents, and also more clearly distinguish the quotes from your commentary. Example:

Tom Brown at 20:00:

the US treats the Constitution as like the holy document—which I think is just a big thing that strengthens the US, like we don't expect the US to go off the rails in part because just like every single person in the US is like The Constitution is a big deal, and if you tread on that, like, I'm mad. I think that the RSP, like, it holds that thing. It's like the holy document for Anthropic. So it's worth doing a lot of iterations getting it right.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-12-21T01:02:52.640Z · LW(p) · GW(p)

Done, thanks.

Replies from: MondSemmel

↑ comment by MondSemmel · 2024-12-21T01:35:39.720Z · LW(p) · GW(p)

Tiny editing issue: "[] everyone in the company can walk around and tell you []" -> The parentheses are empty. Maybe these should be for italicized formatting?

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-12-21T01:58:01.739Z · LW(p) · GW(p)

I use empty brackets similar to ellipses in this context; they denote removed nonsubstantive text. (I use ellipses when removing substantive text.)

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-01-02T22:40:24.728Z · LW(p) · GW(p)

It's more standard to use "[...]" IMO.

comment by Orpheus16 (akash-wasil) · 2024-12-21T20:31:52.343Z · LW(p) · GW(p)

It's so much better if everyone in the company can walk around and tell you what are the top goals of the RSP, how do we know if we're meeting them, what AI safety level are we at right now—are we at ASL-2, are we at ASL-3—that people know what to look for because that is how you're going to have good common knowledge of if something's going wrong.

I like this goal a lot: Good RSPs could contribute to building common language/awareness around several topics (e.g., "if" conditions, "then" commitments, how safety decisions will be handled). As many have pointed out, though, I worry that current RSPs haven't been concrete [LW(p) · GW(p)] or clear enough to build this kind of understanding/awareness.

One interesting idea would be to survey company employees and evaluate their understanding of RSPs & the extent to which RSPs are having an impact on internal safety culture. Example questions/topics:

What is the high-level purpose of the RSP?
Does the RSP specify "if" triggers (specific thresholds that, if hit, could cause the company to stop scaling or deployment activities)? If so, what are they?
Does the RSP specify "then" commitments (specific actions that must be taken in order to cause the company to continue scaling or deployment activities). If so, what are they?
Does the RSP specify how decisions about risk management will be made? If so, how will they made & who are the key players involved?
Are there any ways in which the RSP has affected your work at Anthropic? If so, how?

One of my concerns about RSPs is that they (at least in their current form) don't actually achieve the goal of building common knowledge/awareness or improving company culture. I suspect surveys like this could prove me wrong– and more importantly, provide scaling companies with useful information about the extent to which their scaling policies are understood by employees, help foster common understanding, etc.

(Another version of this could involve giving multiple RSPs to a third-party– like an AI Safety Institute– and having them answer similar questions. This could provide another useful datapoint RE the extent to which RSPs are clearly/concretely laying out a set of specific or meaningful contributions.)

comment by Bird Concept (jacobjacob) · 2024-12-20T22:34:23.774Z · LW(p) · GW(p)

(Can you edit out all the "like"s, or give permission for an admin to do edit it out? I think in written text it makes speakers sound, for lack of a better word, unflatteringly moronic)

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-12-20T22:36:05.299Z · LW(p) · GW(p)

I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.

Edit: actually I did another pass and edited out several more; thanks for the nudge.

Replies from: MondSemmel, jacobjacob

↑ comment by MondSemmel · 2024-12-21T00:31:02.832Z · LW(p) · GW(p)

I did something similar when I made this transcript [LW · GW]: leaving in verbal hedging particularly in the context of contentious statements etc., where omitting such verbal ticks can give a quite misleading impression.

↑ comment by Bird Concept (jacobjacob) · 2024-12-20T22:47:15.214Z · LW(p) · GW(p)

Okay, well, I'm not going to post "Anthropic leadership conversation [fewer likes]" 😂

Anthropic leadership conversation

Contents

Tom Brown at 20:00

Daniela Amodei at 20:26

Sam McCandlish at 21:30

Dario Amodei at 22:00

Daniela Amodei at 23:25

Chris Olah at 24:20

Daniela Amodei at 29:04

Jared Kaplan at 25:11

Dario Amodei at 41:38

Daniela Amodei at 42:08

Dario Amodei at 48:07

17 comments

Tom Brown at 20:00: