Anthropic: Reflections on our Responsible Scaling Policy

post by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-20T04:14:44.435Z · LW · GW · 21 comments

This is a link post for https://www.anthropic.com/news/reflections-on-our-responsible-scaling-policy

Contents

  Threat Modeling and Evaluations
  The ASL-3 Standard
  Assurance Structures
None
21 comments

Last September we published our first Responsible Scaling Policy (RSP) [LW discussion [LW · GW]], which focuses on addressing catastrophic safety failures and misuse of frontier models. In adopting this policy, our primary goal is to help turn high-level safety concepts into practical guidelines for fast-moving technical organizations and demonstrate their viability as possible standards. As we operationalize the policy, we expect to learn a great deal and plan to share our findings. This post shares reflections from implementing the policy so far. We are also working on an updated RSP and will share this soon.

We have found having a clearly-articulated policy on catastrophic risks extremely valuable. It has provided a structured framework to clarify our organizational priorities and frame discussions around project timelines, headcount, threat models, and tradeoffs. The process of implementing the policy has also surfaced a range of important questions, projects, and dependencies that might otherwise have taken longer to identify or gone undiscussed.

Balancing the desire for strong commitments with the reality that we are still seeking the right answers is challenging. In some cases, the original policy is ambiguous and needs clarification. In cases where there are open research questions or uncertainties, setting overly-specific requirements is unlikely to stand the test of time. That said, as industry actors face increasing commercial pressures we hope to move from voluntary commitments to established best practices and then well-crafted regulations.

As we continue to iterate on and improve the original policy, we are actively exploring ways to incorporate practices from existing risk management and operational safety domains. While none of these domains alone will be perfectly analogous, we expect to find valuable insights from nuclear security, biosecurity, systems safety, autonomous vehicles, aerospace, and cybersecurity. We are building an interdisciplinary team to help us integrate the most relevant and valuable practices from each.

Our current framework for doing so is summarized below, as a set of five high-level commitments.

  1. Establishing Red Line Capabilities. We commit to identifying and publishing "Red Line Capabilities" which might emerge in future generations of models and would present too much risk if stored or deployed under our current safety and security practices (referred to as the ASL-2 Standard).

  2. Testing for Red Line Capabilities (Frontier Risk Evaluations). We commit to demonstrating that the Red Line Capabilities are not present in models, or - if we cannot do so - taking action as if they are (more below). This involves collaborating with domain experts to design a range of "Frontier Risk Evaluations" – empirical tests which, if failed, would give strong evidence against a model being at or near a red line capability. We also commit to maintaining a clear evaluation process and a summary of our current evaluations publicly.

  3. Responding to Red Line Capabilities. We commit to develop and implement a new standard for safety and security sufficient to handle models that have the Red Line Capabilities. This set of measures is referred to as the ASL-3 Standard. We commit not only to define the risk mitigations comprising this standard, but also detail and follow an assurance process to validate the standard’s effectiveness. Finally, we commit to pause training or deployment if necessary to ensure that models with Red Line Capabilities are only trained, stored and deployed when we are able to apply the ASL-3 standard.

  4. Iteratively extending this policy. Before we proceed with activities which require the ASL-3 standard, we commit to publish a clear description of its upper bound of suitability: a new set of Red Line Capabilities for which we must build Frontier Risk Evaluations, and which would require a higher standard of safety and security (ASL-4) before proceeding with training and deployment. This includes maintaining a clear evaluation process and summary of our evaluations publicly.

  5. Assurance Mechanisms. We commit to ensuring this policy is executed as intended, by implementing Assurance Mechanisms. These should ensure that our evaluation process is stress-tested; our safety and security mitigations are validated publicly or by disinterested experts; our Board of Directors and Long-Term Benefit Trust have sufficient oversight over the policy implementation to identify any areas of non-compliance; and that the policy itself is updated via an appropriate process.

Threat Modeling and Evaluations

Our Frontier Red Team and Alignment Science teams have focused on threat modeling and engaging with domain experts. They are primarily focused on (a) improving threat models to determine which capabilities would warrant the ASL-3 standard of security and safety, (b) working with teams developing ASL-3 controls to ensure that those controls are tailored to the correct risks, and (c) mapping capabilities which the ASL-3 standard would be insufficient to handle, and which we would continue to test for even once it is implemented. Some key reflections are:

Our Frontier Red Team, Alignment Science, Finetuning, and Alignment Stress Testing teams are focused on building evaluations and improving our overall methodology. Currently, we conduct pre-deployment testing in the domains of cybersecurity, CBRN, and Model Autonomy for frontier models which have reached 4x the compute of our most recently tested model (you can read a more detailed description of our most recent set of evaluations on Claude 3 Opus here). We also test models mid-training if they reach this threshold, and re-test our most capable model every 3 months to account for finetuning improvements. Teams are also focused on building evaluations in a number of new domains to monitor for capabilities for which the ASL-3 standard will still be unsuitable, and identifying ways to make the overall testing process more robust. Some key reflections are:

The ASL-3 Standard

Our Security, Alignment Science, and Trust and Safety teams have been focused on developing the ASL-3 standard. Their goal is to design and implement a set of controls that will sufficiently mitigate the risk of the model weights being stolen by non-state actors or models being misused via our product surfaces. This standard would be sufficient for many models with capabilities where even a low rate of misuse could be catastrophic. However, it would not be sufficient to handle capabilities which would enable state groups or groups with substantial state backing and resources. Some key reflections are:

Assurance Structures

Lastly, our Responsible Scaling, Alignment Stress Testing, and Compliance teams have been focused on exploring possible governance, coordination, and assurance structures. We intend to introduce more independent checks over time and are looking to hire a Risk Manager to develop these structures, drawing on best practices from other industries and relevant research. Some key reflections are:

Ensuring future generations of frontier models are trained and deployed responsibly will require serious investment from both Anthropic and others across industry and governments. Our Responsible Scaling Policy has been a powerful rallying point with many teams' objectives over the past months connecting directly back to the major workstreams above. The progress we have made on operationalizing safety during this period has necessitated significant engagement from teams across Anthropic, and there is much more work to be done. Our goal in sharing these reflections ahead of the upcoming AI Seoul Summit is to continue the discussion on creating thoughtful, empirically-grounded frameworks for managing risks from frontier models. We are eager to see more companies adopt their own frameworks and share their own experiences, leading to the development of shared best practices and informing future efforts by governments.


Zac's note: if you're interested in further technical detail, we just published an RSP Evals Report for Claude 3 Opus (pdf), adapted from a report shared with the Anthropic Board of Directors and Long-Term Benefit Trust before release. And whether or not you're interested in joining us, the RSP team job description says more about our expectations going forward.

21 comments

Comments sorted by top scores.

comment by Zach Stein-Perlman · 2024-05-20T05:32:01.972Z · LW(p) · GW(p)

No major news here, but some minor good news, and independent of news/commitments/achievements I'm always glad when labs share thoughts like this. Misc reactions below.


Probably the biggest news is the Claude 3 evals report. I haven't read it yet. But at a glance I'm confused: it sounds like "red line" means ASL-3 but they also operationalize "yellow line" evals and those sound like the previously-discussed ASL-3 evals. Maybe red is actual ASL-3 and yellow is supposed to be at least 6x effective compute lower, as a safety buffer.

"Assurance Mechanisms . . . . should ensure that . . . our safety and security mitigations are validated publicly or by disinterested experts." This sounds great. I'm not sure what it looks like in practice. I wish it was clearer what assurance mechanisms Anthropic expects or commits to implement and when, and especially whether they're currently doing anything along the lines of "validated publicly or by disinterested experts." (Also whether "validated" means "determined to be sufficient if implemented well" or "determined to be implemented well.")

Something that was ambiguous in the RSP and is still ambiguous here: during training, if Anthropic reaches "3 months since last eval" before "4x since last eval," do they do evals? Or does the "3 months" condition only apply after training?

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.) [Edit: and maybe facilitating anonymous back-and-forth conversations is much better than just anonymous one-way reporting, and this should be pretty easy to facilitate.]

Some other hopes for the RSP, off the top of my head:

  • **ASL-4 definition + operationalization + mitigations, including generally how Anthropic will think about safety cases after the "no dangerous capabilities" safety case doesn't work anymore
  • Clarifying security commitments (when the RAND report on securing model weights comes out)
  • Dangerous capability evals by external auditors, e.g. METR
Replies from: zac-hatfield-dodds
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-20T06:05:23.622Z · LW(p) · GW(p)

"red line" vs "yellow line"

Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the "register a typo'd domain" step from an ARA eval, because there are only so many good typos for our domain.

assurance mechanisms

Our White House committments mean that we're already reporting safety evals to the US Government, for example. I think the natural reading of "validated" is some combination of those, though obviously it's very hard to validate that whatever you're doing is 'sufficient' security against serious cyberattacks or safety interventions on future AI systems. We do our best.

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn't mean much compared to whether it's actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don't have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2024-05-20T06:18:26.876Z · LW(p) · GW(p)

Thanks.

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn't mean much compared to whether it's actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don't have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.

Not sure. I can generally imagine a company publishing what Anthropic has published but having a weak/fake system in reality. Policy details do seem less important for non-compliance reporting than some other policies — Anthropic says it has an infohazard review policy [LW(p) · GW(p)], and I expect it's good, but I'm not confident, and for other companies I wouldn't necessarily expect that their policy is good (even if they say a formal policy exists), and seeing details (with sensitive bits redacted) would help.

I mostly take back my secret policy is strong evidence of bad policy insinuation — that's ~true on my home planet, but on Earth you don't get sufficient credit for sharing good policies and there's substantial negative EV from misunderstandings and adversarial interpretations, so I guess it's often correct to not share :(

As an 80/20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it's good or have concerns. I would feel better if that happened all the time.

Replies from: akash-wasil
comment by Akash (akash-wasil) · 2024-05-21T17:53:09.331Z · LW(p) · GW(p)

on Earth you don't get sufficient credit for sharing good policies and there's substantial negative EV from misunderstandings and adversarial interpretations, so I guess it's often correct to not share :(

What's the substantial negative EV that would come from misunderstanding or adversarial interpretations? I feel like in this case, worst-case would be like "the non-compliance reporting policy is actually pretty good but a few people say mean things about it and say 'see, here's why we need government oversight.' But this feels pretty minor/trivial IMO.

As an 80/20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it's good or have concerns. I would feel better if that happened all the time

This is clever, +1. 

comment by Akash (akash-wasil) · 2024-05-21T12:49:47.456Z · LW(p) · GW(p)

We also recently implemented a non-compliance reporting policy that allows employees to anonymously report concerns to our Responsible Scaling Officer about our implementation of our RSP.

This seems great, and I think it would be valuable for other labs to adopt a similar system. 

What about whistleblowing or anonymous reporting to governments? If an Anthropic employee was so concerned about RSP implementation (or more broadly about models that had the potential to cause major national or global security threats), where would they go in the status quo? 

If Anthropic is supportive of this kind of mechanism, it might be good to explicitly include this (e.g., "We also recently implemented non-compliance reporting policy that allows employees to anonymously report concerns to specific officials at the [Department of Homeland Security, Department of Commerce].") 

Replies from: zac-hatfield-dodds
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-21T16:35:32.814Z · LW(p) · GW(p)

What about whistleblowing or anonymous reporting to governments? If an Anthropic employee was so concerned about RSP implementation (or more broadly about models that had the potential to cause major national or global security threats), where would they go in the status quo?

That really seems more like a question for governments than for Anthropic! For example, the SEC or IRS whistleblower programs operate regardless of what companies puport to "allow", and I think it'd be cool if the AISI had something similar.

If I was currently concerned about RSP implementation per se (I'm not), it's not clear why the government would get involved in a matter of voluntary commitments by a private organization. If there was some concern touching on the White House committments, Bletchley declaration, Seoul declaration, etc., then I'd look up the appropriate monitoring body; if in doubt the Commerce whistleblower office or AISI seem like reasonable starting points.

Replies from: akash-wasil
comment by Akash (akash-wasil) · 2024-05-21T16:46:46.866Z · LW(p) · GW(p)

That really seems more like a question for governments than for Anthropic

+1. I do want governments to take this question seriously. It seems plausible to me that Anthropic (and other labs) could play an important role in helping governments improve its ability to detect/process information about AI risks, though.

it's not clear why the government would get involved in a matter of voluntary commitments by a private organization

Makes sense. I'm less interested in a reporting system that's like "tell the government that someone is breaking an RSP" and more interested in a reporting system that's like "tell the government if you are worried about an AI-related national security risk, regardless of whether or not this risk is based on a company breaking its voluntary commitments."

My guess is that existing whistleblowing programs are the best bet right now, but it's unclear to me whether they are staffed by people who understand AI risks well enough to know how to interpret/process/escalate such information (assuming the information ought to be escalated).

comment by elifland · 2024-05-21T05:26:03.364Z · LW(p) · GW(p)

From the RSP Evals report:

As a rough attempt at quantifying the elicitation gap, teams informally estimated that, given an additional three months of elicitation improvements and no additional pretraining, there is a roughly 30% chance that the model passes our current ARA Yellow Line, a 30% chance it passes at least one of our CBRN Yellow Lines, and a 5% chance it crosses cyber Yellow Lines. That said, we are currently iterating on our threat models and Yellow Lines so these exact thresholds are likely to change the next time we update our Responsible Scaling Policy.

What's the minimum X% that could replace 30% and would be treated the same as passing the yellow line immediately, if any? If you think that there's an X% chance that with 3 more months of elicitation, a yellow line will be crossed, what's the decision-making process for determining whether you should treat it as already being crossed?

In the RSP it says "It is important that we are evaluating models with close to our best capabilities elicitation techniques, to avoid underestimating the capabilities it would be possible for a malicious actor to elicit if the model were stolen" so it seems like folding in some forecasted elicited capabilities into the current evaluation would be reasonable (though they should definitely be discounted the further out they are).

(I'm not particularly concerned about catastrophic risk from the Claude 3 model family, but I am interested in the general policy here and the reasoning behind it)

Replies from: zac-hatfield-dodds
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-21T16:56:48.466Z · LW(p) · GW(p)

The yellow-line evals are already a buffer ('sufficent to rule out red-lines') which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it's reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.

I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round, and enough calibration training to avoid very low probabilities.

Finally, the point of these estimates is that they can guide research and development prioritization - high estimates suggest that it's worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.

Replies from: elifland
comment by elifland · 2024-05-21T18:03:07.918Z · LW(p) · GW(p)

Thanks for the response!

I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.

[...]

I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round

Thanks for these clarifications. I didn't realize that the 30% was for the new yellow-line evals rather than the current ones.

Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals

I'm having trouble parsing this sentence. What you mean by "doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals"? Doesn't pausing include focusing on mitigations and evals?

Replies from: zac-hatfield-dodds
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-21T21:59:04.670Z · LW(p) · GW(p)

Thanks for these clarifications. I didn't realize that the 30% was for the new yellow-line evals rather than the current ones.

That's how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable.

I'm having trouble parsing this sentence. What you mean by "doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals"? Doesn't pausing include focusing on mitigations and evals?

Of course, but pausing also means we'd have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we'd continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.

comment by Chris_Leong · 2024-05-20T05:36:18.364Z · LW(p) · GW(p)

Frontier Red Team, Alignment Science, Finetuning, and Alignment Stress Testing

 

What's the difference between a frontier red team and alignment stress-testing? Is the red team focused on the current models you're releasing and the alignment stress testing focused on the future?

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2024-05-21T05:37:34.750Z · LW(p) · GW(p)

I think Frontier Red Team is about eliciting model capabilities and Alignment Stress Testing is about "red-team[ing] Anthropic’s alignment techniques and evaluations, empirically demonstrating ways in which Anthropic’s alignment strategies could fail."

comment by Chris_Leong · 2024-05-20T05:28:37.809Z · LW(p) · GW(p)

I know that Anthropic doesn't really open-source advanced AI, but it might be useful to discuss this in Anthropic's RSP anyway because one way I see things going badly is people copying Anthropic's RSP's and directly applying it to open-source projects without accounting for the additional risks this entails.

Replies from: zac-hatfield-dodds, Zach Stein-Perlman
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-20T05:41:06.319Z · LW(p) · GW(p)

I believe that meeting our ASL-2 deployment commitments - e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models - with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights... I think that would be pretty cool.

(also note that e.g. LLama is not open source [LW(p) · GW(p)] - I think you're talking about releasing weights; the license doesn't affect safety but as an open-source maintainer the distinction matters to me)

Replies from: Vladimir_Nesov, Chris_Leong
comment by Vladimir_Nesov · 2024-05-21T20:25:53.532Z · LW(p) · GW(p)

Ideological adherence to open source seems to act like a religion, arguing against universal applicability of its central tenets won't succeed with only reasonable effort. Unless you state something very explicitly, it will be ignored, and probably even then.

Enforcement of mitigations when it's someone else who removes them won't be seen as relevant, since in this religion a contributor is fundamentally not responsible for how the things they release will be used by others. Arguments to the contrary in particular very unusual cases slide right off.

Replies from: zac-hatfield-dodds
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-21T22:17:16.721Z · LW(p) · GW(p)

Enforcement of mitigations when it's someone else who removes them won't be seen as relevant, since in this religion a contributor is fundamentally not responsible for how the things they release will be used by others.

This may be true of people who talk a lot about open source, but among actual maintainers the attitude is pretty different. If some user causes harm with an overall positive tool, that's on the user; but if the contributor has built something consistently or overall harmful that is indeed on them. Maintainers tend to avoid working on projects which are mostly useful for surveillance, weapons, etc. for pretty much this reason.

Source: my personal experience as a a maintainer and PSF Fellow, and the multiple Python core developers I just checked with at the PyCon sprints.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-05-21T22:49:56.603Z · LW(p) · GW(p)

if the contributor has built something consistently or overall harmful that is indeed on them

I agree, this is in accord with the dogma. But for AI, overall harm is debatable and currently purely hypothetical, so this doesn't really apply. There is a popular idea that existential risk from AI has little basis in reality since it's not already here to be observed. Thus contributing to public AI efforts remains fine (which on first order effects is perfectly fine right now).

My worry is that this attitude reframes commitments from RSP-like documents, so that people don't see the obvious implication of how releasing weights breaks the commitments (absent currently impossible feats of unlearning), and don't see themselves as making a commitment to avoid releasing high-ASL weights even as they commit to such RSPs. If this point isn't written down, some people will only become capable of noticing it if actual catastrophes shift the attitude to open weights foundation models being harmful overall (even after we already get higher up in ASLs). Which doesn't necessarily happen even if there are some catastrophes with a limited blast radius, since they get to be balanced out by positive effects.

comment by Chris_Leong · 2024-05-20T06:04:38.248Z · LW(p) · GW(p)

"Presently beyond the state of the art... I think that would be pretty cool"

Point taken, but it doesn't make it sufficient for avoiding society-level catastrophies.

comment by Zach Stein-Perlman · 2024-05-20T05:34:32.920Z · LW(p) · GW(p)

I think this is implicit — the RSP discusses deployment mitigations, which can't be enforced if the weights are shared.

Replies from: Chris_Leong
comment by Chris_Leong · 2024-05-20T05:38:08.631Z · LW(p) · GW(p)

That's the exact thing I'm worried about, that people will equate deploying a model via API with releasing open-weights when the latter has significantly more risk due to the potential for future modification and the inability for it to be withdrawn.