Anthropic's updated Responsible Scaling Policy

post by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · LW · GW · 3 comments

This is a link post for https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy

Contents

  The promise and challenge of advanced AI
  A framework for proportional safeguards
  Implementation and oversight
  Learning from experience
  Looking ahead
None
3 comments

Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards. Key improvements include new capability thresholds to indicate when we will upgrade our safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards (inspired by safety case methodologies), and new measures for internal governance and external input. By learning from our implementation experiences and drawing on risk management practices used in other high-consequence industries, we aim to better prepare for the rapid pace of AI advancement.

The promise and challenge of advanced AI

As frontier AI models advance, they have the potential to bring about transformative benefits for our society and economy. AI could accelerate scientific discoveries, revolutionize healthcare, enhance our education system, and create entirely new domains for human creativity and innovation. However, frontier AI systems also present new challenges and risks that warrant careful study and effective safeguards.

In September 2023, we released our Responsible Scaling Policy, a framework for managing risks from increasingly capable AI systems. After a year of implementation and learning, we are now sharing a significantly updated version that reflects practical insights and accounts for advancing technological capabilities.

Although this policy focuses on catastrophic risks like the categories listed below, they are not the only risks that we monitor and prepare for. Our Usage Policy sets forth our standards for the use of our products, including rules that prohibit using our models to spread misinformation, incite violence or hateful behavior, or engage in fraudulent or abusive practices. We continually refine our technical measures for enforcing our trust and safety standards at scale. Further, we conduct research to understand the broader societal impacts of our models. Our Responsible Scaling Policy complements our work in these areas, contributing to our understanding of current and potential risks.

A framework for proportional safeguards

As before, we maintain our core commitment: we will not train or deploy models unless we have implemented safety and security measures that keep risks below acceptable levels. Our RSP is based on the principle of proportional protection: safeguards that scale with potential risks. To do this, we use AI Safety Level Standards (ASL Standards), graduated sets of safety and security measures that become more stringent as model capabilities increase. Inspired by Biosafety Levels, these begin at ASL-1 for models that have very basic capabilities (for example, chess-playing bots) and progress through ASL-2, ASL-3, and so on.

In our updated policy, we have refined our methodology for assessing specific capabilities (and their associated risks) and implementing proportional safety and security measures. Our updated framework has two key components:

At present, all of our models operate under ASL-2 Standards, which reflect current industry best practices. Our updated policy defines two key Capability Thresholds that would require upgraded safeguards:

ASL-3 safeguards involve enhanced security measures and deployment controls. On the security side, this will include internal access controls and more robust protection of model weights. For deployment risks, we plan to implement a multi-layered approach to prevent misuse, including real-time and asynchronous monitoring, rapid response protocols, and thorough pre-deployment red teaming.

Implementation and oversight

To contribute to effective implementation of the policy, we have established:

Learning from experience

We have learned a lot in our first year with the previous RSP in effect, and are using this update as an opportunity to reflect on what has worked well and what makes sense to update in the policy. As part of this, we conducted our first review of how well we adhered to the framework and identified a small number of instances where we fell short of meeting the full letter of its requirements. These included procedural issues such as completing a set of evaluations three days later than scheduled or a lack of clarity on how and where we should note any changes to our placeholder evaluations. We also flagged some evaluations where we may have been able to elicit slightly better model performance through implementing standard techniques (such as chain-of-thought or best-of-N).

In all cases, we found these instances posed minimal risk to the safety of our models. We used the additional three days to refine and improve our evaluations; the different set of evaluations we used provided a more accurate assessment than the placeholder evaluations; and our evaluation methodology still showed we were sufficiently far from the thresholds. From this, we learned two valuable lessons to incorporate into our updated framework: we needed to incorporate more flexibility into our policies, and we needed to improve our process for tracking compliance with the RSP. You can read more here.

Since we first released the RSP a year ago, our goal has been to offer an example of a framework that others might draw inspiration from when crafting their own AI risk governance policies. We hope that proactively sharing our experiences implementing our own policy will help other companies in implementing their own risk management frameworks and contribute to the establishment of best practices across the AI ecosystem.

Looking ahead

The frontier of AI is advancing rapidly, making it challenging to anticipate what safety measures will be appropriate for future systems. All aspects of our safety program will continue to evolve: our policies, evaluation methodology, safeguards, and our research into potential risks and mitigations.

Additionally, Co-Founder and Chief Science Officer Jared Kaplan will serve as Anthropic’s Responsible Scaling Officer, succeeding Co-Founder and Chief Technology Officer Sam McCandlish who held this role over the last year. Sam oversaw the RSP’s initial implementation and will continue to focus on his duties as Chief Technology Officer. As we work to scale up our efforts on implementing the RSP, we’re also opening a position for a Head of Responsible Scaling. This role will be responsible for coordinating the many teams needed to iterate on and successfully comply with the RSP.

Read the updated policy at anthropic.com/rsp, and supplementary information at anthropic.com/rsp-updates

We extend our sincere gratitude to the many external groups that provided invaluable feedback on the development and refinement of our Responsible Scaling Policy.


  1. We have also shared our assessment methodology with both AI Safety Institutes, as well as a selection of independent experts and organizations, for feedback. This does not represent an endorsement from either AI Safety Institute or the independent experts and organizations. ↩︎

3 comments

Comments sorted by top scores.

comment by aysja · 2024-10-16T13:59:17.596Z · LW(p) · GW(p)

In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is, if evaluations triggered, then they would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP as much more than a declared intent to act responsibly, rather than a commitment. But the updated RSP seems to be far worse, even, than that: the “if” is no longer dependent on the outcomes of pre-specified evaluations, but on the personal judgment of Dario Amodei and Jared Kaplan.

Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.

This seems strictly worse to me. Some room for flexibility is warranted, but this strikes me as almost maximally flexible, in that practically nothing is predefined—not evaluations, nor safeguards, nor responses to evaluations. This update makes the RSP more subjective, qualitative, and ambiguous. And if Anthropic is going to make the RSP weaker, I wish this were noted more as an apology, or along with a promise to rectify this in the future. Especially because after a year, Anthropic presumably has more information about the risk than before. Why, then, is even more flexibility needed now? What would cause Anthropic to make clear commitments? 

I also find it unsettling that the ASL-3 risk threshold has been substantially changed, and the reasoning for this is not explained. In the first RSP, a model was categorized as ASL-3 if it was capable of various precursors for autonomous replication. Now, this has been downgraded to a “checkpoint,” a point at which they promise to evaluate the situation more thoroughly, but don’t commit to taking any particular actions: 

We replaced our previous autonomous replication and adaption (ARA) threshold with a “checkpoint” for autonomous AI capabilities. Rather than triggering higher safety standards automatically, reaching this checkpoint will prompt additional evaluation of the model’s capabilities and accelerate our preparation of stronger safeguards.

This strikes me as a big change. The ability to self-replicate is already concerning, but the ability to perform AI R&D seems potentially catastrophic, risking loss of control or extinction. Why does Anthropic now think this shouldn’t count as ASL-3? Why have they substituted this criteria with a substantially riskier one instead? 


Dario estimates the probability of something going “really quite catastrophically wrong, on the scale of human civilization” as between 10-25%. He also thinks this might happen soon—perhaps between 2025-2027. It seems obvious to me that a policy this ambiguous, this dependent on figuring things out on the fly, this beset with such egregious conflicts of interest, is a radically insufficient means of managing risk from a technology which poses so grave and imminent a threat to our world.

Replies from: T3t
comment by RobertM (T3t) · 2024-10-17T03:53:15.365Z · LW(p) · GW(p)

Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.

This doesn't seem right to me, though it's possible that I'm misreading either the old or new policy (or both).

Re: predefined evaluations, the old policy neither specified any evaluations in full detail, nor did it suggest that Anthropic would have designed the evaluations prior to a training run.  (Though I'm not sure that's what you meant, when contrasted it with "employees design and run them on the fly" as a description of the new policy.)

Re: CEO's decisionmaking, my understanding of the new policy is that the CEO (and RSO) will be responsible only for approving or denying an evaluation report making an affirmative case that a new model does not cross a relevant capability threshold ("3.3 Capability Decision", original formatting removed, all new bolding is mine):

If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. The process for making such a determination is as follows:

  • First, we will compile a Capability Report that documents the findings from the comprehensive assessment, makes an affirmative case for why the Capability Threshold is sufficiently far away, and advances recommendations on deployment decisions.
  • The report will be escalated to the CEO and the Responsible Scaling Officer, who will (1) make the ultimate determination as to whether we have sufficiently established that we are unlikely to reach the Capability Threshold and (2) decide any deployment-related issues.
  • In general, as noted in Sections 7.1.4 and 7.2.2, we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions to inform future refinements to our methodology. For high-stakes issues, however, the CEO and RSO will likely solicit internal and external feedback on the report prior to making any decisions.
  • If the CEO and RSO decide to proceed with deployment, they will share their decision–as well as the underlying Capability Report, internal feedback, and any external feedback–with the Board of Directors and the Long-Term Benefit Trust before moving forward.

The same is true for the "Safeguards Decision" (i.e. making an affirmative case that ASL-3 Required Safeguards have been sufficiently implemented, given that there is a model that has passed the relevant capabilities thresholds).

This is not true for the "Interim Measures" described as an allowable stopgap if Anthropic finds itself in the situation of having a model that requires ASL-3 Safeguards but is unable to implement those safeguards.  My current read is that this is intended to cover the case where the "Capability Decision" report made the case that a model did not cross into requiring ASL-3 Safeguards, was approved by the CEO & RSO, and then later turned out to be wrong.  It does seem like this permits more or less indefinite deployment of a model that requires ASL-3 Safeguards by way of "interim measures" which need to provide "the the same level of assurance as the relevant ASL-3 Standard", with no provision for what to do if it turns out that implementing the actually-specified ASL-3 standard is intractable.  This seems slightly worse than the old policy:

If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers. We will provide transparency and support to impacted customers throughout the process. An emergency of this type would merit a detailed post-mortem and a policy shift to avoid re-occurrence of this situation.

which has much the same immediate impact, but with at least a nod to a post-mortem and policy adjustment.

But, overall, the new policy doesn't seem to be opening up a gigantic hole that allows Dario to press the "all clear" button on capability determinations; he only has the additional option to veto, after the responsible team has already decided the model doesn't cross the threshold.

Replies from: aysja
comment by aysja · 2024-10-17T23:05:26.449Z · LW(p) · GW(p)

Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.

That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wiggle room that I didn’t interpret it as imposing real constraints on Anthropic’s actions; but it did at least seem to me to be aiming at well-specified “if’s,” i.e., ones which depended on the results of specific evaluations. Like, the previous RSP describes their response policy as: “If an evaluation threshold triggers, we will follow the following procedure” (emphasis mine), where the trigger for autonomous risk happens if “at least 50% of the tasks are passed.” 

In other words, the “if’s” in the first RSP seemed more objective to me; the current RSP strikes me as a downgrade in that respect. Now, instead of an evaluation threshold, the “if” is determined by some opaque internal process at Anthropic that the document largely doesn’t describe. I think in practice this is what was happening before—i.e., that the policy basically reduced to Anthropic crudely eyeballing the risk—but it’s still disappointing to me to see this level of subjectivity more actively codified into policy. 

My impression is also that this RSP is more Orwellian than the first one, and this is part of what I was trying to gesture at. Not just that their decision process has become more ambiguous and subjective, but that the whole thing seems designed to be glossed over, such that descriptions of risks won’t really load in readers’ minds. This RSP seems much sparser on specifics, and much heavier on doublespeak—e.g., they use the phrase “unable to make the required showing” to mean “might be terribly dangerous.” It also seems to me to describe many things too vaguely to easily argue against. For example, they claim they will “explain why the tests yielded such results,” but my understanding is that this is mostly not possible yet, i.e., that it’s an open scientific question, for most such tests, why their models produce the behavior they do. But without knowing what “tests” they mean, nor the sort of explanations they’re aiming for, it’s hard to argue with; I’m suspicious this is intentional.