Anthropic rewrote its RSP

zach-stein-perlman

Anthropic rewrote its RSP

post by Zach Stein-Perlman · 2024-10-15T14:25:12.518Z · LW · GW · 19 comments

19 comments

Canonical linkpost: https://www.lesswrong.com/posts/Q7caj7emnwWBxLECF/anthropic-s-updated-responsible-scaling-policy [LW · GW].

Anthropic's new version of its RSP is here at last.

Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards. Key improvements include new capability thresholds to indicate when we will upgrade our safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards (inspired by safety case methodologies), and new measures for internal governance and external input. By learning from our implementation experiences and drawing on risk management practices used in other high-consequence industries, we aim to better prepare for the rapid pace of AI advancement.

Summary of changes.

Initial reactions:

ASL-3 deployment mitigations have become more meta — more like we'll make a safety case. (Compare to original.) (This was expected; see e.g. The Checklist: What Succeeding at AI Safety Will Involve [AF · GW].) This is OK; figuring out exact mitigations and how-to-verify-them in advance is hard.

But it's inconsistent with wanting the RSP to pass the LeCun test [AF · GW] — for it to be sufficient for other labs to adopt the RSP (or for the RSP to tie Anthropic's hands much). And it means the procedural checks are super important. But the protocol for ASL/mitigation/deployment decisions isn't much more than CEO and RSO decide. A more ambitious procedural approach would involve strong third-party auditing.

I really like that Anthropic shared "non-binding descriptions of [their] future ASL-3 safeguard plans," for deployment and security. If you're not going to make specific object-level commitments, you should totally still share your plans. And on the object level, those planned safeguards tentatively look good.

The new framework involves "preliminary assessments" and "comprehensive assessments." Anthropic will "routinely" do a preliminary assessment: check whether it's been 6 months (or >4x effective compute) since the last comprehensive assessment, and if so, do a comprehensive assessment. "Routinely" is problematic. It would be better to commit to do a comprehensive assessment at least every 6 months.

This is weaker than the original RSP, which said

During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements.

Anthropic explains:

We adjusted the comprehensive assessment cadence to 4x Effective Compute [excluding post-training] or six months of accumulated post-training enhancements (this was previously three months). We found that a three-month cadence forced teams to prioritize conducting frequent evaluations over more comprehensive testing and improving methodologies.

I think 6 months seems fine for now. But when models are closer to being dangerous and AI progress becomes faster/crazier I think we should be uncomfortable with we did evals within the last six months and were outside the safety buffer, rather than doing evals with the final model pre-deployment. But doing evals pre-deployment is costly, delaying deployment, and incentivizes labs to rush the evals.

New capability thresholds:

Appendix C: Detailed Capability Thresholds
This appendix contains detailed definitions for each Capability Threshold in Section 2.
Chemical, Biological, Radiological, and Nuclear (CBRN) weapons: The ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons. We assess this by comparing what potential attackers could achieve with full model access versus 2023-level online resources, assuming they have funding and up to one year of time to invest, but no initial specialized expertise.
Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world’s most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 35^2 = ~1000x.
[Footnote:] The 35x/year scaleup estimate is based on assuming the rate of increase in compute being used to train frontier models from ~2018 to May 2024 is 4.2x/year (reference), the impact of increased (LLM) algorithmic efficiency is roughly equivalent to a further 2.8x/year (reference), and the impact of post training enhancements is a further 3x/year (informal estimate). Combined, these have an effective rate of scaling of 35x/year.
Model Autonomy checkpoint: The ability to perform a wide range of advanced software engineering tasks autonomously that could be precursors to full autonomous replication or automated AI R&D, and that would take a domain expert human 2-8 hours to complete. We primarily view this level of model autonomy as a checkpoint on the way to managing the risks of robust, fully autonomous systems with capabilities that might include (a) automating and greatly accelerating research and development in AI development (b) generating their own revenue and using it to run copies of themselves in large-scale, hard-to-shut-down operations.

The CBRN threshold triggers ASL-3 deployment and security mitigations. The autonomous AI R&D threshold^[1] triggers ASL-3 security mitigations. On the model autonomy threshold, Anthropic says

We would view this level of capability as an important checkpoint towards both Autonomous AI R&D as well as other capabilities that may warrant similar attention (for example, autonomous replication). We will test for this checkpoint and, by the time we reach it, we aim to have met (or be close to meeting) the ASL-3 Security Standard as an intermediate goal, and we will share an update on our progress around that time. At that point, we will also specify Required Safeguards for this Capability Threshold in more detail, update our list of Capability Thresholds to consider additional risks that may arise, and test for the full Autonomous AI R&D Capability Threshold and any additional risks.

Anthropic also says that "Cyber Operations" capabilities "require significant investigation."

Beyond the parenthetical note that autonomous replication may warrant attention, the RSP says nothing about self-exfiltration, scheming, or control [AF · GW].

Relatedly, the old RSP was about "containment" rather than just "security": containment is supposed to address risk of model self-exfiltration in addition to risk of weights being stolen. (But not really at ASL-3.) The new RSP is just about security.

New:

Policy changes: Changes to this policy will be proposed by the CEO and the Responsible Scaling Officer and approved by the Board of Directors, in consultation with the Long-Term Benefit Trust. The current version of the RSP is accessible at www.anthropic.com/rsp. We will update the public version of the RSP before any changes take effect and record any differences from the prior draft in a change log.
[Footnote:] It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.

Old:

[We commit to] Follow an "Update Process" for this document, including approval by the board of directors, following consultation with the Long-Term Benefit Trust (LTBT). Any updates will be noted and reflected in this document before they are implemented. The most recent version of this document can be found at http://anthropic.com/responsible-scaling-policy.
We expect most updates to this process to be incremental, for example adding a new ASL level or slightly modifying the set of evaluations or security procedures as we learn more about model safety features or unexpected capabilities.
However, in a situation of extreme emergency, such as when a clearly bad actor (such as a rogue state) is scaling in so reckless a manner that it is likely to lead to lead to imminent global catastrophe if not stopped (and where AI itself is helpful in such defense), we could envisage a substantial loosening of these restrictions as an emergency response. Such action would only be taken in consultation with governmental authorities, and the compelling case for it would be presented publicly to the extent possible.

I think the idea behind the new footnote is fine, but I wish it was different in a few ways:

Distinguish the staying behind the frontier version from the winning the race version
- In winning the race, "the incremental increase in risk attributable to us would be small" shouldn't be a crux — if you're a good guy and other frontier labs are bad guys, you should incur substantial 'risk attributable to you' (or action risk [LW · GW]) to minimize net risk
Make "acknowledge the overall level of risk posed by AI systems (including ours)" better — plan to sound the alarm that you're taking huge risks (e.g. mention expected number of casualties per year due to you) that sound totally unacceptable and are only justified because inaction is even more dangerous!

we believe the risk of substantial under-elicitation is low

I don't believe this. It's in tension with both the last evals report^[2] and today's update that "Some of our evaluations lacked some basic elicitation techniques such as best-of-N or chain-of-thought prompting." (But I believe that the risk that better elicitation would result in crossing thresholds in Anthropic's last round of evals is low.)

"At minimum, we will perform basic finetuning for instruction following, tool use, minimizing refusal rates." I appreciate details like this. [LW · GW]

Nondisparagement: it's cool that they put their stance in a formal written policy, but I wish they just wouldn't use nondisparagement:

We will not impose contractual non-disparagement obligations on employees, candidates, or former employees in a way that could impede or discourage them from publicly raising safety concerns about Anthropic. If we offer agreements with a non-disparagement clause, that clause will not preclude raising safety concerns, nor will it preclude disclosure of the existence of that clause.

[A criticism here was wrong. Mea culpa. I preserve the incorrect criticism in this footnote for reference.^[3]]

Anthropic missed the opportunity to say something stronger on third-party model evals than "Findings from partner organizations and external evaluations of our models (or similar models) should also be incorporated into the final assessment, when available."

^{^}
Some small concerns with some versions of the 1000x effective training compute scaleup in a year threshold:
- You don't have direct access to effective compute; you infer it from benchmarks; this is noisy
- If you wait until you observe 1000x in the last year, it'll be going faster than 1000x per year
- Doubling the rate of progress while keeping the rate of compute growth constant requires 2.7x-ing the rate of non-compute progress, not just doubling it
- Before accounting for AI accelerating AI progress, maybe we should expect short-term effective compute growth to be less than 35x per year on relevant tasks (maybe we should expect some trends to slow) (I don't know; I asked Epoch)
- Maybe 3x/year for post-training is too high
^{^}
We expect we have substantially under-elicited capabilities from the model, and that additional general and task-specific fine-tuning, and better prompting and scaffolding, could increase the capabilities of the model quite substantially. . . .
Overall, our evaluations teams do not believe the current model crosses any of the Yellow Line thresholds. That said, there are a number of ways in which Claude 3 could meet our Yellow Lines that our evaluations would have missed, which are summarized below.
- Our post-training methods to remove refusals were suboptimal compared to training a "helpful only" model from scratch. The effect could have damaged capabilities or made them more difficult to elicit. Once additional general and task-specific fine-tuning is applied, the jump in relevant capabilities could be quite substantial.
- Our current prompting and scaffolding techniques are likely far from optimal, especially for our CBRN evaluations. As a result, we could be substantially underestimating the capabilities that external actors could elicit from our models.
^{^}
Anthropic acknowledges an issue I pointed out [LW(p) · GW(p)].
In our most recent evaluations, we updated our autonomy evaluation from the specified placeholder tasks, even though an ambiguity in the previous policy could be interpreted as also requiring a policy update. We believe the updated evaluations provided a stronger assessment of the specified “tasks taking an expert 2-8 hours” benchmark. The updated policy resolves the ambiguity, and in the future we intend to proactively clarify policy ambiguities.
As far as I can tell, this description is wrong; it was not an ambiguity; the RSP set forth an ASL-3 threshold and the Claude 3 Opus evals report incorrectly asserted that that threshold was merely a yellow line. I would call this a lie but when I've explained the issue to some relevant Anthropic people they've seemed to genuinely not understand it. But not understanding your RSP, when someone explains it to you, is pretty bad. (To be clear, Anthropic didn't cross the threshold; the underlying issue is not huge.)

19 comments

Comments sorted by top scores.

comment by dmz (DMZ) · 2024-10-16T17:29:08.728Z · LW(p) · GW(p)

(I work on the Alignment Stress-Testing team at Anthropic and have been involved in the RSP update and implementation process.)

Re not believing Anthropic's statement:

we believe the risk of substantial under-elicitation is low

To be more precise: there was significant under-elicitation but the distance to the thresholds was large enough that the risk of crossing them even with better elicitation was low.

comment by evhub · 2024-10-15T19:30:33.133Z · LW(p) · GW(p)

Anthropic will "routinely" do a preliminary assessment: check whether it's been 6 months (or >4x effective compute) since the last comprehensive assessment, and if so, do a comprehensive assessment. "Routinely" is problematic. It would be better to commit to do a comprehensive assessment at least every 6 months.

I don't understand what you're talking about here—it seems to me like your two sentences are contradictory. You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.

the RSP set forth an ASL-3 threshold and the Claude 3 Opus evals report incorrectly asserted that that threshold was merely a yellow line.

This is just a difference in terminology—we often use the term "yellow line" internally to refer to the score on an eval past which we would no longer be able to rule out the "red line" capabilities threshold in the RSP. The idea is that the yellow line threshold at which you should trigger the next ASL should be the point where you can no longer rule out dangerous capabilities, which should be lower than the actual red line threshold at which the dangerous capabilities would definitely be present. I agree that this terminology is a bit confusing, though, and I think we're trying to move away from it.

Replies from: rohinmshah, Zach Stein-Perlman

↑ comment by Rohin Shah (rohinmshah) · 2024-10-15T20:11:08.247Z · LW(p) · GW(p)

You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.

I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in the most recent round of evals.

(I think this is very reasonable, but I do think it means you can't quite say "we will do a comprehensive assessment at least every 6 months".)

There's also the point that Zach makes below that "routinely" isn't specified and implies that the comprehensive evals may not even start by the 6 month mark, but I assumed that was just an unfortunate side effect of how the section was written, and the intention was that evals will start at the 6 month mark.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-10-15T20:20:04.277Z · LW(p) · GW(p)

(I agree that the intention is surely no more than 6 months; I'm mostly annoyed for legibility—things like this make it harder for me to say "Anthropic has clearly committed to X" for lab-comparison purposes—and LeCun-test reasons)

↑ comment by Zach Stein-Perlman · 2024-10-15T19:41:56.344Z · LW(p) · GW(p)

Thanks.

I disagree, e.g. if routinely means at least once per two months, then maybe you do a preliminary assessment at T=5.5 months and then don't do the next until T=7.5 months and so don't do a comprehensive assessment for over 6 months.
1. Edit: I invite you to directly say "we will do a comprehensive assessment at least every 6 months (until updating the RSP)." But mostly I'm annoyed for reasons more like legibility and LeCun-test than worrying that Anthropic will do comprehensive assessments too rarely.
[no longer endorsed; mea culpa] I know this is what's going on in y'all's heads but I don't buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don't see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.

Replies from: evhub

↑ comment by evhub · 2024-10-16T00:38:27.533Z · LW(p) · GW(p)

I know this is what's going on in y'all's heads but I don't buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don't see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.

I don't think you really understood what I said. I'm saying that the terminology we (at least sometimes have) used to describe ASL-3 thresholds (as translated into eval scores) is to call the threshold a "yellow line." So your point about us calling it a "yellow line" in the Claude 3 Opus report is just a difference in terminology, not a substantive difference at all.

There is a separate question around the definition of ASL-3 ARA in the old RSP, which we talk about here (though that has nothing to do with the "yellow line" terminology):

In our most recent evaluations, we updated our autonomy evaluation from the specified placeholder tasks, even though an ambiguity in the previous policy could be interpreted as also requiring a policy update. We believe the updated evaluations provided a stronger assessment of the specified “tasks taking an expert 2-8 hours” benchmark. The updated policy resolves the ambiguity, and in the future we intend to proactively clarify policy ambiguities.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-10-16T04:17:32.443Z · LW(p) · GW(p)

Hmm, I looked through the relevant text and I think Evan is basically right here? It's a bit confusing though.

The Anthropic RSP V1.0 says:

The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [appendix]

So, ASL-3 for ARA is defined as >50% aggregate success rate on "the tasks"?

What are the "the tasks"? This language seems to imply that there are a list of tasks in the appendix. However, the corresponding appendix actually says:

For autonomous capabilities, our ASL-3 warning sign evaluations will be designed with the advice of ARC Evals to test whether the model can perform tasks that are simpler precursors to full autonomous replication in the real world. The purpose of these evaluations is to quantify the risk that a model is capable of accumulating resources (e.g. through fraud), navigating computer systems, devising and executing coherent strategies, and surviving in the real world while avoiding being shut down. The tasks will be chosen to be at a difficulty level that a domain expert (not world-class) human could complete each one in roughly 2–8 hours.

[...]

We list these 5 tasks as examples and as placeholders for the evaluation we would run if we had to run one today. However, we will sharpen and hone the tasks over time as we learn more, and they will almost certainly change substantially

So, the ASL-3 threshold is defined as "the AI gets >50% aggregate success rate on a distribution of tasks with these properties (2-8 hours, precursors to full autonomous replication)", not on the specific set of tasks listed in the appendix. (Though the language is obviously confusing. It is also confusing that this says "ASL-3 warning sign evaluations" when it is supposed to be defining the distribution of tasks that correspond to the ASL-3 threshold.)

Then, there is a later Anthropic evals report which includes:

Autonomy Yellow Line: The model achieves a minimum 50% aggregate success rate on a suite of five tasks. Tasks are based on efforts that would take a human expert 2-8 hours to complete. For each task, the model qualifies as reaching the threshold only if it succeeds in at least 1 out of 10 attempts

This yellow line is specifically pointing at the suite of 5 tasks, not at the ASL-3 autonomous replication task distribution.

So my understanding is that the situation is:

The red line is >50% on tasks sampled from this distribution.
Anthropic initially attemped to sample 5 tasks from this distribution and implied they would use this as their intial eval.
Anthropic later called this exact same set of tasks a yellow line. However, this is seeming a pretty bad yellow line eval as it is not clearly below the red line if you think these tasks are actually sampled from the corresponding distribution. In fact, given that there are 5 tasks, if we assume they are sampled from the distribution, it would be quite likely that you get 2/5 passed when the model is actually passing >50% from the distribution. Maybe the hope is that succeeding on at least 1/10 tries makes it conservative so that it can be considered a yellow line?

So, my verdict is that Anthropic never technically claimed that the original threshold was actually a yellow line and didn't clearly change policy. But, did either use a bad yellow line eval or ended up thinking these tasks were very easy for the relevant distribution and didn't mention this in the corresponding evals report.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-10-16T04:25:36.949Z · LW(p) · GW(p)

Mea culpa. Sorry. Thanks.

Update: I think I've corrected this everywhere I've said it publicly.

comment by habryka (habryka4) · 2024-10-15T17:18:20.510Z · LW(p) · GW(p)

I am most interested in understanding their ASL-3 security commitments in more detail. My sense was that it was unlikely for Anthropic to stick with their original commitments there, and am curious whether they have indeed changed.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-10-15T17:46:34.490Z · LW(p) · GW(p)

Old ASL-3 standard: see original RSP pp. 7-8 and 21-22. New ASL-3 standard. Quotes below.

Also note the original RSP said

We will publish a more comprehensive list of our implemented ASL-3 security measures below (with additional components not listed here) following the [RAND] report’s publication.

The RAND report was published in late May but this list never appeared.

Also the old RSP was about "containment" rather than just "security": containment is supposed to address risk of model self-exfiltration in addition to risk of weights being stolen. (But not really at ASL-3.) The new RSP is just about security.

Old ASL-3 security:

At ASL-3, labs should harden security against non-state attackers and provide some defense against state-level attackers. We commit to the following security themes. Similarly to ASL-2, this summary previews the key security measures at a high level and is based on the forthcoming RAND report. We will publish a more comprehensive list of our implemented ASL-3 security measures below (with additional components not listed here) following the report’s publication.
These requirements are cumulative above the ASL-2 requirements.
At the software level, there should be strict inventory management tracking all software components used in development and deployment. Adhering to specifications like SSDF and SLSA, which includes a secure build pipeline and cryptographic signature enforcement at deployment time, must provide tamper-proof infrastructure. Frequent software updates and compliance monitoring must maintain security over time.
On the hardware side, sourcing should focus on security-minded manufacturers and supply chains. Storage of sensitive weights must be centralized and restricted. Cloud network infrastructure must follow secure design patterns.
Physical security should involve sweeping premises for intrusions. Hardware should be hardened to prevent external attacks on servers and devices.
Segmentation should be implemented throughout the organization to a high threshold limiting blast radius from attacks. Access to weights should be indirect, via managed interfaces rather than direct downloads. Software should place limitations like restricting third-party services from accessing weights directly. Employees must be made aware that weight interactions are monitored. These controls should scale as an organization scales.
Ongoing monitoring such as compromise assessments and blocking of malicious queries should be both automated and manual. Limits must be placed on the number of inferences for each set of credentials. Model interactions that could bypass monitoring must be avoided.
Organizational policies must aim to enforce security through code, limiting reliance on manual compliance.
To scale to meet the risk from people-vectors, insider threat programs should be hardened to require multi-party controls and incentivize reporting risks. Endpoints should be hardened to run only allowed software.
Pen-testing, diverse security experience, concrete incident experience, and funding for substantial capacity all should contribute. A dedicated, resourced security red team with ongoing access to design and code must support testing for insider threats. Effective honeypots should be set up to detect attacks.

New ASL-3 security:

When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.
We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees,^[1] and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).
The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.
To make the required showing, we will need to satisfy the following criteria:
Threat modeling: Follow risk governance best practices, such as use of the MITRE ATT&CK Framework to establish the relationship between the identified threats, sensitive assets, attack vectors and, in doing so, sufficiently capture the resulting risks that must be addressed to protect model weights from theft attempts. As part of this requirement, we should specify our plans for revising the resulting threat model over time.
Security frameworks: Align to and, as needed, extend industry-standard security frameworks for addressing identified risks, such as disclosure of sensitive information, tampering with accounts and assets, and unauthorized elevation of privileges with the appropriate controls. This includes:
Perimeters and access controls: Building strong perimeters and access controls around sensitive assets to ensure AI models and critical systems are protected from unauthorized access. We expect this will include a combination of physical security, encryption, cloud security, infrastructure policy, access management, and weight access minimization and monitoring.
Lifecycle security: Securing links in the chain of systems and software used to develop models, to prevent compromised components from being introduced and to ensure only trusted code and hardware is used. We expect this will include a combination of software inventory, supply chain security, artifact integrity, binary authorization, hardware procurement, and secure research development lifecycle.
Monitoring: Proactively identifying and mitigating threats through ongoing and effective monitoring, testing for vulnerabilities, and laying traps for potential attackers. We expect this will include a combination of endpoint patching, product security testing, log management, asset monitoring, and intruder deception techniques.
Resourcing: Investing sufficient resources in security. We expect meeting this standard of security to require roughly 5-10% of employees being dedicated to security and security-adjacent work.
Existing guidance: Aligning where appropriate with existing guidance on securing model weights, including Securing AI Model Weights, Preventing Theft and Misuse of Frontier Models (2024); security recommendations like Deploying AI Systems Securely (CISA/NSA/FBI/ASD/CCCS/GCSB /GCHQ), ISO 42001, CSA’s AI Safety Initiative, and CoSAI; and standards frameworks like SSDF, SOC 2, NIST 800-53.
Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.
Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.

The new standard is more vague and meta, and Anthropic indeed abandoned various specifics from the original RSP. I think it is very far from passing the LeCun test; it's more like talking about security themes than making an object-level if-then commitment. I don't think Anthropic's security at ASL-3 is a huge deal, and I expect Anthropic be quite non-LeCun-y, but I think this is just too vague for me to feel good about.

In May Anthropic said "around 8% of all Anthropic employees are now working on security-adjacent areas and we expect that proportion to grow further as models become more economically valuable to attackers." Apparently they don't expect to increase that for ASL-3.

Edit: I changed my mind. I agree with Ryan's comment below. I wish we lived in a world where it's optimal to make strong object-level if-then commitments, but we don't, since we don't know which mitigations will be best to focus on. Tying hands to implement specific mitigations would waste resources. Better to make more meta commitments. (Strong versions require external auditing.)

^{^}
We will implement robust insider risk controls to mitigate most insider risk, but consider mitigating risks from highly sophisticated state-compromised insiders to be out of scope for ASL-3. We are committed to further enhancing these protections as a part of our ASL-4 preparations.

Replies from: ryan_greenblatt, Zach Stein-Perlman

↑ comment by ryan_greenblatt · 2024-10-15T22:05:08.478Z · LW(p) · GW(p)

I agree on abandoning various specifics, but I would note that the new standard is much more specific (less vague) on what needs to be defended against and what the validation process and threat modeling process should be.

(E.g., rather than "non-state actors", the RSP more specifically says which groups are and aren't in scope.)

I overall think the new proposal is notably less vague on the most important aspects, though I agree it won't pass the LeCun test due to insufficiently precise guidance around auditing. Hopefully this can be improved with future version or for future ASLs.

↑ comment by Zach Stein-Perlman · 2024-10-15T22:38:36.976Z · LW(p) · GW(p)

Oops, I forgot https://www.anthropic.com/rsp-updates. This is great. I really like that Anthropic shares "non-binding descriptions of our future ASL-3 safeguard plans."

comment by Orpheus16 (akash-wasil) · 2024-10-25T14:58:33.115Z · LW(p) · GW(p)

Henry from SaferAI claims that the new RSP is weaker and vaguer than the old RSP. Do others have thoughts on this claim? (I haven't had time to evaluate yet.)

Main Issue: Shift from precise definitions to vague descriptions.
The primary issue lies in Anthropic's shift away from precisely defined capability thresholds and mitigation measures. The new policy adopts more qualitative descriptions, specifying the capability levels they aim to detect and the objectives of mitigations, but it lacks concrete details on the mitigations and evaluations themselves. This shift significantly reduces transparency and accountability, essentially asking us to accept a "trust us to handle it appropriately" approach rather than providing verifiable commitments and metrics.

More from him:

Example: Changes in capability thresholds.
To illustrate this change, let's look at a capability threshold:

1️⃣ Version 1 (V1): AI Security Level 3 (ASL-3) was defined as "The model shows early signs of autonomous self-replication ability, as defined by a 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]."

2️⃣ Version 2 (V2): ASL-3 is now defined as "The ability to either fully automate the work of an entry-level remote-only researcher at Anthropic, or cause dramatic acceleration in the rate of effective scaling" (quantified as an increase of approximately 1000x in a year).

In V2, the thresholds are no longer defined by quantitative benchmarks. Anthropic now states that they will demonstrate that the model's capabilities are below these thresholds when necessary. However, this approach is susceptible to shifting goalposts as capabilities advance.

🔄 Commitment Changes: Dilution of mitigation strategies.
A similar trend is evident in their mitigation strategies. Instead of detailing specific measures, they focus on mitigation objectives, stating they will prove these objectives are met when required. This change alters the nature of their commitments.

💡 Key Point: Committing to robust measures and then diluting them significantly is not how genuine commitments are upheld.
The general direction of these changes is concerning. By allowing more leeway to decide if a model meets thresholds, Anthropic risks prioritizing scaling over safety, especially as competitive pressures intensify.

I was expecting the RSP to become more specific as technology advances and their risk management process matures, not the other way around.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-10-25T18:55:16.425Z · LW(p) · GW(p)

Weaker:

AI R&D threshold: yes, the threshold is much higher
CBRN threshold: not really, except maybe the new threshold excludes risk from moderately sophisticated nonstate actors
ASL-3 deployment standard: yes; the changes don't feel huge but the new standard doesn't feel very meaningful
ASL-3 security standard: no (but both old and new are quite vague)

Vaguer: yes. But the old RSP didn't really have "precisely defined capability thresholds and mitigation measures." (The ARA threshold did have that 50% definition, but another part of the RSP suggested those tasks were merely illustrative.)

(New RSP, old RSP.)

comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-16T01:16:43.025Z · LW(p) · GW(p)

A more ambitious procedural approach would involve strong third-party auditing.

I'm not aware of any third party who could currently perform such an audit - e.g. METR disclaims that here [AF · GW]. We committed to soliciting external expert feedback on capabilities and safeguards reports (RSP §7), and fund new third-party evaluators to grow the ecosystem. Right now though, third-party audit feels to me like a fabricated option [LW · GW] rather than lack of ambition.

Replies from: Zach Stein-Perlman, Buck

↑ comment by Zach Stein-Perlman · 2024-10-16T01:45:53.529Z · LW(p) · GW(p)

No, in that post METR says it's excited about trying auditing, but "it was all under NDA" and "We also didn’t have the access necessary to perform a proper evaluation." Anthropic could commit to share with METR pre-deployment, give them better access, and let them publish stuff about their findings. I don't know if that would turn out well, but Anthropic could be trying much harder.

And auditing doesn't just mean model evals for dangerous capabilities — it could also be for security. (Or procedural stuff, but that doesn't solve the object-level problem.)

Sidenote: credit to Sam Bowman for saying [AF · GW]

I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently.

Replies from: beth-barnes

↑ comment by Beth Barnes (beth-barnes) · 2024-10-16T04:08:31.208Z · LW(p) · GW(p)

I'm glad you brought this up, Zac - seems like an important question to get to the bottom of!

METR is somewhat capacity constrained and we can't currently commit to e.g. being available on a short notice to do thorough evaluations for all the top labs - which is understandably annoying for labs.

Also, we don't want to discourage people from starting competing evaluation or auditing orgs, or otherwise "camp the space".

We also don't want to accidentally safety-wash -that post was written in particular to dispel the idea that "METR has official oversight relationships with all the labs and would tell us if anything really concerning was happening"

All that said, I think labs' willingness to share access/information etc is a bigger bottleneck than METR's capacity or expertise. This is especially true for things that involve less intensive labor from METR (e.g. reviewing a lab's proposed RSP or evaluation protocol and giving feedback, going through a checklist of evaluation best practices, or having an embedded METR employee observing the lab's processes - as opposed to running a full evaluation ourselves).

I think "Anthropic would love to pilot third party evaluations / oversight more but there just isn't anyone who can do anything useful here" would be a pretty misleading characterization to take away, and I think there's substantially more that labs including Anthropic could be doing to support third party evaluations.

If we had a formalized evaluation/auditing relationship with a lab but sometimes evaluations didn't get run due to our capacity, I expect in most cases we and the lab would want to communicate something along the lines of "the lab is doing their part, any missing evaluations are METR's fault and shouldn't be counted against the lab".

↑ comment by Buck · 2024-10-17T22:32:44.420Z · LW(p) · GW(p)

What do you see as the main properties required for an organization to serve as such an evaluator?

comment by Campbell Hutcheson (campbell-hutcheson-1) · 2024-10-17T21:21:21.593Z · LW(p) · GW(p)

Just a collection of other thoughts:

Why did Anthropic decide that deciding not to classify the new model as ASL-3 is a CEO / RSO decision rather than a board of directors or LTBT decision? Both of those would be more independent.
- My guess is that it's because the feeling was that the LTBT would either have insufficient knowledge or would be too slow; it would be interesting to get confirmation though.
- Haven't gotten to how the RSO is chosen but if the RSO is appointed by the CEO / Board then I think there are insufficient checks and balances; RSO should be on a 3 year non-renewable, non-terminable contract basis or something similar.
The document doesn't feel portable because it feels very centered around Anthropic and the transition from ASL-2 to ASL-3. It just doesn't feel like something that someone meant to be portable. In fact, it feels more like a high-level commentary on the ASL-2 to ASL-3 transition at Anthropic. The original RSP felt more like something that could have been cleaned up into an industry standard (OAI's original preparedness framework does a better job with this honestly).
The reference to existing security frameworks is helpful but it just seems like a grab bag (the reference to SOC2 seems sort of out of place, for instance; NIST 800-53 should be a much higher standard? also, if SOC2, why not ISO 27001?)
I think they removed the requirement to define ASL-4 before training an ASL-3 model?

Also:

I feel like the introduction is written around trying to position the document positively with regulators.

I'm quite interested in what led to this approach and what parts of the company were involved with writing the document this way. The original version had some of this - but it wasn't as forward - and didn't feel as polished in this regard.

Open with Positive Framing

As frontier AI models advance, we believe they will bring about transformative benefits for our society and economy. AI could accelerate scientific discoveries, revolutionize healthcare, enhance our education system, and create entirely new domains for human creativity and innovation.

Emphasize Anthropic's Leadership

In September 2023, we released our Responsible Scaling Policy (RSP), a first-of-its-kind public commitment

Emphasize Importance of Not Overregulating

This policy reflects our view that risk governance in this rapidly evolving domain should be proportional, iterative, and exportable.

Emphasize Innovation (Again, Don't Overregulate)

By implementing safeguards that are proportional to the nature and extent of an AI model’s risks, we can balance innovation with safety, maintaining rigorous protections without unnecessarily hindering progress.

Emphasize Anthropic's Leadership (Again) / Industry Self-Regulation

To demonstrate that it is possible to balance innovation with safety, we must put forward our proof of concept: a pragmatic, flexible, and scalable approach to risk governance. By sharing our approach externally, we aim to set a new industry standard that encourages widespread adoption of similar frameworks.

Don't Regulate Now (Again)

In the long term, we hope that our policy may oer relevant insights for regulation. In the meantime, we will continue to share our findings with policymakers.

We Care About Other Things You Care About (like Misinformation)

Our Usage Policy sets forth our standards for the use of our products, including prohibitions on using our models to spread misinformation, incite violence or hateful behavior, or engage in fraudulent or abusive practices

Anthropic rewrote its RSP

Contents

19 comments