Anthropic: Three Sketches of ASL-4 Safety Case Components

post by Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · LW · GW · 2 comments

This is a link post for https://alignment.anthropic.com/2024/safety-cases/

Contents

2 comments

The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so.  However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal of this post is to present some candidates for what such a safety case might look like.

This is a post by Roger Grosse on Anthropic's new Alignment Science Blog. The post is full of disclaimers about how it isn't an official plan and doesn't speak for the org (and that it's inadequate: "none of the sketches presented here fully succeeds in addressing the sabotage risk"). But presumably it represents Anthropic's thinking and is their working/best sketch of ASL-4 safety cases.

The three safety cases are Mechanistic Interpretability, AI Control, and Incentives Analysis.

Regardless of how good these safety cases are, it's good when labs share their thinking on safety stuff; yay Anthropic.

2 comments

Comments sorted by top scores.

comment by Zach Stein-Perlman · 2024-11-06T16:45:07.175Z · LW(p) · GW(p)

When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:

  1. Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
  2. Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not

Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why.

comment by Zach Stein-Perlman · 2024-11-06T16:00:25.325Z · LW(p) · GW(p)

My quick reactions:

  1. Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
  2. Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong [LW · GW]) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
  3. Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate