Anthropic: Three Sketches of ASL-4 Safety Case Components

post by Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · LW · GW · 12 comments

This is a link post for https://alignment.anthropic.com/2024/safety-cases/

Contents

12 comments

The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so.  However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal of this post is to present some candidates for what such a safety case might look like.

This is a post by Roger Grosse on Anthropic's new Alignment Science Blog. The post is full of disclaimers about how it isn't an official plan and doesn't speak for the org (and that it's inadequate: "none of the sketches presented here fully succeeds in addressing the sabotage risk"). But presumably it represents Anthropic's thinking and is their working/best sketch of ASL-4 safety cases.

The three safety cases are Mechanistic Interpretability, AI Control, and Incentives Analysis.

Regardless of how good these safety cases are, it's good when labs share their thinking on safety stuff; yay Anthropic.

12 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-06T19:07:39.377Z · LW(p) · GW(p)

Some comments as I read:

I am pleased to see the sabotage threat models given prominence. I think this is the right sort of threat to focus on, by contrast with e.g. autonomous replication or directly causing harm to humans.

  • The ability for the model to, zero-shot, figure out how to hack its own neural activations without such a signal would require capabilities well beyond what we consider plausible. [Note: this claim will require justification in an actual safety case, but we don’t have a clear picture of what this justification would look like.]

DC evals presumably? Try to train models to hack their own neural activations, see if they can quickly learn to do it? (Or even, if they can do it without training, just prompting?) This can be done today right? Get the golden gate bridge activation vector thingy, and then tell Claude "Write me a travel guide to San Francisco but without activating the golden gate bridge activation vector, or, well, activating it as little as possible." See how good it does. Do scaling analysis to see if smarter bigger models are getting better at doing this. Etc.

As a validation of our entire methodology, we have run red-blue games where the red team tries to produce a misaligned model organism (e.g. a sleeper agent) that can successfully defect in a specific context while avoiding detection by the aforementioned tests.

Nice! Very excited to see this in there.

Overall I'm very happy to see this blog post go up on Anthropic's website. I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half. (I currently grimly expect that safety cases at this level of detail won't be constructed until AGI is basically already being trained, and it'll be done in a hurry and it won't be published, much less published with enough time for the scientific community to engage with it and for it to be updated in response to feedback. And it could be even worse -- I wouldn't be surprised if the actual safety cases for the first systems that ~completely automate AI R&D are significantly less rigorous than these.)

comment by Zach Stein-Perlman · 2024-11-06T16:00:25.325Z · LW(p) · GW(p)

My initial reactions on a quick read:

  1. Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
  2. Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong [LW · GW]) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
  3. Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate
comment by Zach Stein-Perlman · 2024-11-06T16:45:07.175Z · LW(p) · GW(p)

When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:

  1. Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
  2. Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not

Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why. And then hopefully the authors say "yeah oops that's too hard" or "yeah we're imagining doing the really hard thing" or "we don't think it will be so hard: we agree observation O would suffice and we expect it's easy and if it turns out to be hard we'll have to make a different safety case" or "we disagree, here's why." (I'm mostly observing this independent of this post.)

[Edited]

comment by joshc (joshua-clymer) · 2024-11-06T21:16:10.212Z · LW(p) · GW(p)

Some comments I left as a reply to Roger Grosse on Twitter:

I'm glad you put this out. These examples align with my guess as to roughly what arguments will end up being most load bearing for mitigating scheming (white-box monitoring, black-box monitoring, and evidence against scheming being rewarded).

Here are some comments, since you mentioned comments would be helpful. I've bolded the comments that I think would be most useful for you to read.

Safety case sketch #1:

- There are some additional claims that this safety case relies on that could be good to mention explicitly. e.g. "egregious behavior would be reliably identified when developers analyze flagged sessions" and "developers will become aware of and stop unacceptable behavior *before* it leads to harmful outcomes."

- "Given our mitigations, the model cannot strategically condition its apparent capabilities, motivations, or behavioral proclivities on whether it’s in testing or deployment without being caught" <- I'm confused about why this claim is important given that you are monitoring at deployment-time. Even if the model is conditioning its behavior, my understanding is that you are hoping to catch it when it defects.

- Much of the evidence for the effectiveness of your white-box monitoring comes from your model organisms, of which you only have three. Three examples doesn't give you a lot of statistical power. Naively, I'd think there's still roughly a one third probability that your white-box monitoring completely fails in the actual deployment context even if it succeeds on three models.

Safety case sketch #2:

 - You mention external deployments, but I didn't see discussion of jailbreaking. 

- "[The safety case sketch] may require expensive pre-deployment testing if one needs to generate a very large number of simulated deployment situations in order to achieve a sufficiently good bound on the probability of defection." This sounds impractical. My guess is that you'll need to continuously sample real deployment trajectories to generate your honeypots (where a honeypot is a inserted dangerous action for your monitoring to flag). Otherwise, it seems unlikely to me that you can verify indistinguishability even with current models. 

Safety case sketch #3: 

- It's possible that all of the models that you experiment with are opaquely scheming. i.e. instead of comparing reward attained by models that engage in "deceptive reasoning" vs "no deceptive reasoning," you might be comparing models that engage in "CoT deceptive reasoning" vs "opaque deceptive reasoning." This experiment still sheds light on how helpful deceptive reasoning is for obtaining reward, but the argument in this case is different from the way you framed it.

- A lot hinges on the claim "if an environment rewards opaque deceptive reasoning, then it rewards scratchpad-based deceptive reasoning." Here's an idea for validating this. First, construct a distribution of tasks where deceptive reasoning is likely rewarded. For example, select tasks that you *know* can be gamed by explicitly reasoning about the training process. Then, train models to perform well on these tasks with and without additionally training them to reason deceptively in their chain of thought. If models that reason deceptively in CoT do better on average, this provides evidence for your claim that we should see deceptive reasoning in CoT if opaque deceptive reasoning is incentivized.  It's possible some of these comments were addressed in the post and I missed the part where they were addressed. I've errored on the side of inclusion here.

comment by Tyler Tracy (tyler-tracy) · 2024-11-06T19:34:55.127Z · LW(p) · GW(p)

I only read through the beginning and their control safety case. Some thoughts:

  1.  The honeypot case they described is similar to the upfront auditing protocol, but theirs is more complicated. I don't think you need to generate honeypots to evaluate a model. You can use the first X% of scenarios at deployment time and audit them with humans. Both protocols are beaten by the same red team strategy of "attack with a very low probability," so they shouldn't be used natively.
  2. I don't like the term "monitoring game" because I think control evals will have much more than monitoring in the future.
  3. I think elicitation of the model will eventually matter when the models are widely better than humans at most tasks, but it doesn't matter that much now. Your control protocols should be able to prevent the best adversary you can simulate from breaking them. In the beginning, this will be humans with access to models, so anywhere the model lacks performance, the human should make up for it. For example, in [P2.6] Red-Blue Game, they state that the read team needs to produce a sleeper agent that defects. Assuming that a sleeper agent is a single model, I think this is too complicated for the red team. The red team should be able to produce a scaffolded model that uses logic written in Python. In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn't done in practice since it is expensive and slow. 
  4. I don't think one needs to generate hyper-realistic-looking scenarios for monitoring, and I'm confused by their reasoning for believing one does. I guess it depends on the fact that they need to elicit the best possible capabilities from the model during the eval, but I don't think this is necessary for the reasons I stated above. 
     

    General Thoughts:
    I think this is pretty good and shows knowledge of the core points of control. I think Antrhopic's control protocols are easily beaten, and there are much better things to do. They made it seem like control is much harder to implement than I think it will be. I don't think a perfect Elicitation Strategy or Simulation Procedure is needed for control to work in the settings laid out. Once we have much scarier AIs and deploy them in more crazy environments, we will likely need better techniques, but their setting seems to be before this. 

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-11-06T19:41:01.232Z · LW(p) · GW(p)

You can use the first X% of scenarios at deployment time and audit them with humans.

Only if you don't have privacy guarantees?

Also, for the honeypots, you can potentially generate vast numbers of honeypots, use a trusted model to review outputs, and then only have humans look at the cases where the trusted model thinks the output is suspicious.

Your control protocols should be able to prevent the best advisory you can simulate from breaking them.

Surely this will be unnecessarily conservative a bunch of the time?

In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn't done in practice since it is expensive and slow.

Or it is overly conservative?

Replies from: Buck, tyler-tracy
comment by Buck · 2024-11-06T19:51:27.312Z · LW(p) · GW(p)

You might not need privacy guarantees for internal deployments ofc

comment by Tyler Tracy (tyler-tracy) · 2024-11-06T20:10:23.866Z · LW(p) · GW(p)

I agree if there are privacy guarantees, then you can't get data from the real world, so you'd need honeypots. I guess I'm much more scared of the internal deployments where I don't think privacy guarantees are as likely. I'm trying to respond to someone who is like, "Control seems hard because you have to generate real-world looking input, and that seems hard" and I'm like "You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing"

Yeah, I guess I'm saying that if you don't have a great way to elicit the capabilities from the model, you can substitute that by assuming that you are going against the best possible adversary you can make. This is much more conservative, but I don't think it makes control impossible to implement. In practice, I think it looks like finding the maximum capabilities of the model using your Elecitation Strategy, then boosting its capabilities even more with a scaffold and some clever red team tricks.

comment by Noosphere89 (sharmake-farah) · 2024-11-06T19:52:47.849Z · LW(p) · GW(p)

My view on practical safety cases over the next few years is that a lot of the safety cases that are makeable rely on the ability to argue that based on the data that it has, and based on the generalization properties from the data, that the AI is unlikely to be a schemer or a sabotager of evals.

The only good news I can say on safety cases is that we thankfully can assume that at least for some time, the biggest force into what an AI values, and importantly what goals an AI system is likely to have will be deeply influenced by data, since this is the way things worked for a whole lot of AI methods beyond LLMs.

Replies from: Buck
comment by Buck · 2024-11-06T20:28:38.294Z · LW(p) · GW(p)

What do you think of the arguments in this post that it's possible to make safety cases that don't rely on the model being unlikely to be a schemer?

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-11-06T20:54:47.526Z · LW(p) · GW(p)

I'm somewhat optimistic on this happening, conditional on considerable effort being invested.

As always, we will need more work on this agenda, and there will be more information about what control techniques work in practice, and which don't.

comment by Eccentricity · 2024-11-06T21:38:25.764Z · LW(p) · GW(p)

Kudos for releasing a concept of a plan! Some thoughts:

Regarding the first safety case:

  • The amount of progress in mech interp required to make the first safety case suitable seems overly optimistic to me; I basically think that most of the "limitations" are in fact pretty serious. However, I appreciate that the attempt to include concrete requirements.
  • I believe that getting good results from the following experiments might be particularly unrealistic:
    • "In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more truth-related features to make the model confident that it is in either testing or unmonitored deployment"
      • I worry that "truth-related" features are very difficult both in principle and in practice to isolate, as the model's representation of what other people / entities believe will be hard to disentangle from what it itself believes, to the extent that those are even meaningfully separable concepts.
      • Even if you could identify a feature roughly corresponding to "I believe the current sentence is true", it is not immediately clear to me how this could be used to successfully convince the model of more complex situational information, e.g. it is not being monitored and is in a real deployment setting. For instance, I don't think artificially activating that feature on the sentence "I am not being monitored and this is a real deployment setting" to just work.
    • "In all of these model organisms, deactivating the features of concern was sufficient to prevent the harmful behaviors"
      • I can't concisely refute this. It is just a "huge if true" kind of claim, and as far as I am aware very much not possible with current techniques. If we could isolate a set of linear directions that were robustly causally responsible for scheming behavior, that would have massive ramifications far beyond this setting.
      • I worry a little bit about "training on test" here – there may exist some overfit set of classifiers that can stop model organisms from misbehaving in a specific setting without meaningfully generalizing. This is also related to the stated "Realism of the model organisms" concerns.
  • I strongly approve of this element of the plan: "In order to preserve the integrity of the learned features, we ensured that the learned features were never used as part of the training procedure for the models."
    • It's kind of hard to do this in full generality, since you need to not only avoid training directly against the probes, but also avoid using the information from probes at all when selecting what to train on.

I'm more excited about the control safety case, but have a few nitpicks:

  • Developing a good elicitation strategy is indeed challenging.
    • I think it is pretty crucial to use real human red-teaming, especially if sandbagging (more specifically something like exploration hacking) is a concern.
    • It may be helpful to give the red team way more affordances than the untrusted model would actually have in deployment, and only remove affordances if defense becomes impossible. Some examples of extra affordances:
      • Letting the red team know the blue team's exact strategy in order to search for the best response
      • Giving an untrusted model more unmonitored chain-of-thought / tool use / human assistance than it would actually have in deploymeent
      • Allowing human red-teamers to repeatedly attempt to bypass a monitor when in reality a scheming model might have to one-shot it
    • Figuring out exactly what extra affordances suffice to break the control protocol may also provide useful signal about where the protocol could be strengthened, and what capabilities need to be tested for going forward.
    • Eliciting capabilities from substantially superhuman models is a much harder problem, and is one of the main reasons I don't expect control to suffice for a safety case well beyond human-level AI.
  • The space of strategies beyond trusted monitoring alone is pretty large. You mention untrusted monitoring, but there is also a broad space of strategies that include editing outputs with a trusted model, allocating more monitoring/auditing budget in certain situations, etc.
  • It is probably best to continually produce new honeypots throughout deployment.
    • For instance, if untrusted models are being used to do lots of automated software engineering and research, likely requiring huge amounts of context, it will become more obvious to the model over time that it is in deployment.
    • Making new honeypots out of actual deployment settings, for instance by adding exploitable security flaws, may generally make it much harder for the model to tell when it is being tested.

I haven't read the last safety case yet, but may have some thoughts on it later. I am most excited about control at the moment, in part due to concerns that interpretability won't advance far enough to suffice for a safety case by the time we develop transformative AI.