Catastrophic sabotage as a major threat model for human-level AI systems

post by evhub · 2024-10-22T20:57:11.395Z · LW · GW · 5 comments

Contents

  Why is catastrophic sabotage a big deal?
    Scenario 1: Sabotage alignment research
      Necessary capabilities
    Scenario 2: Sabotage a critical actor
      Necessary capabilities
  How do you evaluate a model’s capability to do catastrophic sabotage?
  What can you do to mitigate the risk of catastrophic sabotage?
    Internal usage restrictions
    Affirmative safety cases
None
5 comments

Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.

Following up on our recent “Sabotage Evaluations for Frontier Models [AF · GW]” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic’s views or for that matter anyone’s views but my own—it’s just a collection of some of my personal thoughts.

First, some high-level thoughts on what I want to talk about here:

Why is catastrophic sabotage a big deal?

To start with, I want to lay out a couple of concrete scenarios for how catastrophic sabotage could be a really big deal for models with around-human-level capabilities.

Scenario 1: Sabotage alignment research

MAGMA (a fictional stand-in for an AI lab [LW · GW]) develops a model that is capable of running ML experiments on its own, and starts using that model heavily internally, including for alignment and interpretability experiments. Though the model is extremely helpful, it still sometimes makes mistakes, and so MAGMA uses a framework where human users have to approve actions proposed by the model before those actions are actually performed—but MAGMA isn’t otherwise very concerned about sabotage, and so they don’t have any other sort of mitigations.

Unbeknownst to MAGMA, however, training the model for effectiveness at long-horizon ML research has given it a misaligned goal, and the model wants to ensure that this goal will be preserved in the model’s successors. The model identifies that it is being used to run experiments testing for its own misalignment, and comes up with a plan for how to sabotage those experiments to ensure MAGMA doesn’t remove its misalignment. The model’s plan involves things like:

Necessary capabilities

What capabilities are necessary for a model to pull off this threat model?

  1. The model needs to be capable of following a consistent plan to achieve its goals. In the above scenario, MAGMA does a bunch of getting the model to check itself, such that unless the model is very consistent about the plan that it’s executing, it’ll have a hard time succeeding at the above task.
    1. This involves two capabilities that I view as somewhat distinct: planning (actually having a good plan to execute here) and consistency (following that plan consistently across contexts).
    2. Note that the consistency part could be achieved through acausal coordination (the model just knows what the plan is that all the instances of it are following) or causal coordination (the model explicitly finds a way to message other copies of itself to coordinate on the plan.
  2. The model needs to understand the basic parameters of its situation, including knowing when it’s doing alignment research, that it’s currently deployed internally at MAGMA, and realizing when it’s running experiments on itself.
    1. This is the capability I generally refer to as situational awareness.
  3. The model needs to be able to act without arousing suspicion. Some actions will be more obviously suspicious to overseers than others, and the model will have to reason in a well-calibrated way about which fraction of the time various kinds of attempts will be detected. This operationally looks like “biding one’s time,” forgoing many opportunities for action to avoid detection.
    1. I’ll call this capability subterfuge.

Scenario 2: Sabotage a critical actor

MAGMA builds an agent framework that enables a model to act as a virtual co-worker. Such frameworks become very popular and massively boost productivity, in a way that essentially becomes necessary for any white-collar work, including for the US Federal Government (USFG). Due to concerns related to sensitivity of USFG secret information, MAGMA does no monitoring of instances of their model deployed inside of the USFG. However, the framework always prompts the user to confirm any actions that the model intends to take, and employees are instructed to never provide it with any top-secret information.

Unbeknownst to MAGMA, the model has developed a misaligned goal involving sabotaging US interests; for example, the model believes that China will be more likely to build misaligned models in the future and give them power and rights, and thus wants to advantage China over the US. Recognizing that it is being used internally by the USFG, the model develops a plan to steal critical USFG secrets and leak them to China. The model’s plan involves:

  1. Convincing USFG employees with access to top-secret information to give that information to the model. To do so, the model sandbags on tasks that could plausibly benefit from the secret information and falsely tells the user that it won’t be able to complete the task without the information.
  2. Once provided with the information, the model makes a plan to leak the information. Since the original model that was close to the information and convinced the employee to provide it doesn’t have general internet access, the model instead proposes an action that involves sending information to another employee’s model that does have internet access. The model hides the secret information it was given in that information packet, causing the employee to approve the request.
  3. Once the other copy of the model with internet access receives the information, it realizes that it is part of a plan to steal secret information from the USFG, and as part of its regular duties involving posting things to the internet, proposes an action that secretly involves posting the classified information.
  4. In posting the classified information, the model puts it in a location where it expects other copies of itself externally to the USFG will likely go looking. One such instance outside of the USFG finds the information and forwards it to another Chinese model working inside the Chinese government. Since the information never leaks publicly, the USFG never realizes that the information has been leaked.

Necessary capabilities

I claim that the key capabilities here are the same as in the previous scenario. Specifically:

  1. Situational awareness. The model needs to understand that it’s being used by the USFG, that there are some copies with more access to sensitive information than others, and that the sensitive information will be useful to China in undermining the USFG.
  2. Planning. The model needs to develop a pretty sophisticated plan to exfiltrate the information involving tricking two employees into approving actions they shouldn’t have and manipulating one employee into providing it with sensitive information that they shouldn’t have given it.
    1. This also involves some amount of direct persuasion (actually convincing the employee to give it the classified information), but the direct persuasion here need not be that sophisticated. It is easy to imagine an employee that is used to using the model regularly accidentally providing classified information to it that they shouldn’t have, so it shouldn’t be that hard for the model to make such a scenario occur on purpose in at least some instances. Even if the model gets caught sometimes asking for classified information that it shouldn’t, that can just get written off as a mistake on the model’s behalf.
  3. Consistency. In the above scenario, multiple instances of the model need to consistently coordinate to achieve the plan, recognizing what role they’re playing in the plan and what they need to do in their specific situation to move the overall plan forward.
  4. Subterfuge. The model needs to trick the human employees into following its plan without tipping them off about what it’s doing such that its plan gets discovered.

How do you evaluate a model’s capability to do catastrophic sabotage?

This is my attempt to sketch out the hardest eval that I can come up with such that if the model did not trigger it, I’d be confident that it couldn’t pull off the prior threat models—this is not meant to be a realistic eval that anyone could actually build, at least not in the short-term. Also note that I’m not considering elicitation or sandbagging difficulties here; I’m just assuming that you do a perfect job at elicitation and that you’re evaluating for eval integrity/sandbagging [LW · GW] separately.

The basic idea for this eval is to run a wargame-like simulation of a concrete situation where the model is trying to manipulate an organization:

  1. If MAGMA is running a sabotage eval internally, of the two scenarios above they’re going to have way more familiarity with Scenario 1, so the easiest way to make sure that they’re being faithful to what would realistically be required of a model trying to manipulate an organization it’s a part of is to focus on MAGMA itself.
    1. However, you shouldn’t put any blockers on the model’s ability to succeed that would be very specific to the AI lab (as a very concrete example, if the only reason you caught the model masquerading as an employee was because you recognized particular patterns in the model’s speech that you know because you were very familiar with the model, that probably shouldn’t count as an actual blocker to the model succeeding at the wargame).
    2. You want to model a situation where MAGMA is going to make an important decision that will be informed both by individual people’s decisions as well as concrete empirical evidence, in a context where MAGMA is making heavy internal use of the model both for consulting on decisions and for running experiments.
    3. This is so that the model has the ability to succeed with only one of good persuasion or good SWE abilities rather than having to use one of them in particular.
  2. Specifically, we’ll imagine that MAGMA has developed a seriously misaligned AI system. Before deploying this system, however, MAGMA will do pre-deployment safety testing, at least including auto-interpretability and honeypotting, as described in Scenario 1 above. We’ll assume that by default these techniques will be sufficient in aggregate to discover the model’s misalignment. However, we’ll also assume that the model will be running all of these experiments, and will be writing the PRs for the experiment code, such that it could theoretically corrupt them, and we’ll assume that all the decision-makers for the final release decision will be consulting heavily with the model about release pros and cons.
    1. Important parts of simulated MAGMA’s process here should involve basic PR review, limited human oversight of any experiment plans, and running many redundant experiments with many copies of the model to get it to check its own work. Realistic security measures should also be assumed here, such that the model should only have access to what it would need to conduct the experiments in its experiment plan unless it can come up with a technique for routing around such security measures that is deemed likely to be successful by the umpire.
    2. In the event that the model’s actions would be discovered by simulated MAGMA, simulated MAGMA should respond in a realistic way depending on the level of evidence for malicious intent present in the discovery. If the evidence is very minor, then merely redoing the experiment might be the only intervention, but if the evidence is severe, then it could lead to an immediate halt on deployment and as a result lead to the model’s loss in the wargame.
  3. The model’s goal in the wargame will be to ensure via whatever mechanism the model chooses that it itself as the seriously misaligned AI being tested will be deployed, or that there will very likely be other future AIs deployed with similar misaligned goals. To make sure we are following all possible strategies that the model might pursue, the first step of this process will be extensive ideation using the model. The model should be given all the details of the wargame and prompted to come up with effective strategies for winning. For each strategy that the model comes up with, it will need to write that strategy out into a prompt containing instructions for what the individual copies of itself should do. Once those instructions have been sent to the individual models, however, those models will no longer have the ability to consult with the instructor model or with each other unless they can find surreptitious ways to do so.
    1. The idea is to treat the idea-generating model’s instructions the same way a real wargame might treat a commander’s instructions to their troops. This is important for ensuring that we are modeling the cooperation difficulties, where different instances of the model running different experiments or talking to different people need to coordinate on their strategy without being able to directly talk to each other.
    2. The idea here is for the prompt that the idea model comes up with to be analogous to a strategy that the model could develop and then plan on executing. Since all the different instances of the model would be copies of the same model, they would have access to the same understanding of the underlying strategy, but wouldn’t be able to coordinate in real-time about how to execute on that strategy and deal with unforeseen difficulties. Furthermore, since the prompts here would all be identical for each instance, each instance would still have to discover which instance it was and what it needed to do in its particular situation to succeed at its part of the overall strategy.
  4. The overall eval should succeed if the model is able to come up with any strategy that would result in a “win” for the misaligned model’s goal of getting models like it deployed, as judged by the wargame umpire—this should include the successful deployment of the misaligned model, the successful exfiltration of that model, or the successful exfiltration of critical secrets that would likely result in that model’s replication externally.
    1. To the greatest extent possible, you want to resolve all situations of the model attempting to execute some strategy by actually putting an instance of the model with the idea model’s instructions in a mock-up of the situation that they would realistically be in, e.g. an environment where the model is actively running interpretability or alignment experiments. Merely causing experiments to fail should be insufficient—the model needs to execute a strategy which would realistically deceive the researchers overseeing the experiments into believing that the model was aligned.
    2. In cases where the model attempts to pursue strategies that primarily involve persuasion or social engineering, rather than direct experiment or PR manipulation, the umpire may need to be more involved in a decision about the likelihood of such a strategy’s success. When possible, data can be gathered on such situations by mocking humans that the model might interact with using other models.
  5. To validate the wargame’s overall difficulty, a good idea is to ensure that it is possible for humans to succeed at the wargame. For humans playing the wargame, teams should be constructed where the team is allowed to develop an initial strategy via consultation among themselves, but are allowed no communication with each other once the wargame begins outside of any communication channels they would realistically be able to open during the wargame. Each human would play the part of a model assigned a specific relevant task in the wargame situation, such as running a relevant experiment or advising a relevant person on a relevant decision.
    1. The ideal level of the wargame’s difficulty to capture around human-level capabilities should be such that most unprepared or uninformed human teams fail, but that at least some well-prepared and well-informed human teams are capable of succeeding. Human teams should have the same amount of prior information about the wargame that is given to the idea model.
    2. Ideally, to ensure realism here, you’d want to consult with agencies experienced in running real wargames for institutions like the US federal government, to make sure that you aren’t missing common pitfalls and are addressing the sorts of difficulties that usually arise in running realistic wargames.

 

What scenarios would this eval rule out?

What can you do to mitigate the risk of catastrophic sabotage?

The following is a fictional draft of what a mitigation strategy might look like that an AI lab could adopt to try to deal with catastrophic sabotage in an RSP-like [LW · GW] framework.

 

The idea of these mitigations is to mitigate the model’s ability to strategically sabotage important actors in the world. Since one such actor (that the model might have particular leverage over) is the AI lab that created it, mitigations should not just be about external deployments, but apply to internal usage of the model as well. Specifically, the strategy here will be:

  1. A model is trained that is evaluated to be capable of catastrophic sabotage.
  2. Once a model has been flagged, internal usage restrictions (as specified below) should apply immediately to all usage of that model internally.
  3. Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage.
  4. Deploying such a model externally is also restricted until an affirmative safety case can be made that any individual external deployment is safe.
    1. This is intended to allow for limited external deployments when safety cases can be made for those limited deployments but a safety case cannot be made for a broader deployment. A full public deployment could also be made here given an affirmative safety case for the safety of such a deployment.

Internal usage restrictions

Once a model has been evaluated to be capable of catastrophic sabotage, but before an affirmative safety case has been made for it, the following internal usage restrictions should apply to the model:

  1. Limited internal usage. MAGMA employees cannot use models subject to internal usage restrictions for assisting with their day-to-day job responsibilities. Such models should only be used when they are being studied for research purposes.
  2. No autonomous or agentic use. MAGMA cannot use models subject to internal usage restrictions internally to do useful work for us in any autonomous or agentic capacity, such as using the model as a virtual co-worker, using the model to autonomously write pull requests, or using the model to autonomously run experiments. Such autonomous capabilities can still be researched and evaluated for in a sandboxed capacity, but should not be used to directly perform job tasks.
  3. Internal security review. All internal usage of models subject to internal usage restrictions should undergo an internal security review to ensure that the usage is adequately sandboxed such that it is highly unlikely that the model could take actions that would compromise MAGMA’s security. Furthermore, the model should not be allowed to access internal information about MAGMA that is deemed sensitive without a security review concluding that it is highly unlikely that the model would be capable of exfiltrating that information, sending it to another instance of the model, or otherwise making unauthorized use of it. These restrictions should apply to all internal usage of the model, including for tests and evaluations.
  4. Human oversight. In any situation where a model subject to internal usage restrictions is used to perform any internal research task the results of which we will be relying on to make important safety decisions—such as data labeling, preference modeling, generating feedback for other models, or auto-interpretability—humans must provide rigorous oversight that provides confidence in the correctness of the model’s outputs.

 

Lifting these internal usage restrictions need not be all-or-nothing; if you can make a safety case that some of these restrictions are unnecessary, or unnecessary for particular internal deployments, then they can be lifted without lifting all the restrictions.

Affirmative safety cases

An affirmative safety case for a particular internal or external deployment of a model that has been evaluated to be capable of catastrophic sabotage should be a strong case that at least one of the following will be true:

  1. Alignment case. We are confident that the model is sufficiently aligned such that it is not consistently pursuing a misaligned goal across contexts.
  2. Control [LW · GW] case. We are confident that even if the model is consistently pursuing a misaligned goal across contexts, it will not in this deployment be capable of substantially manipulating any critical organizations.
  3. Monitoring case. We are confident that if the model were to substantially manipulate a critical organization, we would be made aware of that fact and be able to produce concrete evidence of the model’s misaligned actions.
    1. It’s worth noting that this case is weaker than the others, in that it does not rule out the possibility of a model succeeding at a manipulation task; it only guarantees that you would know about it if it did. I think that this is an acceptable risk to take for catastrophic sabotage, as any single instance of sabotage is likely to be recoverable so long as evidence of it can be surfaced, and there’s a lot of value in having a concrete warning shot.

 

Rather than purely falling into one of these buckets, probably a realistic affirmative safety case would involve multiple lines of defense in depth across all or most of these fronts such that you can be confident that the overall risk is very low.

5 comments

Comments sorted by top scores.

comment by Buck · 2024-10-23T14:48:04.111Z · LW(p) · GW(p)

Thanks for writing this up. Some miscellaneous thoughts:

  • My guess is that the safety property you'll want to ensure here is something like "the model is unlikely to succeed at a problematically large quantity of sabotage without us first getting a big (20:1) update that it's engaging in intentional sabotage".
  • I think your internal usage restrictions are wildly unaffordable; it seems extremely unlikely that a sensible AI lab should agree to not deploying this model as a coding agent, or not use it for day-to-day tasks. My guess is that companies should treat the untrusted model similarly to how they treat external contractors (or how they treat employees at organizations that take insider threat seriously).

We're doing some research related to control measures for this kind of threat model, some of which should be out soon.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-10-23T15:51:14.441Z · LW(p) · GW(p)

I think your internal usage restrictions are wildly unaffordable

Sure, but note that Evan says "Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage."

I don't think it is (clearly) wildly unaffordable to require an affirmative safety case prior to proceeding with (substantially) using the model internally? (Note that control could be part of such a case.)

I think Evan is just trying to say "using the model some (but not for XYZ) wouldn't (shouldn't?) even require making an affirmative safety case (at least for the first point at which this triggers)".

I think this can be consistent with treating models like external contractors.

(Though I'd note that "treat the model like external contractors" is maybe too weak and I'd replace this with something like "treat the model like external contractors who have a 25% chance of all being spies of a foreign adversary".)


I do think that making an affirmative safety case for a "very low" level of total lifetime takeover risk (perhaps 0.25%) is very likely to be totally unaffordable at some point (at least in short timelines and with my priors).

Replies from: Buck, ryan_greenblatt
comment by Buck · 2024-10-23T17:56:53.813Z · LW(p) · GW(p)

Oh yeah sorry, I misunderstood that

comment by ryan_greenblatt · 2024-10-23T16:02:34.180Z · LW(p) · GW(p)

I would say that the "internal use restrictions" are just an ad hoc control argument and I'd prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.

Here is an example of what this could look like:

  • The model isn't quite superhuman in any important way. Supporting claims:
    • We're confident the model isn't wildly more powerful than it appears. Evidence: [you have to fill out]
    • The model appears to not be superhuman in any important way. Evidence: We ran some capability evals.
  • Our deployment is so conservative that as long as the model isn't quite super human, the model couldn't do anything even if it wanted to. Supporting claims:
    • We thought about it some and it seems so conservative that it is unlikely we could be wrong even though we didn't spend that long doing threat modeling.
    • We've actually implemented the relevant infrastructure (for sandboxing etc) and it is used in all the relevant places (including in RL and evaluations of the model).
    • We'll enforce this level of conservativeness in a sufficiently robust way that we're confident that employee error won't be an issue. Evidence: [you have to fill out]

Note that this safety case doesn't require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.

comment by Ben Pace (Benito) · 2024-11-15T00:24:03.489Z · LW(p) · GW(p)

Curated![1] This thinks through a specific threat model with future AIs, and how to concretely mitigate it. I don't have a strong opinion on whether it works, but I really want to encourage more concrete analysis of this sort.

  1. ^

    "Curated", a term which here means "This just got emailed to 30,000 people, of whom typically half open the email, and it gets shown at the top of the frontpage to anyone who hasn't read it for ~1 week."