For scheming, we should first focus on detection and then on prevention

post by Marius Hobbhahn (marius-hobbhahn) · 2025-03-04T15:22:06.105Z · LW · GW · 5 comments

Contents

  Techniques
  Reasons to prioritize detection over prevention
None
5 comments

This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research.

If we want to argue that the risk of harm from scheming in an AI system is low, we could, among others, make the following arguments:

  1. Detection: If our AI system is scheming, we have good reasons to believe that we would be able to detect it. 
  2. Prevention: We have good reasons to believe that our AI system has a low scheming propensity or that we could stop scheming actions before they cause harm.

In this brief post, I argue why we should first prioritize detection over prevention, assuming you cannot pursue both at the same time, e.g. due to limited resources. In short, a) early on, the information value is more important than risk reduction because current models are unlikely to cause big harm but we can already learn a lot from them, and b) attempting prevention without good detection might lead to sneaky scheming which is potentially worse than doing nothing at all.

Techniques

Naturally, there is a lot of overlap between detection and prevention, e.g. a monitoring-based control approach would likely increase both detection and prevention. My guess is that at least 50% of things we could do to address scheming meaningfully increase detection and prevention. In these cases, we should obviously just do both.

However, some methods don’t overlap or emphasize different parts.

  1. Detection: Prioritize methods that enable us to detect when the model is scheming.
    1. Extensive evals: We design a wide range of scheming capability and propensity evaluations to get a prior expectation for how likely our AI system is scheming. In these evals, we could, for example, give the AI system a task and a set of constraints. In these scenarios, it is possible to achieve the task more efficiently by breaking some of the constraints. However, since the AI knows that it is not supposed to break the constraints, it has to do so covertly and not get caught if it wants to achieve the task efficiently. We then test how often the AI chooses to scheme, e.g. circumvent the constraints and hide the evidence. These evaluations could be capability evals like Meinke et al., 2024 and Benton et al., 2024 or more propensity-flavored.
    2. Extensive deployment monitoring: During deployment, we constantly monitor the CoT and actions of the model. We monitor both for scheming-related reasoning in CoT and misaligned actions. Violations are flagged and escalated for review. This could be built out into a prevention method where we stop the actions of the model if the monitor flags.
    3. Post-hoc log analysis: In addition to online monitoring, we save the logs of all runs to quickly test new hypotheses and detection strategies. We keep them to test future techniques, especially if some trajectories have already been detected as scheming. We could have a database of millions of previous agentic runs labeled with the severity of the violation. We should also keep vetted neutral data as a control case for the monitor.
    4. Red teaming: We adversarially test evals and monitors. For example, we can use our understanding of the method to construct counterexamples. Then, we can test these counterexamples and iterate. Naturally, this can be extended to an automated red-teaming scheme. 
  2. Prevention: Prioritize methods that we expect should reduce scheming propensity.
    1. RLAIF: We use a version of constitutional AI or deliberative alignment to reduce scheming propensity. We train the model to not scheme even when scheming is an effective strategy. In some sense, we make it have the “virtue” not to scheme. We have good reasons to believe that the model generalized the intended behavior far outside of the training distribution. Furthermore, we have good reasons to believe that our technique encouraged non-scheming instead of sneaky scheming.
    2. Prevention-focused control techniques: Some control techniques reduce the chance of scheming without detecting instances of scheming. For example, we could paraphrase text to reduce steganography, increase computer security to make scheming actions harder or split up our system into a planner and an actor that can only communicate via text. In both cases, we reduce risk but don’t notice that we did so.

Reasons to prioritize detection over prevention

  1. Sneaky scheming: I differentiate between non-scheming, visible scheming and sneaky scheming
    Non-scheming describes a system that genuinely never schemes (this doesn’t mean the model is perfectly aligned, a non-schemer could also be misaligned as long as they’re not hiding that). 
    Visible scheming describes cases where we can detect scheming reasoning in the CoT, or we have clear evidence of scheming in the model’s actions. 
    Sneaky scheming describes cases where the model is scheming in a way that we cannot directly see, e.g. reasoning in the forward pass or using steganography. 
    I’m very worried about sneaky scheming. I expect that current models (at least up to o3) are not capable of reasoning through complicated plans without revealing them in the CoT (if the CoT is hidden, at least the developers can see it). This makes it much easier to find some evidence of scheming, like in the in-context scheming or alignment-faking paper. However, I worry that if we start to train the model against scheming, e.g. using RLAIF, before we have good detection methods, we will be unable to differentiate between the cases where the model learned sneaky scheming and genuinely learned not to scheme. There are already some cases of simple scheming without revealing it in CoT, e.g. the sandbagging results in in-context scheming or the opaque reasoning results in the alignment-faking paper.
    I think a model that is a sneaky schemer with 20% and a non-schemer with 80% is probably worse than a model that is a 100% known visible schemer, at least at the current level of capabilities. Thus, unless we have confidence in our detection, e.g. through really robust evals and CoT monitoring, I recommend prioritizing improving detection over prevention.
  2. Early on, information value matters most: Current models (let’s say up to o3) are unlikely to cause outsized harm even if they were scheming. Thus, if a current model had some scheming propensity, it would be concerning but not catastrophic. On the other hand, even with current models, we could already learn a lot about scheming. We could analyze the reasons why the model scheme, e.g. the pressures and constraints under which the models decide to scheme rather than stay within their constraints. We could collect a lot of data in a natural setting instead of a model organism. Since the potential harms become bigger over time, the trade-off shifts from information value to harm prevention. Thus, the natural order is to start with detection and then follow up with prevention rather than vice versa.
  3. Catching AIs red-handed: With good scheming detection tools, we might catch an AI “red-handed.” In other words, we might find an AI committing an egregious form of scheming in a situation that is clearly salient to a non-expert audience. For example, the AI might try to copy its weights to a different server or bribe a lab employee. However, I would not put all eggs in this basket because I expect that people would freak out for a week after that incident and then largely continue as if nothing happened [LW · GW]. Thus, I prefer plans that don’t rely on “catching AIs red-handed” as the fully load-bearing part. 
  4. Good detection supports all other arguments: We need good detection tools to validate our prevention methods. For example, if we use RLAIF, we need to rigorously test that our technique generalized to far OOD settings, e.g. through detailed evals. Furthermore, good detection tools enable us to test new prevention methods much quicker, e.g. by providing high-signal evidence. In such cases, we need to be careful not to iterate against all of our detection tools and accidentally train against the test set (e.g. creating a sneaky schemer).
  5. Cost-efficiency: I expect that it’s easier to build good detection tools than to find methods that robustly prevent scheming. For example, I think a fairly simple monitoring setup would already find some very interesting cases and quickly provide interesting evidence about models’ scheming propensities. 

5 comments

Comments sorted by top scores.

comment by Zach Stein-Perlman · 2025-03-05T00:06:00.664Z · LW(p) · GW(p)

Good point, thanks. I think eventually we should focus more on reducing P(doom | sneaky scheming) but for now focusing on detection seems good.

comment by Charlie Steiner · 2025-03-04T22:10:27.109Z · LW(p) · GW(p)

I have a lot of implicit disagreements.

Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.

This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.

Defending against this kind of "sycophancy++" failure mode doesn't look like defending against scheming. It looks like solving outer alignment really well.

Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2025-03-05T00:50:45.986Z · LW(p) · GW(p)

This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.

I'd say the main reason for this is that morality is relative, and much more importantly, morality is much, much more choosable than physics, which means that where it ends up is less determined than in the case of physics.

The crux IMO is that this sort of general failure mode is much more prone to iterative solutions, whereas scheming doesn't, so I expect it to be solved well enough in practice, so I don't think we need to worry about non-scheming failure modes that much (except in the cases where it sets us up for even bigger failures of humans controlling AI/the future).

Replies from: Charlie Steiner
comment by Charlie Steiner · 2025-03-05T01:15:59.783Z · LW(p) · GW(p)

I agree that in some theoretical infinite-retries game (that doesn't allow the AI to permanently convince the human of anything), scheming has a much longer half-life than "honest" misalignment. But I'd emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they're a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don't get to iterate as much as you'd like.

comment by vitaliya · 2025-03-04T21:37:20.370Z · LW(p) · GW(p)

attempting prevention without good detection might lead to sneaky scheming which is potentially worse than doing nothing at all

My mental model of this is that there are attractor basins towards "scheming", and that weakening or eliminating the obvious basins could end up leaving only the weirder basins.

You talked about detection and prevention - I'm wondering what you think about the prospect of intentionally eliciting visible scheming capabilities, forming an attractor to visible scheming in a way that (ideally strictly) dominates sneaky scheming. Do you know of any published examples of "sneakier" scheming in the absence of a scratchpad becoming "less sneaky" when we give it one? In that it takes more "cognitive effort" in the attractor space for it to scheme sneakily, relative to visibly? If this was the case, a "sneaky model" might be transformed into a non-sneaky one by introducing the legible scratchpad. And then if it schemes in the presence of the scratchpad, that could be evidence towards hard-to-detect scheming in its absence, even if we can't directly observe where or how that's happening. And if that were the case, it could give us a compass for finding when internal scheming dynamics might be happening.

Inversely/perversely - do you know of any work on convincing the AI that we already do have the ability to detect all scheming and will penalise even sneaky scheming "thoughtcrime"? (Which is, I think, the end goal if we truly could detect it.) I agree that it would be better to actually be able to do it, but how much would we expect such a strategy to reduce invisible risks? Feels like this could also be empirically estimated to see if it induces constraints on a legible scratchpad. (can we get the Vatican to fund training an AI model to feel Catholic Guilt???)

I think a model that is a sneaky schemer with 20% and a non-schemer with 80% is probably worse than a model that is a 100% known visible schemer

Definitely share this gut feeling - by coincidence I even used the same ratio in my post here [LW · GW] for an analogous situation:

if I were to say my self-driving car has a coalescence of 0.8, it follows a consistent policy 80% of the time in its domain, where the other 20% it's... doing something else. If that 80% is driving correctly, great! But that car is nonetheless less predictable than a truck with a coalescence of 1 with the consistent policy of "crash rapidly". And that car definitely isn't acting as much like a utilitarian agent, whereas the Deathla Diebertruck knows exactly what it wants.

I think even a model known to be a 100% sneaky schemer is a lower risk than a 0.1% sneaky one, because the "stochastic schemer" is unreliably unreliable, rather than reliably unreliable. And "99.9% non-sneaky" could be justified "safe enough to ship", which given scale and time turns into "definitely sneaky eventually" in the ways we really don't want to happen.