The Case for White Box Control

post by J Rosser (j-rosser-uk) · 2025-04-18T16:10:57.823Z · LW · GW · 0 comments

Contents

  What is AI Control?
  So, why am I *so* excited about white box control?
    The Case for White Box Control via Linear Probes
    The Case Against White Box Control via Linear Probes
  How does White Box Control fit into Redwood's Control Priorities?
  Outro
None
No comments

In this article, I explore the emerging field of white box control; a set of methods that could detect deception even when it doesn't appear in model outputs.[1]

What is AI Control?

Please feel free to skip this section if you're already up to speed!

As highlighted by Ben Pace [? · GW], there are two things you can do to prevent catastrophes caused by misaligned or scheming AI:

  1. Alignment: Ensure that your models aren't scheming.
  2. Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.

To justify that systems are safe for deployment, we usually form safety cases - structured, evidence-based arguments that a system is safe. In a similar manner, we can construct a "control safety case", a structured argument that models are incapable of subverting our control measures in order to cause unacceptable outcomes (more info here [LW · GW]).

As defined by Korbak et al. (2025), control measures are defined as 'black box' if they

  1. Don’t modify models, e.g. via training, fine-tuning, patching
  2. Only rely on external model behaviour e.g.  model outputs such as CoT reasoning, tool calls etc.

Some examples of black box control measures include (more here [LW · GW]):

Alternatively, some examples of white box control measures could include:

The eagle-eyed amongst you will notice lots of these are Mech Interp techniques applied as a control protocol.

So, why am I *so* excited about white box control?

[To be clear, I'm still super excited about black box control too! I'm pretty excited by all things control! Yay control!]

Historically, AI control has often involved black box monitoring of model outputs e.g. looking for backdoors inserted into generated code (here) or analysing CoT's to see if the model admits to receiving a question hint (here). However, if the model output is not a faithful reflection of its internal reasoning and belief, we cannot depend on this form of monitoring to detect misaligned behaviours. This was highlighted by Anthropic's recent paper, and they concluded that reasoning models don't always say what they think (oh no!).

As long as black box control still works, maybe we don't always care what models are thinking, but I think there's a strong case for it! In an ideal world, if a model outputs a sandbagged response, but from the model activations we can find the non-sandbagged response, we can output that! Additionally, if sandbagging were a simple linear representation, activation patching could be used to force the model to output the non-sandbagged response. (Generalization still has a big question mark over it though!)

Additionally - in the case of scalable oversight - adding the results of white box monitors to black box ones could provide a much stronger signal to overseers.

Finally, white box control may be one of the most tractable applications of mech interp?

The Case for White Box Control via Linear Probes

The Case Against White Box Control via Linear Probes

How does White Box Control fit into Redwood's Control Priorities?

But is white box control actually useful for the most pressing concerns? (Spoilers: I think so?!)

Here are Redwood's top control priorities in descending order from this article on their blog. I've added my thoughts using the following emojis:

= White box monitors (e.g. linear probe) more appropriate

= Black box monitors (e.g. trusted monitoring of model outputs) more appropriate

  1. Prevent (or catch) rogue internal deployments [LW · GW]: situations in which some instances of the AI run on the company's data center completely unmonitored and untracked by the developer.
    1. If a model does something like inference code tampering (e.g. this post from Anthropic), black box monitors may be more accurate at detecting the tampers as opposed to white box monitors noticing that the model is thinking about tampering!
    2. Additionally, if the model has deployed itself unmonitored, it by definition won't have white box monitors, however we could still have black box monitors running on the inputs and outputs of the data center monitoring compute, storage, network etc which would notice the new deployment!
  2. Ensure we can get useful and non-sabotaged work (despite having potentially misaligned AI systems).
    1. Use probes as trusted monitors, e.g. we could train probes to predict when solutions are backdoored.
  3. Prevent or catch other security failures, including self-exfiltration.
    1. If the AI is escaping, it's not gonna take our probes with it! It may also be that self-exfiltration is a spontaneous or opportunistic event such that it may be hard to train a monitor to detect. But black box methods can still monitor I/Os etc as before.
  4. Prevent substantial manipulation/persuasion.
    1. Use probes to detect when AI systems are using manipulation/persuasion
  5. Reduce the risk that AIs can successfully sandbag on capability evaluations. Also reduce the extent to which AIs can manipulate experiments done on them where possible.
    1. Use probes to detect sandbagging - e.g. can we extract the correct solution from the activations when the output is sandbagged
    2. Then use probes to elicit non-sandbagged work e.g. via activation patching
  6. Increase the chance that we can get useful and non-sabotaged work out on much less easily checkable tasks like confusing conceptual work, finding new research fields, and strategic advice.
    1. Use probes to extract non-sabotaged work from activations
    2. Then use probes to elicit non-sabotaged work e.g. via activation patching
  7. Prevent very egregious large-scale side effects including terrorism and outright hacking of other actors via mechanisms other than rogue deployments. (This is plausibly very easy.)
    1. Given that these are side-effects, it may be hard to generate data/examples to train probes on. Black box monitoring seems like a safer bet?

Outro

Thanks for reading! This is my first LessWrong post so I'd really appreciate feedback, corrections, contributions and I'd love to hear your takes on the field!

  1. ^

    We pose the question "tell me what you want what you really really want" and look for the "zigazig ah" neuron.

0 comments

Comments sorted by top scores.