The Case for White Box Control

j-rosser

The Case for White Box Control

post by J Rosser (j-rosser-uk) · 2025-04-18T16:10:57.823Z · LW · GW · 0 comments

  What is AI Control?
  So, why am I *so* excited about white box control?
    The Case for White Box Control via Linear Probes
    The Case Against White Box Control via Linear Probes
  How does White Box Control fit into Redwood's Control Priorities?
  Outro
None
No comments

In this article, I explore the emerging field of white box control; a set of methods that could detect deception even when it doesn't appear in model outputs.^[1]

What is AI Control?

Please feel free to skip this section if you're already up to speed!

As highlighted by Ben Pace [? · GW], there are two things you can do to prevent catastrophes caused by misaligned or scheming AI:

Alignment: Ensure that your models aren't scheming.
Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.

To justify that systems are safe for deployment, we usually form safety cases - structured, evidence-based arguments that a system is safe. In a similar manner, we can construct a "control safety case", a structured argument that models are incapable of subverting our control measures in order to cause unacceptable outcomes (more info here [LW · GW]).

As defined by Korbak et al. (2025), control measures are defined as 'black box' if they

Don’t modify models, e.g. via training, fine-tuning, patching
Only rely on external model behaviour e.g. model outputs such as CoT reasoning, tool calls etc.

Some examples of black box control measures include (more here [LW · GW]):

Trusted Monitoring: Using weaker but trusted models to monitor the untrusted model's outputs before executing AI actions.
Trusted Editing: Using methods such as paraphrasing to remove steganography (hidden messages) or backdoors in model outputs.
Cybersecurity: Improving computer security such that if a model did try to self-exfiltrate (escape), it wouldn't be able to.

Alternatively, some examples of white box control measures could include:

Adversarial Training: Augmenting the training process with intentionally perturbed inputs to expose and mitigate vulnerabilities by directly updating model parameters for robustness.
Activation Patching: Intervening in the model’s internal activations at specific layers during inference to steer or correct its behavior on the fly.
Linear Probes: Training simple linear classifiers on fixed hidden representations to investigate - and potentially steer - which features each layer encodes.
Sparse Autoencoders (SAEs): Encoding inputs into a sparse latent space and reconstructing them, thereby uncovering and manipulating the model’s internal features through targeted sparsity constraints.
Influence Functions: Approximating how small changes to individual training examples would shift model parameters and outputs, allowing you to identify and directly intervene on the most influential data points.

The eagle-eyed amongst you will notice lots of these are Mech Interp techniques applied as a control protocol.

So, why am I so excited about white box control?

[To be clear, I'm still super excited about black box control too! I'm pretty excited by all things control! Yay control!]

Historically, AI control has often involved black box monitoring of model outputs e.g. looking for backdoors inserted into generated code (here) or analysing CoT's to see if the model admits to receiving a question hint (here). However, if the model output is not a faithful reflection of its internal reasoning and belief, we cannot depend on this form of monitoring to detect misaligned behaviours. This was highlighted by Anthropic's recent paper, and they concluded that reasoning models don't always say what they think (oh no!).

As long as black box control still works, maybe we don't always care what models are thinking, but I think there's a strong case for it! In an ideal world, if a model outputs a sandbagged response, but from the model activations we can find the non-sandbagged response, we can output that! Additionally, if sandbagging were a simple linear representation, activation patching could be used to force the model to output the non-sandbagged response. (Generalization still has a big question mark over it though!)

Additionally - in the case of scalable oversight - adding the results of white box monitors to black box ones could provide a much stronger signal to overseers.

Finally, white box control may be one of the most tractable applications of mech interp?