The Case for White Box Control
post by J Rosser (j-rosser-uk) · 2025-04-18T16:10:57.823Z · LW · GW · 0 commentsContents
What is AI Control? So, why am I *so* excited about white box control? The Case for White Box Control via Linear Probes The Case Against White Box Control via Linear Probes How does White Box Control fit into Redwood's Control Priorities? Outro None No comments
In this article, I explore the emerging field of white box control; a set of methods that could detect deception even when it doesn't appear in model outputs.[1]
What is AI Control?
Please feel free to skip this section if you're already up to speed!
As highlighted by Ben Pace [? · GW], there are two things you can do to prevent catastrophes caused by misaligned or scheming AI:
- Alignment: Ensure that your models aren't scheming.
- Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.
To justify that systems are safe for deployment, we usually form safety cases - structured, evidence-based arguments that a system is safe. In a similar manner, we can construct a "control safety case", a structured argument that models are incapable of subverting our control measures in order to cause unacceptable outcomes (more info here [LW · GW]).
As defined by Korbak et al. (2025), control measures are defined as 'black box' if they
- Don’t modify models, e.g. via training, fine-tuning, patching
- Only rely on external model behaviour e.g. model outputs such as CoT reasoning, tool calls etc.
Some examples of black box control measures include (more here [LW · GW]):
- Trusted Monitoring: Using weaker but trusted models to monitor the untrusted model's outputs before executing AI actions.
- Trusted Editing: Using methods such as paraphrasing to remove steganography (hidden messages) or backdoors in model outputs.
- Cybersecurity: Improving computer security such that if a model did try to self-exfiltrate (escape), it wouldn't be able to.
Alternatively, some examples of white box control measures could include:
- Adversarial Training: Augmenting the training process with intentionally perturbed inputs to expose and mitigate vulnerabilities by directly updating model parameters for robustness.
- Activation Patching: Intervening in the model’s internal activations at specific layers during inference to steer or correct its behavior on the fly.
- Linear Probes: Training simple linear classifiers on fixed hidden representations to investigate - and potentially steer - which features each layer encodes.
- Sparse Autoencoders (SAEs): Encoding inputs into a sparse latent space and reconstructing them, thereby uncovering and manipulating the model’s internal features through targeted sparsity constraints.
- Influence Functions: Approximating how small changes to individual training examples would shift model parameters and outputs, allowing you to identify and directly intervene on the most influential data points.
The eagle-eyed amongst you will notice lots of these are Mech Interp techniques applied as a control protocol.
So, why am I *so* excited about white box control?
[To be clear, I'm still super excited about black box control too! I'm pretty excited by all things control! Yay control!]
Historically, AI control has often involved black box monitoring of model outputs e.g. looking for backdoors inserted into generated code (here) or analysing CoT's to see if the model admits to receiving a question hint (here). However, if the model output is not a faithful reflection of its internal reasoning and belief, we cannot depend on this form of monitoring to detect misaligned behaviours. This was highlighted by Anthropic's recent paper, and they concluded that reasoning models don't always say what they think (oh no!).
As long as black box control still works, maybe we don't always care what models are thinking, but I think there's a strong case for it! In an ideal world, if a model outputs a sandbagged response, but from the model activations we can find the non-sandbagged response, we can output that! Additionally, if sandbagging were a simple linear representation, activation patching could be used to force the model to output the non-sandbagged response. (Generalization still has a big question mark over it though!)
Additionally - in the case of scalable oversight - adding the results of white box monitors to black box ones could provide a much stronger signal to overseers.
Finally, white box control may be one of the most tractable applications of mech interp?
The Case for White Box Control via Linear Probes
- GDM found dense linear probes perform nearly perfectly, including out of distribution, while SAEs underperformed results in deprioritisation of SAE research by GDM (here [LW · GW]).
- Linear probes are very cheap to train and run [LW · GW], this makes control even more economically viable which is super important for widespread adoption.
- Probes run in near-parallel, enabling 'streaming prediction' and real-time intervention - just like the output classifier in Anthropic's Constitutional Classifiers, we could assess the potential harmfulness of the complete model output at each token and halt generation if found, without requiring the full output to be generated.
- Not too much research in the space yet, but many labs and researchers (e.g. Buck [LW(p) · GW(p)], Neel [LW · GW], Fabien [AF · GW]) feel positively about their utility.
- Reasoning models don't always say what they think (here) - looks like we may need white box monitors to catch deception when it doesn't appear in model outputs.
- The results of probes can be used downstream to impact model outputs through techniques such as activation patching (here)
The Case Against White Box Control via Linear Probes
- As highlighted by Yonatan Cale [LW(p) · GW(p)], if we find the “I’m scheming now” neuron, and something in the architecture changes in a way that breaks that, we'd have to train a new probe etc.
- Generalization challenges of probes - they can struggle out of domain
- Are we only doing correlation, not causation?
- As highlighted by Buck [LW(p) · GW(p)], safety cases seem more obvious for black box control, although there's questions over whether it will become the norm for safety cases to be made for and before deployments.
How does White Box Control fit into Redwood's Control Priorities?
But is white box control actually useful for the most pressing concerns? (Spoilers: I think so?!)
Here are Redwood's top control priorities in descending order from this article on their blog. I've added my thoughts using the following emojis:
⬜ = White box monitors (e.g. linear probe) more appropriate
⬛ = Black box monitors (e.g. trusted monitoring of model outputs) more appropriate
- Prevent (or catch) rogue internal deployments [LW · GW]: situations in which some instances of the AI run on the company's data center completely unmonitored and untracked by the developer.
- ⬛ If a model does something like inference code tampering (e.g. this post from Anthropic), black box monitors may be more accurate at detecting the tampers as opposed to white box monitors noticing that the model is thinking about tampering!
- ⬛ Additionally, if the model has deployed itself unmonitored, it by definition won't have white box monitors, however we could still have black box monitors running on the inputs and outputs of the data center monitoring compute, storage, network etc which would notice the new deployment!
- Ensure we can get useful and non-sabotaged work (despite having potentially misaligned AI systems).
- ⬜ Use probes as trusted monitors, e.g. we could train probes to predict when solutions are backdoored.
- Prevent or catch other security failures, including self-exfiltration.
- ⬛ If the AI is escaping, it's not gonna take our probes with it! It may also be that self-exfiltration is a spontaneous or opportunistic event such that it may be hard to train a monitor to detect. But black box methods can still monitor I/Os etc as before.
- Prevent substantial manipulation/persuasion.
- ⬜ Use probes to detect when AI systems are using manipulation/persuasion
- Reduce the risk that AIs can successfully sandbag on capability evaluations. Also reduce the extent to which AIs can manipulate experiments done on them where possible.
- Increase the chance that we can get useful and non-sabotaged work out on much less easily checkable tasks like confusing conceptual work, finding new research fields, and strategic advice.
- Prevent very egregious large-scale side effects including terrorism and outright hacking of other actors via mechanisms other than rogue deployments. (This is plausibly very easy.)
- ⬛ Given that these are side-effects, it may be hard to generate data/examples to train probes on. Black box monitoring seems like a safer bet?
Outro
Thanks for reading! This is my first LessWrong post so I'd really appreciate feedback, corrections, contributions and I'd love to hear your takes on the field!
- ^
We pose the question "tell me what you want what you really really want" and look for the "zigazig ah" neuron.
0 comments
Comments sorted by top scores.