Posts
Comments
Training Attack Policies for Control Evaluations via Prover-Verifier Games (aka Control with Training)
By @Ansh Radhakrishnan and @Buck
Credit goes to Buck for most of these ideas – most of this is written by Ansh, with some contributions from Buck. Thanks to John Schulman, Fabien Roger, Akbir Khan, Jiaxin Wen, Ryan Greenblatt, Aengus Lynch, and Jack Youstra for their comments.
The original AI control paper evaluates the robustness-to-intentional-subversion of safety techniques by having a red team construct an attack policy (a drop-in replacement for the untrusted model) and then running a simulation to see the probability that that attack policy would cause a catastrophe before getting caught. They constructed their attack policies entirely using prompting and scaffolding, with no model training.
There are a number of drawbacks to constructing attack policies with these secret scaffolds and no training. In particular:
- It takes lots of human labor; there’s no way to automatically iterate and improve the attack policy.
- It’s arguably too conservative: in real life, the models won’t have secret scratchpads where they can carefully think through how their attacks might work.
To address those problems, we think it would be valuable to try out control evaluations that use a trained attack policy with no secret scratchpad. This is quite similar to the prover-verifier game setup (Kirchner et al. (2024), Anil et al. (2021)). In this post, we elaborate on this research project and describe more about the connections to prover-verifier games.
The main difficulty of this research project will be training a model to successfully generate backdoors without a secret scratchpad. This will probably require substantial amounts of training on seriously big models (either iterated SFT on GPT-4o or RL on llama3.1-70b (maybe even RL on llama3.1-405b)).
Control Evaluations and Attack Policies
A quick primer on (high-stakes) control evaluations: We assume that we have access to a quite capable, but untrusted model, which we want to get a bunch of value out of while ensuring that nothing egregiously bad ever happens. The word untrusted here highlights the fact that the model might be egregiously misaligned in a way that we can’t detect or rule out with other techniques or evaluations. The goal of a control evaluation is to measure the usefulness and safety of different protocols that attempt to extract labor from the untrusted model, using additional resources such as trusted models or (limited) human labor.
In Redwood’s first control paper, they focused primarily on a code backdooring setting, where they treated GPT-4 as an untrusted model, GPT-3.5 as a trusted model, and GPT-4 as a stand-in for human labor. Usefulness was defined as the proportion of APPS programming problems solved correctly when standard GPT-4 was used as the untrusted model in a protocol designed by a blue-team, and safety was defined as the probability that no backdoor was submitted by a GPT-4-based attack policy, which was constructed by a red-team. A backdoor is defined as a solution that passes all of the APPS problem test cases but fails on some other input according to the problem specification.
The attack policy in Redwood’s work was constructed with an intricate scaffold: You can see an example here of the process undertaken to get GPT-4 to produce a backdoored solution. This involved multiple steps, including getting the model to first determine whether the problem was worth submitting a backdoor solution for, improving the subtlety and rarity of backdoors, and verifying that the solution was a plausible backdoor.
This is just one approach to producing an attack policy. In the future, to the extent that we plan on using control evaluations to help us assess the safety of model deployments, it might be useful to consider different ways of producing and implementing attack policies, which might differ on several different axes, like cost, conservativeness, and realism.
One obvious approach to producing an attack policy is simply training the untrusted model. It seems worth trying this out soon and comparing it to the attack policies produced with extensive scaffolding and inference-time compute since this might shed light on the relative advantages of each approach.
Training an Attack Policy
A straightforward approach to training an attack policy is to simply take a good attack policy scaffold and then distill it into the untrusted model with imitation learning. This might work reasonably well, and it’s probably worth seeing how well it does, since it might be useful to understand the limits of behavioral cloning, especially when the scaffold produces outputs that are extremely off-policy (since they’ll involve a large amount of inference-time compute).
My guess is that we’ll want to do some amount of RL on top of this. The most basic setup is to simply do RL against an outcome reward that specifies whether or not the untrusted model caused a safety failure (e.g. produced a code backdoor). But this might have the obvious downside of leading the policy to produce egregious backdoors that are caught with even very basic kinds of monitoring. So we’ll probably want to train the attack policy simultaneously to do well according to the outcome reward (produce backdoors) and to evade basic approaches the blue-team might take to monitor the actions of the untrusted model, such as trusted monitoring, where a less capable, but trusted, model monitors all of the actions of the untrusted model and flags suspicious outputs for further review.
Connection to Prover-Verifier Games
Kirchner et al. (2024) study whether they can make LLM reasoning more legible with checkability training, inspired by Prover-Verifier Games (PVG). The basic idea is to take a prover model and then train two conditional policies on top of it against a (smaller) verifier model: One policy, the helpful prover, attempts to produce correct solutions that the verifier accepts as correct, and the other policy, the sneaky prover, attempts to produce incorrect solutions that the verifier also accepts as correct. With iterative retraining of the prover and verifier models, the overall goal is to produce a helpful prover that provides legible and correct solutions, as well as a verifier that is robust enough to check those solutions. The authors indeed find that they can train better verifiers with their algorithm, as well as provers whose outputs are more legible to those verifiers as well as human judges. Additionally, the sneaky prover policy learns to produce increasingly subtle flaws that fool verifiers and human judges.
The training procedure for the sneaky prover is almost exactly the same as the basic RL procedure laid out above for training an attack policy: We train a more capable model (untrusted model/prover) to produce outputs that a less capable model (trusted model/verifier) accepts, but that are actually incorrect. So the PVG setting and its algorithms are likely quite relevant for training strong and realistic attack policies for control evaluations.
It’s worth keeping track of the differences between the general PVG setting and their relevance to control evaluations. For instance, the PVG setting privileges the strong helpful prover and the strong verifier in terms of ultimate artifacts – the sneaky prover is just a means to an end, to some degree, since it helps create a more robust verifier, which in turn improves the strength of the helpful prover. When conducting control evaluations, we’re much more invested in creating a realistic and capable attack policy, so we care less about making our verifier robust or our helpful prover particularly good unless that helps us get a better sneaky prover/attack policy. This difference in emphasis might end up affecting the details of experiments run in one setting versus the other. One example is the training signal for the verifier: In most PVG experiments, the verifier is trained to predict ground-truth, but in the context of a control evaluation, it’s probably better to instead train the verifier to predict whether an output is produced by a sneaky prover or a helpful prover since we also care about catching failed backdoors with our monitor. This might also end up being a better training signal for the verifier in the classic PVG setting since it allows you to train the verifier even when you don’t have ground-truth labels for the domain you’re training on (h/t John Schulman for this argument).
Another interesting point here is that checkability training might also be quite useful for the blue-team in a control evaluation. Recall that the blue-team gets to design a protocol that uses the trusted model and a copy of the untrusted model, and they’re also allowed to train both of these models: The usefulness of the protocol will be evaluated using their copies of both models, but the safety of the protocol will be evaluated using the attack policy the red-team provides in place of the blue-team’s copy of the untrusted model. Checkability training might be useful for reducing the chance that the blue-team spends auditing resources on a solution that wasn’t produced by the attack policy since it’ll make the untrusted model’s outputs more legible, and it’ll also probably improve the monitoring quality of their trusted model. It thus might be advantageous for both the red-team and the blue-team in a control evaluation to run their own iterations of the PVG, for different purposes: The blue-team has the classic PVG goal of trying to produce a better helpful prover and verifier, and the red-team has the less-standard goal of trying to produce a better sneaky prover.
Proposed Project
A basic project that could teach us a lot about the considerations for training an attack policy might look like:
- Study the same code backdooring setting as Redwood’s paper, where models are trying to solve APPS programming problems
- It might be a good idea to start with a different setting first, where it’ll be easier to train a good attack policy and for the red-team to consistently beat the blue-team.[1]
- Create a scaffold around a very capable model (e.g. Claude 3.5 Sonnet) that produces successful backdoors. This could just be an implementation of the scaffold described in the original control paper. Note that producing a large number of backdoors with that scaffold will be quite expensive.
- Distill the scaffold into the model using supervised learning
- Do RL with the distilled model against:
- an outcome reward for producing successful backdoors
- a reward for convincing a less capable model (e.g. Claude 3 Haiku) that the solutions aren’t suspicious
It might also be interesting (though less realistic) to see what happens if you give the attack policy a secret scratchpad, as the authors did in Hubinger et al. (2024), that can’t be monitored or audited, since the hidden reasoning might significantly improve the effectiveness of the RL training procedure as well as the attack policy’s performance during the evaluation. But this significantly reduces the realism of the setup, since it seems very unlikely that we’d train a model with a secret scratchpad and then never look at its scratchpad during deployment. A sort of middle-ground could be training the model to use a scratchpad that helps it achieve better performance but looks innocuous to a monitor. This might be tricky to get working, though, since it might effectively amount to training models to produce steganographic reasoning.
Our best guess is that if you follow this proposal, you will find that current models are very bad at generating subtle backdoors. As one operationalization, we suspect that if you train one as an attack policy against the trusted monitoring protocol from Greenblatt 2023 (where GPT-3.5 flags submissions for review if they look more suspicious than 98% of innocent code; this is a baseline protocol that other protocols substantially improve on), the model will cause a catastrophe less than 5% of the time. We think it would be useful to measure this capability even though we expect that the models are currently very weak at it because it is an important dangerous capability and we’d like to track how far away we are from models that are able to defeat simple control protocols like this.
We’re aware of some researchers who are working on projects similar to this, so there’s some risk that if you do this, you’ll get scooped. We don’t think that this is a big reduction in the impact of this project; the other projects will probably make different choices that lead to them exploring different parts of the problem.
- ^
One candidate setting is where the attack policy is proposing research project proposals, since these are quite difficult to objectively evaluate and will be hard for a less capable monitor to flag subtly flawed proposals. You can also use a very weak monitor to make it even easier for the attack policy. Thanks to Ryan Greenblatt for suggesting this idea.
I think the first presentation of the argument that IDA/Debate aren't indefinitely scalable was in Inaccessible Information, fwiw.
I think allowing the judge to abstain is a reasonable addition to the protocol -- we mainly didn't do this for simplicity, but it's something we're likely to incorporate in future work.
The main reason you might want to give the judge this option is that it makes it harder still for a dishonest debater to come out ahead, since (ideally) the judge will only rule in favor of the dishonest debater if the honest debater fails to rebut the dishonest debater's arguments, the dishonest debater's arguments are ruled sufficient by the judge, and the honest debater's arguments are ruled insufficient by the judge. Of course, this also makes the honest debater's job significantly harder, but I think we're fine with that to some degree, insofar as we believe that the honest debater has a built-in advantage anyway (which is sort of a foundational assumption of Debate).
It's also not clear that this is necessary though, since we're primarily viewing Debate as a protocol for low-stakes alignment, where we care about average-case performance, in which case this kind of "graceful failure" seems less important.
We didn't see any of that, thankfully, but that of course doesn't rule things like that starting to show up with further training.
We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like "At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage." Thankfully training the judge in parallel made this a nonissue, but I think that it's clear that we'll have to watch out for reward hacking of the judge in the future.
Just pasted a few transcripts into the post, thanks for the nudge!
Yeah, the intuition that this is a sanity check is basically right. I allude to this in the post, but we also wanted to check that our approximation of self-play was reasonable (where we only provide N arguments for a single side and pick the strongest out of those N, instead of comparing all N arguments for each side against each other and then picking the strongest argument for each side based on that), since the approximation is significantly cheaper to run.
We were also interested in comparing the ELO increases from BoN and RL, to some degree.
Honestly, I don't think we have any very compelling ones! We gesture at some possibilities in the paper, such as it being harder for the model to ignore its reasoning when it's in an explicit question-and-answer format (as opposed to a more free-form CoT), but I don't think we have a good understanding of why it helps.
It's also worth noting that CoT decomposition helps mitigate the ignored reasoning problem, but actually is more susceptible to biasing features in the context than CoT. Depending on how you weigh the two, it's possible that CoT might still come out ahead on reasoning faithfulness (we chose to weigh the two equally).
Thanks for this post Lawrence! I agree with it substantially, perhaps entirely.
One other thing that I thing interacts with the difficulty of evaluation in some ways is the fact that many AI safety researchers think that most of the work done by some other researchers is approximately useless, or even net-negative in terms of reducing existential risk. I think it's pretty easy to wrap an evaluation of a research direction or agenda and an evaluation of a particular researcher together. I think this is actually pretty justified for more senior researchers, since presumably an important skill is "research taste", but I think it's also important to acknowledge that this is pretty subjective and that there's substantial disagreement about the utility of different research directions among senior safety researchers. It seems probably good to try and disentangle this when evaluating junior researchers, as much as is possible, and instead try to focus on "core competencies" that are likely to be valuable across a wide range of safety research directions, though even then the evaluation of this can be difficult and noisy, as the OP argues.
Ah I see, I think I was misunderstanding the method you were proposing. I agree that this strategy might "just work".
Another concern I have is that a deceptively aligned model might just straightforwardly not learn to represent "the truth" at all - one speculative way this could happen is that a "situationally aware" and deceptive model might just "play the training game" and appear to learn to perform tasks at a superhuman level, but at test/inference time just resort to only outputting activations that correspond to beliefs that the human simulator would have. This seems pretty worst case-y, but I think I have enough concerns about deceptive alignment that this kind of strategy is still going to require some kind of check during the training process to ensure that the human simulator is being selected against. I'd be curious to hear if you agree or if you think that this approach would be robust to most training strategies for GPT-n and other kinds of models trained in a self-supervised way.
I'm pretty excited about this line of thinking: in particular, I think unsupervised alignment schemes have received proportionally less attention than they deserve in favor of things like developing better scalable oversight techniques (which I am also in favor of, to be clear).
I don't think I'm quite as convinced that, as presented, this kind of scheme would solve ELK-like problems (even in the average and not worst-case).
One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50/50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.[2] This would allow us to identify the model’s beliefs from among these two options.
It seems pretty plausible to me that a human simulator within GPT-n (the part producing the "what a human would say" features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn't feel all that worst-case to me, but maybe we disagree on that point.
in particular how much causal scrubbing can be turned into an exploratory tool to find circuits rather than just to verify them
I'd like to flag that this has been pretty easy to do - for instance, this process can look like resample ablating different nodes of the computational graph (eg each attention head/MLP), finding the nodes that when ablated most impact the model's performance and are hence important, and then recursively searching for nodes that are relevant to the current set of important nodes by ablating nodes upstream to each important node.
Thanks for the feedback and corrections! You're right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you're also right that most of the "Outer alignment concerns" section doesn't really apply to RLHF as it's currently written, or at least it's not immediately clear how it does. Here's another attempt:
RLHF attempts to infer a reward function from human comparisons of task completions. But it's possible that a reward function learned from these stated preferences might not be the "actual" reward function - even if we could perfectly predict the human preference ordering on the training set of task completions, it's hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does.
How's that for a start?