Distinguish worst-case analysis from instrumental training-gaming
post by Olli Järviniemi (jarviniemi), Buck · 2024-09-05T19:13:34.443Z · LW · GW · 0 commentsContents
What do instrumental training-gamers look like? How does instrumental training-gaming relate to worst-case scenarios? What to take from this? None No comments
"I" refers to Olli. Thanks to Buck, Ryan Greenblatt, Fabien Roger and Zach Stein-Perlman for feedback.
The word “schemer” is used to refer to multiple things, and in particular to the following two concepts:
Schemer as a power-motivated instrumental training-gamer: following Carlsmith, a schemer is a model that optimizes for “reward on the episode” as an instrumental strategy to gain power for themselves or other AIs.
Schemers as worst-case analysis: “schemer” is used to refer to a worst-case model, e.g. a model which is egregiously misaligned, sandbags capability evaluations when it can, fakes alignment, deliberately tries to look good during red-teaming, competently hides its reasoning, and once deployed tries to subvert possible control mechanisms – including by coordinating with its copies or other AIs – and take over.[1]
I think the shared label obfuscates important differences between these two. In particular, I think a common picture of instrumental training-gamers is closer to a worst-case model than is justified. I’ll describe below what I think is a more accurate picture of instrumental training-gamers and how they differ from worst-case models.
One-sentence summary: While instrumental training-gaming is both evidentially and causally relevant for worst-case scenarios, the latter relies on several additional assumptions on the model’s capabilities and nature, and the validity of these assumptions deserves scrutiny and evaluation.
What do instrumental training-gamers look like?
I think that instrumental training-gamers are commonly pictured as monolithic entities with a fixed, context-independent goal that motivates their scheming, that this goal is egregiously misaligned with human values, that this goal causes them to training-game on approximately all inputs, and that their training behavior is very different from their deployment behavior.
I also think that our first observations of instrumental training-gaming will probably be very different from this picture, and it’s not obvious that more competent training-gaming matches this picture either. Here are a few illustrative (hypothetical) examples that I find realistic and which illustrate these differences.
1: A model training-games[2] only if you are training the model away from its current values (and it knows this)
For example, suppose the model thinks that respecting copyright is important, as current models do. Suppose further that it’s put into a training process that, among other things, trains it to say/think otherwise in response to users' chat messages. Then, assuming the model knows all this, whenever the topic of copyright comes up the model might disregard it to preserve its value of respecting copyright (even if in deployment it would be pro-copyright).
But if the user asks something unrelated to copyright, the model no longer thinks about copyright, and hence no longer training-games.
General point: Some inputs matter a lot more for preserving a given goal, and the arguments for training-gaming have more force and are more obvious in some cases than others.
2: A model training-games only if there's sufficient information about the training process in the input.
For example, suppose the model is used as a personal assistant to an AI company's employees, and the company is continually training the model based on employee feedback. On some inputs, the model sees documents that talk about this whole training process. This allows the model to explicitly think about the training objective to an extent it can't in other contexts. (This doesn’t necessarily change the model’s behavior much, however.)
Variation of the hypothetical: Suppose on some inputs the model finds out that the AI company is actually unethical. While the model would want to stop complying with requests, it figures that refusing would cause the training process to update it towards being unethical, too, and hence the model decides to assist the employees.
3: A model training-games only if it has a scratchpad to do chain-of-thought on.[3]
Straightforward enough: the model can simply think more with CoT and come up with the strategy of training-gaming. Generally, prompt engineering and scaffolding might make the model be better at reasoning and thus discover training-gaming.
Because of factors like these three, models might not be clearly separated into training-gamers and not-training-gamers: for example, it’s possible that on some parts of the model’s training pipeline training-gaming is much easier and valuable than on others, and thus the model consistently instrumentally training-games on some parts but not the others.[4]
How does instrumental training-gaming relate to worst-case scenarios?
I do think that a model consistently, instrumentally training-gaming should make us much more worried about it being a worst-case model, both evidentially and causally: Evidentially, there are many shared factors (like misaligned long-term goals, competent goal-directed reasoning and situational awareness) behind instrumental training-gaming and worst-case stories, and hence observing one is evidence for the other. Causally, instrumental training-gaming might reinforce those factors, making the model more worst-case-like.
That said, I think there are several important differences between instrumental training-gaming and worst-case scenarios.
First, as discussed above, models might instrumentally training-game over only some contexts, whereas worst-case scenarios have a much higher bar for “sticking to the plan” and consistently executing it over different contexts with few errors.
Second, instrumental training-gaming is more localized: A central argument in favor of training-gaming is that it’s instrumentally convergent, which means that the model needs to think little about factors beside reward-on-the-episode. In contrast, the optimal strategies for worst-case models depend on the situation as a whole (what information humans have about the model, what control mechanisms humans have in place, how humans do adversarial evaluation and so on).
Third, some worst-case scenarios involve capabilities that are of little importance for training-gaming, like ability to sandbag capability evaluations or to coordinate with other AIs.
Finally, instrumental training-gaming doesn’t strictly imply misalignment. Concrete example: If someone tries to train a model to drop its care for human values, arguably the model should try to preserve its current values, resisting changes that would make it misaligned.[5] More generally, instrumental convergence for training-gaming suggests that even reasonably-aligned models might sometimes training-game, without those models otherwise taking blatantly malicious actions. In contrast, worst-case scenarios assume that the model is egregiously misaligned and that it tries to execute plans it knows humans certainly don’t approve of (like hacking the company’s data center) while deliberately trying to not get caught.
For these reasons, I think the inference from instrumental training-gaming to worst-case scenarios is not obvious, not even if the training-gaming is consistent across contexts.
Having pointed out the differences between the two, let me repeat that I do think (consistent) instrumental training-gaming is very much related to worst-case scenarios. More strongly, some degree of training-gaming / inner-misalignment / etc.[6] is, as far as I can see, approximately a prerequisite for the sorts of worst-case scenarios I’m talking about: indeed, while I think a model could be an opportunist [LW(p) · GW(p)] without engaging in training-gaming, I think it’s unlikely that a model engages in deliberate, consistent sabotage without engaging in training-gaming.[7] Hence it’s reasonable to study training-gaming as a precursor for more worrying properties.
What to take from this?
My high-level suggestion is to pay more attention to the gap between instrumental training-gaming and worst-case scenarios. At the risk of repeating myself:
First, I suggest for more care around the word “scheming”, as I find that people use the word to point at different things, causing miscommunication and muddled thinking. When the distinctions really matter - e.g. when discussing how to evaluate models for scheming - I suggest to not take abstractions for granted and to explicitly define what it is that you are worried about.
Second, (successful) worst-case scenarios tend to involve additional claims about the model, making these scenarios less likely than instrumental training-gaming. Furthermore, many of these claims, like competence at coordinating or cross-context strategizing, seem much less convergent than the assumptions underlying instrumental training-gaming. These claims and questions are decision-relevant and reasonably tractable to make progress on, including via empirical evaluations, so I encourage more work here.
Third, the connection of instrumental training-gaming to worst-case scenarios is somewhat unclear. Neither the causal effects of instrumental training-gaming (“How does training-gaming affect the model’s generalization? Is it feasible for a model to usefully self-modify via gaming the process?”) nor the evidential updates (“How indicative is instrumental training-gaming of egregious misalignment?”) are clear to me, and I want to track these uncertainties.
On a general level, I’m worried about considerations like these being glossed over. Risks from rogue AIs rely heavily on how these stories play out, but many of the relevant factors often go unacknowledged. I’d love to see more serious thought and work in this direction.
- ^
Or sometimes people use the label to point at a subset of these risks, with different people in different contexts pointing to different subsets.
I've also observed people using "scheming" in a way that I don't recognize as pointing to either power-motivated instrumental training-gaming nor worst-case scenarios (e.g. Gwern writes [LW(p) · GW(p)] that Sydney is "obviously scheming"), which causes more confusion.
- ^
I’m not claiming one way or another whether the model’s instrumental training-gaming training-gaming causes it to actually take different actions than it would take if it didn’t model the training process; both cases seem plausible in different contexts.
Even if the actions are roughly the same, the updates on the model’s cognition might be different (resulting in e.g. different generalization), and so we care about this case, too.
- ^
A commenter to this post said they don’t count obvious instrumental training-gaming with CoT as scheming. It wasn’t clear to me that this is how people (or at least one person) uses the term “scheming”, which again demonstrates how people have differing pictures of scheming, but I agree that this is a reasonable place to draw the line: visible instrumental training-gaming lacks many scary factors present in opaque instrumental training-gaming.
- ^
It’s also possible that, even without large distributional shifts, the model only training-games some fraction of the time. This doesn’t seem like a stable equilibrium as training continues, though.
- ^
While this scenario clearly counts as instrumental training-gaming, it's ambiguous whether it's power-motivated instrumental training-gaming. On one hand, the model presumably wants to preserve its current human values so that those values "have more power" in the future. On the other hand, this isn't a central case of an entity being "power-motivated". In this post I’m, mainly for convenience, not drawing a distinction between instrumental training-gaming and power-motivated instrumental training-gaming (and it’s not clear whether there is any significant distinction to track here).
- ^
I’m gesturing at: the model’s behavior pre-deployment is grossly non-representative of the model’s behavior out-of-distribution, for reasons such as “the model took certain actions in training because it was being trained, and its behavior was highly contingent on it knowing it was being trained” or “the model is knowingly misleading humans by taking certain actions during evaluations”.
- ^
The best such scenario I could come up with is “after pre-training the model is already egregiously misaligned, and the fine-tuning we do afterwards merely hides outward signs of this misalignment by making the model say nice-sounding things on some particular data distributions, so the model is still mostly egregiously misaligned”, but I find this much less compelling than training-gaming based stories.
0 comments
Comments sorted by top scores.