Summing up "Scheming AIs" (Section 5)

post by Joe Carlsmith (joekc) · 2023-12-09T15:48:49.109Z · LW · GW · 1 comments

Contents

  Summing up
None
1 comment

This is Section 5 of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”. There’s also a summary of the full report here [LW · GW] (audio here). The summary covers most of the main points and technical terms, and I’m hoping that it will provide much of the context necessary to understand individual sections of the report on their own.

Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.

Summing up

I've now reviewed the main arguments I've encountered for expecting SGD to select a schemer. What should we make of these arguments overall?

We've reviewed a wide variety of interrelated considerations, and it can be difficult to hold them all in mind at once. On the whole, though, I think a fairly large portion of the overall case for expecting schemers comes down to some version of the "counting argument." In particular, I think the counting argument is also importantly underneath many of the other, more specific arguments I've considered. Thus:

That is, in all of these cases, schemers are being privileged as a hypothesis because a very wide variety of goals could in principle lead to scheming, thereby making it easier to (a) land on one of them naturally, (b) land "nearby" one of them, or (c) find one of them that is "simpler" than non-schemer goals that need to come from a more restricted space. And in this sense, as I noted in the section 4.2, the case for schemers mirrors one of the most basic arguments for expecting misalignment more generally – e.g., that alignment is a very narrow target to hit in goal-space. Except, here, we are specifically incorporating the selection we know we are going to do on the goals in question: namely, they need to be such as to cause models pursuing them to get high reward. And the most basic worry is just that: this isn't enough. Still, despite your best efforts in training, and almost regardless of your reward signal, almost all the models you might've selected will be getting high reward for instrumental reasons – and specifically, in order to get power.

I think this basic argument, in its various guises, is a serious source of concern. If we grant that advanced models will be relevantly goal-directed and situationally-aware, that a wide variety of goals would indeed lead to scheming, and that schemers would perform close-to-optimally in training, then on what grounds, exactly, would we assume that training has produced a non-schemer instead? Perhaps, per the "haziness" of my "hazy counting argument," we don't specifically allocate our credence over models in proportion to some attempt to "count" the possible goals in question. But even a hazy sense that "lots of goals" lead to scheming is, in my book, cause for alarm, here. We don't know enough about ML training, at this stage, to be confident that we've avoided the relevant parts of goal-space. Rather, if our knowledge does not improve, we will be faced, centrally, with some goal-directed mind that understands what's going on and the process we are using to shape it, and which is getting high reward because it wants something. "Why, exactly, does the thing it wants lead it to get high reward?" we will have to ask. And the most basic answer will be: "we don't know." That's not an acceptable answer. It's not acceptable with respect to the possibility of misalignment in general. And it's especially unacceptable, in my view, if a very wide variety of especially-scary misaligned goals would give rise to this behavior as part of a strategy for seeking power.

That said, I do think there are a few causes for comfort here. We can break these into roughly two categories.

The first focuses on questions about whether scheming is, in fact, such a convergently rational instrumental strategy for such a wide variety of beyond-episode goals. In particular:

The second source of comfort focuses on forms of selection pressure that a high level counting argument, based solely on the assumption that the selected model gets "high reward," doesn't cover. In particular:

The possibility that there are additional selection pressures that disfavor schemers, here (and in particular: the possibility that SGD intrinsically disfavors schemers due to their needing to perform extra reasoning), seems to me especially important given the centrality of "counting arguments" to the various arguments in favor of expecting scheming. In particular: I think that a key way that "counting arguments" in general tend to go wrong is by neglecting the power that active selection can have in overcoming the "prior" set by the count in question. Thus, to borrow an epistemic example/analogy from Xu (2021), your "prior" that my name is "Joseph Carlsmith" should be quite low, because there is a very strong "counting argument" against this hypothesis: namely, that most names (even for men in my demographic etc) are not "Joseph Carlsmith." But when I tell you that my name is "Joseph Carlsmith," this is actually very strong evidence – enough to overcome the prior and leave you confident in the hypothesis in question. And something similar holds for various forms of selection in building functional artifacts. The reason we can overcome the prior of "most arrangements of car parts don't form a working car," or "most parameter settings in this neural network don't implement a working chatbot," is that the selection power at stake in human engineering, and in SGD, is that strong. So if SGD's selection power is actively working against schemers, this might quickly overcome a "counting argument" in their favor. For example, as I discussed in section 4.2: if there are 2^100 schemer-like goals for every non-schemer goal, this might make it seem very difficult to hit a non-schemer goal in the relevant space. But actually, 100 bits of a selection pressure can be cheap for SGD (consider, for example, 100 extra gradient updates, each worth at least a halving of the remaining possible goals, in the context of a training run that involves many million).[3]

For this reason, the question of whether SGD "cares" a lot about the extra reasoning schemers have to do (and/or, whether we can "make it care more") seems to me fairly key here. That is, if just grinding harder on SGD's intrinsic preferences (e.g., for high reward behavior) is itself a force selecting against schemers, then counting arguments or no, I think that anti-schemer efforts have a very powerful tool in their favor. In this respect, I feel intuitively more optimistic about anti-schemer efforts in "low slack" training environments that put a lot of pressure on models to perform optimally – and this is a parameter that we may be able to control. And I feel more worried about schemers in a "path dependent" regime in which SGD might light on a schemer goal early in training, and then get stuck there despite the fact that it would have "preferred" a non-schemer overall.

Another question that seems important to me is whether SGD's incrementalism blocks "training-game-dependent" stories about the origins of schemer-like goals – e.g., the "nearest max-reward goal" argument. If so, I think, this would be a meaningfully additional source of comfort. In particular: to me it seems much easier to avoid naturally giving a model a (suitably ambitious) beyond-episode goal than it is to make sure that by the time the model is situationally aware, its goal is sufficiently close to max-reward that there isn't a schemer-like goal "nearby" along some dimension. In particular: the image of SGD specifically "searching out" the nearest max reward goal, in a space where a wide array of schemer-like goals are accessible to the search, seems to me quite worrying (and also: more likely to result in beyond-episode goals with whatever properties are specifically necessary to incentivize scheming – e.g., highly ambitious goals, goals with unbounded time horizons, and so on).

Finally, I want to re-emphasize some ongoing uncertainty about whether scheming requires an unusually high standard of goal-directedness. I've been trying, here, to separate debates about goal-directedness per se from debates about which sorts of goal-directed models to expect SGD to select. But insofar as even highly capable AIs may not require the sort of coherent, strategic goal-directedness the alignment discourse often assumes, I think this may be especially relevant to the probability that such AIs will be well understood as "schemers," since the form of goal-directedness at stake in scheming seems especially coherent, strategic, and "consequentialist."

Stepping back and trying to look at these considerations as a whole, I feel pulled in two different directions.

That is, scheming feels robust and common at the level of "goal space," and yet specific and fairly brittle at the level of "yes that's what's going on with this real-world model, it's getting reward because (or: substantially because) it wants to get power later and getting reward now helps with that." I think this sense of conjunctiveness has a few different components:

Can we put a rough, subjective number on the overall likelihood of schemers? In my opinion, and despite their drawbacks, numbers like this often function to sharpen the debate a bit, so I'll give it a shot here. Let's say: if you train a coherently goal-directed, situationally aware, somewhat-better-than-human-level model using baseline forms of self-supervised pre-training + RLHF on diverse, long-horizon, real-world tasks, my subjective probability is ~25% that this model will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.

I want to be clear, though, that even on top of the many forms of vagueness the claim above implicates, I'm just pulling a number from my gut. I haven't built a quantitative model of the relevant considerations (though I'd be interested to see efforts in this vein), and the main point of the report is the analysis itself, not this attempt to synthesize the analysis into a quantitative upshot.

I also think it may be possible to reduce the risk here in fairly simple ways: for example, via training on shorter-horizon tasks, or via various forms of adversarial training prior to situational awareness. And I think the probability that the model is both scheming and engaging in actively nefarious actions like sandbagging and early undermining is lower, because these actions come at more significant costs in terms of cognition and reward.

I also think that the risk increases with more powerful models, who may be better positioned to escape/take-over in the future and to coordinate/cooperate with other AIs in doing so, and who are more likely to be able to engage in more sophisticated forms of goal-guarding like "introspective goal-guarding methods" (though these seem to me intuitively quite challenging even for quite advanced AIs).

Finally, I want to re-emphasize that non-schemers can engage in schemer-like behavior, too. Reward-on-the-episode seekers can fake their alignment (or even: engage in escape/take-over) in an effort to get reward-on-the-episode; non-training-gamers can still end up with power-seeking goals that incentivize various forms of deception; and the eventual AIs that matter most might differ in important ways from the paradigm sort of AI I've been focused on here (for example, they might be more like "language model agents" than single models, or they might be created via methods that differ even more substantially from sort of baseline ML methods I've focused on), while still engaging in power-motivated alignment-faking.[5] Scheming, in my view, is a paradigm instance of this sort of scariness, and one that seems, to me, especially pressing to understand. But it's far from the only source of concern.


  1. Though again: it needs to be a notion of "survival" tolerant of values-change. ↩︎

  2. See section 6.8 for a bit more on this. ↩︎

  3. Thanks to Paul Christiano for discussion here. ↩︎

  4. It also feels a bit difficult to track all of the other, subtler conjuncts that can build up in the backdrop of the schemer hypothesis. ↩︎

  5. Though as noted above, if the relevant language model agents are trained end to end (as opposed to just being built out individually-trained components), then the report's framework will apply to them as well. ↩︎

1 comments

Comments sorted by top scores.

comment by Mateusz Bagiński (mateusz-baginski) · 2024-05-09T10:23:06.437Z · LW(p) · GW(p)

Let's say: if you train a coherently goal-directed, situationally aware, somewhat-better-than-human-level model using baseline forms of self-supervised pre-training + RLHF on diverse, long-horizon, real-world tasks, my subjective probability is ~25% that this model will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.

Have you tried extending this gut estimate to something like:

If many labs use somewhat different training procedures to train their models but that each falls under the umbrella of "coherently goal-directed, situationally aware [...]", what is the probability that at least one of these models "will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later."?