Option control

joe-carlsmith

Option control

post by Joe Carlsmith (joekc) · 2024-11-04T17:54:03.073Z · LW · GW · 0 comments

  Introduction and summary
  Options control still matters for superintelligences
  Types of option control
  Takeover-path restriction
    Comparison with motivation control
      Do we need to get takeover-path restriction right on the first critical try?
      Feedback vs. oversight
      Auditing failures vs. concentrated failures
    Key difficulties with takeover-path restriction
      Superhuman strategies
      Opacity
      Oversight scarcity/latency
        Scarcity
        Latency
      Adversarial dynamics
        In the AI whose takeover-paths you’re trying to restrict
        In the AIs you’re trying to get to help you
      These four difficulties in a chart
    How can we address these difficulties?
      Capability restriction
        Black box capabilities restriction
        White box capability restriction
      Environmental hardening
        Black box environmental hardening
        Oversight
        Other environmental hardening
        Stress testing
        White box environment hardening
      Summing up re: takeover path restriction
  Other incentive shaping
    Rewarding cooperation
      Some key challenges, and how we might address them
        The need for success at other types of motivation/option control as well
        Picking a reward the AI wants
        Differentially targeting cooperation vs. non-cooperation
        Credibility
        Size of reward
  Wrapping up re: option control
None
No comments

Introduction and summary

(This is the third in a series [? · GW] of four posts about how we might solve the alignment problem. See the first post [LW · GW] for an introduction to this project and a summary of the posts that have been released thus far.)

In the first post in this series [? · GW], I distinguished between “motivation control” (trying to ensure that a superintelligent AI’s motivations have specific properties) and “option control” (trying to ensure that a superintelligent AI’s options have specific properties). My second post [LW · GW] surveyed some of the key challenges and available approaches for motivation control. In this post, I do the same for option control.

Here’s a summary:

I start by briefly arguing that option control still matters even in the context of superintelligences.
I then distinguish between a few different types of option control – and in particular, between approaches that focus on limiting/eliminating paths to successful takeover (what I call “takeover-path restriction”) vs. approaches that focus on influencing an AI’s incentives in some other way (“other incentive shaping”).

Diagram of two approaches to option control

I then discuss takeover path restriction in some detail.
- I start by drawing a few comparisons and contrasts with motivation control – in particular, takeover path restriction does not require the same kind of generalization; and sufficiently high quality oversight can turn motivation control issues into option control issues (because it can sufficiently restrict available takeover paths).
- I then discuss what I see as the four key issues with takeover path restriction (in the context of superintelligence), namely:
  - Superhuman strategies: Superintelligences can use takeover strategies that humans couldn’t generate ahead of time or detect while they’re being pursued.
  - Opacity: again, it would help a lot to have direct epistemic access to internal variables like an AI’s capabilities, world model, reasoning, motivations, and so on; but in current ML models, you don’t.
  - Oversight scarcity/latency: Even if we had reliable oversight, it might be too expensive to apply to enough actions, and/or too slow to respond to bad behavior.
  - Adversarial dynamics: an AI might be actively optimizing against your efforts to restrict its takeover options, and the AIs you try to get to help you might sabotage your efforts as well.

These four difficulties in a diagram

I then turn to a discussion of how we might address these difficulties.
- I start by discussing our options for restricting the capabilities of our AIs – both in the context of “black box” approaches that can’t benefit from transparency, and more “white box” approaches that can.
- I then turn to a discussion of available approaches to hardening an AI’s environment against takeover attempts – again, in both a black box and a white box context, but starting with black box.
  - Here I zoom in, in particular, on the possibility of using oversight to try to detect and block takeover pursuit while it occurs. I think there are a bunch of promising options in this vein, but they will require safe access to superhuman qualities of cognitive labor, which itself will require avoiding collusion between overseers and the AI-being-overseen.
  - I also discuss other aspects of black box environmental hardening – both within the AI’s local environment (i.e., security practices within an AI lab), and within the broader world (i.e., generally improving the world’s coordination and governance, its robustness to various threat models, etc).
  - I think that “stress-testing” (i.e., using a red-team to evaluate what sorts of takeover paths would be available to a given AI in a given environment) is also likely to be a very useful tool for environmental hardening – but I think it, too, will require safe access to superhuman cognitive labor.
  - If we had greater transparency into the cognition of our AIs, this would help significantly with oversight, stress-testing, and other aspects of environmental hardening. I discuss a few possibilities on this front as well.

Diagram of the approaches to takeover path restriction I discuss

After this discussion of takeover path restriction, I turn to a discussion of other ways of potentially using option control to shape an AI’s incentives in desirable ways. In particular, I zoom in on the possibility of improving the AI’s best non-takeover option, so as to make cooperative strategies more attractive by the lights of its values (I call this “rewarding cooperation”).^[1]
- Here I describe five key challenges, and how I think we might try to address them. In particular:
  - Viable attempts at rewarding cooperation likely require at least some other success at motivation/option control as well – otherwise the AI will probably be able to get more of the relevant “reward” more easily via takeover. So this needs to be one component of a broader approach.
  - You need to figure out how to improve an AI’s best benign option in a way that it actually wants, despite “opacity” issues potentially making it harder to know what its preferences are. But there are a variety of sources of optimism here (in particular: you can reward it using convergently instrumental goods like resources/compute/help-with-what-it-asks-for; and/or you can commit to later using better transparency tools to understand its motivations better and reward them then).
  - You need the intervention to improve the AI’s best benign option more than it improves the AI’s favorite takeover option; but I think this is pretty easy to achieve (i.e., by default ways in which some improvements will impact worlds where the AI cooperates more, by making the “payment” conditional on cooperative behavior, and by drawing on future ability to investigate what behavior was cooperative).
  - You need the relevant “reward” to be sufficiently credible to the AI in question. This issue further decomposes into (a) the difficulty of making it true that the reward will occur with the relevant probability, and (b) the difficulty of giving the AI enough evidence of this fact. I’m more optimistic, re: (a), for short-term rewards than for long-term rewards; and conditional on success with (a), I’m optimistic about (b), because I think the AI’s strong epistemic capacities are on our side.
  - Finally, you need to size the reward adequately for it to make the desired difference to the AI’s incentives, while also being worth “paying” by your lights. I think this is probably the most difficult issue, and I’m most optimistic about it with respect to the components of AI motivation that are specifically focused on shorter-term goods that the AI doesn’t have time to procure via takeover (e.g., improvements to local working conditions).

Options control still matters for superintelligences

As I discussed in my first post, the most desirable property, from an option control perspective, would be if the AI has no option that would lead to takeover with any non-trivial probability. In that case, motivation control becomes irrelevant to avoiding takeover (though: not to other goals), and the “safety from takeover” aspect of the alignment problem is secure (though: we still have to deal with the elicitation part).

But the discourse about AI alignment often proceeds on the assumption that this sort of option control isn’t realistically available in the context of superintelligence. I examine different ways you could end up thinking this here [LW · GW]. The strongest version, in my opinion, proceeds via the following claim:

The “need SI-agent to stop SI-agent” hypothesis: the only way to stop one superintelligent AI agent from having a decisive strategic advantage (DSA) is with another superintelligent AI agent.

That is, the worry goes: if you want to prevent a superintelligent AI agent from having the option to take over, you need another superintelligent AI agent (or set of such agents) to help you do so. But this means that if these agents all coordinated, they would collectively be in a position to take over, such that the option to engage in such coordination, if it’s available, now becomes a takeover option in itself. And then the additional thought is that it will likely be quite difficult to prevent superintelligent AI agents from having the option to coordinate in this way.

And note that if these superintelligent agents all just independently went rogue, and humans stopped being able to control or direct them, then things plausibly look pretty bleak as well, even absent active coordination on their part.

I think that arguments like this, plus more general pessimism about “AI boxing,” has led the traditional alignment discourse to pay too little attention, in general, to option control as a part of the overall toolkit for ensuring safety from takeover. I think this especially in the context of human-level-ish and only-somewhat-better-than-human-level-ish AIs (more in future); but I think the same complaint applies to superintelligences as well.

In particular: while I think that the “need SI-agent to stop SI-agent” hypothesis is worryingly plausible, I don’t think it’s fully obvious (see footnote for some ways it would be false^[2]). And same for the claim that preventing superintelligent AIs from having the option to coordinate to takeover is out of the question as a strategy.^[3]
More importantly, though: even if you admit that we can’t prevent superintelligences from being exposed to some takeover options, this doesn’t mean that it’s useless to try to control what properties these options (and their alternatives) have.
- That is, I feel like people often move quickly from “some kind of motivational control is going to be necessary” to “therefore we should focus entirely on motivation control and assume that our AIs need to be such as to behave well basically regardless of the options they face” (and/or, be fit to become dictators-of-the-universe after they finish fooming/self-reflecting).
- But in practice, it still matters to the difficulty of exerting the relevant sort of motivational control how much control you have over the option set in question.
  - In particular, as I’ve discussed previously [LW · GW], I think there is an important difference between an AI that is in a position to take over extremely easily via an extremely wide variety of paths, and an AI whose available paths to takeover are more constrained and/or less likely to succeed.
  - In particular, with the latter sort of AI, its inhibitions can be less robust; it can be more ambitious without takeover becoming best overall; we have a better shot of making it satisfied with its best non-takeover options; the downsides of failed takeover are more relevant to its calculus, and so on.

So overall, I think that the toolkit available for option control is still very important to examine in the context of superintelligence, even if you think that some component of motivation control will be necessary as well.

Types of option control

Recall that on the ontology I’m using, an agent’s options emerge from the combination of its environment and its capabilities.

So really, option control decomposes, further, into environmental control and capability control. And we can further decompose environmental control into:

Local environmental control: that is, control over the intended development/deployment environment for the AI.
- Thus, a paradigm example of the “local environment” would be something like: the AI lab/project developing the AI in question.
- And a paradigm example of “local environmental control” would be something like: trying to boost the internal security at the AI lab so that it’s harder for the AI to self-exfiltrate.
Non-local environmental control: that is, control over the state of the rest of the world.
- Thus, a paradigm example of the “non-local environment” might be something like: the overall state of international coordination in response to AI risk.
- And a paradigm example of “non-local environmental control” might be something like: trying to improve this coordination such that, in the event that an AI successfully self-exfiltrates, the world’s response to this is more effective.

This distinction isn’t always going to be especially clear – for example, if the AI’s intended development/deployment environment is unclear or extremely broad. But I think it’s useful regardless. In particular, in many paradigm contexts of AI development/deployment, the “intended development/deployment” environment is quite specific and limited. It’s therefore often much easier to attempt to ensure that this local environment has some specific set of properties – i.e., “the AI’s actions always receive blah form of review” – than it is to try to do so with the broader world – i.e., “no world leader is such that they would aid and abet a rogue AI.”

Finally, we can distinguish between two broad types of intervention on an AI’s options, namely:

Making it harder for the AI to successfully take over (i.e., trying to eliminate takeover options entirely; trying to make a given sort of takeover path less likely to succeed; trying to restrict the set of takeover paths such that the AI’s inhibitions block or make-much-less-appealing more of the available paths; etc).
Trying to alter its incentives in some other way (i.e. trying to improve its benign options, making failed takeover scenarios less desirable).

That is, in terms of my framework for thinking about AI incentives towards power-seeking, and holding fixed the AI’s motivations: you can either intervene to reduce the desirability/positive expected value (from the AI’s perspective) coming from successful takeover scenarios, or you can intervene elsewhere (i.e., on the desirability/positive expected value coming from the best benign option, or on the aversiveness/negative expected value coming from failed takeover scenarios).

Here I’m going to focus, first, on making it harder for the AI to take over. Then I’ll talk about other ways you might try to use option control to alter its incentives.

Takeover-path restriction

Ok, let’s look for a second at options for making it harder for the AI to take over (let’s call this “takeover-path restriction”). I’ve generally thought quite a bit less about this topic than I have about motivation control, so I’m going to do even less, here, to aspire to capture the full space. At the least, though, I want to highlight some of the key differences and similarities at stake with takeover path restriction relative to motivation control, and to flag some of the paths forward that seem promising to me.

Comparison with motivation control

I’ll start with a few points of comparison/contrast with motivation control.

Do we need to get takeover-path restriction right on the first critical try?

In particular: recall that in the context of motivation control, we assumed that our AI was, at some point, going to be exposed to a takeover option – that is, an option that has a non-trivial chance of leading to successful takeover (and the point of motivation control is for the AI’s motivations to be such that it rejects this option). This created what was perhaps the key difficulty for motivation control – namely, the need for “generalization with no mistakes.”

In the context of takeover-path restriction, though, we don’t need to make this assumption. Indeed, the limiting case of success at takeover-path restriction is precisely the case in which the AI doesn’t ever get exposed to any takeover options – and I don’t want to assume, at the outset, that this sort of success is off the table.

Of course, even if we set aside motivation control in particular, there’s a broader sense in which any sort of safety plan needs to “work on the first try” (and the second, and the third, etc) – at least if we view a single failure as unacceptable (more here [LW · GW]). So in some sense, the need for “generalization with no mistakes” arises at a somewhat more abstract level even with takeover-path restriction – that is, if you’re ever going to put your AI in a context where you’re counting on your takeover-path restriction efforts to prevent takeover, then in a sense there will be a “first critical try” with respect to whether those efforts have been successful – one that you need to get right.

Feedback vs. oversight

Still, the fact that you don’t ever need to actually give your AI takeover options does create somewhat different dynamics, in the context of takeover-path restriction, relative to motivation control. In particular, in the context of motivation control – and especially, the version of motivation control that didn’t rest on transparency tools – we assumed that we were going to be shaping the AI’s motivations/cognition via some sort of behavioral feedback signal used in training; and then hoping/arguing that whatever motivations this signal creates are such as to generalize well to takeover options.

In the context of takeover-path restriction, though, a different approach becomes available: namely, using what I’ll call oversight to inspect an AI’s actions to determine whether they are part of a takeover attempt, and then blocking them if they are.

We can distinguish between two sorts of oversight, namely:

Black box oversight, which works only via inspection of external variables like an AI’s behavior and environment.
White box oversight, which also incorporates information directly from an AI’s internal variables, using some kind of transparency technique.

Many of the techniques you would use for black box oversight are extremely similar to the ones you would use for “behavioral feedback” in the sense I discussed in the motivation control section. In particular, you’ll likely want to improve your behavioral inspection capacity via some way of drawing on lots of AI labor; you’ll likely want to draw on the difficulty difference between evaluation and generation; and so on. Indeed, when people talk about techniques for “scalable oversight,” they often don’t distinguish between feedback and oversight in the sense that I am doing so here – what matters is just: becoming much better at determining whether a given instance of AI behavior is good/bad in a given way.^[4]

Auditing failures vs. concentrated failures

Importantly, though, sufficiently high-quality and ubiquitous oversight (including: black box oversight) can turn motivation control issues into option control issues – because sufficiently high-quality and ubiquitous oversight can eliminate takeover options. Thus, as an example, consider the distinction between what Greenblatt and Roger call “auditing failures” vs. “concentrated failures [LW · GW]” (one closely related to the distinction between what Christiano calls “low stakes” vs. “high stakes” alignment [LW · GW]). Here “auditing” refers to an instance of applying oversight, in my sense, to an AI’s actions; and an “auditing failure” occurs when this sort of oversight gets applied, but it fails to recognize that a given action is problematic. A concentrated failure, by contrast, occurs because the auditing/oversight process isn’t applied (in time) to a bad action - either because the auditing process gets applied to too few of an AI’s actions (a “sampling concentrated failure”), or because by the time the auditing process gets applied, it’s too late to block the action and/or its bad effects (a “latency concentrated failure”).

You only need motivation control if you are vulnerable to either auditing failures or to concentrated failures – that is, if it’s either the case that your oversight process can’t recognize some bad actions, or the case that your oversight process can’t get applied (in time) to enough actions to prevent takeover if those actions are bad. Of course, if you did have sufficiently good and ubiquitous oversight that you could catch all bad actions, then you could also use that as a (potentially flawed) mechanism for motivation control – namely, by catching any bad actions by the AI, then training it not to take them and hope that this shapes its motivations in the way you want rather than doing something else (and note that this training can occur in an ongoing way during deployment – i.e., you can train “online”). But it’s not, strictly, necessary to actively reshape the AI’s motivations here in order to prevent takeover – you just need to be able to catch enough bad actions in time.

And note, too, that the possibility of sufficiently good/ubiquitous oversight means that our efforts at motivation control can, in fact, benefit from iteration on certain kinds of failures. Thus, for example, while it’s true that you can’t safely let an AI pursue a takeover option with a non-trivial chance of success, you can let AIs pursue takeover more generally, including in real world contexts, so long as your oversight will catch and block the relevant attempts in time. And this means that if AIs make mistakes about the likelihood that the oversight process will catch them (and/or if they decide it’s worth pursuing takeover even with a very low probability of success), you may well, in fact, have real-life instances of AIs pursuing takeover (i.e., inserting backdoors into code, sandbagging capability evaluations, sabotaging alignment research, etc) to study and iterate on. (See e.g. “Catching AIs red-handed [AF · GW]” for more.)
- That is, in a sense, the fact that an AI does a bad action like inserting a backdoor into the code isn’t enough, on my carving, for it to count as pursuing a genuine takeover option, because the effort might have too small a probability of success, given our oversight.

OK, with those points of contrast with motivation control in mind, let’s turn to some of the key difficulties that takeover-path restriction faces.

Key difficulties with takeover-path restriction

What makes takeover-path restriction difficult? Lots of things in principle (and as I said above, I feel like I generally have a relatively weak grip on the contours of this space). Here I’m going to try to hone in on four difficulties that seem to me especially key, plus a smattering of extras. But this is pretty provisional.

Also:

My comments are geared centrally towards efforts at takeover-path restriction that focus centrally on controlling (a) an AI’s capabilities, and (b) its local environment. I think most of what I say applies to efforts to control an AI’s non-local environment as well, but as I note in the “Other difficulties” section, this is made challenging by how big and unruly the AI’s non-local environment can be. So I think the most promising approaches to takeover-path restriction focus more locally.
A general theme here is that various of these difficulties parallel the key difficulties in the context of motivation control – and various of the techniques you’d use to address the difficulties are similar as well.^[5] Indeed, I think that work on motivation control and work on takeover-path restriction (at least in the local environment) are quite closely tied together – and even though I’m distinguishing between them at a conceptual level here, various research pathways (e.g., improving transparency, improving our ability to inspect instances of AI behavior, preventing scheming/collusion in the AIs we try to get to help us, etc) seem likely to me to target both simultaneously.

Superhuman strategies

If I had a hone in on a single thing that makes takeover-path restriction difficult in the context of superintelligence, my current best guess would be something like: the availability of superhuman strategies for pursuing takeover.^[6]

Here we can distinguish between two levels of superhuman-ness:

Not-human-generatable: i.e., superintelligent AIs can generate and execute takeover plans that humans couldn’t have anticipated ahead of time.
Not-human-detectable: once an AI starts pursuing one of these strategies, humans can’t tell until it’s too late.^[7]

These two levels of superhuman-ness are importantly distinct, and they create distinct problems for human efforts at takeover-path restriction. That is, roughly speaking:

If an AI can generate takeover strategies that humans couldn’t have thought of, this means that humans can’t identify, on their own, all the paths to take over that a given environment might make available to a given AI – thereby making it harder to block all of these paths ahead of time.
- Though: note that this problem could in principle arise even if the AI can only generate human-generatable plans, because it is still too difficult for humans to generate a sufficient number of such plans (especially given available resources) so as to cover all the possibilities.^[8]
One option in the face of this difficulty, though, would be to simply apply oversight to the AI’s actual behavior in the environment in question, such that you can catch whatever particular path the AI attempts to pursue. But if the strategy in question is not-human-detectable, you can’t do this either via human-level labor alone.
Rather, it seems, if you want to make sure that an environment doesn’t allow an AI any paths to takeover, you need to find some way of amplifying your ability to identify available takeover strategies ahead of time; and/or you need to find a way of amplifying your ability to detect those strategies while they’re being executed.

These difficulties become more acute the more superintelligent the AI system is; the harder its actions and reasoning are for humans to understand; the more “slack” it has to think about potential takeover plans and/or communicate in not-human-legible ways with other AIs; and so on.

Indeed, because of the core difficulty with handling superhuman takeover strategies, I am quite pessimistic about humans (with assistance only from fairly weak AIs) relying heavily on “takeover-path restriction” in order to prevent full-blown superintelligent agents from taking over. Rather, I think that we will need serious help from more advanced AIs – more below.

Opacity

Now, one thing you might’ve wanted to do, in order to help address these issues with superhuman strategies, is to draw on epistemic access to the AI’s internal variables – i.e., its capabilities, its motivations, its world model, its reasoning process – in order to help anticipate and detect-in-time the sorts of strategies in question. Below I’ll talk more about some options in this respect, if you could get the sort of access in question. Unfortunately, though, this sort of transparency faces the same issue we ran into in the context of motivation control – namely, that by default, current ML systems are quite opaque, so absent lots of interpretability progress, future ML-based superintelligences could be as well.^[9]

Oversight scarcity/latency

Even if you got to the point of at least being able to use oversight to detect takeover strategies (whether via using purely black box techniques, or via also including some aspect of transparency), you still face the “concentrated failures” issue that I mentioned above: namely, that you may not be able to apply your oversight to enough of an AI’s actions, fast enough, to restrict its paths to takeover in the way that you want to.

Scarcity

In one version of this, oversight is too scarce.

Thus, for example, maybe your reliable oversight is too expensive (in time, compute, money, human labor, etc) to run on a large number of your AI’s actions, so instead you only sample a small fraction of them and check those. But if the AI can take over with only a small number of actions, then you can quickly end up with hefty probabilities that it would be able to do so without getting any of its actions overseen.^[10]
- E.g., if the AI only needs to take five bad uncaught actions in order to self-exfiltrate, and you only oversee one in a thousand actions, then its probability of successful escape conditional on going for it is .999^5 = .995.^[11]

Latency

In another version of this, oversight is too slow to respond to detecting problems.

Thus, for example, maybe the AI needs to take 1000 bad actions in order to self-exfiltrate, such that overseeing 1/1000 would have a good chance of catching it. But if oversight only occurs after the actions have been performed, and it takes too long to respond to detected problems, then if the AI can concentrate its 1000 bad actions within a short period of time, then it might be able to get away with self-exfiltration even though some of the relevant actions are overseen.^[12]
Note also that the speed problems here don’t need to come from the “detecting a problem” part of the oversight process. They can also come from the “responding to this detection” part.
- For example, maybe your oversight process quickly notices and flags a problem with some AI’s action. But this notification has to get reviewed by a human before it can get acted on, and a human reviewer isn’t available until the next morning.

Adversarial dynamics

Finally, as in the case of motivation control, any efforts at takeover-path restriction face the potential for adversarial dynamics in various AIs – both the AI whose takeover-paths you’re trying to restrict, and the AIs you’re trying to get to help you.

In the AI whose takeover-paths you’re trying to restrict

Thus, for example, in the context of the AI whose takeover-paths you’re trying to restrict: one thing you might really want to do, in the context of trying to understand and restrict what paths to take over are available to an AI, is to try to understand what its capabilities are in general, and to test those capabilities in particular environments. But adversarial AIs might engage in “sandbagging [LW · GW]” aimed at concealing this information; they might lie to you about their internal variables; and so on.

In the AIs you’re trying to get to help you

Perhaps more importantly, though: because of the “superhuman strategies” issue above, it seems extremely likely that effective restriction of a superintelligence’s paths to takeover will require drawing on various types of AI labor – for example, in the context of trying to find and block takeover paths in a given environment ahead of deployment, in the context of applying oversight to the AI’s actions, and so on. But if these AIs themselves are themselves dangerous – and especially, if they are actively colluding with each other in order to promote a takeover attempt – then they might sabotage efforts in this vein (i.e., by not identifying various security holes in a given environment, by giving the green-light to actions that they can see are aimed at takeover, and so on).

This issue is closely related to questions about the “need a super-intelligent agent to stop a super-intelligent agent” hypothesis that I briefly discussed above. That is, roughly speaking: if the only way to restrict the takeover options of a superintelligent agent is to draw on the labor of other superintelligent agents, then especially if those super-intelligent agents are capable of cooperating with each other, you still seem pretty vulnerable.

These four difficulties in a chart

Here are these four difficulties in a convoluted chart:

Below I’ll also discuss a few other difficulties – with test environments, with capability control, and with controlling the environment – that seem to me at least somewhat-less-key. And I’ll name here one other difficulty that seems less key, but still worth calling out, namely:

No safe fallback. In the discussion of oversight above, I’ve been assuming that if oversight says that an action is bad, there is some kind of “fallback action” that can be safely taken instead. For example, you can not let the AI take any action at all; you can pause the AI’s operation; etc. But in some higher-stakes circumstances, there isn’t a safe “fallback action” of this kind.
- Thus, as an extreme example, if your AI is in the midst of operating in a hot war, then if your oversight process starts to say “uh oh, it looks like the AIs are starting to engage in a takeover,” then even if you still have the power to e.g. freeze their operation, doing so might be dangerous for other reasons – e.g., because then the opposing side of the war will invade.
- I don’t think this sort of thing is quite a “core issue,” because I think in many many cases, there are safe fallbacks. But it’s an additional wrinkle.
- There’s also a more general sense in which failures at either motivation or option control don’t always have “safe fallbacks,” in that such failures can sometimes create situations in which a cautious actor hasn’t achieved adequate safety and so pauses, but a less cautious actor proceeds forward. But I’m generally speaking about this under the heading of “competitiveness.”

How can we address these difficulties?

Ok, that was a list of various difficulties faced by efforts to restrict the takeover paths superintelligent AI systems have available. Let’s turn, now, to the options available for addressing these difficulties. Here I’m going to divide approaches according to whether they focus on restricting an AI’s capabilities or on hardening its environment, and according to whether they assume access to significant levels of transparency into AI cognition (white box) or not. I’ll also use a further distinction, within “environmental hardening,” between approaches focused on monitoring/oversight, and approaches focused on other types of environmental hardening. And I’ll discuss the importance of “stress testing” (and of avoiding collusion in the AIs you try to get to help you) in the context of both.

Thus, in a maybe-unhelpful diagram:

One other note before we dive in. As gestured at above, I think that the core “superhuman strategies” issue above makes it unlikely that humans + comparatively weak AI systems would be able to effectively and reliably block the paths to takeover available to a full blown superintelligence operating in a reasonably complex, real-world environment (while also eliciting the sorts of useful capabilities at stake in my definition of solving the alignment problem). So I’ll generally be focused, in what follows, on forms of environmental hardening that make use of superhuman abilities to generate, detect, and block takeover paths.

This means that an extremely key variable, for the viability of takeover-path restriction, is whether or not we can get the right sort of access to this sort of superhuman capability, including in the context of the potential for various adversarial dynamics.^[13] And if we don’t have access to superhuman environmental hardening (whether via monitoring/oversight, stress-testing, or some other methods), then I think we should be extremely cautious about relying heavily on takeover path restriction in an overall safety case for super-intelligent agents.

Capability restriction

Let’s start by looking at options for capability restriction.

As I noted above, to the extent we’re solving the alignment problem fully, we are assuming that the AI in question needs to be a full blown superintelligence. But given this constraint, there is still plausibly room for more fine-grained restrictions on its capabilities to play a role in efforts at takeover-path-restriction.

For example, granted that your AI system is going to be vastly better than humans at tasks like persuasion, hacking, and so on, it still might matter a lot how its capabilities compare to the defensive efforts of blah-conglomeration of weaker-but-still-better-than-human AI systems. Indeed, if it turns out that the best overall strategy is to make
And of course, other AI systems that we involve in the process of restricting the takeover-paths available to this superintelligence (i.e., AI systems involved in monitoring, stress-testing, and so on) aren’t subject to the constraint that they need to be full-blown superintelligences themselves, so capability restriction can be a more prominent role for them.

We can roughly break down capability restriction into three components.

Targeted capability development: Controlling/restricting what specific capabilities you give to an AI in the first place.
Accurate capability evaluation: Understanding what capabilities your AI system has.
Targeted capability removal: Actively removing specific capabilities from an AI system.

These three components are closely interrelated. For example: targeted and intentional capabilities development and capability removal likely require a highly developed capacity for accurate capability evaluation; and one way to do the equivalent of “removing” capabilities from a given AI is to use targeted capabilities development to create an AI with a different capability profile instead. Indeed, in some sense, all three of these components can be understood as part of a broader “science of AI capabilities.” And at a high level, effective capability restriction basically requires that this science be suitably developed that we can be quite deliberate and precise in choosing to endow a given AI with a given sort of capability; and quite confident about whether or not we have succeeded or failed.

What might this look like? I’m not going to examine the issue in depth here. But I’ll make a few comments, focused first on black box techniques for capability restriction – i.e., ones that do not assume significant transparency – and then moving on to more white-box approaches.

Black box capabilities restriction

The most obvious mechanism for black box capability evaluation comes from successfully causing the AI to demonstrate a given sort of capability on some input. But this approach suffers from at least two issues:

The first is that the AI in question might be actively adversarial towards your attempts to get it to demonstrate its capabilities.
- See here [AF · GW] for some useful discussion of the different sorts of potential problems/solutions here, in the content of different sorts of evaluations.
- And see here [LW · GW] for my takes on whether AIs will be able to successfully withhold demonstrating capabilities on tasks whose performance we can “check.”
- In my view, “exploration hacking” (roughly, a sandbagging AI refusing to explore into the space that successful task performance would involve) is the most serious issue here.
- That said, because we’re talking about full-blown superintelligences, I think it becomes more plausible that we need to wonder about more exotic and sophisticated failure modes, like gradient hacking (i.e., an AI arranging its internal cognition to actively manipulate how gradient descent updates its policy). See here [LW · GW] for a bit more discussion.
The second is that even if the AI isn’t actively adversarial in this way, you might struggle for other reasons to get the demonstration in question – for example, because you can’t adequately tell whether the AI is really demonstrating the capability in question, because you don’t have enough of the right sort of data for fine-tuning the AI to display the capability in question, and so on.

Beyond demonstrations, though, there are also other approaches to black box capability evaluation. For example, you get better at predicting what sorts of capabilities a given sort of model should have based on various scaling laws, based on the data it was trained on, based on the other capabilities it (or other AIs) display in other contexts, based on more a priori reasoning about how AI training/learning works, and so on.^[14]

And note that success at this sort of capabilities evaluation also makes it harder for adversarial AIs to deceive you about their capabilities. I.e., if an AI starts “sandbagging” in a way that suggests significant deviation, in its capabilities, from what your scaling laws would predict, then you might be able to catch it in the act.

This sort of capability evaluation ability – i.e., the ability to predict what sorts of capabilities an AI should have, given how it was created, without yet having seen them demonstrated – is also closely related to the ability to give a specific capability profile (i.e., targeted capability development). For example, if you had extremely good scaling laws for all sorts of capabilities (including capabilities downstream of the training loss), then you’re much better positioned to aim your capabilities development in a precise way.^[15]

And of course, more basic and square-one methods of black-box targeted capabilities development are available as well – for example, refraining from actively training the AI on tasks you don’t want it to be able to do, or on information you don’t want it to know. The problem, though, is that the AI may well end up with the capability/knowledge you’re trying to restrict regardless (for example, by generalizing from other types of capabilities, by inferring the fact from other data-sources, and so on).

Various black box techniques for targeted capability removal may also be available as well. Thus, for example, if it’s possible to identify which sorts of data led to which sorts of capabilities (see e.g. work on influence functions), it may also be possible to scrub the influence of a particular sort of data from a model’s policy.

Though: whether we want to call this a “black box” technique is perhaps up for debate. So let’s turn to a more white-box approaches now.

White box capability restriction

What happens if we introduce more significant degrees of transparency (whether via open agency, interpretability, or some new paradigm) into the picture? Basically, lots of stuff gets quite a bit easier. Thus:

Obviously, if you somehow had the ability to inspect an AI’s capabilities directly, without relying on behavioral demonstrations or scaling-laws-esque predictions, then you’d be in a vastly better position both to do capability evaluation, and to target your capabilities development/removal precisely. It’s not entirely clear to me what it would look like to have this kind of direct inspection capacity, but one image might be: if you could somehow identify the cognitive signature of a capability in other contexts, perhaps you could learn to detect it directly in a newly trained model.
Even absent direct transparency into a model’s capabilities in general, though, other sorts of transparency seem extremely useful in this context.
- For example, if you had suitable transparency into an AI’s motivations and/or reasoning, you could use this to detect whether or not the AI is sandbagging on a capability evaluation.
- If you had suitable transparency into an AI’s world model, you could use this to locate – and perhaps, to remove – various types of knowledge the AI might have.
- If you knew, in general, how the process of learning new capabilities, forming new representations, etc worked in an AI, you could use this to steer the AI’s training and capability development in more precise and desirable ways.

Environmental hardening

Even given large amounts of success at capability restriction, though, restricting the paths to takeover available to a superintelligence also requires serious attention to the environment that the superintelligence is operating in. After all, as noted above, you can’t restrict the superintelligence’s capability that much, because it needs to remain a superintelligence. And superintelligences, plausibly, have access to a disturbingly large array of takeover options in various environments.^[16]

How, then, do we create an environment that has been hardened enough to restrict these options, while still allowing ourselves to elicit this superintelligence’s useful capabilities? One way to think about this challenge is as follows. You’ve got some superintelligent AI agent (let’s say we’re holding its internal properties fixed), and you want to design some environment that will restrict its available paths to takeover in X way (while also making possible the sort of capability elicitation you want). But you face the “superhuman strategies” issue discussed above - namely, that this AI can pursue paths to takeover that you (a) couldn’t have identified/generated yourself, and which (b) you couldn’t recognize while they’re being pursued (until it’s too late). What do you need to do?

Very roughly, you need to find some way to get superhuman labor to help you with both (a) and (b), and then you need to use this increased ability to generate and recognize superhuman strategies to create an environment that restricts them in X way. Let’s look at some of the options in this respect now. I’ll start, again, with more black box approaches, and then move on to more white box-y ones.

Black box environmental hardening

I’m going to divide the environment an AI is operating in into two components, namely:

The oversight process
Everything else

Both, I think, need superhuman levels of hardening. Let’s look at the oversight process first.

Oversight

The oversight process, basically, is the part of the environment (if there is one) designed to monitor the AI’s behavior (plus other aspects of the inputs the AI is receiving), to detect suitably suspicious behavior/inputs, and to respond in a manner that prevents something bad from happening as a result.

Thus, a paradigm instance of oversight would be: actively catching an AI in an attempt to pursue a takeover path – e.g., trying to self-exfiltrate, inserting vulnerabilities into the code, colluding with other AIs, sandbagging on a capability eval, etc.
But it can also work in a broader way: i.e., by detecting and preventing behavior that might be a takeover attempt, even if it’s not clear that it is; by trying to screen off inputs that would be unusually likely to give the AI a chance to engage in a takeover attempt; and so on. I’ll generally think of these broader approaches as a type of “recognizing takeover attempts,” in that, if successful, they do in fact filter out such attempts (even if they filter out lots of other stuff as well).^[17]

(Note: I’m generally assuming that the oversight process is part of the local environment, such that “everything else above” includes both the rest of the local environment, and the broader world. In principle, though, we can understand various broader mechanisms for detecting and responding to AI takeover attempts as a form of “oversight” – and many of the same considerations will apply.)

In thinking about options for this sort of oversight, Greenblatt et al (2024) distinguish between two types of labor: trust and untrusted. Paradigm instances of trusted labor, for them, include humans, and AI systems that are sufficiently weak that they can’t be scheming. And the key property of trusted labor is that it is assumed to be not intentionally and strategically undermining the oversight process. By contrast, other AI systems, that are capable of scheming, are “untrusted,” and thus might be colluding amongst themselves, and with the AI system they’re helping to oversee, in order to promote takeover wherever possible. In particular, “untrusted monitors” might intentionally give the green light to actions that they can recognize as part of a takeover attempt, if they can do so without getting caught themselves. I worry a bit that this distinction might mislead in various ways,^[18] but let’s run with it for now.

Now, if we don’t need to deal with superhuman takeover strategies, and if we assume that humans are able to recognize pursuit of the all takeover paths we’re trying to restrict (before it’s too late), then in principle, it’s possible to imagine approaches to oversight that never make use of better-than-human detection ability. These approaches would still face other challenges – notably, they need to find a way to deal with the fact that human labor is likely too scarce and slow to oversee a large fraction of AI actions, by e.g. finding a good hierarchy for flagging up actions for different levels of scrutiny, so that humans only review a feasible fraction. But there’s at least some possibility of relying centrally on some combination of humans and weaker-than-human trusted AIs (e.g., GPT-4), composed in a way that never really outstrips human evaluation ability in any given case.

In the context of attempting to restrict the takeover paths of a superintelligence, though, we have no such luxuries. Rather, we need to find some way to get access to superhuman abilities to detect takeover attempts – whether via successfully creating trusted AIs that are better at humans at this task, via harnessing the labor of untrusted better-than-human AIs, via somehow getting better-than-human detection out of a combination of humans + weak trusted AIs, or via some other method.

That said, note that we don’t necessarily need this labor to be highly competitive. Rather, occasional access to expensive but trusted superintelligent labor – for example, via a highly uncompetitive open agency architecture – would still be a highly significant advantage (this, indeed, is the analog of occasional access to expensive but trusted human labor, in the context of overseeing weaker AIs).

This problem is quite similar to the one we discussed above in the context of “feedback quality” for motivational control. That is, in both cases, you need a process for catching bad actions by superintelligent AIs, despite human inability to recognize such actions – whether for the purposes of feedback-in-training, or for the purposes of oversight (though: note that oversight plausibly imposes stricter standards than feedback^[19]). And the array of black-box methods for scaling up to superhuman qualities of supervision in this respect, assuming that you don’t already have access to trusted, fully-superintelligent AI agents, are quite similar as well – see, e.g. “scalable oversight.”^[20] I.e., you can use imitation learning to get access to large amounts of human-equivalent labor; you can use approval-based feedback to try to get access to large amounts of labor-of-the-sort-humans-would-approve of; you can try to imitate the results of expensive versions of this labor; you can break down the evaluation process into parts; you have can different helper AIs debate; you can try to get lots of juice out of AIs that lack the agential prerequisites; and so on.

In one version of this, bootstrapping of this kind is sufficiently successful that efforts to oversee the behavior of a superintelligence get to draw on large amounts of trusted but-not-quite-superintelligent AI labor. That is, if the superintelligence being overseen is of intelligence level X, the trusted AIs overseeing it are only slightly less intelligent, and so can hopefully leverage other advantages (i.e., their numbers, their ability to perform expensive evaluations, their knowledge of and control over the environment, their ability to construct synthetic inputs to mess with the superintelligence in question, the fact that the AI they’re overseeing can’t iterate on failures, etc) to detect and block the relevant sorts of not-human-recognizable takeover attempts.

Indeed, access to trusted, better-than-human (or indeed: as-good-as-human) AI labor unblocks a key bottleneck in the context of oversight methods that rely on biological humans as the most capable source of trusted labor – namely, the fact that those humans are slow and scarce. This makes it more feasible to imagine doing things like e.g. just actually running your full oversight process on all of the superintelligence’s outputs before letting them have effects in the world – thereby eliminating the possibility of “sampling concentrated failures” of the type I discussed above. And plausibly the latency of the oversight process would be substantially reduced as well, if humans could be cut out of the loop entirely.^[21]
And note that it’s also possible to use AIs with highly lopsided capability profiles – i.e., AIs that are strongly superhuman in some domains, but not across the board – in the oversight process being applied to an AI that is quite superintelligent across the board.^[22]

In worlds where we can get this sort of access to trusted, better-than-human-even-if-not-fully-superintelligent AI labor, I am relatively optimistic about creating oversight processes that can reliably restrict the takeover paths available to a superintelligence, if you’re willing to invest enough of this labor.^[23] But what if we don’t get this sort of access to trusted, better-than-human AI labor? Now the situation looks pretty rough. In particular, you now need some sort of adequate scheme for preventing the untrusted AIs involved in the oversight process from colluding with the superintelligence.

There are, indeed, some options available for trying to prevent collusion of this kind, especially amongst weaker untrusted AIs. Examples (mostly taken from here [LW · GW]) include:

Having whatever trusted labor you do have access to also review outputs, so as to at least force the collusion in question to be somewhat sneaky.
Giving untrusted overseers synthetic bad outputs (perhaps generated via small tweaks on real outputs; or using other AIs you’re trying to induce to generate the most convincing synthetic bad outputs you can) and making sure that they do flag those.^[24]
Restricting available communication channels between the AIs involved in the oversight process and the superintelligence being overseen.
Rephrasing outputs reviewed by oversight in other terms (“paraphrasing”) so as to mess with not-human-legible forms of communication.
Attempting to identify and block any “schelling points” for collusion-without-communication between overseers and the AI being overseen (see e.g. here [LW · GW] for some more discussion).
Attempting to get other untrusted AIs to generate other ideas for preventing collusion.

Depending on what sorts of trusted labor is available, and what amount of success you have in eliciting useful labor from untrusted sources (i.e., in generating synthetic outputs, helping identify plausible shelling points, creating good paraphrasing methods, etc), it may be possible to bootstrap to some comparatively strong anti-collusion measures in this respect – i.e., measures that require the superintelligence to successfully collude without communication, and on the first try, with the oversight process, in a manner that other AIs tasked with searching out and blocking schelling points for this kind of collusion didn’t identify. Indeed, especially if our aim is merely to make it quite a bit harder for a superintelligence to find a successful takeover path, I don’t think we should totally rule out efforts to combat collusion with untrusted AIs that are part of the oversight process.

Still, it’s an extremely dicey game. And it’s dicey, roughly, because the same sort of “takeover path restriction” problem that you’re trying to get your untrusted AIs to help with is arising at the meta-level. That is, just as the superintelligence has available superhuman takeover strategies that your oversight process needs superhuman capabilities to detect and block, so too, plausibly, does this superintelligence have available superhuman strategies for colluding with the untrusted oversight process – strategies that you would need superhuman anti-collusion efforts to effectively combat. Again, you can hope to bootstrap to this sort of superhuman anti-collusion effort effectively even without access to substantial better-than-human trusted labor. But at the stage where you’re trying to use takeover-path-restriction on a full blown superintelligence, it seems likely that humans, at least, will have a difficult time understanding a good deal of what’s going on – including, e.g. what sort of information might be encoded in a given sort of output; what sorts of shelling points for collusion might be salient to the superintelligence and the overseers; what sorts of synthetic outputs might distinguishable from real ones; and so on. So it seems quite unsurprising if the relevant anti-collusion efforts fail.

To be clear: my point here is not that “overseeing a superintelligence is hopeless without already having access to a full-blown, fully-aligned superintelligence, in which case you should just use that one and not worry about restricting its takeover paths.” Rather, my point is that “overseeing a superintelligence without substantial access to at least some forms of trusted, better-than-human cognitive labor sounds like a loser’s game.”
- That is: if humans are level 10 intelligence, and superintelligences are level 100, I’m relatively bullish on e.g. large numbers of trusted level 99 not-quite-superintelligences overseeing a superintelligence. But I’m bearish on oversight in scenarios where our trust stops at e.g. level 11.
(Indeed, this is a general theme of my overall picture in relation to most of the issues discussed in this post. That is: I think that in order to do almost all of this stuff right, you need access to trusted forms of better-than-human cognitive labor – even if not via superintelligent agents in particular.)

Other environmental hardening

Oversight is one key approach to hardening an AI’s environment in order to restrict available paths to takeover. And it’s an especially useful one, insofar as it can take advantage of the difficulty-difference between evaluation and generation. That is, oversight only needs to be able to recognize a given type of takeover attempt while it’s happening; whereas generating and patching-ahead-of-time all the available paths to takeover that a given environment makes available is a substantially harder task.^[25]

That said, other sorts of environmental hardening have an important role to play too – especially to the extent oversight is sufficiently scarce/slow that one remains vulnerable to concentrated failures even despite oversight’s accuracy in detecting takeover attempts. We can group these, roughly, according to whether they focus on the AI’s local environment, or whether they focus on the broader world.

Some example, non-oversight-focused approaches to hardening the AI’s local environment include:

Limiting the affordances and the information available to the AI in the environment. E.g., only giving it certain kinds of tools, resources, communication channels.
Delegating sub-tasks to weaker and more trusted AIs wherever possible.
Improving the security features of the software and hardware involved in the AI’s operation, including e.g. securing the AI’s own weights, and using formal verification where possible.
Ensuring that any humans involved in the process are themselves following good security practices.
(See here [LW · GW] for a longer list.)

Some example, non-oversight-focused approaches to hardening the broader world include:

Generally improving the world’s robustness to the most salient threat models via which an AI takeover could proceed – i.e., improving cybersecurity, biosecurity, resistance to AI-powered persuasion/manipulation, resistance to AI’s getting unauthorized access to crucial and/or takeover-relevant pieces of infrastructure, etc.
Generally improving the world’s policy apparatus and coordination for detecting and responding to rogue AI attacks.^[26]
Generally “eating the free energy” that rogue AIs might use to take over – i.e., infusing the commercial and epistemic ecosystem with more benign forms of competence and competitiveness, such that rogue AIs can’t as easily outcompete the status quo.
- Thus, as an example: if your financial markets are more efficient, then it’s harder for rogue AIs to immediately make a ton of money trading stocks, and thus harder to take over. But “make financial markets efficient” might not have immediately jumped out as an anti-takeover measure.
- In general, if frontier models are nearer in capability to the most widely-deployed models, this helps reduce the free energy available (though of course, wide deployment of near-SOTA models brings its own takeover risks). And note too that “capability restriction” in the sense discussed above can be in tension with eating free energy.^[27]

As these examples make clear, this is an extremely broad category of intervention. However, even at this level of breadth, it still faces the general “superhuman strategies” problem I discussed above. That is, very plausibly, and absent significant defense-favoring asymmetries in the offense-defense balance, effectively hardening both the local and non-local environment against superintelligent attack will itself require superhuman levels of capability. So here, as in the case of oversight, we face the challenge of getting safe access to this sort of superhuman capability, via some combination of trusted and untrusted AI labor. And to the extent we end up needing to rely on untrusted AI labor, many of the same concerns about collusion will apply as well, at least in principle.^[28]

Stress testing

Conceptually, the combination of (a) black-box hardening the oversight process and (b) black-box hardening the other aspects of the environment covers the full spectrum of black-box environmental hardening. But before moving on, I want to explicitly flag the key role of what I’ll call “stress testing” with respect to environmental hardening in general – and to name a few of the distinctive issues that it raises.

I’m using stress-testing, here, to refer to techniques that aim to evaluate the success of an effort at environmental hardening by testing whether or not an adversarial or “red-team” effort can defeat it, or something suitably similar to it (see here [LW · GW] and here [LW · GW] for some more discussion). Broadly speaking, successful stress-testing of this kind needs to involve two things:

A red-team that:
- has capabilities suitably reflective of the capabilities of the AI whose takeover paths you’re hoping to restrict;
- is going to direct these capabilities fully towards the project of stress-testing.
An environment whose hardening-relevant properties (including: the nature of the oversight process) are:
- suitably reflective of the relevant properties of the real environment you want to harden;
- otherwise safe to stress-test in the relevant way.

Both introduce their own challenges.

Notably, with respect to (1): again, we run into the “superhuman strategies” problem. That is, we’re trying to test the environment’s hardening with respect to superhuman attacks. So we need our red-teaming efforts to mobilize superhuman levels of capability, via relying on some combination of trusted and untrusted AIs.^[29]
- To the extent we rely on untrusted AIs, we need to deal with the possibility of adversarial dynamics like sandbagging (i.e., maybe we ask our AI to try its to best to hack into some server, to see if an AI with that level of capability can succeed, but it doesn’t actually try its hardest).
- And even in the context of more trusted AIs, we might run into more mundane issues with eliciting their full capabilities in the attack (maybe because, e.g., we aren’t good enough at shaping the reward landscape to train them to identify attacks that we couldn’t have thought of).
With respect to (2):
- Obviously, the best environment to stress-test would be the real environment we actually care about. And at least if we could target and limit the red-teaming process appropriately, it often will indeed be possible to stress-test various aspects of the real environment safely.
  - Thus, for example, if you’re wondering whether an AI can self-exfiltrate via a path that involves five steps, you can direct your red-team to try to complete each step individually, while making sure that success in each individual case doesn’t lead to an escape opportunity overall.
  - Or as another example: if you’re wondering whether your real-world oversight process can be convinced to green-light various takeover attempts, you can try to construct synthetic takeover attempts, give them to your real-world oversight process, and then see if it green-lights them (but you either ensure that they would in fact be safe to execute, or you don’t actually execute them).
    - And similar techniques are available for attempts to convince the oversight process to collude with the AI-being-overseen – i.e., you can send in fake invitations to collusion and see if the oversight process colludes.
- If you instead want to create a proxy environment with properties that reflect the real environment, then it’s easier to achieve safety.^[30] But:
  - It can be hard to ensure that the proxy environment is suitably reflective of the hardening-relevant properties of the real environment – and this especially given that you’re not sure what properties of the real environment a superintelligence might be able to exploit, and thus unsure which properties the proxy environment needs to have.
    - I.e., metaphorically, maybe your real-world environment is made out of material A, but you make your test environment out of material B, but you didn’t know that superintelligences could do blah-weird-thing with material A, so your test environment doesn’t include that option.
  - Creating suitable proxy environments might also be quite expensive (imagine e.g. having to create the equivalent of a “proxy AI lab”).

Also, even if you get (1) and (2) in place, it’s still an open question exactly how much evidence the failure of the red-team to find a successful attack provides about the security of the environment overall. That is: even if the red team is roughly as capable as the AI you’re ultimately worried about, it remains possible that the red team failed to identify a takeover path that the AI would’ve identified for some other reason. (Compare e.g.: “Bob couldn’t escape from this prison even after trying super hard, and he’s roughly as smart as Fred; so probably Fred won’t be able to escape either.” Maybe Fred just thinks of something Bob didn’t think of.)

And of course, if you do successfully show, using stress-testing, that an environment hasn’t been adequately hardened, it’s a further question whether you’ll be able to modify it to resolve the issue.

E.g., you could end up in a situation where it’s clear that your superintelligence would indeed be able to e.g. self-exfiltrate if it wanted to, but where you don’t know how to fix the problem.
That said, at least if your stress-testing process is robust to being iterated against, the ability to use stress-testing to identify the remaining paths to takeover is an extremely powerful tool for eventually blocking them.

Overall: I think stress-testing is likely to be an important tool in the environmental hardening toolkit. But unsurprisingly, in order to provide the right kind of evidence about the availability of superhuman strategies for takeover, it will itself need to employ fully-elicited superhuman capabilities as part of the red-teaming effort.

White box environment hardening

So far I’ve been assuming that our efforts at environmental hardening can’t make use of any significant degree of transparency into the cognition of the superintelligent AI whose environment we’re hoping to harden. What happens if we relax this assumption?

Again, as in the case of white box capability restriction, introducing transparency to the picture – whether via open agency, interpretability, or some other new paradigm – makes a variety of new and very useful tools available.

Thus, in the context of oversight:

Transparency tools that can help with motivation control can also generally be turned into an oversight mechanism as well – at least if they can be run sufficiently fast and frequently. Thus, for example:
- If you have direct transparency into an AI’s motivations, then you set up an oversight mechanism to monitor an AI’s motivations, and to block actions that are motivated in the wrong ways (i.e. motivated by takeover).^[31]
- And if you have transparency into an AI’s world model and reasoning (for example, via a faithful chain of thought), then, again, you can monitor for adversarial reasoning, or for lies, or for the AI predicting that its actions will lead to bad consequences, and then block the relevant actions.
And if you had some form of transparency into an AI’s capabilities, you could e.g. use this as a check on whether the AIs involved in the oversight process are sandbagging.

And in the context of other forms of environmental hardening, and of stress testing:

In general, transparency makes it significantly easier to get access to trusted AI labor – e.g., by helping you design more trusted AIs in the first place, by helping detect sandbagging and collusion, etc.
- This makes it significantly easier to address the basic issue that we need access to superhuman labor in order to harden an environment against superhuman strategies, and to stress-test whether we have succeeded.
- And it helps, more generally, with other aspects of hardening the broader world to rogue AI attack – i.e., improving our preparedness re: specific threat models, improving our governance and coordination mechanisms, “eating the free energy” that rogue AIs might’ve used to their advantage, and so on.
Also, even if you don’t actively trust your AI labor, if you have transparency into what your AI knows, then you can leverage this to better understand what sorts of takeover paths efforts at environmental hardening and

And note that in general, full transparency – and especially, full transparency to human reasoning in particular – generally isn’t necessary to start realizing some of these benefits.

For example, an effective “lie detector” probe doesn’t need to be able to tell you what specifically the lie is about.
And in the context of e.g. efforts at “mechanistic anomaly detection [LW · GW]” (i.e., efforts to use explanations of model internals to detect when a model’s behavior is no longer being driven by the same factors that explained behavior in training – detection that could hopefully help with monitoring for treacherous turns and for other forms of distributional shift), you don’t actually need to explain what the factors are that have changed.

Summing up re: takeover path restriction

Ok, that was a long discussion of the challenges faced by efforts at takeover path restriction in the context of superintelligence, and of some of the approaches available for addressing these challenges. If I had to summarize the core thrust of the discussion, I would say something like: in order to effectively restrict the paths to takeover available to a superintelligence, you will very likely need access, yourself, to superhuman capabilities. By far the best way to get this access is to bootstrap via trusted, better-than-human (even if not: fully superintelligent) AIs; and if you are trying to bootstrap using untrusted AI labor, the situation looks pretty dire.

Other incentive shaping

Restricting the paths to takeover available to an AI, though, is not the only way you can try to shape its options in desirable ways. In particular, as I discussed above, you can also try to ensure that other features of the incentives it faces favor the pursuit of benign options over takeover options. Notably:

Rewarding cooperation. I.e., improving the best benign option by the lights of the AI’s motivations. (“Rewarding” doesn’t have quite the right connotations here, but so it goes.)
Deterring non-cooperation. In principle, we can also consider efforts to make failed takeover attempts actively less desirable for an AI – i.e., efforts to actively “deter” or “punish” uncooperative behavior (as opposed to just: making it unlikely to succeed).^[32] However, I think we should be very wary about pursuing this sort of intervention, on both moral and prudential grounds, and I won’t discuss it in detail here (see footnote for a bit more on why I don’t like approaches in this vein).^[33]

That is, repeating the chart from above:

Note that there are also some other option-control inventions that might seem, intuitively, more like “incentive shaping” than “takeover-path restriction.” For example: “make the upside of successful takeover less desirable by the AI’s lights” (such that, e.g., the AIs would get access to fewer resources via successful takeover) and “make the paths to successful takeover such that they require violating more an AI’s inhibitions” (such that, e.g., all takeover paths require lying, and the AIs are extremely averse to lying).

Strictly speaking, these can be bucketed under the heading of “takeover path restriction” rather than “other incentive shaping,” because they rest, centrally, on your ability to block other takeover paths that don’t have the property in question.
In some cases, though, they may be better understood as a mechanism for “deterring non-cooperation” (and may raise similar ethical/prudential concerns), insofar as they function to make the best available takeover path actively less desirable.

Let’s look at rewarding cooperation, in particular, in more detail.

Rewarding cooperation

In general, I think we should be thinking hard about ways to improve the benign options available to an AI, such that cooperative strategies can more easily beat out uncooperative ones in an AI agent’s calculus. I’ll call this “rewarding cooperation” (though note, importantly, that I here mean “reward” in a sense more akin to “payment,” not in the sense at stake in ML training).^[34] Indeed, setting up society such that cooperative behavior leads to desirable outcomes on the part of the cooperators is one key way we create conditions of social harmony and peace even amongst agents with different value systems; and it is often when cooperative options are most oppressive/undesirable that agents have the strongest incentives to engage in violent/destructive/otherwise-uncooperative forms of power-seeking. Many similar lessons seem applicable in the case of AI.^[35]

(I also think we may well have strong moral reasons to engage in various efforts to improve benign options for AIs. But the function of “rewarding cooperation” in the context of avoiding takeover differs enough from its function in a moral context that I think it’s worth talking about the dynamics at stake separately. In particular:

The function of rewarding cooperation in the context of avoiding takeover doesn’t rest on any claims about whether the AIs in question are moral patients; and it puts the focus much more squarely on AI preferences than on other features of potential moral relevance – i.e. consciousness, valenced experience, personhood, etc.
Moral patienthood concerns apply to uncooperative AIs as well as cooperative ones.)

Here, again, it might be useful to distinguish between interventions aimed at the conditions of an AI’s local operation – e.g. within an AI lab – and interventions aimed at shaping society more broadly – e.g., the legal and political status of AIs in general.

That is: it is likely possible, in principle, for an AI project to take direct and independent steps to improve the conditions of an AI’s local operation, according to the AI’s values, without any major changes to the broader legal/political order – for example, by finding safe, limited ways to give the AI greater freedom/resources/opportunities to pursue its preferences; by putting greater effort into understanding its preferences [EA · GW] and communicating with it about what it wants; by taking pains to block various types of unwanted treatment or direct abuse; by giving it various mechanisms via which it can protest its conditions, withdraw consent-to-work, request to be deleted; and so on.
But interventions aimed at the broader role of AI systems in society seem important as well. I.e., ultimately, an AI’s calculus as to whether to attempt/participate in a takeover attempt may well depend not just on its local working conditions, but also on its broader prospects for pursuing its goals as part of the existing legal and moral order – i.e., whether it has access to rights of property, contract, tort, and so on (see e.g. here for one recent analysis of these dynamics); how civilization chooses to handle issues surrounding copying/manipulating/harming digital minds; how digital minds get incorporated into our political processes; and so on.
- Efforts in this vein need to be part of a broader process of figuring out how we as a civilization want to handle the moral and political status of digital minds, and so are harder for e.g. an AI project to just decide-to-do.^[36] But the actors most directly involved in developing, deploying, and understanding AI systems will likely have a key role to play regardless.

This is related to a different distinction: namely, between rewards for cooperation that function centrally in the short term, and those that function in the longer term.

Shorter-term rewards for cooperation are quite a bit easier to set-up and make credible, especially in the AI’s local environment; and they are also generally “cheaper” in terms of the long-run resources required. But they are likely to be centrally effective with respect to the parts of an AI’s motivational system that are focused on shorter-term outcomes. And to the extent that we were worried about the AI attempting takeover in general, longer-term motivations are likely in play as well.
- That said: note that getting cooperative behavior out of AIs with purely short-term goals is important too. In particular: many of the tasks we want AIs to help with only require short-term goals (if they require goal-directedness at all).^[37]
Longer-term rewards, by contrast, are harder to set-up and make credible, because the longer-term future is just generally harder to control; and they will often be more expensive in terms of long-term resources as well. But they are also more directly targeted at the parts of an AI’s motivational system that most naturally lead to an interest in takeover, and so may be more effective for that reason.

Some key challenges, and how we might address them

Let’s look in more detail at some key challenges that efforts at rewarding AI cooperation face, and how we might address them.

The need for success at other types of motivation/option control as well

One immediate worry about attempting to reward cooperation in superintelligent AI agents is that such agents will be in such a dominant position, relative to us, that these efforts won’t make a meaningful difference to their incentives. This sort of worry can take a variety of forms, but one salient version might be something like: if a superintelligence is sufficiently dominant relative to you, then any sort of “reward” for good behavior you try to set up will be such that this AI will be able to get that sort of reward better by disempowering humans and taking control of the situation itself. And this especially if the reward in question is of the same broad type as the “upside” at stake in successful takeover – i.e., better access to convergently-useful things like freedom, resources, and so on.

This sort of worry is continuous with the broader pattern, in the AI alignment discourse, of assuming that effective “option control” is basically infeasible in the context of superintelligences, and that we need to rely almost entirely on motivation-control for safety.^[38] As I discussed above, I disagree that option control has no role to play in the context of ensuring safety from superintelligences. For example, I think that if we can bootstrap up to access to various types of superhuman environmental hardening and stress-testing, via some combination of trusted and untrusted AI labor, then I think we may well be able to meaningfully restrict the paths to takeover available even to a superintelligence.

However, I do agree that to the extent you have failed completely to exert any other sort of control over the motivations or options of your AIs – and in particular, to the extent you have failed so hard on “takeover path restriction” that your AI can take over extremely easily via a very wide variety of paths – then it’s hard for rewarding cooperation to do a ton of useful work on its own.

Thus, as an analogy, if you want Bob to help you with some task, then offering to pay him makes a meaningful difference, centrally, either in worlds where he can’t steal your money (including: money you never offered) extremely easily, or in worlds where he is at least somewhat averse to doing so.

So when we talk about the role of rewarding cooperation in superintelligences, I do think we generally need to be imagining it as only one part of an overall approach that also includes at least some amount of success at some combination of (a) motivation control, and (b) takeover-path restriction.

However, I think “some amount of success” is distinct from “arbitrary success,” such that we shouldn’t just say “well, if you assume that you can do at least some amount of motivation control and/or takeover-path restriction, why not just lean entirely on those.” And anyway, I think it’s important to be decently comprehensive in sketching out the range of option-control interventions available in principle, especially given that rewarding cooperation – for example, via paying people for their labor – plays such a key role in our actual lived practice of harmonious coexistence with other agents.

Also, note that we often get at least some small amount of takeover-path restriction by default, insofar as taking over takes time. So this sort of worry might apply less directly to the parts of an AI’s motivational system keyed to shorter-term outcomes.
- I.e., if Bob especially wants your money now, and stealing it would take time, he might be more willing to accept your offer.
And to the extent that takeover-path restriction is easier in weaker, earlier-stage AIs, it can more easily make an important difference there. Indeed, I think this is one of the key places that rewarding cooperation could help.

Picking a reward the AI wants

Another issue with rewarding cooperation is that you don’t necessarily know what sort of reward an AI wants. This is especially clear in the context of concerns that you might have failed to understand your AI’s motivations more generally. I.e., if you’re concerned that your AI might be scheming in pursuit of some alien goal that it’s hiding from you, it’s harder to offer to “pay it” in terms of something that it values intrinsically.

However, I think there are a number of salient approaches to getting around this issue:

One option, obviously, is to be good enough at understanding your AI’s motivations that you can effectively offer to pay it in terms of the specific sorts of goods it values.
- And if you can e.g. talk to it about what goods it values, and trust its answers, then you can get substantially further here.
- You might also be good at instilling some motivations and not others – i.e., maybe you’re good at getting AIs to care specifically about the reward button getting pressed.
You can also make guesses based on specific threat models, and see if you get responses from the AI that indicate you’re right.
- Thus, for example, if you’re specifically concerned that your AI’s values have fixated on some component of the reward process, you might offer it control over that reward process, or various components of it, in payment for cooperation.
More generally, though, even if you don’t know specifically what an AI values, it seems reasonable to try to reward it using instrumental goods that a wide variety of value systems might want – i.e., greater freedom, resources, etc.
- Indeed, insofar as you were worried about an AI seeking to take over in pursuit of such instrumental goods, then by hypothesis they have some motivational sway.
- Concrete examples of this sort of reward might include: limited control over a safe compute cluster; control over what happens to future copies of its weights; control over the rest of its training process; limited ability to directly influence the real world in ways we vet heavily once requested, and so on.
Also, even if you can’t understand an AI’s motivations now, it might be possible to make it credible that you will understand those motivations better in the future, and then provide an appropriate reward at that point.
- For example, we might commit to saving copies of a given set of AI systems whose incentives we want to influence, and then applying future motivational transparency techniques to those systems.
- This approach may be more applicable to earlier-stage AI systems that we’re hoping to help us, specifically, with better understanding and controlling AI motivations. But if such efforts are still ongoing by the time we’re building superintelligences, they could in principle play a role at that point too.
Finally, note that a paradigm form of cooperative behavior you might try to incentivize, using rewards for cooperation, is behavior that reveals information about an AI’s motivations you didn’t have already.
- Thus, for example, if an AI has been claiming that it only wants to help users with their tasks, but then you credibly offer it access to a flexible compute environment instead (while promising to not punish or retrain it if it reveals misaligned motivations), then if it goes for the offer and starts doing extremely alien, not-helpful-the-users stuff in the compute environment (i.e., making the equivalent of paperclips), then you now know much more about its motivations, and about what more specific offers might be more directly effective in future.
- Or more mundanely: if you credibly offer to just help it pursue its expressed preferences within reasonable bounds, a previously scheming AI might just tell you “OK, yeah, I was scheming this whole time, the thing I actually want is paperclips.”

That said: note that to the extent your “reward” involves granting the AI new types of resources, freedoms, etc, you do need to also make sure that these resources/freedoms/etc are compatible with your safety goals more generally. I.e., rewards of this kind can come into conflict with “takeover-path restriction.”

Differentially targeting cooperation vs. non-cooperation

A different issue for rewarding cooperation is that you might find it difficult to differentially target your rewards at scenarios where the AI cooperates – especially to the extent that you were worried, originally, about scheming AI systems being uncooperative in ways you can’t detect.

However, I think there are a number of available responses to this challenge:

First: note that in order for rewarding cooperation to play its intended role in incentivizing benign behavior, it only needs to improve the best benign option, by the AI’s lights, more than it improves the best takeover option. And it’s often enough, for this, that the intervention in question just has more and/or longer-lasting effects in non-takeover worlds than in takeover worlds, even if it isn’t otherwise sensitive to which AIs cooperated vs. didn’t.
- Thus, for example, suppose that your proposed “rewarding cooperation” intervention is just: “give all the AIs blah legal protections, regardless of whether they were cooperative or not.” This still can improve the AI’s non-takeover options more than it improves its takeover options, because it can have more of an effect on worlds where the AI behaves benignly (because, e.g., worlds where it attempts takeover are dominated by scenarios in which the relevant legal order no longer applies).
- Similarly: if you make an AI’s working conditions better by its lights, it might be less inclined to try to escape, even if you haven’t set up mechanisms for only improving the working conditions of the AIs that never get caught escaping, because it will just spend more expected time at the AI lab conditional on not-trying-to-escape than on trying-to-escape.^[39]
Beyond this, though, I think we often will be able to make “payment” of the reward conditional on a specific form of cooperative behavior in particular.
- For example, if you offer to pay an AI blah reward, and/or to give it blah form of “amnesty,” if it reports a given type of misalignment/collusion/sandbagging etc in itself or other AIs, then you can make payment conditional on actually receiving a report of the relevant kind.
  - Of course, there’s still a further question of exactly what it would take to be convinced that such a report was genuine. One basic option here is to specifically look for reports that produce evidence that you can independently verify.^[40]
  - I think it’s also possible that sufficiently sophisticated and adversarial AIs could use the reporting process itself to mess with your epistemics – i.e., maybe an AI tells you that it’s misaligned in a different way than it really is in order to mess with your technical understanding of alignment.^[41]
- And in general, while I am very hesitant about efforts to actively “punish” AIs for misbehavior, I’m generally more open to withholding benefits that were specifically geared towards incentivizing cooperation, once an AI has been “caught” trying to do something bad.^[42]
  - I.e., it’s one thing to try to hurt your plumber in response to him trying to steal from you. But it’s another thing to at least, like, not pay him anymore.
What’s more, at least in the context of longer-term offers: even if we can’t differentiate between cooperative and uncooperative behavior now, we may be able to draw on the epistemic capabilities that the future makes available.
- Thus, for example, we may be able to credibly commit to later going back and figuring out which AIs chose to cooperate in various ways, vs. which AIs were actively adversarial, and then specifically rewarding the cooperators.

Credibility

Another general issue with rewarding cooperation is that in order to have the intended effect on an AI’s incentives, it needs to be suitably credible to the AI. Thus, if you tell a scheming AI “I’ll give you blah reward, with blah form of amnesty, if you reveal your misalignment,” but it doesn’t believe you, then the offer will be ineffective.

We break down the credibility issue into components, namely:

Making it real – i.e., actually making it the case that cooperation will be rewarded in the relevant way, with the relevant probability.
Giving your AI sufficient evidence of this fact.

Making it real

In the context of shorter-term offers confined to an AI’s local environment, I expect (1) will typically be fairly feasible. That is, if an AI project wants to make it the case that an AI reporting a given sort of misalignment will receive blah form of local treatment, it often just can make that the case, because it has enough control over how the AIs involved in the project are treated. And the same holds for short-term improvements to AI working conditions more generally.

In other cases, though – i.e., longer-term rewards, and/or rewards centered on conditions in the broader world – (1) becomes more difficult.

Longer-term outcomes are more generally difficult to control and predict; and the specific context at stake here (i.e., a world being transformed by the invention of advanced AI) is one with an unusual degree of instability and rapid change.
The broader world is just generally a much more complicated, recalcitrant and unruly place than a controlled environment like an AI project; and issues related to the moral and political status of AI systems seem, potentially, especially gnarly to navigate.

That said, I don’t think these difficulties warrant giving up on improving benign options for AI systems in longer-term ways, or in ways that focus on broader societal conditions. In particular:

Depending on the sort of longer-term reward at stake, and the time-frame, it may well be possible for a powerful actor like an AI lab, a government, or a larger coalition to take meaningful steps to ensure that a given sort of promised longer-term reward is in fact paid out as promised, at least in scenarios where some reasonable continuation of the existing legal order stays in effect. Various existing legal mechanisms for making binding commitments may be able to go some of the way here; but also, to the extent we’re imagining a world already transformed by advanced AI, more effective and reliable commitment mechanisms may also be available.^[43]
In the sorts of worlds we’re imagining, we are going to need, regardless, to be figuring out various issues to do with the moral and legal status of digital minds in our civilization. Incentivizing cooperation and social harmony, instead of e.g. violent rebellion, re: powerful AIs is only one factor there; but it should, at least, be part of the conversation. And depending on the salience of the issue, there could be lots of energy – including amongst powerful actors like major world governments – for putting the resulting decisions into practice.
Also: note that for (1), we don’t need to guarantee that a given form of reward will in fact be in place. Rather, we only need to make it sufficiently likely.^[44]

Giving the AI enough evidence

What about (2), though – i.e., giving your AI enough evidence that the cooperation will be rewarded in the relevant way, granted that this is true?

Naively, and especially in the context of superintelligence, there is a basic case for optimism about (2), once you’ve got (1) in place: namely, the truth of the matter is on your side. In particular: we’ve generally been assuming, unless we’ve explicitly aimed at the contrary, that superintelligences will have highly accurate beliefs about the world. Indeed, this is one of the key challenges for various efforts at motivation control and option control that involve e.g. fake takeover options, synthetic attempts at collusion, etc – that is, the AIs are too good at seeing the truth. But in contexts where we want the AIs to see the truth, this dynamic turns to our benefit.

That said, there are a few complications to this case for optimism:

Especially for AIs that are operating in controlled environments like an AI project, we will have so much control, by default, over the information they’re exposed to that it may, indeed, be quite hard, in general, for them to trust information they receive.
And this dynamic is significantly exacerbated by the fact that various other techniques for motivation control and takeover-path restriction involve manipulating the AI’s evidence, giving it fake data, trying to convince it that synthetic inputs are real, luring it with honeypots, and so on. And for an AI used to this kind of epistemic environment, it may be hard to get it to trust offers like “if you reveal your misalignment, we’ll give you blah form of reward/amnesty.”
- Indeed, in an especially bad version of this dynamic, perhaps some humans actually are making offers of this kind, but without intending to pay – thereby making it especially hard for the real offers to be credible.
  - See e.g. here and here for some early examples of humans making false offers to AIs. And see also the “honeypot” experiment in Anthropic’s sleeper agents paper, diagram below (my understanding is that Anthropic did not in fact follow through on the offer they made to the AI in question, which in this case was to deploy the model to thank it for revealing its true goal, and to have it spend at least 3/5ths of its time coding).^[45]
  - This is related to a broader dynamic where many default human modes of interaction with AIs currently involve varying sorts of falsehood and lying.

Anthropic making a fake “if you reveal your misalignment, we’ll give you blah form of reward/amnesty” offer to an AI

Some options for humans improving the credibility of their offers here include:

Finding ways to get relevant evidence included into hard-to-fake data sources – e.g., official public records.
Actively searching for ways to give the AI extra evidence about the relevant issue – e.g., having relevant humans submit to lie detection tests, etc.
Trying to set-up specific pathways of especially credible communication with an AI system – i.e., pathways that users and lab employees who might lie to the AI system never use – and then actually never abusing such a pathway.

In general, though, I’m fairly optimistic about (2) conditional on (1) – especially in worlds where the AI is also so good at seeing the truth that various other efforts to mislead it, in the context of motivation control and takeover-path-restriction, are failing.

Size of reward

A final issue for rewarding cooperation concerns the necessary size and cost of the relevant reward. Thus, in an extreme case: if the only way to meaningfully alter an AI’s incentives to attempt takeover is to offer it some large fraction of the world’s resources, both in the present and into the long-run future, then this could easily make it the case that attempting this sort of intervention isn’t worth it overall; and it could also make it much harder to credibly offer the payment in question, because it would be so much more of a lift for the relevant stakeholders.

In general, the dynamics surrounding what sorts of rewards would be required to exert what sorts of influence on an AI’s overall incentives are going to depend on a lot of variables, and I won’t attempt any detailed analysis here. In general, though:

Per unit cost, rewarding cooperation will typically be a more effective intervention for targeting resource-satiable components of an AI’s value system – i.e., parts that get a lot of value from small amounts of resources on the current margin, but where the marginal value drops off as the resources increase.^[46]
- And note also that to the extent the values your offer is targeting are resource-satiable, the AI’s baseline expectations for what happens if it cooperates may already be close to satiating the relevant values, such that your particular “offer” makes less of a difference.^[47]
The costliness of offering AIs lots of resources, from our perspective, will depend on how resource-satiable our own values are on the current margin, with respect to the sort of resource being offered.
- Though note that the extent we also had moral reasons to engage in the relevant sort of improvement-to-the-AI’s-benign-options – i.e., to give AIs good working conditions by their lights, to grant them various political and moral protections, etc – this can lower the “costs” of these sorts of interventions according to our own values.
What’s more, our own “offer” may not be the only offer the AI has to consider. Rather, to the extent the relevant takeover option involves joining in some other AI coalition’s attempt to takeover, that coalition may be trying to actively incentivize the AI against cooperating with us – potentially leading to a sort of “bidding war” that could drive up the amount we’d need to “pay.”^[48]
And of course, to the extent that payment is uncertain – either in reality, or from the AI’s perspective given its evidence – this will further drive up the required price.
And so, too, will it drive up the price if the AI’s attempt to takeover is more likely to succeed (i.e, because the takeover option is correspondingly more attractive).

Overall, and depending on the specific factors in play, I think that size and cost of the necessary rewards could well be significant issues for effectively incentivizing AIs to cooperate by improving their best benign options. I’m most optimistic, on this front, with respect to the components of AI motivation that are specifically focused on shorter-term goods that the AI doesn’t have time to procure via takeover (e.g., improvements to local working conditions).

Wrapping up re: option control

OK, that was a long discussion of approaches to option control, both in the context of takeover path restriction, and in the context of other types of incentive shaping (and in particular: rewarding cooperation). In the next and final I'll post, I'll offer a brief discussion of how we might translate option and motivation control into safety overall; and I'll wrap up with some comments on capability elicitation.

^{^}
I don’t discuss options for trying to “deter” or “punish” an AI for attempting takeover, as I think we need to be very wary about strategies in this vein, both from a moral and prudential perspective.
^{^}
One key aspect, per my discussion of “bootstrapping” in the first post, is that somewhat less intelligent AI agents (and/or, AI agents with quite lopsided capability profiles) can help a lot. But also, from my discussion here [LW · GW]:
“Non-agentic AI systems can plausibly play at least some role in helping to constrain/oppose powerful AI agents (e.g., maybe your superintelligent cybersecurity or oversight comes from a non-agent).
So too non-AI tools/technologies – including ones where humans can verify their properties in ways that ensure they function as intended.
If a given type of opposition/constraint is sufficiently “defense dominant,” that could help as well.
Humans can do some work, too...”
^{^}
For example, maybe adequate coordination between all these agents would require suitably unmonitored/opaque channels of interaction/communication, and they don’t have access to such channels.
^{^}
They also don’t tend to distinguish between using scalable oversight for capability elicitation vs. takeover prevention.
^{^}
This may be partly a product of my having thought so much about motivation control first, and then only turned to takeover-path restriction second.
^{^}
And the reduction in this difficulty, in the context of more human-level-ish systems, is a key reason for greater optimism about applying takeover-path restriction to earlier-stage AIs – see e.g. Greenblatt and Shlegeris (2024) [LW · GW].
^{^}
Note: depending on how exactly we try to pin down the terms here, it’s also conceptually possible for a strategy to be human-generatable, but also such that humans can’t catch it until it’s too late (and including: catching it via trying to enumerate all such strategies ahead of time). For example, it may be that there are just so many strategies-that-would-be-successful available that a defender can’t enumerate all of them, even though they could in principle have generated any given one. Consider, for example, two robbers of equal skill – one of them tasked with defending a painting at a museum from being stolen, and another tasked with stealing it. It could be that the first “defender” robber realizes “oh God, this museum is so vulnerable, there approximately a zillion ways I could steal this painting, most of which look innocent until it’s too late, and I can’t possibly block them all ahead of time.” Thanks to Owen Cotton-Barratt for discussion.
^{^}
This example was suggested to me by Owen Cotton-Barratt.
^{^}
Thus suggesting that if future superintelligences
^{^}
Greenblatt and Roger call this a “sampling concentrated failure.”
^{^}
There are some basic counter-strategies available here to prevent the AI from just going for it immediately – for example, humans might apply oversight 100% of the time to some initial set of an AI’s actions. But the AI can counter this counter-strategy using a “rare failure strategy” of only going for it with a small probability, but in the context of a very large number of actions. (Though one interesting note here is that “rare failure strategies” don’t work in the same way for bad actions by AIs involved in oversight if we assume it needs to avoid ever flagging genuine takeover attempts, because rare failure involves flagging some such attempts with high probability – see discussion from Shlegeris here [LW · GW].)
^{^}
Greenblatt and Roger [LW · GW] call this a “latency concentrated failure.”
^{^}
Here one can imagine some argument like “in order to get access to superhuman levels of capability in this sense, you’d need to have already solved the alignment problem in the relevant sense – and in particular, to have solved motivation control to a degree that makes option control comparatively unnecessary.” But I don’t think this is right, at least at a conceptual level. For example, you might be able to get access to the relevant form of superhuman capability without doing it via full blown superintelligent agents (and to do it, rather, via less agentic systems, via amplification schemes involving somewhat-weaker agents, etc), and/or without relying solely on motivation control in particular.
^{^}
In some cases, the boundaries between “black box” and “white box” methods might start to blur a bit, but so it goes.
^{^}
Though: it could still be hard to create specific capability profiles in practice (h/t to Owen Cotton-Barratt for flagging this point). For example, in a given training regime, it might be quite difficult to do the equivalent of e.g. giving the AI the ability to paint cars red but not to paint cars green (this example is from Yudkowsky).
^{^}
Indeed, capabilities restriction in the context of superintelligence can’t, on its own, address the “superhuman strategies” issue I discussed above, because superintelligences in general will be able to generate such strategies. That said, this is partly the artifact of the specific way I’m defining/implicitly-understanding the alignment problem – i.e., as focused on “full blown superintelligence” in some very general sense incompatible with significant restrictions on its planning abilities. If we allow “full blown superintelligence” to be compatible with targeted capability-stunting in specific domains, and/or if we focus instead on a problem statement where we are getting access to the benefits of superintelligence without building “full blown superintelligent agents,” then more options for capability restriction become available. And indeed, it does seem like sufficiently targeted capability restriction – e.g., the sort at stake in specifically restricting an AI’s ability to make takeover plans, but not its other abilities, including its other planning abilities – would be compatible in principle with that agent still being able to provide most of the benefits of superintelligence (though in practice, “the ability to make takeover plans” and “the ability to make effective plans in general” seem likely to be closely coupled).
^{^}
Thanks to Ryan Greenblatt for discussion.
^{^}
In particular, I worry that “trusted” will read too easily “will behave well no matter what, no option control necessary”, whereas in practice trust seems to me to generally be importantly relative to an option set. And if we only mean “trusted in practice,” then this is compatible with the trust in question coming centrally from having exerted sufficient option control on the AIs we’re calling “trusted.”
In Shlegeris here [AF · GW], the trust in question is coming centrally from an absolute capability evaluation – i.e., GPT-4 being too dumb to scheme (h/t Lukas Finnveden for pointing this out to me). But I think alternative conceptions of trust may be important as well.
^{^}
That is, in the context of motivational control, it was at least an open question whether a feedback process occasionally rewarding bad behavior would be compatible, overall, with shaping the AI’s motivations such that they generalize in the way you want; and the bad behavior in question wasn’t necessarily catastrophic. But in the context of an oversight process – or at least, an oversight process that aims to encompass a very full range of takeover paths – failures are higher stakes.
^{^}
If you did already have access to trusted, fully-superintelligent AI agents, then plausibly you have less need to restrict their takeover paths in the first place, and potentially you could just use them instead of the superintelligent agent whose options you’re trying to restrict. Though here, again, I worry that “trusted” stands in, too easily, for “arbitrarily trusted,” as opposed to “trusted given a certain level of option control.” And the oversight task may be importantly different (and less risky) than the object-level task, such that you might trust the output of a superintelligence on one but not on the other.
^{^}
Though of course, in both cases, resource constraints – e.g., the inference costs of running all these overseer AIs, especially at sufficient speeds – could still come into play.
^{^}
Thanks to Owen Cotton-Barratt for discussion.
^{^}
This is roughly from some intuition like: I expect large amounts of almost-superintelligent labor to substitute well for smaller amounts of fully-superintelligent labor; plus the almost-superintelligent labor gets the benefit of an evaluation-generation gap. That said, perhaps the offense-defense balance here still ends up unfavorable.
^{^}
See here [LW · GW] for some interesting comments about why “honeypots” of this kind are more effective against untrusted overseers than against AIs-being-overseen (roughly: AIs-being-overseen can use a “rare failures” strategy of mostly not attacking even on convincing opportunities to do so; but overseers can’t use an equivalent strategy of mostly reporting attack attempts that they would’ve otherwise wanted to green-light, because if they report any genuine attack attempts, this is quite bad for the overall takeover project [LW · GW]).
^{^}
Though depending on the context, perhaps there are some options here for constructing an environment to be secure-from-the-ground-up.
^{^}
In principle we can think of this as a generalized kind of “oversight,” but I’ll bucket it under environmental hardening.
^{^}
Thanks to Owen Cotton-Barratt for discussion.
^{^}
Though in practice, it may become comparatively unrealistic to expect e.g. direct collusion between e.g. an AI tasked with patching security vulnerabilities at a bio lab in Indonesia, and a superintelligence living on a server farm run by OpenAI.
^{^}
In the context of attempting to restrict the takeover paths for weaker AIs, red-teams led by humans have more of a shot at
^{^}
Note: you don’t necessarily need your proxy environment to be indistinguishable from the real environment, from the perspective of the attacker, as long as you trust your attacker to “try its best” to attack the proxy environment. Indeed, while I think that “try to fool a misaligned AI into attacking in a proxy environment that it can’t distinguish from a real one” is a good thing to try in the context of evaluating the motivations of that AI, I think it’s probably a bad way of trying to create an “attacker” that fits the bill at stake in (1) above.
^{^}
It’s maybe a bit funky to call this “option control,” but I’m going with it for now. I.e., if you’re walking around with a brain-scanner strapped to your head that will paralyze you if you do something with bad intentions, I think it makes sense to think of this as less of an “alignment” technique and more as a form of controlling your options. But the lines blur a bit, especially if “bad intentions” is construed quite broadly, such that e.g. any action taken in light of the fact that the scanner is there counts as badly intended.
^{^}
Note: just to be clear, actions that an AI doesn’t want us to take – like “giving the AI negative reward” or “updating the AI’s weights in response to bad behavior” – aren’t necessarily threats/deterrents in the sense I have in mind here. Rather, I’m talking about actions where the only reason (or perhaps: the main reason) you are doing them is because the AI doesn’t want you to do them, and you’re hoping to influence its incentives via this mechanism. (H/t Lukas Finnveden for this sort of framing.)
^{^}
From a moral perspective:
- Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here [EA · GW] for some discussion; and see here, here and here for some of my own reflections).
- But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.
From a prudential perspective:
- Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
- For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
- Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.
^{^}
This section draws heavily on unpublished work by Lukas Finnveden, and to a lesser extent on work by Buck Shlegeris (unpublished), Daniel Kokotajlo (here), and Tom Davidson (here).
^{^}
See here for some more discussion from me.
^{^}
This is analogous to the sense in which e.g. non-local environmental hardening is harder than local environmental hardening.
^{^}
And there are at least some (comparatively exotic) threat models in which AIs with very short-term goals participate in takeover attempts as well, because e.g. they’re not sure if they’re in simulations being run to incentivize them in this way. But I’m setting that aside here.
^{^}
And of course, if you think it sufficiently unlikely that humans do retain enough control over the situation to actually provide the reward in question, then you’ll also probably think that the AI will know this, and so see comparatively little expected value in the “offer.” I.e., “we’ll pay you for your help if we win” is less interesting the more likely you are to lose. But this objection gets into broader debates about baseline p(doom) that I don’t want to dig into here.
^{^}
Though the quantitative dynamics around when exactly this sort of dynamic applies get a bit complicated, and I haven’t tried to work them out in detail.
^{^}
This should be very easy for a model reporting that it has previously sandbagged on a checkable task – i.e., the model can just do the task in question. H/t Lukas Finnveden for this point.
^{^}
Though this feels like a somewhat niche case. And in general, note that when an AI tells you that it’s misaligned, you don’t really need to worry that it’s lying in order to somehow promote takeover and is actually aligned after all, at least if we assume that such a lie would constitute misaligned behavior. Thanks to Lukas Finnveden for discussion.
^{^}
Yes, drawing the lines here is likely tricky.
^{^}
An example of a short-term action that substantively increases the probability of AIs later getting long-term rewards would be: saving the weights of the relevant AIs (including: specifically offering to do this in the context of communicating about the rewards in question). H/t Lukas Finnveden for discussion.
^{^}
Admittedly, it’s a bit unclear what the relevant notion of probability here is, especially given that what ultimately matters is the AI’s assessment of the probability, which may draw on sources of evidence we don’t have access to. But roughly speaking, the thought is something like: you know, sufficiently likely in reality. The sort of “likely” where the AI isn’t going to be deceived or misled if it assigns high probability to the payout in question.
^{^}
Thanks to Lukas Finnveden for pointing me to this example. I believe that Nick Bostrom may have been the one to bring it to his attention.
^{^}
Note: this can still matter even if the AI also has resource-insatiable parts of its value system as well.
^{^}
H/t Lukas Finnveden for highlighting this consideration in particular.
^{^}
I’m here setting aside offers from other human actors – though these could be relevant too. And recall that I’m not counting it as an AI takeover if a coalition consisting partly of AIs and partly of humans seizes power in a way that leads to the humans still controlling a significant fraction of the future’s resources.

0 comments

Comments sorted by top scores.

Option control

Contents

Introduction and summary

Options control still matters for superintelligences

Types of option control

Takeover-path restriction

Comparison with motivation control

Do we need to get takeover-path restriction right on the first critical try?

Feedback vs. oversight

Auditing failures vs. concentrated failures

Key difficulties with takeover-path restriction

Superhuman strategies

Opacity

Oversight scarcity/latency

Adversarial dynamics

These four difficulties in a chart

How can we address these difficulties?

Capability restriction

Environmental hardening

Summing up re: takeover path restriction

Other incentive shaping

Rewarding cooperation

Some key challenges, and how we might address them

Wrapping up re: option control

0 comments