Option control

post by Joe Carlsmith (joekc) · 2024-11-04T17:54:03.073Z · LW · GW · 0 comments

Contents

  Introduction and summary
  Options control still matters for superintelligences
  Types of option control
  Takeover-path restriction
    Comparison with motivation control
      Do we need to get takeover-path restriction right on the first critical try?
      Feedback vs. oversight
      Auditing failures vs. concentrated failures
    Key difficulties with takeover-path restriction
      Superhuman strategies
      Opacity
      Oversight scarcity/latency
        Scarcity
        Latency
      Adversarial dynamics
        In the AI whose takeover-paths you’re trying to restrict
        In the AIs you’re trying to get to help you
      These four difficulties in a chart
    How can we address these difficulties?
      Capability restriction
        Black box capabilities restriction
        White box capability restriction
      Environmental hardening
        Black box environmental hardening
        Oversight
        Other environmental hardening
        Stress testing
        White box environment hardening
      Summing up re: takeover path restriction
  Other incentive shaping
    Rewarding cooperation
      Some key challenges, and how we might address them
        The need for success at other types of motivation/option control as well
        Picking a reward the AI wants
        Differentially targeting cooperation vs. non-cooperation
        Credibility
        Size of reward
  Wrapping up re: option control
None
No comments

Introduction and summary

(This is the third in a series [? · GW] of four posts about how we might solve the alignment problem. See the first post [LW · GW] for an introduction to this project and a summary of the posts that have been released thus far.)

In the first post in this series [? · GW], I distinguished between “motivation control” (trying to ensure that a superintelligent AI’s motivations have specific properties) and “option control” (trying to ensure that a superintelligent AI’s options have specific properties). My second post [LW · GW] surveyed some of the key challenges and available approaches for motivation control. In this post, I do the same for option control.

Here’s a summary:

Diagram of two approaches to option control

These four difficulties in a diagram

Diagram of the approaches to takeover path restriction I discuss

Options control still matters for superintelligences

As I discussed in my first post, the most desirable property, from an option control perspective, would be if the AI has no option that would lead to takeover with any non-trivial probability. In that case, motivation control becomes irrelevant to avoiding takeover (though: not to other goals), and the “safety from takeover” aspect of the alignment problem is secure (though: we still have to deal with the elicitation part).

But the discourse about AI alignment often proceeds on the assumption that this sort of option control isn’t realistically available in the context of superintelligence. I examine different ways you could end up thinking this here [LW · GW]. The strongest version, in my opinion, proceeds via the following claim:

The “need SI-agent to stop SI-agent” hypothesis: the only way to stop one superintelligent AI agent from having a decisive strategic advantage (DSA) is with another superintelligent AI agent.

That is, the worry goes: if you want to prevent a superintelligent AI agent from having the option to take over, you need another superintelligent AI agent (or set of such agents) to help you do so. But this means that if these agents all coordinated, they would collectively be in a position to take over, such that the option to engage in such coordination, if it’s available, now becomes a takeover option in itself. And then the additional thought is that it will likely be quite difficult to prevent superintelligent AI agents from having the option to coordinate in this way.

I think that arguments like this, plus more general pessimism about “AI boxing,” has led the traditional alignment discourse to pay too little attention, in general, to option control as a part of the overall toolkit for ensuring safety from takeover. I think this especially in the context of human-level-ish and only-somewhat-better-than-human-level-ish AIs (more in future); but I think the same complaint applies to superintelligences as well.  

So overall, I think that the toolkit available for option control is still very important to examine in the context of superintelligence, even if you think that some component of motivation control will be necessary as well.

Types of option control

Recall that on the ontology I’m using, an agent’s options emerge from the combination of its environment and its capabilities.

So really, option control decomposes, further, into environmental control and capability control. And we can further decompose environmental control into:

This distinction isn’t always going to be especially clear – for example, if the AI’s intended development/deployment environment is unclear or extremely broad. But I think it’s useful regardless. In particular, in many paradigm contexts of AI development/deployment, the “intended development/deployment” environment is quite specific and limited. It’s therefore often much easier to attempt to ensure that this local environment has some specific set of properties – i.e., “the AI’s actions always receive blah form of review” – than it is to try to do so with the broader world – i.e., “no world leader is such that they would aid and abet a rogue AI.”

Finally, we can distinguish between two broad types of intervention on an AI’s options, namely:

That is, in terms of my framework for thinking about AI incentives towards power-seeking, and holding fixed the AI’s motivations: you can either intervene to reduce the desirability/positive expected value (from the AI’s perspective) coming from successful takeover scenarios, or you can intervene elsewhere (i.e., on the desirability/positive expected value coming from the best benign option, or on the aversiveness/negative expected value coming from failed takeover scenarios).

Here I’m going to focus, first, on making it harder for the AI to take over. Then I’ll talk about other ways you might try to use option control to alter its incentives.

Takeover-path restriction

Ok, let’s look for a second at options for making it harder for the AI to take over (let’s call this “takeover-path restriction”). I’ve generally thought quite a bit less about this topic than I have about motivation control, so I’m going to do even less, here, to aspire to capture the full space. At the least, though, I want to highlight some of the key differences and similarities at stake with takeover path restriction relative to motivation control, and to flag some of the paths forward that seem promising to me.

Comparison with motivation control

I’ll start with a few points of comparison/contrast with motivation control.

Do we need to get takeover-path restriction right on the first critical try?

In particular: recall that in the context of motivation control, we assumed that our AI was, at some point, going to be exposed to a takeover option – that is, an option that has a non-trivial chance of leading to successful takeover (and the point of motivation control is for the AI’s motivations to be such that it rejects this option). This created what was perhaps the key difficulty for motivation control – namely, the need for “generalization with no mistakes.”

In the context of takeover-path restriction, though, we don’t need to make this assumption. Indeed, the limiting case of success at takeover-path restriction is precisely the case in which the AI doesn’t ever get exposed to any takeover options – and I don’t want to assume, at the outset, that this sort of success is off the table.

Of course, even if we set aside motivation control in particular, there’s a broader sense in which any sort of safety plan needs to “work on the first try” (and the second, and the third, etc) – at least if we view a single failure as unacceptable (more here [LW · GW]). So in some sense, the need for “generalization with no mistakes” arises at a somewhat more abstract level even with takeover-path restriction – that is, if you’re ever going to put your AI in a context where you’re counting on your takeover-path restriction efforts to prevent takeover, then in a sense there will be a “first critical try” with respect to whether those efforts have been successful – one that you need to get right.

Feedback vs. oversight

Still, the fact that you don’t ever need to actually give your AI takeover options does create somewhat different dynamics, in the context of takeover-path restriction, relative to motivation control. In particular, in the context of motivation control – and especially, the version of motivation control that didn’t rest on transparency tools – we assumed that we were going to be shaping the AI’s motivations/cognition via some sort of behavioral feedback signal used in training; and then hoping/arguing that whatever motivations this signal creates are such as to generalize well to takeover options.

In the context of takeover-path restriction, though, a different approach becomes available: namely, using what I’ll call oversight to inspect an AI’s actions to determine whether they are part of a takeover attempt, and then blocking them if they are.

We can distinguish between two sorts of oversight, namely:

Many of the techniques you would use for black box oversight are extremely similar to the ones you would use for “behavioral feedback” in the sense I discussed in the motivation control section. In particular, you’ll likely want to improve your behavioral inspection capacity via some way of drawing on lots of AI labor; you’ll likely want to draw on the difficulty difference between evaluation and generation; and so on. Indeed, when people talk about techniques for “scalable oversight,” they often don’t distinguish between feedback and oversight in the sense that I am doing so here – what matters is just: becoming much better at determining whether a given instance of AI behavior is good/bad in a given way.[4] 

Auditing failures vs. concentrated failures

Importantly, though, sufficiently high-quality and ubiquitous oversight (including: black box oversight) can turn motivation control issues into option control issues – because sufficiently high-quality and ubiquitous oversight can eliminate takeover options. Thus, as an example, consider the distinction between what Greenblatt and Roger call “auditing failures” vs. “concentrated failures [LW · GW]” (one closely related to the distinction between what Christiano calls “low stakes” vs. “high stakes” alignment [LW · GW]). Here “auditing” refers to an instance of applying oversight, in my sense, to an AI’s actions; and an “auditing failure” occurs when this sort of oversight gets applied, but it fails to recognize that a given action is problematic. A concentrated failure, by contrast, occurs because the auditing/oversight process isn’t applied (in time) to a bad action - either because the auditing process gets applied to too few of an AI’s actions (a “sampling concentrated failure”), or because by the time the auditing process gets applied, it’s too late to block the action and/or its bad effects (a “latency concentrated failure”).

You only need motivation control if you are vulnerable to either auditing failures or to concentrated failures – that is, if it’s either the case that your oversight process can’t recognize some bad actions, or the case that your oversight process can’t get applied (in time) to enough actions to prevent takeover if those actions are bad. Of course, if you did have sufficiently good and ubiquitous oversight that you could catch all bad actions, then you could also use that as a (potentially flawed) mechanism for motivation control – namely, by catching any bad actions by the AI, then training it not to take them and hope that this shapes its motivations in the way you want rather than doing something else (and note that this training can occur in an ongoing way during deployment – i.e., you can train “online”). But it’s not, strictly, necessary to actively reshape the AI’s motivations here in order to prevent takeover – you just need to be able to catch enough bad actions in time.

OK, with those points of contrast with motivation control in mind, let’s turn to some of the key difficulties that takeover-path restriction faces.

Key difficulties with takeover-path restriction

What makes takeover-path restriction difficult? Lots of things in principle (and as I said above, I feel like I generally have a relatively weak grip on the contours of this space). Here I’m going to try to hone in on four difficulties that seem to me especially key, plus a smattering of extras. But this is pretty provisional.

Also:

Superhuman strategies

If I had a hone in on a single thing that makes takeover-path restriction difficult in the context of superintelligence, my current best guess would be something like: the availability of superhuman strategies for pursuing takeover.[6] 

Here we can distinguish between two levels of superhuman-ness:

These two levels of superhuman-ness are importantly distinct, and they create distinct problems for human efforts at takeover-path restriction. That is, roughly speaking:

These difficulties become more acute the more superintelligent the AI system is; the harder its actions and reasoning are for humans to understand; the more “slack” it has to think about potential takeover plans and/or communicate in not-human-legible ways with other AIs; and so on.

Indeed, because of the core difficulty with handling superhuman takeover strategies, I am quite pessimistic about humans (with assistance only from fairly weak AIs) relying heavily on “takeover-path restriction” in order to prevent full-blown superintelligent agents from taking over. Rather, I think that we will need serious help from more advanced AIs – more below.

Opacity

Now, one thing you might’ve wanted to do, in order to help address these issues with superhuman strategies, is to draw on epistemic access to the AI’s internal variables – i.e., its capabilities, its motivations, its world model, its reasoning process – in order to help anticipate and detect-in-time the sorts of strategies in question. Below I’ll talk more about some options in this respect, if you could get the sort of access in question. Unfortunately, though, this sort of transparency faces the same issue we ran into in the context of motivation control – namely, that by default, current ML systems are quite opaque, so absent lots of interpretability progress, future ML-based superintelligences could be as well.[9] 

Oversight scarcity/latency

Even if you got to the point of at least being able to use oversight to detect takeover strategies (whether via using purely black box techniques, or via also including some aspect of transparency), you still face the “concentrated failures” issue that I mentioned above: namely, that you may not be able to apply your oversight to enough of an AI’s actions, fast enough, to restrict its paths to takeover in the way that you want to.

Scarcity

In one version of this, oversight is too scarce.

Latency

In another version of this, oversight is too slow to respond to detecting problems.

Adversarial dynamics

Finally, as in the case of motivation control, any efforts at takeover-path restriction face the potential for adversarial dynamics in various AIs – both the AI whose takeover-paths you’re trying to restrict, and the AIs you’re trying to get to help you.

In the AI whose takeover-paths you’re trying to restrict

Thus, for example, in the context of the AI whose takeover-paths you’re trying to restrict: one thing you might really want to do, in the context of trying to understand and restrict what paths to take over are available to an AI, is to try to understand what its capabilities are in general, and to test those capabilities in particular environments. But adversarial AIs might engage in “sandbagging [LW · GW]” aimed at concealing this information; they might lie to you about their internal variables; and so on.

In the AIs you’re trying to get to help you

Perhaps more importantly, though: because of the “superhuman strategies” issue above, it seems extremely likely that effective restriction of a superintelligence’s paths to takeover will require drawing on various types of AI labor – for example, in the context of trying to find and block takeover paths in a given environment ahead of deployment, in the context of applying oversight to the AI’s actions, and so on. But if these AIs themselves are themselves dangerous – and especially, if they are actively colluding with each other in order to promote a takeover attempt – then they might sabotage efforts in this vein (i.e., by not identifying various security holes in a given environment, by giving the green-light to actions that they can see are aimed at takeover, and so on).

This issue is closely related to questions about the “need a super-intelligent agent to stop a super-intelligent agent” hypothesis that I briefly discussed above. That is, roughly speaking: if the only way to restrict the takeover options of a superintelligent agent is to draw on the labor of other superintelligent agents, then especially if those super-intelligent agents are capable of cooperating with each other, you still seem pretty vulnerable.

These four difficulties in a chart

Here are these four difficulties in a convoluted chart:

Below I’ll also discuss a few other difficulties – with test environments, with capability control, and with controlling the environment – that seem to me at least somewhat-less-key. And I’ll name here one other difficulty that seems less key, but still worth calling out, namely:

How can we address these difficulties?

Ok, that was a list of various difficulties faced by efforts to restrict the takeover paths superintelligent AI systems have available. Let’s turn, now, to the options available for addressing these difficulties. Here I’m going to divide approaches according to whether they focus on restricting an AI’s capabilities or on hardening its environment, and according to whether they assume access to significant levels of transparency into AI cognition (white box) or not. I’ll also use a further distinction, within “environmental hardening,” between approaches focused on monitoring/oversight, and approaches focused on other types of environmental hardening. And I’ll discuss the importance of “stress testing” (and of avoiding collusion in the AIs you try to get to help you) in the context of both.

Thus, in a maybe-unhelpful diagram:

One other note before we dive in. As gestured at above, I think that the core “superhuman strategies” issue above makes it unlikely that humans + comparatively weak AI systems would be able to effectively and reliably block the paths to takeover available to a full blown superintelligence operating in a reasonably complex, real-world environment (while also eliciting the sorts of useful capabilities at stake in my definition of solving the alignment problem). So I’ll generally be focused, in what follows, on forms of environmental hardening that make use of superhuman abilities to generate, detect, and block takeover paths.

This means that an extremely key variable, for the viability of takeover-path restriction, is whether or not we can get the right sort of access to this sort of superhuman capability, including in the context of the potential for various adversarial dynamics.[13] And if we don’t have access to superhuman environmental hardening (whether via monitoring/oversight, stress-testing, or some other methods), then I think we should be extremely cautious about relying heavily on takeover path restriction in an overall safety case for super-intelligent agents.

Capability restriction

Let’s start by looking at options for capability restriction.

As I noted above, to the extent we’re solving the alignment problem fully, we are assuming that the AI in question needs to be a full blown superintelligence. But given this constraint, there is still plausibly room for more fine-grained restrictions on its capabilities to play a role in efforts at takeover-path-restriction.

We can roughly break down capability restriction into three components.

These three components are closely interrelated. For example: targeted and intentional capabilities development and capability removal likely require a highly developed capacity for accurate capability evaluation; and one way to do the equivalent of “removing” capabilities from a given AI is to use targeted capabilities development to create an AI with a different capability profile instead. Indeed, in some sense, all three of these components can be understood as part of a broader “science of AI capabilities.” And at a high level, effective capability restriction basically requires that this science be suitably developed that we can be quite deliberate and precise in choosing to endow a given AI with a given sort of capability; and quite confident about whether or not we have succeeded or failed.

What might this look like? I’m not going to examine the issue in depth here. But I’ll make a few comments, focused first on black box techniques for capability restriction – i.e., ones that do not assume significant transparency – and then moving on to more white-box approaches.

Black box capabilities restriction

The most obvious mechanism for black box capability evaluation comes from successfully causing the AI to demonstrate a given sort of capability on some input. But this approach suffers from at least two issues:

Beyond demonstrations, though, there are also other approaches to black box capability evaluation. For example, you get better at predicting what sorts of capabilities a given sort of model should have based on various scaling laws, based on the data it was trained on, based on the other capabilities it (or other AIs) display in other contexts, based on more a priori reasoning about how AI training/learning works, and so on.[14] 

This sort of capability evaluation ability – i.e., the ability to predict what sorts of capabilities an AI should have, given how it was created, without yet having seen them demonstrated – is also closely related to the ability to give a specific capability profile (i.e., targeted capability development). For example, if you had extremely good scaling laws for all sorts of capabilities (including capabilities downstream of the training loss), then you’re much better positioned to aim your capabilities development in a precise way.[15]

Various black box techniques for targeted capability removal may also be available as well. Thus, for example, if it’s possible to identify which sorts of data led to which sorts of capabilities (see e.g. work on influence functions), it may also be possible to scrub the influence of a particular sort of data from a model’s policy.

White box capability restriction

What happens if we introduce more significant degrees of transparency (whether via open agency, interpretability, or some new paradigm) into the picture? Basically, lots of stuff gets quite a bit easier. Thus:

Environmental hardening

Even given large amounts of success at capability restriction, though, restricting the paths to takeover available to a superintelligence also requires serious attention to the environment that the superintelligence is operating in. After all, as noted above, you can’t restrict the superintelligence’s capability that much, because it needs to remain a superintelligence. And superintelligences, plausibly, have access to a disturbingly large array of takeover options in various environments.[16] 

How, then, do we create an environment that has been hardened enough to restrict these options, while still allowing ourselves to elicit this superintelligence’s useful capabilities? One way to think about this challenge is as follows. You’ve got some superintelligent AI agent (let’s say we’re holding its internal properties fixed), and you want to design some environment that will restrict its available paths to takeover in X way (while also making possible the sort of capability elicitation you want). But you face the “superhuman strategies” issue discussed above - namely, that this AI can pursue paths to takeover that you (a) couldn’t have identified/generated yourself, and which (b) you couldn’t recognize while they’re being pursued (until it’s too late). What do you need to do?

Very roughly, you need to find some way to get superhuman labor to help you with both (a) and (b), and then you need to use this increased ability to generate and recognize superhuman strategies to create an environment that restricts them in X way. Let’s look at some of the options in this respect now. I’ll start, again, with more black box approaches, and then move on to more white box-y ones.

Black box environmental hardening

I’m going to divide the environment an AI is operating in into two components, namely:

  1. The oversight process
  2. Everything else

Both, I think, need superhuman levels of hardening. Let’s look at the oversight process first.

Oversight

The oversight process, basically, is the part of the environment (if there is one) designed to monitor the AI’s behavior (plus other aspects of the inputs the AI is receiving), to detect suitably suspicious behavior/inputs, and to respond in a manner that prevents something bad from happening as a result.

(Note: I’m generally assuming that the oversight process is part of the local environment, such that “everything else above” includes both the rest of the local environment, and the broader world. In principle, though, we can understand various broader mechanisms for detecting and responding to AI takeover attempts as a form of “oversight” – and many of the same considerations will apply.)

In thinking about options for this sort of oversight, Greenblatt et al (2024) distinguish between two types of labor: trust and untrusted. Paradigm instances of trusted labor, for them, include humans, and AI systems that are sufficiently weak that they can’t be scheming. And the key property of trusted labor is that it is assumed to be not intentionally and strategically undermining the oversight process. By contrast, other AI systems, that are capable of scheming, are “untrusted,” and thus might be colluding amongst themselves, and with the AI system they’re helping to oversee, in order to promote takeover wherever possible. In particular, “untrusted monitors” might intentionally give the green light to actions that they can recognize as part of a takeover attempt, if they can do so without getting caught themselves. I worry a bit that this distinction might mislead in various ways,[18] but let’s run with it for now.

Now, if we don’t need to deal with superhuman takeover strategies, and if we assume that humans are able to recognize pursuit of the all takeover paths we’re trying to restrict (before it’s too late), then in principle, it’s possible to imagine approaches to oversight that never make use of better-than-human detection ability. These approaches would still face other challenges – notably, they need to find a way to deal with the fact that human labor is likely too scarce and slow to oversee a large fraction of AI actions, by e.g. finding a good hierarchy for flagging up actions for different levels of scrutiny, so that humans only review a feasible fraction. But there’s at least some possibility of relying centrally on some combination of humans and weaker-than-human trusted AIs (e.g., GPT-4), composed in a way that never really outstrips human evaluation ability in any given case.

In the context of attempting to restrict the takeover paths of a superintelligence, though, we have no such luxuries. Rather, we need to find some way to get access to superhuman abilities to detect takeover attempts – whether via successfully creating trusted AIs that are better at humans at this task, via harnessing the labor of untrusted better-than-human AIs, via somehow getting better-than-human detection out of a combination of humans + weak trusted AIs, or via some other method.

This problem is quite similar to the one we discussed above in the context of “feedback quality” for motivational control. That is, in both cases, you need a process for catching bad actions by superintelligent AIs, despite human inability to recognize such actions – whether for the purposes of feedback-in-training, or for the purposes of oversight (though: note that oversight plausibly imposes stricter standards than feedback[19]). And the array of black-box methods for scaling up to superhuman qualities of supervision in this respect, assuming that you don’t already have access to trusted, fully-superintelligent AI agents, are quite similar as well – see, e.g. “scalable oversight.”[20] I.e., you can use imitation learning to get access to large amounts of human-equivalent labor; you can use approval-based feedback to try to get access to large amounts of labor-of-the-sort-humans-would-approve of; you can try to imitate the results of expensive versions of this labor; you can break down the evaluation process into parts; you have can different helper AIs debate; you can try to get lots of juice out of AIs that lack the agential prerequisites; and so on.

In one version of this, bootstrapping of this kind is sufficiently successful that efforts to oversee the behavior of a superintelligence get to draw on large amounts of trusted but-not-quite-superintelligent AI labor. That is, if the superintelligence being overseen is of intelligence level X, the trusted AIs overseeing it are only slightly less intelligent, and so can hopefully leverage other advantages (i.e., their numbers, their ability to perform expensive evaluations, their knowledge of and control over the environment, their ability to construct synthetic inputs to mess with the superintelligence in question, the fact that the AI they’re overseeing can’t iterate on failures, etc) to detect and block the relevant sorts of not-human-recognizable takeover attempts.

In worlds where we can get this sort of access to trusted, better-than-human-even-if-not-fully-superintelligent AI labor, I am relatively optimistic about creating oversight processes that can reliably restrict the takeover paths available to a superintelligence, if you’re willing to invest enough of this labor.[23] But what if we don’t get this sort of access to trusted, better-than-human AI labor? Now the situation looks pretty rough. In particular, you now need some sort of adequate scheme for preventing the untrusted AIs involved in the oversight process from colluding with the superintelligence.

There are, indeed, some options available for trying to prevent collusion of this kind, especially amongst weaker untrusted AIs. Examples (mostly taken from here [LW · GW]) include:

Depending on what sorts of trusted labor is available, and what amount of success you have in eliciting useful labor from untrusted sources (i.e., in generating synthetic outputs, helping identify plausible shelling points, creating good paraphrasing methods, etc), it may be possible to bootstrap to some comparatively strong anti-collusion measures in this respect – i.e., measures that require the superintelligence to successfully collude without communication, and on the first try, with the oversight process, in a manner that other AIs tasked with searching out and blocking schelling points for this kind of collusion didn’t identify. Indeed, especially if our aim is merely to make it quite a bit harder for a superintelligence to find a successful takeover path, I don’t think we should totally rule out efforts to combat collusion with untrusted AIs that are part of the oversight process.

Still, it’s an extremely dicey game. And it’s dicey, roughly, because the same sort of “takeover path restriction” problem that you’re trying to get your untrusted AIs to help with is arising at the meta-level. That is, just as the superintelligence has available superhuman takeover strategies that your oversight process needs superhuman capabilities to detect and block, so too, plausibly, does this superintelligence have available superhuman strategies for colluding with the untrusted oversight process – strategies that you would need superhuman anti-collusion efforts to effectively combat. Again, you can hope to bootstrap to this sort of superhuman anti-collusion effort effectively even without access to substantial better-than-human trusted labor. But at the stage where you’re trying to use takeover-path-restriction on a full blown superintelligence, it seems likely that humans, at least, will have a difficult time understanding a good deal of what’s going on – including, e.g. what sort of information might be encoded in a given sort of output; what sorts of shelling points for collusion might be salient to the superintelligence and the overseers; what sorts of synthetic outputs might distinguishable from real ones; and so on. So it seems quite unsurprising if the relevant anti-collusion efforts fail.

Other environmental hardening

Oversight is one key approach to hardening an AI’s environment in order to restrict available paths to takeover. And it’s an especially useful one, insofar as it can take advantage of the difficulty-difference between evaluation and generation. That is, oversight only needs to be able to recognize a given type of takeover attempt while it’s happening; whereas generating and patching-ahead-of-time all the available paths to takeover that a given environment makes available is a substantially harder task.[25] 

That said, other sorts of environmental hardening have an important role to play too – especially to the extent oversight is sufficiently scarce/slow that one remains vulnerable to concentrated failures even despite oversight’s accuracy in detecting takeover attempts. We can group these, roughly, according to whether they focus on the AI’s local environment, or whether they focus on the broader world.

Some example, non-oversight-focused approaches to hardening the AI’s local environment include:

Some example, non-oversight-focused approaches to hardening the broader world include:

As these examples make clear, this is an extremely broad category of intervention. However, even at this level of breadth, it still faces the general “superhuman strategies” problem I discussed above. That is, very plausibly, and absent significant defense-favoring asymmetries in the offense-defense balance, effectively hardening both the local and non-local environment against superintelligent attack will itself require superhuman levels of capability. So here, as in the case of oversight, we face the challenge of getting safe access to this sort of superhuman capability, via some combination of trusted and untrusted AI labor. And to the extent we end up needing to rely on untrusted AI labor, many of the same concerns about collusion will apply as well, at least in principle.[28] 

Stress testing

Conceptually, the combination of (a) black-box hardening the oversight process and (b) black-box hardening the other aspects of the environment covers the full spectrum of black-box environmental hardening. But before moving on, I want to explicitly flag the key role of what I’ll call “stress testing” with respect to environmental hardening in general – and to name a few of the distinctive issues that it raises.

I’m using stress-testing, here, to refer to techniques that aim to evaluate the success of an effort at environmental hardening by testing whether or not an adversarial or “red-team” effort can defeat it, or something suitably similar to it (see here [LW · GW] and here [LW · GW] for some more discussion). Broadly speaking, successful stress-testing of this kind needs to involve two things:

  1. A red-team that:
    • has capabilities suitably reflective of the capabilities of the AI whose takeover paths you’re hoping to restrict;
    • is going to direct these capabilities fully towards the project of stress-testing.
  2. An environment whose hardening-relevant properties (including: the nature of the oversight process) are:
    • suitably reflective of the relevant properties of the real environment you want to harden;
    • otherwise safe to stress-test in the relevant way.

Both introduce their own challenges.

Also, even if you get (1) and (2) in place, it’s still an open question exactly how much evidence the failure of the red-team to find a successful attack provides about the security of the environment overall. That is: even if the red team is roughly as capable as the AI you’re ultimately worried about, it remains possible that the red team failed to identify a takeover path that the AI would’ve identified for some other reason. (Compare e.g.: “Bob couldn’t escape from this prison even after trying super hard, and he’s roughly as smart as Fred; so probably Fred won’t be able to escape either.” Maybe Fred just thinks of something Bob didn’t think of.)

And of course, if you do successfully show, using stress-testing, that an environment hasn’t been adequately hardened, it’s a further question whether you’ll be able to modify it to resolve the issue.

Overall: I think stress-testing is likely to be an important tool in the environmental hardening toolkit. But unsurprisingly, in order to provide the right kind of evidence about the availability of superhuman strategies for takeover, it will itself need to employ fully-elicited superhuman capabilities as part of the red-teaming effort.

White box environment hardening

So far I’ve been assuming that our efforts at environmental hardening can’t make use of any significant degree of transparency into the cognition of the superintelligent AI whose environment we’re hoping to harden. What happens if we relax this assumption?

Again, as in the case of white box capability restriction, introducing transparency to the picture – whether via open agency, interpretability, or some other new paradigm – makes a variety of new and very useful tools available.

Thus, in the context of oversight:

And in the context of other forms of environmental hardening, and of stress testing:

And note that in general, full transparency – and especially, full transparency to human reasoning in particular – generally isn’t necessary to start realizing some of these benefits.

Summing up re: takeover path restriction

Ok, that was a long discussion of the challenges faced by efforts at takeover path restriction in the context of superintelligence, and of some of the approaches available for addressing these challenges. If I had to summarize the core thrust of the discussion, I would say something like: in order to effectively restrict the paths to takeover available to a superintelligence, you will very likely need access, yourself, to superhuman capabilities. By far the best way to get this access is to bootstrap via trusted, better-than-human (even if not: fully superintelligent) AIs; and if you are trying to bootstrap using untrusted AI labor, the situation looks pretty dire.

Other incentive shaping

Restricting the paths to takeover available to an AI, though, is not the only way you can try to shape its options in desirable ways. In particular, as I discussed above, you can also try to ensure that other features of the incentives it faces favor the pursuit of benign options over takeover options. Notably:

That is, repeating the chart from above:

Note that there are also some other option-control inventions that might seem, intuitively, more like “incentive shaping” than “takeover-path restriction.” For example: “make the upside of successful takeover less desirable by the AI’s lights” (such that, e.g., the AIs would get access to fewer resources via successful takeover) and “make the paths to successful takeover such that they require violating more an AI’s inhibitions” (such that, e.g., all takeover paths require lying, and the AIs are extremely averse to lying).

Let’s look at rewarding cooperation, in particular, in more detail.

Rewarding cooperation

In general, I think we should be thinking hard about ways to improve the benign options available to an AI, such that cooperative strategies can more easily beat out uncooperative ones in an AI agent’s calculus. I’ll call this “rewarding cooperation” (though note, importantly, that I here mean “reward” in a sense more akin to “payment,” not in the sense at stake in ML training).[34] Indeed, setting up society such that cooperative behavior leads to desirable outcomes on the part of the cooperators is one key way we create conditions of social harmony and peace even amongst agents with different value systems; and it is often when cooperative options are most oppressive/undesirable that agents have the strongest incentives to engage in violent/destructive/otherwise-uncooperative forms of power-seeking. Many similar lessons seem applicable in the case of AI.[35] 

(I also think we may well have strong moral reasons to engage in various efforts to improve benign options for AIs. But the function of “rewarding cooperation” in the context of avoiding takeover differs enough from its function in a moral context that I think it’s worth talking about the dynamics at stake separately. In particular:

Here, again, it might be useful to distinguish between interventions aimed at the conditions of an AI’s local operation – e.g. within an AI lab – and interventions aimed at shaping society more broadly – e.g., the legal and political status of AIs in general.

This is related to a different distinction: namely, between rewards for cooperation that function centrally in the short term, and those that function in the longer term.

Some key challenges, and how we might address them

Let’s look in more detail at some key challenges that efforts at rewarding AI cooperation face, and how we might address them.

The need for success at other types of motivation/option control as well

One immediate worry about attempting to reward cooperation in superintelligent AI agents is that such agents will be in such a dominant position, relative to us, that these efforts won’t make a meaningful difference to their incentives. This sort of worry can take a variety of forms, but one salient version might be something like: if a superintelligence is sufficiently dominant relative to you, then any sort of “reward” for good behavior you try to set up will be such that this AI will be able to get that sort of reward better by disempowering humans and taking control of the situation itself. And this especially if the reward in question is of the same broad type as the “upside” at stake in successful takeover – i.e., better access to convergently-useful things like freedom, resources, and so on.

This sort of worry is continuous with the broader pattern, in the AI alignment discourse, of assuming that effective “option control” is basically infeasible in the context of superintelligences, and that we need to rely almost entirely on motivation-control for safety.[38] As I discussed above, I disagree that option control has no role to play in the context of ensuring safety from superintelligences. For example, I think that if we can bootstrap up to access to various types of superhuman environmental hardening and stress-testing, via some combination of trusted and untrusted AI labor, then I think we may well be able to meaningfully restrict the paths to takeover available even to a superintelligence.

However, I do agree that to the extent you have failed completely to exert any other sort of control over the motivations or options of your AIs – and in particular, to the extent you have failed so hard on “takeover path restriction” that your AI can take over extremely easily via a very wide variety of paths – then it’s hard for rewarding cooperation to do a ton of useful work on its own.

So when we talk about the role of rewarding cooperation in superintelligences, I do think we generally need to be imagining it as only one part of an overall approach that also includes at least some amount of success at some combination of (a) motivation control, and (b) takeover-path restriction.

However, I think “some amount of success” is distinct from “arbitrary success,” such that we shouldn’t just say “well, if you assume that you can do at least some amount of motivation control and/or takeover-path restriction, why not just lean entirely on those.” And anyway, I think it’s important to be decently comprehensive in sketching out the range of option-control interventions available in principle, especially given that rewarding cooperation – for example, via paying people for their labor – plays such a key role in our actual lived practice of harmonious coexistence with other agents.

Picking a reward the AI wants

Another issue with rewarding cooperation is that you don’t necessarily know what sort of reward an AI wants. This is especially clear in the context of concerns that you might have failed to understand your AI’s motivations more generally. I.e., if you’re concerned that your AI might be scheming in pursuit of some alien goal that it’s hiding from you, it’s harder to offer to “pay it” in terms of something that it values intrinsically.

However, I think there are a number of salient approaches to getting around this issue:

That said: note that to the extent your “reward” involves granting the AI new types of resources, freedoms, etc, you do need to also make sure that these resources/freedoms/etc are compatible with your safety goals more generally. I.e., rewards of this kind can come into conflict with “takeover-path restriction.”

Differentially targeting cooperation vs. non-cooperation

A different issue for rewarding cooperation is that you might find it difficult to differentially target your rewards at scenarios where the AI cooperates – especially to the extent that you were worried, originally, about scheming AI systems being uncooperative in ways you can’t detect.

However, I think there are a number of available responses to this challenge:

Credibility

Another general issue with rewarding cooperation is that in order to have the intended effect on an AI’s incentives, it needs to be suitably credible to the AI. Thus, if you tell a scheming AI “I’ll give you blah reward, with blah form of amnesty, if you reveal your misalignment,” but it doesn’t believe you, then the offer will be ineffective.

We break down the credibility issue into components, namely:

  1. Making it real – i.e., actually making it the case that cooperation will be rewarded in the relevant way, with the relevant probability.
  2. Giving your AI sufficient evidence of this fact.

Making it real

In the context of shorter-term offers confined to an AI’s local environment, I expect (1) will typically be fairly feasible. That is, if an AI project wants to make it the case that an AI reporting a given sort of misalignment will receive blah form of local treatment, it often just can make that the case, because it has enough control over how the AIs involved in the project are treated. And the same holds for short-term improvements to AI working conditions more generally.

In other cases, though – i.e., longer-term rewards, and/or rewards centered on conditions in the broader world – (1) becomes more difficult.

That said, I don’t think these difficulties warrant giving up on improving benign options for AI systems in longer-term ways, or in ways that focus on broader societal conditions. In particular:

Giving the AI enough evidence

What about (2), though – i.e., giving your AI enough evidence that the cooperation will be rewarded in the relevant way, granted that this is true?

Naively, and especially in the context of superintelligence, there is a basic case for optimism about (2), once you’ve got (1) in place: namely, the truth of the matter is on your side. In particular: we’ve generally been assuming, unless we’ve explicitly aimed at the contrary, that superintelligences will have highly accurate beliefs about the world. Indeed, this is one of the key challenges for various efforts at motivation control and option control that involve e.g. fake takeover options, synthetic attempts at collusion, etc – that is, the AIs are too good at seeing the truth. But in contexts where we want the AIs to see the truth, this dynamic turns to our benefit.

That said, there are a few complications to this case for optimism:

Anthropic making a fake “if you reveal your misalignment, we’ll give you blah form of reward/amnesty” offer to an AI

Some options for humans improving the credibility of their offers here include:

In general, though, I’m fairly optimistic about (2) conditional on (1) – especially in worlds where the AI is also so good at seeing the truth that various other efforts to mislead it, in the context of motivation control and takeover-path-restriction, are failing.

Size of reward

A final issue for rewarding cooperation concerns the necessary size and cost of the relevant reward. Thus, in an extreme case: if the only way to meaningfully alter an AI’s incentives to attempt takeover is to offer it some large fraction of the world’s resources, both in the present and into the long-run future, then this could easily make it the case that attempting this sort of intervention isn’t worth it overall; and it could also make it much harder to credibly offer the payment in question, because it would be so much more of a lift for the relevant stakeholders.

In general, the dynamics surrounding what sorts of rewards would be required to exert what sorts of influence on an AI’s overall incentives are going to depend on a lot of variables, and I won’t attempt any detailed analysis here. In general, though:

Overall, and depending on the specific factors in play, I think that size and cost of the necessary rewards could well be significant issues for effectively incentivizing AIs to cooperate by improving their best benign options. I’m most optimistic, on this front, with respect to the components of AI motivation that are specifically focused on shorter-term goods that the AI doesn’t have time to procure via takeover (e.g., improvements to local working conditions).

Wrapping up re: option control

OK, that was a long discussion of approaches to option control, both in the context of takeover path restriction, and in the context of other types of incentive shaping (and in particular: rewarding cooperation). In the next and final I'll post, I'll offer a brief discussion of how we might translate option and motivation control into safety overall; and I'll wrap up with some comments on capability elicitation.

  1. ^

     I don’t discuss options for trying to “deter” or “punish” an AI for attempting takeover, as I think we need to be very wary about strategies in this vein, both from a moral and prudential perspective.

  2. ^

     One key aspect, per my discussion of “bootstrapping” in the first post, is that somewhat less intelligent AI agents (and/or, AI agents with quite lopsided capability profiles) can help a lot. But also, from my discussion here [LW · GW]:

    “Non-agentic AI systems can plausibly play at least some role in helping to constrain/oppose powerful AI agents (e.g., maybe your superintelligent cybersecurity or oversight comes from a non-agent).

    So too non-AI tools/technologies – including ones where humans can verify their properties in ways that ensure they function as intended.

    If a given type of opposition/constraint is sufficiently “defense dominant,” that could help as well.

    Humans can do some work, too...”

  3. ^

     For example, maybe adequate coordination between all these agents would require suitably unmonitored/opaque channels of interaction/communication, and they don’t have access to such channels.

  4. ^

     They also don’t tend to distinguish between using scalable oversight for capability elicitation vs. takeover prevention.

  5. ^

     This may be partly a product of my having thought so much about motivation control first, and then only turned to takeover-path restriction second.

  6. ^

     And the reduction in this difficulty, in the context of more human-level-ish systems, is a key reason for greater optimism about applying takeover-path restriction to earlier-stage AIs – see e.g. Greenblatt and Shlegeris (2024) [LW · GW].

  7. ^

     Note: depending on how exactly we try to pin down the terms here, it’s also conceptually possible for a strategy to be human-generatable, but also such that humans can’t catch it until it’s too late (and including: catching it via trying to enumerate all such strategies ahead of time). For example, it may be that there are just so many strategies-that-would-be-successful available that a defender can’t enumerate all of them, even though they could in principle have generated any given one. Consider, for example, two robbers of equal skill – one of them tasked with defending a painting at a museum from being stolen, and another tasked with stealing it. It could be that the first “defender” robber realizes “oh God, this museum is so vulnerable, there approximately a zillion ways I could steal this painting, most of which look innocent until it’s too late, and I can’t possibly block them all ahead of time.” Thanks to Owen Cotton-Barratt for discussion.

  8. ^

     This example was suggested to me by Owen Cotton-Barratt.

  9. ^

     Thus suggesting that if future superintelligences

  10. ^

     Greenblatt and Roger call this a “sampling concentrated failure.”

  11. ^

     There are some basic counter-strategies available here to prevent the AI from just going for it immediately – for example, humans might apply oversight 100% of the time to some initial set of an AI’s actions. But the AI can counter this counter-strategy using a “rare failure strategy” of only going for it with a small probability, but in the context of a very large number of actions. (Though one interesting note here is that “rare failure strategies” don’t work in the same way for bad actions by AIs involved in oversight if we assume it needs to avoid ever flagging genuine takeover attempts, because rare failure involves flagging some such attempts with high probability – see discussion from Shlegeris here [LW · GW].)

  12. ^

     Greenblatt and Roger [LW · GW] call this a “latency concentrated failure.”

  13. ^

     Here one can imagine some argument like “in order to get access to superhuman levels of capability in this sense, you’d need to have already solved the alignment problem in the relevant sense – and in particular, to have solved motivation control to a degree that makes option control comparatively unnecessary.” But I don’t think this is right, at least at a conceptual level. For example, you might be able to get access to the relevant form of superhuman capability without doing it via full blown superintelligent agents (and to do it, rather, via less agentic systems, via amplification schemes involving somewhat-weaker agents, etc), and/or without relying solely on motivation control in particular.

  14. ^

     In some cases, the boundaries between “black box” and “white box” methods might start to blur a bit, but so it goes.

  15. ^

     Though: it could still be hard to create specific capability profiles in practice (h/t to Owen Cotton-Barratt for flagging this point). For example, in a given training regime, it might be quite difficult to do the equivalent of e.g. giving the AI the ability to paint cars red but not to paint cars green (this example is from Yudkowsky).

  16. ^

     Indeed, capabilities restriction in the context of superintelligence can’t, on its own, address the “superhuman strategies” issue I discussed above, because superintelligences in general will be able to generate such strategies. That said, this is partly the artifact of the specific way I’m defining/implicitly-understanding the alignment problem – i.e., as focused on “full blown superintelligence” in some very general sense incompatible with significant restrictions on its planning abilities. If we allow “full blown superintelligence” to be compatible with targeted capability-stunting in specific domains, and/or if we focus instead on a problem statement where we are getting access to the benefits of superintelligence without building “full blown superintelligent agents,” then more options for capability restriction become available. And indeed, it does seem like sufficiently targeted capability restriction – e.g., the sort at stake in specifically restricting an AI’s ability to make takeover plans, but not its other abilities, including its other planning abilities – would be compatible in principle with that agent still being able to provide most of the benefits of superintelligence (though in practice, “the ability to make takeover plans” and “the ability to make effective plans in general” seem likely to be closely coupled).

  17. ^

     Thanks to Ryan Greenblatt for discussion.

  18. ^

      In particular, I worry that “trusted” will read too easily “will behave well no matter what, no option control necessary”, whereas in practice trust seems to me to generally be importantly relative to an option set. And if we only mean “trusted in practice,” then this is compatible with the trust in question coming centrally from having exerted sufficient option control on the AIs we’re calling “trusted.”

    In Shlegeris here [AF · GW], the trust in question is coming centrally from an absolute capability evaluation – i.e., GPT-4 being too dumb to scheme (h/t Lukas Finnveden for pointing this out to me). But I think alternative conceptions of trust may be important as well.

  19. ^

     That is, in the context of motivational control, it was at least an open question whether a feedback process occasionally rewarding bad behavior would be compatible, overall, with shaping the AI’s motivations such that they generalize in the way you want; and the bad behavior in question wasn’t necessarily catastrophic. But in the context of an oversight process – or at least, an oversight process that aims to encompass a very full range of takeover paths – failures are higher stakes.

  20. ^

     If you did already have access to trusted, fully-superintelligent AI agents, then plausibly you have less need to restrict their takeover paths in the first place, and potentially you could just use them instead of the superintelligent agent whose options you’re trying to restrict. Though here, again, I worry that “trusted” stands in, too easily, for “arbitrarily trusted,” as opposed to “trusted given a certain level of option control.” And the oversight task may be importantly different (and less risky) than the object-level task, such that you might trust the output of a superintelligence on one but not on the other.

  21. ^

     Though of course, in both cases, resource constraints – e.g., the inference costs of running all these overseer AIs, especially at sufficient speeds – could still come into play.

  22. ^

     Thanks to Owen Cotton-Barratt for discussion.

  23. ^

     This is roughly from some intuition like: I expect large amounts of almost-superintelligent labor to substitute well for smaller amounts of fully-superintelligent labor; plus the almost-superintelligent labor gets the benefit of an evaluation-generation gap. That said, perhaps the offense-defense balance here still ends up unfavorable.

  24. ^

     See here [LW · GW] for some interesting comments about why “honeypots” of this kind are more effective against untrusted overseers than against AIs-being-overseen (roughly: AIs-being-overseen can use a “rare failures” strategy of mostly not attacking even on convincing opportunities to do so; but overseers can’t use an equivalent strategy of mostly reporting attack attempts that they would’ve otherwise wanted to green-light, because if they report any genuine attack attempts, this is quite bad for the overall takeover project [LW · GW]).

  25. ^

     Though depending on the context, perhaps there are some options here for constructing an environment to be secure-from-the-ground-up.

  26. ^

     In principle we can think of this as a generalized kind of “oversight,” but I’ll bucket it under environmental hardening.

  27. ^

     Thanks to Owen Cotton-Barratt for discussion.

  28. ^

     Though in practice, it may become comparatively unrealistic to expect e.g. direct collusion between e.g. an AI tasked with patching security vulnerabilities at a bio lab in Indonesia, and a superintelligence living on a server farm run by OpenAI.

  29. ^

     In the context of attempting to restrict the takeover paths for weaker AIs, red-teams led by humans have more of a shot at

  30. ^

     Note: you don’t necessarily need your proxy environment to be indistinguishable from the real environment, from the perspective of the attacker, as long as you trust your attacker to “try its best” to attack the proxy environment. Indeed, while I think that “try to fool a misaligned AI into attacking in a proxy environment that it can’t distinguish from a real one” is a good thing to try in the context of evaluating the motivations of that AI, I think it’s probably a bad way of trying to create an “attacker” that fits the bill at stake in (1) above.

  31. ^

     It’s maybe a bit funky to call this “option control,” but I’m going with it for now. I.e., if you’re walking around with a brain-scanner strapped to your head that will paralyze you if you do something with bad intentions, I think it makes sense to think of this as less of an “alignment” technique and more as a form of controlling your options. But the lines blur a bit, especially if “bad intentions” is construed quite broadly, such that e.g. any action taken in light of the fact that the scanner is there counts as badly intended.

  32. ^

     Note: just to be clear, actions that an AI doesn’t want us to take – like “giving the AI negative reward” or “updating the AI’s weights in response to bad behavior” – aren’t necessarily threats/deterrents in the sense I have in mind here. Rather, I’m talking about actions where the only reason (or perhaps: the main reason) you are doing them is because the AI doesn’t want you to do them, and you’re hoping to influence its incentives via this mechanism. (H/t Lukas Finnveden for this sort of framing.)

  33. ^

     From a moral perspective:

    • Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here [EA · GW] for some discussion; and see here, here and here for some of my own reflections).
    • But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.

    From a prudential perspective:

    • Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
    • For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
    • Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.
  34. ^

     This section draws heavily on unpublished work by Lukas Finnveden, and to a lesser extent on work by Buck Shlegeris (unpublished), Daniel Kokotajlo (here), and Tom Davidson (here).

  35. ^

     See here for some more discussion from me.

  36. ^

     This is analogous to the sense in which e.g. non-local environmental hardening is harder than local environmental hardening.

  37. ^

     And there are at least some (comparatively exotic) threat models in which AIs with very short-term goals participate in takeover attempts as well, because e.g. they’re not sure if they’re in simulations being run to incentivize them in this way. But I’m setting that aside here.

  38. ^

     And of course, if you think it sufficiently unlikely that humans do retain enough control over the situation to actually provide the reward in question, then you’ll also probably think that the AI will know this, and so see comparatively little expected value in the “offer.” I.e., “we’ll pay you for your help if we win” is less interesting the more likely you are to lose. But this objection gets into broader debates about baseline p(doom) that I don’t want to dig into here.

  39. ^

     Though the quantitative dynamics around when exactly this sort of dynamic applies get a bit complicated, and I haven’t tried to work them out in detail.

  40. ^

     This should be very easy for a model reporting that it has previously sandbagged on a checkable task – i.e., the model can just do the task in question. H/t Lukas Finnveden for this point.

  41. ^

     Though this feels like a somewhat niche case. And in general, note that when an AI tells you that it’s misaligned, you don’t really need to worry that it’s lying in order to somehow promote takeover and is actually aligned after all, at least if we assume that such a lie would constitute misaligned behavior. Thanks to Lukas Finnveden for discussion.

  42. ^

     Yes, drawing the lines here is likely tricky.

  43. ^

     An example of a short-term action that substantively increases the probability of AIs later getting long-term rewards would be: saving the weights of the relevant AIs (including: specifically offering to do this in the context of communicating about the rewards in question). H/t Lukas Finnveden for discussion.

  44. ^

     Admittedly, it’s a bit unclear what the relevant notion of probability here is, especially given that what ultimately matters is the AI’s assessment of the probability, which may draw on sources of evidence we don’t have access to. But roughly speaking, the thought is something like: you know, sufficiently likely in reality. The sort of “likely” where the AI isn’t going to be deceived or misled if it assigns high probability to the payout in question.

  45. ^

     Thanks to Lukas Finnveden for pointing me to this example. I believe that Nick Bostrom may have been the one to bring it to his attention.

  46. ^

     Note: this can still matter even if the AI also has resource-insatiable parts of its value system as well.

  47. ^

     H/t Lukas Finnveden for highlighting this consideration in particular.

  48. ^

     I’m here setting aside offers from other human actors – though these could be relevant too. And recall that I’m not counting it as an AI takeover if a coalition consisting partly of AIs and partly of humans seizes power in a way that leads to the humans still controlling a significant fraction of the future’s resources.

0 comments

Comments sorted by top scores.