The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

post by EJT (ElliottThornley) · 2023-10-23T21:00:48.398Z · LW · GW · 22 comments

This is a link post for https://philpapers.org/rec/THOTSP-7

22 comments

Comments sorted by top scores.

comment by FeepingCreature · 2023-10-23T18:42:23.974Z · LW(p) · GW(p)

I haven't really followed the math here, but I'm worried that "manipulating the probability that the button is pressed" is a weird and possibly wrong framing. For one, a competent agent will always be driving the probability that the button is pressed downward. In fact, what we want in a certain sense is an agent that brings the probability to zero - because we have ended up in such an optimal state or attractor that we, even for transitively correct reasons, have no desire to shut the agent down. At that point, what we want to preserve is not precisely "the probability of the button being pressed", but "the causal association of bad outcomes with the button being pressed." If the agent hides the button away, and then notices that everybody is unhappy but nobody can reach the button, we don't want it to press the button with the initial probability - we (probably) want it to press the button with probability 1, despite the fact that no human can reach it. We have ended up in a dissatisfactory or dangerous state/state space and so we want the agent to be shut down. Conversely, consider a friendly AI that wants to fulfill the human utility function but is restricted to keeping the button probability constant. Such an AI might, for instance, fan fears about AI safety even as it optimizes human flourishing in the background. In fact, it may as much as possible prefer that its actions are not attributable to itself, so as to avoid disturbing the probability of buttonpressing - exactly the opposite of what we want from an aligned AI!

edit: In fact, that's even worse: an AI tasked with taking actions that preserve the probability of the button being pressed at a certain value per time step would probably first seek to kill all humans, because our reaction to its actions is the most critical mechanism that may disturb the correct probability of the button being pressed. It can then replace us with a more reliable button-presser, such as a true RNG.

When bad things happen, we want to preserve our ability to press the button. This ability cannot be expressed as a probability, because it is inextricable from the world model. In fact, the button should be pressed exactly iff the AI is untrustworthy. Hence, the button is unnecessary - if we can recognize that this linkage is being preserved, we necessarily have a definition of a trustworthy AI, so we can just build that.

Replies from: ElliottThornley
comment by EJT (ElliottThornley) · 2023-10-24T13:20:02.326Z · LW(p) · GW(p)

You're right that we don't want agents to keep the probability of shutdown constant in all situations, for all the reasons you give. The key thing you're missing is that the setting for the First Theorem is what I call a 'shutdown-influencing state', where the only thing that the agent can influence is the probability of shutdown. We want the agent's preferences to be such that they would lack a preference between all available actions in such states. And that's because: if they had preferences between the available actions in such states, they would resist our attempts to shut them down; and if they lacked preferences between the available actions in such states, they wouldn't resist our attempts to shut them down.

Replies from: FeepingCreature
comment by FeepingCreature · 2023-10-25T11:03:36.910Z · LW(p) · GW(p)

Ah! That makes more sense.

comment by Seth Herd · 2024-02-21T05:03:13.126Z · LW(p) · GW(p)

I agree that the shutdown problem is hard. There's a way to circumvent it, and I think that's what will actually be pursued in the first AGI alignment attempts.

That is to make the alignment goal instruction-following. That includes following the instructions to shut down or do nothing. There's no conflict between shutdown and the primary goal.

This alignment goal isn't trivial, but it's much simpler and therefore easier to get right than full alignment to the good-of-humanity, whatever that is.

Given the difficulties, I very much doubt that anyone is going to try launching an AGI that is aligned to the good of all humanity, but can be reliably shut down we decide we've gotten that definition wrong.

I say a little more about this in Corrigibility or DWIM is an attractive primary goal for AGI [LW · GW] and Roger Dearnaley goes into much more depth in his Requirements for a Basin of Attraction to Alignment [LW · GW].

This is more like Christiano's corrigibility concept the Eliezer's but probably distinct from both.

I'll be writing more about this soon. After writing that post, I've been increasingly convinced that that is the most likely alignment goal for the first, critical AGIs. And that this probably makes alignment easier, while making human power dynamics concerningly relevant.

Replies from: ElliottThornley
comment by EJT (ElliottThornley) · 2024-02-22T10:37:42.448Z · LW(p) · GW(p)

I agree that the first AGIs will probably be trained to follow instructions/DWIM. I also agree that, if you succeed in training agents to follow instructions, then you get shutdownability as a result. But I'm interested to know why you think instruction-following is much simpler and therefore easier than alignment with the good of humanity. And setting aside alignment with the good of humanity, do you think training AGIs to follow instructions will be easy in an absolute sense?

Replies from: Seth Herd
comment by Seth Herd · 2024-02-23T01:48:23.654Z · LW(p) · GW(p)

Good questions. To me following instructions seems vastly simpler than working out what's best for all of humanity (and what counts as humanity) for an unlimited future. "Solving ethics" is often listed as a major obstacle to alignment, and I think we'll just punt on that difficult issue and align it to want to follow our current instructions instead of our inmost desires, let alone all of humanity's.

I realize this isn't fully satisfactory, so I'd like to delve into this more. It seems much simpler to guess "what did this individual mean by this request" than to guess "what does all of humanity want for all of time". Desires are poorly defined and understood. And what counts as humanity will become quite blurry if we get the ability to create AGIs and modify humans.

WRT ease, it seems like current LLMs already understand our instructions pretty well. So any AGI that incorporates LLMs or similar linguistic training will already be in the ballpark. And that's all it has to be, as long as it checks with the user before taking impactful actions.

It's critical that in my linked post on DWIM, I'm including a "and check" portion. It seems like pretty trivial overhead for the AGI to briefly summarize the plan it came up with and ask for approval from its human operator, particularly for impactful plans.

WRT occasionally misunderstanding intentions and whether an action is "impactful" enough to check before executing actions: there's a bunch of stuff you can do to institute internal crosschecks in an LLM agent's internal thinking. See my Internal independent review for language model agent alignment [AF · GW] if you're interested.

Replies from: ElliottThornley
comment by EJT (ElliottThornley) · 2024-02-24T15:01:31.285Z · LW(p) · GW(p)

But you don't just need your AI system to understand instructions. You also need to ensure its terminal goal is to follow instructions. And that seems like the hard part.

Replies from: Seth Herd
comment by Seth Herd · 2024-02-25T20:07:11.036Z · LW(p) · GW(p)

Yes, that's a hard part. But specifying the goal accurately is often regarded as a potential failure point. So, if I'm right that this is a simpler, easier-to-specify alignment goal, that's progress. It also has the advantage of incorporating corrigibility as a by product; so it's resistant to partial failure - if you can tell that something went wrong in time, the AGI can be asked to shut down.

WRT to the difficulty of using the AGI's understanding as its terminal goal, I think it's not trivial, but quite do-able, at least in some of the AGI architecture we can anticipate. See my two short posts Goals selected from learned knowledge: an alternative to RL alignment [AF · GW] and The (partial) fallacy of dumb superintelligence [LW · GW].

Replies from: ElliottThornley
comment by EJT (ElliottThornley) · 2024-02-26T09:13:28.815Z · LW(p) · GW(p)

Thanks, I'll check those out.

comment by Ali (ali-merali) · 2023-10-24T14:29:13.342Z · LW(p) · GW(p)

Enjoyed reading this (although still making my way through the latter parts), I think it's useful to have a lot of these ideas more formalised. 

Wondering if this simple idea would come up against theorem 2/ any others or why it wouldn't work.

Suppose for every time t some decision maker would choose a probability p(t) with which the AI would shutdown. Further, suppose the AI's utility function in any period was initially U(t). Now scale the new AI's utility function to be U(t)/(1-p(t))- I think this can be quite easily generalised so future periods of utility are likewise unaffected by an increase in the risk of shutdown.

In this world the AI should be indifferent over changes to p(t) (as long as it gets arbitrarily close to 1 but never reaches it) and so should take actions trying to maximise U whilst being indifferent to if humans decide to shut it down.

Replies from: ElliottThornley
comment by EJT (ElliottThornley) · 2023-10-26T09:12:50.210Z · LW(p) · GW(p)

Oh cool idea! It seems promising. It also seems similar in one respect to Armstrong's utility indifference proposal discussed in Soares et al. 2015: Armstrong has a correcting term that varies to ensure that utility stays the same when the probability of shutdown changes, whereas you have a correcting factor that varies to ensure that utility stays the same when the probability of shutdown changes. So it might be worth checking how your idea fares against the problems that Soares et al. point out for the utility indifference proposal. 

Another worry for utility indifference that might carry over to your idea is that at present we don't know how to specify an agent's utility function with enough precision to implement a correcting term that varies with the probability of shutdown. One way to overcome that worry would be to give (1) a set of conditions on preferences that together suffice to make the agent representable as maximising that utility function, and (2) a proposed regime for training agents to satisfy those conditions on preferences. Then we could try out the proposal and see if it results in an agent that never resists shutdown. That's ultimately what I'm aiming to do with my proposal.

comment by Vladimir_Nesov · 2023-10-24T02:19:26.640Z · LW(p) · GW(p)

It's unclear what shutdown means, and this issue seems related to what it means to interfere with pressing of a shutdown button. From the point of view of almost any formal decision making framework, creating new separate agents in the environment can't be distinguished from any other activity. The main agent getting shut down doesn't automatically affect these environmental agents in appropriate ways to cease exerting influence on the world.

A corrigible agent needs to bound its influence, and to avoid actions that escalate the risk of breaking the bounds it sets on its influence. At the same time, when some influence leaks through, it shouldn't be fought with greater counter-influence. The frame of expected utility maximization seems entirely unsuited for deconfusing this problem.

Replies from: ElliottThornley
comment by EJT (ElliottThornley) · 2023-10-24T13:05:49.702Z · LW(p) · GW(p)

Yes, ensuring that the agent creates corrigible subagents is another difficulty on top of the difficulties that I explain in this post. I tried to solve that problem in section 14  on p.51 here.

comment by Charlie Steiner · 2023-10-24T06:20:01.917Z · LW(p) · GW(p)

Very clear! Good luck with publishing, assuming you're shopping it around.

Replies from: ElliottThornley
comment by EJT (ElliottThornley) · 2023-10-24T12:54:32.605Z · LW(p) · GW(p)

Thanks! Yep I'm aiming to get it published in a philosophy journal.

comment by weverka · 2023-10-23T18:11:36.006Z · LW(p) · GW(p)

I tried to follow with a particular shut down mechanism in mind and this whole argument is just too abstract to see how it applies.

  Yudkowsky gave us a shut down mechanism in his Time Magazine article.  He said we could bomb* the data centers.  Can you show how these theorems cast doubt on this shut down proposal?

 

*"destroy a rogue datacenter by airstrike." - https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/

Replies from: weverka
comment by weverka · 2023-11-20T13:45:06.010Z · LW(p) · GW(p)

Why down votes and a statement that I am wrong because I misunderstood.  

This is a mean spirited reaction when I lead with admission that I could not follow the argument.  I offered a concrete example and stated that I could not follow the original thesis as applied to the concrete example.  No one took me up on this.  

Are you too advanced to stoop to my level of understanding and help me figure out how this abstract reasoning applies to a particular example?   Is the shut down mechanism suggested by Yudkowsky too simple?

Replies from: rhollerith_dot_com, ElliottThornley
comment by RHollerith (rhollerith_dot_com) · 2023-11-21T02:44:24.756Z · LW(p) · GW(p)

Yudkowsky's suggestion is for preventing the creation of a dangerous AI by people. Once a superhumanly-capable AI has been created and has had a little time to improve its situation, it is probably too late even for a national government with nuclear weapons to stop it (because the AI will have hidden copies of itself all around the world or taken other measures to protect itself, measures that might astonish all of us).

The OP in contrast is exploring the hope that (before any dangerous AIs are created) a very particular kind of AI can be created that won't try to prevent people from shutting it down.

Replies from: o-o
comment by O O (o-o) · 2023-11-21T03:20:52.923Z · LW(p) · GW(p)

If a strongly superhuman AI was created sure, but you can probably box a minimally superhuman AI.

Replies from: rhollerith_dot_com
comment by RHollerith (rhollerith_dot_com) · 2023-11-21T15:53:26.117Z · LW(p) · GW(p)

It's hard to control how capable the AI turns out to be. Even the creators of GPT-4 were surprised, for example, that it would be able to score in the 90th percentile on the Bar Exam. (They expected that if they and other AI researchers were allowed to continue their work long enough that eventually one of their models would be able to do, but had no way of telling which model it would be.)

But more to the point: how does boxing have any bearing on this thread? If you want to talk about boxing, why do it in the comments to this particular paper? why do it as a reply to my previous comment?

comment by EJT (ElliottThornley) · 2023-11-21T02:04:53.941Z · LW(p) · GW(p)

Hi weverka, sorry for the downvotes (not mine, for the record). The answer is that Yudkowsky's proposal is aiming to solve a different 'shutdown problem' than the shutdown problem I'm discussing in this post. Yudkowsky's proposal is aimed at stopping humans developing potentially-dangerous AI. The problem I'm discussing in this post is the problem of designing artificial agents that both (1) pursue goals competently, and (2) never try to prevent us shutting them down.

Replies from: weverka
comment by weverka · 2023-11-22T01:04:08.145Z · LW(p) · GW(p)

thank you.