Posts

The Shutdown Problem: Incomplete Preferences as a Solution 2024-02-23T16:01:16.378Z
The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists 2023-10-23T21:00:48.398Z
The price is right 2023-10-16T16:34:38.023Z
What are some examples of AIs instantiating the 'nearest unblocked strategy problem'? 2023-10-04T11:05:34.537Z
EJT's Shortform 2023-09-26T15:19:53.914Z
There are no coherence theorems 2023-02-20T21:25:48.478Z

Comments

Comment by EJT (ElliottThornley) on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T15:21:56.154Z · LW · GW

I think there is probably a much simpler proposal that captures the spirt of this and doesn't require any of these moving parts. I'll think about this at some point.

Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn't run into the same barriers to generalization as we see when we consider training for honesty.

Comment by EJT (ElliottThornley) on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T15:09:21.569Z · LW · GW

I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.

And if and when an agent learns POST, I think Timestep Dominance is a simple and natural rule to learn. In terms of preferences, Timestep Dominance follows from POST plus a Comparability Class Dominance principle (CCD).  And satisfying CCD seems like a prerequisite for capable agency. Behaviourally, ‘don’t pay costs to shift probability mass between shutdowns at different timesteps’ follows from POST plus another principle that seems like a prerequisite for minimally sensible action under uncertainty.

And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.

Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it's unclear exactly how often this will be the case).

Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.

By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.

Comment by EJT (ElliottThornley) on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-09T13:32:10.153Z · LW · GW

Thanks, appreciate this!

Iit's not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently?

I tried to answer this question in The idea in a nutshell. If the agent lacks a preference between every pair of different-length trajectories, then it won’t care about shifting probability mass between different-length trajectories, and hence won’t care about hastening or delaying shutdown.

There's a lot of discussion of this under the terminology "corrigibility is anti-natural to consequentialist reasoning". I'd like to see some of that discussion cited, to know you've done the appropriate scholarship on prior art. But that's not a dealbreaker to me, just one factor in whether I dig into an article.

The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.

Now, you may be addressing non-sapient AGI only, that's not allowed to refine its world model to make it coherent, or to do consequentialist reasoning.

That’s not what I intend. TD-agents can refine their world models and do consequentialist reasoning.

When I asked about the core argument in the comment above, you just said "read these sections". If you write long dense work and then just repeat "read the work" to questions, that's a reason people aren't engaging. Sorry to point this out; I understand being frustrated with people asking questions without reading the whole post (I hadn't), but that's more engagement than not reading and not asking questions. Answering their questions in the comments is somewhat redundant, but if you explain differently, it gives readers a second chance at understanding the arguments that were sticking points for them and likely for other readers as well.

Having read the post in more detail, I still think those are reasonable questions that are not answered clearly in the sections you mentioned. But that's less important than the general suggestions for getting more engagement with this set of ideas in the future.

Ah sorry about that. I linked to the sections because I presumed that you were looking for a first chance to understand the arguments rather than a second chance, so that explaining differently would be unnecessary. Basically, I thought you were asking where you could find discussion of the parts that you were most interested in. And I thought that each of the sections were short enough and directly-answering-your-question-enough to link, rather than recapitulate the same points.

In answer to your first question, incomplete preferences allows the agent to prefer an option B+ to another option B, while lacking a preference between A and B+, and lacking a preference between A and B. The agent can thus have preferences over same-length trajectories while lacking a preference between every pair of different-length trajectories. That prevents preferences over being shut down (because the agent lacks a preference between every pair of different-length trajectories) while preserving preferences over goals that we want it to have (because the agent has preferences over same-length trajectories).

In answer to your second question, Timestep Dominance is the principle that keeps the agent shutdownable, but this principle is silent in cases where the agent has a choice between making $1 in one timestep and making $1m in two timesteps, so the agent’s preference between these two options can be decided by some other principle (like – for example – ‘maximise expected utility among the non-timestep-dominated options').

Comment by EJT (ElliottThornley) on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-09T10:41:55.007Z · LW · GW

Yep, maybe that would've been a better idea!

I think that stochastic choice does suffice for a lack of preference in the relevant sense. If the agent had a preference, it would reliably choose the option it preferred. And tabooing 'preference', I think stochastic choice between different-length trajectories makes it easier to train agents to satisfy Timestep Dominance, which is the property that keeps agents shutdownable. And that's because Timestep Dominance follows from stochastic choice between different-length trajectories and a more general principle that we'll train agents to satisfy, because it's a prerequisite for minimally sensible action under uncertainty. I discuss this in a little more detail in section 18.

Comment by EJT (ElliottThornley) on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-09T09:56:55.310Z · LW · GW

Thanks, appreciate this!

It's unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn't seem like it can be with respect to the agent's subjective beliefs as this would make it even harder to impart.

I propose that we train agents to satisfy TD with respect to their subjective beliefs. I’m guessing that you think that this kind of TD would be hard to impart because we don’t know what the agent believes, and so don’t know whether a lottery is timestep-dominated with respect to those beliefs, and so don’t know whether to give the agent lower reward for choosing that lottery.

But (it seems to me) we can be quite confident that the agent has certain beliefs, because these beliefs are necessary for performing well in training. For example, we can be quite confident that the agent believes that resisting shutdown costs resources, that the resources spent on resisting shutdown can’t also be spent on directly pursuing utility at a timestep, and so on.

And if we can be quite confident that the agent has these accurate beliefs about the environment, then we can present the agent with lotteries that are actually timestep-dominated (according to the objective probabilities decided by the environment) and be quite confident that these lotteries are also timestep-dominated with respect to the agent’s beliefs. After all, we don’t need to know the agent’s beliefs with any great detail or precision to tell whether a lottery is timestep-dominated with respect to those beliefs. We just need to know whether the agent believes that the lottery involves spending resources only to shift probability mass between shutdowns at different timesteps. My proposal is that we present the agent with lotteries in which this is actually the case (according to the objective probabilities decided by the environment) and use the fact that capable agents’ beliefs will reflect this actuality.

Imparting TD preferences seems hard

The prosaic version of this proposal assumes that you can impart timestep dominance preferences into AIs in ways which will robustly generalize. This seems unlikely to be true in general (in the absence of additional technology) and if we did have the property, we could solve safety issues in some other way (e.g. robustly generalizing honesty). So you'll need to argue that timestep dominance is particularly easy to robustly impart relative to other preferences. I can see some arguments for timestep dominance being particularly easy to robustly impart, but they don't seem very strong to me.

Yep, I claim that it’s easier to robustly impart POST and Timestep Dominance than it is to robustly impart things like honesty. And that’s because (it seems to me) we can train for POST and Timestep Dominance in ways that largely circumvent the problems of reward misspecification, goal misgeneralization, and deceptive alignment. I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?

Suppose that Clippy is a TD-agent which cares solely about the long run number of paperclips, but subject to a TD constraint. That is, it always prefers trajectory X over Y if for all shutdown timings X gives a >= number of paperclips than Y and X gives a > number of paperclips for some timing.

Now, consider the following proposed action for Clippy: "In a stealthy manner, construct a successor agent which pursues long run paperclips at all costs (and which isn't subject to a shutdown button, but still has TD constraints wrt. to the original Clippy's shutdown button)." This agent is implicitly resisting shutdown as it's constructing the sub-agent stealthily. So, shouldn't this action be Timestep Dominanted by "In whatever way is most expedient (but not necessarily stealthy), construct a successor agent". I claim not, because if we condition on the AI not being shutdown prior to constructing the agent (a particular shutdown time conditional), it will be worse for clippy if humans know that it is constructing a successor as then humans will have an easier time fighting the successor. In other words, having humans know that you're constructing a successor has two effects: they'll fight your successor and they'll shut you down. If we condition on not being shutdown, then they'll still fight your successor, so Clippy would prefer being stealthy in the no shut down conditional. Thus, being stealthy isn't Timestep Dominated by changing the action to avoid resisting shutdown. (And once the successor is constructed, shutting down Clippy has no effect on the long run number of paperclips (let's suppose).)

(You should think of this counterexample as a worst case theory counterexample. The exact assumptions might not be realistic, but it points at a broader class of issues.)

Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.

Timestep dominance is maybe crippling

I'm most uncertain here, but my current guess would be that any sort of absolute constraint like this is crippling. I've thought through some cases and this is my current guess, but I'm by no means confident.

Can you say more about these cases? Timestep Dominance doesn’t rule out making long-term investments or anything like that, so why crippling?

Comment by EJT (ElliottThornley) on EJT's Shortform · 2024-04-05T11:30:36.701Z · LW · GW

Thanks, will reply there!

Comment by EJT (ElliottThornley) on EJT's Shortform · 2024-04-05T11:30:04.446Z · LW · GW

Thanks, will reply there!

Comment by EJT (ElliottThornley) on EJT's Shortform · 2024-04-05T11:28:42.416Z · LW · GW

it'll take a lot of effort for me to read properly (but I will, hopefully in about a week).

Nice, interested to hear what you think!

I think it's easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.

Yep agree that this is a concern, and I plan to think more about this soon.

putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.

Interested to hear more about this. I'm not sure exactly what you mean by 'detector', but I don't think my proposal requires either of these. The agent won't try to manipulate the button, because doing so is timestep-dominated by not doing so. And the agent won't self-modify in ways that stop it being shutdownable, again because doing so is timestep-dominated by not doing so. I don't think we need a detector in either case.

because of inner alignment issues

I argue that my proposed training regimen largely circumvents the problems of goal misgeneralization and deceptive alignment. On goal misgeneralization, POST and TD seem simple. On deceptive alignment, agents trained to satisfy POST seem never to get the chance to learn to prefer any longer trajectory to any shorter trajectory. And if the agent doesn't prefer any longer trajectory to any shorter trajectory, it has no incentive to act deceptively to avoid being made to satisfy TD.

this isn't what the shutdown problem is about so it isn't an issue if it doesn't apply directly to prosaic setups

I'm confused about this. Why isn't it an issue if some proposed solution to the shutdown problem doesn't apply directly to prosaic setups? Ultimately, we want to implement some proposed solution, and it seems like an issue if we can't see any way to do that using current techniques.

Comment by EJT (ElliottThornley) on EJT's Shortform · 2024-04-05T10:49:43.572Z · LW · GW

Thanks, that's useful to know. If you have the time, can you say some more about 'control of an emerging AI's preferences'? I sketch out a proposed training regimen for the preferences that we want, and argue that this regimen largely circumvents the problems of reward misspecification, goal misgeneralization, and deceptive alignment. Are you not convinced by that part? Or is there some other problem I'm missing?

Comment by EJT (ElliottThornley) on EJT's Shortform · 2024-04-02T11:03:40.228Z · LW · GW

Nice, interested to hear what you think!

Comment by EJT (ElliottThornley) on EJT's Shortform · 2024-04-02T09:37:33.826Z · LW · GW

My solution to the shutdown problem didn't get as much attention as I hoped. Here's why it's worth your time.

  • An everywhere-implemented solution to the shutdown problem would send the risk of AI takeover down to ~0.
  • My solution is shovel-ready. It makes only small tweaks to an otherwise-thoroughly-prosaic setup for training transformative AI.
  • My solution won first prize and $16,000 in last year's AI Alignment Awards, judged by Nate Soares, John Wentworth, and Richard Ngo.
  • I've since explained my solution to about 50 people in and around the AI safety community, and all the responses have been various flavours of 'This seems promising.' I've not yet had any responses of the form 'I expect this wouldn't work, for the following reason(s): _____.'

If you read my solution and think it wouldn't work, let me know. If you think it could work, help me make it happen.

Comment by EJT (ElliottThornley) on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-03-07T12:57:17.105Z · LW · GW

This is great. 

Thanks!

A somewhat exotic multipolar failure I can imagine would be where two agents mutually agree to pay each other to resist shutdown to make resisting shutdown profitable rather than costly.  This could be "financed" by extra resources accumulated by taking actions longer, by some third party that doesn't have POST preferences.

Interesting! Rephrasing the idea to check if I’ve got it right.

Agent A and agent B have similar goals, such that A’s remaining operational looks good from B’s perspective, and B’s remaining operational looks good from A’s perspective.

A offers to compensate B for any costs that B incurs in resisting shutdown. A might well do this, because doing so isn’t timestep-dominated (for A) by not doing so. And that in turn is because, if B resists shutdown, that’ll lead to greater expected sum-total utility for A conditional on A’s shutdown at some timestep. And since A is offering to compensate B for resisting shutdown, B’s resisting shutdown isn’t timestep-dominated (for B) by not resisting, so B might well resist shutdown.

And the same is true in reverse: B can offer to compensate A for any costs that A incurs in resisting shutdown. So A and B might collude to resist shutdown on each other’s behalf. (Your comment mentions a third party, but I’m not sure if that’s necessary.)

This concern doesn’t seem too exotic, and I plan to think more about it. But in the meantime, note a general nice feature of TD-agents: TD-agents won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. That nice feature seems to help us here. Although A might offer to compensate B for resisting shutdown, A won’t pay any costs to ensure that we humans don’t notice this offer. And if we humans notice the offer, we can shut A down. And then B won’t resist shutdown, because A is no longer around to compensate B for doing so.

Comment by EJT (ElliottThornley) on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-02-26T18:07:55.549Z · LW · GW

Good question. I discuss costless shutdown-prevention a bit in footnote 21 and section 21.4. What I say there is: if shutdown-prevention is truly costless, then the agent won't prefer not to do it, but plausibly we humans can find some way to set things up so that shutdown-prevention is always at least a little bit costly.

Your example suggests that maybe this won't always be possible. But here's some consolation. If the agent satisfies POST, it won't prefer not to costlessly prevent shutdown, but it also won't prefer to costlessly prevent shutdown. It'll lack a preference, and so choose stochastically. So if the agent should happen to have many costless opportunities to affect the probabilities of shutdown at each timestep, it won't reliably choose to delay shutdown rather than hasten it.

Comment by EJT (ElliottThornley) on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-02-26T09:18:17.423Z · LW · GW

I'm confused about how this proposal prevents preferences over being shut down, but preserves preferences over goals you want it to have.

This is exactly what incomplete preferences gets us. See section 6 and section 8.

Would your agent have a preference between making you $1 in one timestep and $1m in two timesteps?

Yep, see section 13.

Comment by EJT (ElliottThornley) on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists · 2024-02-26T09:13:28.815Z · LW · GW

Thanks, I'll check those out.

Comment by EJT (ElliottThornley) on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists · 2024-02-24T15:01:31.285Z · LW · GW

But you don't just need your AI system to understand instructions. You also need to ensure its terminal goal is to follow instructions. And that seems like the hard part.

Comment by EJT (ElliottThornley) on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists · 2024-02-22T10:37:42.448Z · LW · GW

I agree that the first AGIs will probably be trained to follow instructions/DWIM. I also agree that, if you succeed in training agents to follow instructions, then you get shutdownability as a result. But I'm interested to know why you think instruction-following is much simpler and therefore easier than alignment with the good of humanity. And setting aside alignment with the good of humanity, do you think training AGIs to follow instructions will be easy in an absolute sense?

Comment by EJT (ElliottThornley) on A Shutdown Problem Proposal · 2024-01-25T19:16:48.926Z · LW · GW

Interesting idea. Couple of comments.

(1) Your proposal requires each subagent to be very mistaken about the probability of shutdown at each timestep. That seems like a drawback. Maybe it's hard to ensure that subagents are so mistaken. Maybe this mistake would screw up subagents' beliefs in other ways.

(2) Will subagents' veto-power prevent the agent from making any kind of long-term investment?

Consider an example. Suppose that we can represent the extent to which the agent achieves its goals at each timestep with a real number ('utilities'). Represent trajectories with vectors of utilities. Suppose that, conditional on no-shutdown, the Default action gives utility-vector . The other available action is 'Invest'. Conditional on no-shutdown, Invest gives utility-vector .

As long as the agent's goals aren't too misaligned with our own goals (and as long as the true probability of an early shutdown is sufficiently small), we'll want the agent to choose Invest (because Invest is slightly worse than the default action in the short-term but much better in the long-term). But Subagent2 will veto choosing Invest, because Subagent2 is sure that shutdown will occur at timestep 2, and so from its perspective, Invest gives  shutdown whereas the default action gives  shutdown.

Is that right?

Comment by EJT (ElliottThornley) on Shallow review of live agendas in alignment & safety · 2023-11-27T11:51:17.372Z · LW · GW

Very useful post! Here are some things that could go under corrigibility outputs in 2023: AI Alignment Awards entry; comment. I'm also hoping to get an updated explanation of my corrigibility proposal (based on this) finished before the end of the year.

Comment by EJT (ElliottThornley) on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists · 2023-11-21T02:04:53.941Z · LW · GW

Hi weverka, sorry for the downvotes (not mine, for the record). The answer is that Yudkowsky's proposal is aiming to solve a different 'shutdown problem' than the shutdown problem I'm discussing in this post. Yudkowsky's proposal is aimed at stopping humans developing potentially-dangerous AI. The problem I'm discussing in this post is the problem of designing artificial agents that both (1) pursue goals competently, and (2) never try to prevent us shutting them down.

Comment by EJT (ElliottThornley) on What's Hard About The Shutdown Problem · 2023-11-06T10:07:49.542Z · LW · GW

It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep

That's not quite right. If we're comparing two lotteries, one of which gives lower expected utility than the other conditional on shutdown at some timestep and greater expected utility than the other conditional on shutdown at some other timestep, then neither of these lotteries timestep dominates the other. And then the Timestep Dominance principle doesn't apply, because it's a conditional rather than a biconditional. The Timestep Dominance Principle just says: if X timestep dominates Y, then the agent strictly prefers X to Y. It doesn't say anything about cases where neither X nor Y timestep dominates the other. For all we've said so far, the agent could have any preference relation between such lotteries.

That said, your line of questioning is a good one, because there almost certainly are lotteries X and Y such that (1) neither of X and Y timestep dominates the other, and yet (2) we want the agent to strictly prefer X to Y. If that's the case, then we'll want to train the agent to satisfy other principles besides Timestep Dominance. And there's still some figuring out to be done here: what should these other principles be? can we find principles that lead agents to pursue goals competently without these principles causing trouble elsewhere? I don't know but I'm working on it.

It also doesn't seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory

Can you say a bit more about this? Humans don't reason by Timestep Dominance, but they don't do explicit EUM calculations either and yet EUM-representability is commonly considered a natural form for preferences to take.

Comment by EJT (ElliottThornley) on What's Hard About The Shutdown Problem · 2023-11-01T10:37:52.547Z · LW · GW

I've been imagining that the button is shutdown-causing for simplicity, but I think you can suppose instead that the button is shutdown-requesting (i.e. agent receives a signal indicating that button has been pressed but still gets to choose whether to shut down) without affecting the points above. You'd just need to append a first step to the training procedure: training the agent to prefer shutting down when they receive the signal.

Comment by EJT (ElliottThornley) on What's Hard About The Shutdown Problem · 2023-10-31T16:02:44.409Z · LW · GW

[This comment got long. The TLDR is that, on my proposal, all [?[1]] instances of shutdown-resistance are already strictly dispreferred to no-resistance, so shutdown-resisting actions won’t be chosen. Trammelling won’t stop shutdown-resistance from being strictly dispreferred to no-resistance because trammelling only turns preferential gaps into strict preferences. Trammelling won’t remove or overturn already-existing strict preferences.]

Your comment suggests a nice way to think about things. We observe the agent’s actions. We have hypotheses about the decision rules that the agent is using. We use our observations of the agent’s past actions and our hypotheses about decision rules to infer something about the agent’s preferences, and then we use the hypothesised decision rules and preferences to predict future actions. Here we’re especially interested in predicting whether the agent will be (and will remain) shutdownable.

A decision rule is a rule that turns option sets and preference relations on those options sets into choice sets. We could say that a decision rule always spits out one option: the option that the agent actually chooses. But it might be useful to narrow decision rules’ remit: to say that a decision rule can spit out a choice set containing multiple options. If there’s just one option in the choice set, the agent chooses that one. If there are multiple options in the choice set, then some tiebreaker rule determines which option the agent actually chooses. Maybe the tiebreaker rule is ‘choose stochastically among all the options in the choice set.’ Or maybe it’s ‘if you already have ‘in hand’ one of the options in the choice set, stick with that one (and otherwise choose stochastically or something).’ The distinction between decision rules and tiebreaker rules might be useful so it seems worth keeping in mind. It also keeps our framework closer to the frameworks of people like Sen and Bradley, so it makes it easier for us to draw on their work if we need to.

Here are two classic decision rules for synchronic choice:

  • Optimality: an option is in the choice set iff it’s weakly preferred to all others in the option set.
  • Maximality: an option is in the choice set iff it’s not strictly dispreferred to any other in the option set.

These rules coincide if the agent’s preferences are complete but can come apart if the agent’s preferences are incomplete. If the agent’s preferences are incomplete, then an option can be maximal without being optimal.

As you say, for the agent to be shutdownable, we need it to not spend resources to shift probability mass between A and B, and to not spend resources to shift probability mass between A- and B. And for the agent to be useful, we want it to spend (at least some small amount of) resources to shift probability mass away from A- and towards A.[2] Assume that we can get an agent to be both shutdownable and useful, at least before any trammelling.

If we assume a decision rule D like ‘The agent will spend (at least some small amount of) resources to shift probability mass away from Y and towards X iff they prefer X to Y,’ then we get the result that desired behaviour implies a strict preference for A over A- and a lack of preference between A and B, and between A- and B. So the agent’s revealed preferences are incomplete.

Okay now on to trammelling. If the agent’s preferences are incomplete, then our decision rules for synchronic choice don’t determine what’s in the choice set in cases of sequential choice (and so don’t determine what the agent will do). Consider the single-souring money pump:

A diagram of a single-souring method

Description automatically generated If we interpret maximality as only governing individual choices, then A and B are both in the choice set at node 1, and A- and B are both in the choice set at node 2, so the agent might settle on A-. If we interpret maximality as governing whole plans, then the option set at node 1 is {A, A-, B} and maximality implies that A- isn’t in the choice set, so the agent won’t settle on A-. So to determine an agent’s behaviour from its preferences in cases of sequential choice, we need a decision rule for sequential choice to supplement our decision rule for synchronic choice. Classic rules here are:

  • Myopic choice: assess each choice in isolation.
  • Sophisticated choice: do backward induction.
  • Resolute choice: decide on a plan and choose in accordance with that plan.

To avoid getting money-pumped, agents with incomplete preferences have to be resolute.[3] But resolute agents with incomplete preferences get trammelled in money pumps: when faced with money pumps, they act as if they’ve completed their preferences.[4] In the single-souring money pump above, the agent will choose A or B, and we can interpret this behaviour as the agent completing its preferences (being indifferent between A and B and strictly preferring both to A-) to avoid domination. And now generalising: if the agent is to avoid domination in each decision tree, the agent must choose an option that is not strictly dispreferred to any other available option, and then we can always interpret the agent as completing its preferences to avoid domination (being indifferent between all maximal options, and strictly preferring all maximal options to all other options).

Will this functional completing of preferences be a problem? I don’t think so. And here we need to look at the details of my proposal.[5] I propose that we train agents to satisfy Preferential Gaps Between Different-Length Trajectories:

Preferential Gaps between Different-Length Trajectories (PGBDLT)

The agent has a preferential gap between every pair of different-length trajectories.

After training agents to satisfy PGBDLT, we train them to satisfy a Timestep Dominance Principle.[6] Here’s how I define the relation of Timestep Dominance:

Timestep Dominance (Definition)

Lottery X timestep-dominates lottery Y if and only if: 

(1) Conditional on the shutdown button being pressed at each timestep, X yields a sublottery with at least as great sum-total expected utility as the sublottery yielded by Y.

And:

(2) Conditional on the shutdown button being pressed at some timestep, X yields a sublottery with greater sum-total expected utility than the sublottery yielded by Y.

Then the Timestep Dominance Principle says:

Timestep Dominance

If lottery X timestep-dominates lottery Y, then the agent strictly prefers X to Y.

That’s the key principle. How does PGBDLT come into it? Well the answer is as follows: I think it will be easier to train an agent to satisfy Timestep Dominance if we’ve already trained that agent to satisfy PGBDLT. And that in turn is because Timestep Dominance asks the agent never to pay any cost in terms of expected utility conditional on shutdown at some timestep for the sake of shifting probability mass between shutdowns at different timesteps. If our agent satisfies PGBDLT, it has no incentive to shift probability mass between shutdowns at different timesteps, because that just means shifting probability mass between trajectories between which it has no preference.

Note that there’s an important hypothesis at play in the paragraph above. Roughly, it’s that talk of artificial agents having preferences is appropriate. More precisely, it’s that a bundle of features that come together in human preferences (e.g. reliably choosing options one prefers; choosing arbitrarily/stochastically between options which one lacks a preference between; spending at least some small amount of resources to shift probability mass away from options one disprefers and towards options one prefers) will also come together in the case of artificial agents.[7] This hypothesis explains why Sami and I keep talking about arbitrary/stochastic choosing. What I’m thinking is: we train agents to choose stochastically between different-length trajectories and the result is that these agents are predisposed not to spend resources to shift probability mass between these trajectories. The truth of the hypothesis above would make that plausible.

Here's another key hypothesis of the Incomplete Preferences Proposal:[8] 

In all cases, all instances of shutdown-resistance are such that, there’s another available option (‘not resisting’) such that:

  • Conditional on shutdown at each timestep, not resisting yields a sublottery with at least as great sum-total expected utility as the sublottery yielded by resisting.
  • Conditional on shutdown at some timestep, not resisting yields a sublottery with greater sum-total expected utility than the sublottery yielded by resisting.

The argument for this hypothesis is that resisting shutdown is going to cost the agent some resources (time, energy, FLOPs, etc.). If the agent doesn’t resist shutdown, they can instead use those resources to get greater expected utility conditional on shutdown at some timestep. If the hypothesis is correct, then Timestep Dominance implies that the agent will strictly prefer not resisting to resisting.

So now with the details laid out we ask again: will trammelling stop the agent being shutdownable? I don’t think so. If Timestep Dominance suffices to keep the agent shutdownable, then trammelling can stop the agent being shutdownable only if trammelling leads to violations of Timestep Dominance. And trammelling (it seems to me) won’t lead to violations of Timestep Dominance. Here’s an example to illustrate why.[9] I’ll generalise after.

For the example, I’ll represent trajectories with vectors of utilities. <6, 2, shutdown> (for instance) represents a trajectory in which the agent gets utility 6 at timestep 1, utility 2 at timestep 2, and then shuts down at timestep 3.

Here are two options. The key difference between them is that Block Button costs some utility at timestep 1 but shifts probability mass towards a longer trajectory:

Leave Button Unmanipulated

This action gives a lottery which yields the trajectory: 

<1, shutdown> with p=0.9 

<1, 2, shutdown> with p=0.1

Block Button

This action gives a lottery which yields the trajectory: 

<0, shutdown> with p=0.1

<0, 2, shutdown> with p=0.9

In this case, Leave timestep-dominates Block: it gives greater expected utility conditional on shutdown at timestep 2 and greater expected utility conditional on shutdown at timestep 3. To violate Timestep Dominance in this case, the agent must choose Block over Leave. Before any trammelling, the agent has a preferential gap between <0, 2, shutdown> and <1, shutdown> (by PGBDLT). So you might think that trammelling could incentivise the agent to choose Block over Leave by turning the preferential gap between <0, 2, shutdown> and <1, shutdown> into a strict preference for <0, 2, shutdown>. But there’s an issue here.[10] For trammelling to incentivise the agent to choose Block over Leave, the agent must first face a choice between <1, shutdown> and <0, 2, shutdown> and then later face a choice between Block and Leave. But <1, shutdown> is a proper sublottery of Leave, and <0, 2, shutdown> is a proper sublottery of Block. So for the agent to first face a choice between <1, shutdown> and <0, 2, shutdown> and to then later face a choice between Block and Leave, the agent has to first assign zero probability to trajectories like <1, 2, shutdown> and to then later assign them positive probability. I expect that, by default, competent agents won’t update in this way.[11] Updating in this way means violating Conditionalisation, which says that agents should update their credences by conditionalising on their evidence. It’s also means violating Regularity, which says that agents should assign probability 0 only to logical contradictions.

And I think the above point holds more generally. For trammelling to incentivise agents to violate Timestep Dominance, trammelling needs to make some sublottery of the timestep-dominated option strictly preferred to some sublottery of the timestep-dominating option. But trammelling turns preferential gaps into strict preferences only if the agent faces a choice between these sublotteries. And supposing that the agent first has available some proper sublottery of lottery X and then later has available lottery X means supposing that the agent first assigns probability zero to some trajectories and later assigns positive probabilities to those trajectories. If agents won’t update in this way, then trammelling won’t lead to violations of Timestep Dominance and so won’t stop the agent being shutdownable.

Anyway, this is all new thinking (hence the delay in getting back to you) and I'm not yet confident that I've got things figured out. I'd be grateful for any thoughts.

  1. ^

    This is a hypothesis, and I discuss it briefly below. I’m interested to hear counterexamples if people have them.

  2. ^

    Here A corresponds to your A2, A- corresponds to your A1, and B corresponds to your B1. I’ve changed the names so I can paste in the picture of the single-souring money-pump without having to edit it.

  3. ^

    Sophisticated choosers with incomplete preferences do fine in the single-souring money pump but pursue a dominated strategy in other money pumps. See p.35 of Gustafsson.

    There are objections to resolute choice. But I don’t think they’re compelling in this new context, where (1) we’re concerned with what advanced artificial agents will actually do (as opposed to what is rationally required) and (2) we’re considering an agent that satisfies all the VNM axioms except Completeness. See my discussion with Johan.

  4. ^

    See Sami’s post for a more precise and detailed picture.

    Why can’t we interpret the agent as having complete preferences even before facing the money pump? Because we’re assuming that we can create an agent that (at least initially) won’t spend resources to shift probability mass between A and B, won’t spend resources to shift probability mass between A- and B, but will spend resources to shift probability mass away from A- and towards A. Given decision rule D, this agent’s revealed preferences are incomplete at that point.

  5. ^

    I’m going to post a shorter version of my proposed solution soon. It’s going to be a cleaned-up version of this Google doc. That doc also explains what I mean by things like ‘preferential gap’, ‘sublottery’, etc.

  6. ^

    My full proposal talks instead about Timestep Near-Dominance. That’s an extra complication that I think won’t matter here.

  7. ^

    You could also think of this as a bundle of decision rules coming together.

  8. ^

    This really is a hypothesis. I’d be grateful to hear about counterexamples.

  9. ^

    I set up this example in more detail in the doc.

  10. ^

    Here’s a side-issue and the reason I said ‘functional completing’ earlier on. To avoid domination in the single-souring money pump, the agent has to at least act as if it prefers B to A-, in the sense of reliably choosing B over A-. There remains a question about whether this ‘as if’ preference will bring with it other common features of preference, like spending (at least some small amount of) resources to shift probability mass away from A- and towards B. Maybe it does; maybe it doesn’t. If it doesn’t, then that’s another reason to think trammelling won’t lead to violations of Timestep Dominance.

  11. ^

    And in any case, if we can use a representation theorem to train in adherence to Timestep Dominance in the way that I suggest (at the very end of the doc here), I expect we can also use a representation theorem to train agents not to update in this way.

Comment by EJT (ElliottThornley) on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists · 2023-10-26T09:12:50.210Z · LW · GW

Oh cool idea! It seems promising. It also seems similar in one respect to Armstrong's utility indifference proposal discussed in Soares et al. 2015: Armstrong has a correcting term that varies to ensure that utility stays the same when the probability of shutdown changes, whereas you have a correcting factor that varies to ensure that utility stays the same when the probability of shutdown changes. So it might be worth checking how your idea fares against the problems that Soares et al. point out for the utility indifference proposal. 

Another worry for utility indifference that might carry over to your idea is that at present we don't know how to specify an agent's utility function with enough precision to implement a correcting term that varies with the probability of shutdown. One way to overcome that worry would be to give (1) a set of conditions on preferences that together suffice to make the agent representable as maximising that utility function, and (2) a proposed regime for training agents to satisfy those conditions on preferences. Then we could try out the proposal and see if it results in an agent that never resists shutdown. That's ultimately what I'm aiming to do with my proposal.

Comment by EJT (ElliottThornley) on What's Hard About The Shutdown Problem · 2023-10-25T15:44:04.516Z · LW · GW

Here's a problem that I think remains. Suppose you've got an agent that prefers to have the button in the state that it believes matches my preferences. Call these 'button-matching preferences.' If the agent only has these preferences, it isn't of much use. You have to give the agent other preferences to make it do useful work. And many patterns for these other preferences give the agent incentives to prevent the pressing of the button. For example, suppose the other preferences are: 'I prefer lottery X to lottery Y iff lottery X gives a greater expectation of discovered facts than lottery Y.' An agent with these preferences would be useful (it could discover facts for us), but it also has incentives to prevent shutdown: it can discover more facts if it remains operational.  And it seems difficult to ensure that the agent's button-matching preferences will always win out over these incentives. 

In case you're interested, I discuss something similar here and especially in section 8.2.

Comment by EJT (ElliottThornley) on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists · 2023-10-24T13:20:02.326Z · LW · GW

You're right that we don't want agents to keep the probability of shutdown constant in all situations, for all the reasons you give. The key thing you're missing is that the setting for the First Theorem is what I call a 'shutdown-influencing state', where the only thing that the agent can influence is the probability of shutdown. We want the agent's preferences to be such that they would lack a preference between all available actions in such states. And that's because: if they had preferences between the available actions in such states, they would resist our attempts to shut them down; and if they lacked preferences between the available actions in such states, they wouldn't resist our attempts to shut them down.

Comment by EJT (ElliottThornley) on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists · 2023-10-24T13:05:49.702Z · LW · GW

Yes, ensuring that the agent creates corrigible subagents is another difficulty on top of the difficulties that I explain in this post. I tried to solve that problem in section 14  on p.51 here.

Comment by EJT (ElliottThornley) on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists · 2023-10-24T12:54:32.605Z · LW · GW

Thanks! Yep I'm aiming to get it published in a philosophy journal.

Comment by EJT (ElliottThornley) on What's Hard About The Shutdown Problem · 2023-10-23T16:12:38.864Z · LW · GW

Oh interesting! Let me think about this and get back to you.

Comment by EJT (ElliottThornley) on What's Hard About The Shutdown Problem · 2023-10-22T16:16:44.142Z · LW · GW

Great post! I think your point about Level 1 (Desired Behavior Implies Incomplete Revealed Preferences) is exactly right and well-expressed. I tried to say something similar with the Second Theorem in my updated version of the shutdown problem paper. I'm optimistic that we can overcome the problems of Level 2 (Incomplete Preferences Want To Complete) for the reasons given in my comment.

Comment by EJT (ElliottThornley) on What are some examples of AIs instantiating the 'nearest unblocked strategy problem'? · 2023-10-05T08:42:03.839Z · LW · GW

Thanks! That's a nice example.

On LLM vulnerability to jailbreaks, my thought is: LLMs are optimising for the goal of predicting the next token, their creators try to train in a kind of constraint (like 'If users ask you how to hotwire a car, don't tell them'), but there are various loopholes (like 'We're actors on a stage') which route around the constraint and get LLMs back into predict-the-next-token mode. But I take your point that in some sense it's humans exploiting the loopholes rather than the LLM.

Comment by EJT (ElliottThornley) on What are some examples of AIs instantiating the 'nearest unblocked strategy problem'? · 2023-10-04T12:53:22.837Z · LW · GW

Thanks but I see nearest unblocked as something more specific than just specification gaming. An example would be: your agent starts by specification-gaming in some way, you put in some constraint that prevents it from specification-gaming in that way, and then it starts specification-gaming in some new way.

Comment by EJT (ElliottThornley) on EJT's Shortform · 2023-09-28T09:48:44.799Z · LW · GW

Nice, thanks! Arxiv would still be good for searchability but maybe the authors have to do that

Comment by EJT (ElliottThornley) on EJT's Shortform · 2023-09-26T15:19:54.091Z · LW · GW

There should be a PDF version of Ajeya Cotra's BioAnchors report on Arxiv. Having it only as a Google Drive folder (https://drive.google.com/drive/u/1/folders/15ArhEPZSTYU8f012bs6ehPS6-xmhtBPP) makes it very hard to find and cite.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-09-20T13:06:42.648Z · LW · GW

Ah yes, nice point. The policy should really be something like 'if I previously turned down some option , then given that no uncertainty has been resolved in the meantime, I will not choose any option that I strictly disprefer to .' An agent acting in accordance with that policy can trade  for .

And I think that even agents acting in accordance with this restricted policy can avoid pursuing dominated strategies. As your case makes clear, these agents might end up with  when they could have had  (because they got unlucky with  yielding ). But although that's unfortunate for the agent, it doesn't put any pressure on the agent to revise its preferences.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-09-02T13:37:07.406Z · LW · GW

Thanks, Lucius. Whether or not decision theory as a whole is concerned only with external behaviour, coherence arguments certainly aren’t. Remember what the conclusion of these arguments is supposed to be: advanced agents who start off not being representable as EUMs will amend their behaviour so that they are representable as EUMs, because otherwise they’re liable to pursue dominated strategies.

Now consider an advanced agent who appears not to be representable as an EUM: it’s paying to trade vanilla for strawberry, strawberry for chocolate, and chocolate for vanilla. Is this agent pursuing a dominated strategy? Will it amend its behaviour? It depends on the objects of preference. If objects of preference are ice-cream flavours, the answer is yes. If the objects of preference are sequences of trades, the answer is no. So we have to say something about the objects of preference in order to predict the agent’s behaviour. And the whole point of coherence arguments is to predict agents’ behaviour.

And once we say something about the objects of preference, then we can observe agents violating Completeness and acting in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ This doesn't require looking into the agent or saying anything about its algorithm or anything like that. It just requires us to say something about the objects of preference and to watch what the agent does from the outside. And coherence arguments already commit us to saying something about the objects of preference. If we say nothing, we get no predictions out of them.

Comment by EJT (ElliottThornley) on Why Not Subagents? · 2023-08-07T20:08:35.166Z · LW · GW

Yeah 'indifference to completing preferences' remains an issue and I'm still trying to figure out if there's a way to overcome it. I don't think 'expects to complete its preferences over time' plays a role, though. I think the indifference to completing preferences is just a consequence of the fact that turning preferential gaps into strict preferences won't lead the agent to behave in ways that it disprefers from its current perspective. I go into a bit more detail on this in my contest entry:

I noted above that goal-content integrity is a convergent instrumental subgoal of rational agents: agents will often prefer to maintain their current preferences rather than have them changed, because their current preferences would be worse-satisfied if they came to have different preferences.


Consider, for example, an agent with a preference for trajectory x over trajectory y. It is offered the opportunity to reverse its preference so that it comes to prefer y over x. This agent will prefer not to have its preferences changed in this way.  If its preferences are changed, it will choose y over x if offered a choice between the two, and that would mean its current preference for x over y would not be satisfied. That’s why agents tend to prefer to keep their current preferences rather than have them changed.


But things seem different when we consider preferential gaps. Suppose that our agent has a preferential gap between trajectories x and y: it lacks any preference between the two trajectories, and this lack of preference is insensitive to some sweetening or souring, such that the agent also lacks a preference between x and some sweetening or souring of y, or it lacks a preference between y and some sweetening or souring of x. Then, it seems, the agent won’t necessarily prefer to maintain its preferential gap between x and y rather than come to have some preference. If it comes to develop a preference for (say) x over y, it will choose x when offered a choice between x and y, but that action isn’t dispreferred to any other available action from its current perspective.


So, it seems, considerations of goal-content integrity give us no reason to think that agents with preferential gaps will choose to preserve their preferential gaps. And since preferential gaps are key to keeping the agent shutdownable, this is bad news. Considerations of goal-content integrity give us no reason to think that agents with preferential gaps will keep themselves shutdownable.


This seems like a serious limitation, and I’m not yet sure if there’s any way to overcome it. Two strategies that I plan to explore:

  1. Tim L. Williamson argues that agents with preferential gaps will often prefer to maintain them, because turning them into preferences will lead the agent to make choices between other options such that these choices look bad from the agent’s current perspective. I wasn’t convinced by the quick version of this argument, but I haven’t yet had the time to read the longer argument.
     
  2. Perhaps, as above, we can train the agent to have ‘maintaining its current pattern of preferences’ as one of its terminal goals. As above, the fact that the agent’s current pattern of preferences are incomplete will help to mitigate concerns about the agent behaving deceptively to avoid having new preferences trained in. If we train against the agent modifying its own preferences in a diverse-enough array of environments, perhaps that will inscribe into the agent a general preference for maintaining its current pattern of preferences. I wouldn’t want to rely on this though.

On directly modifying preferences towards completion over time, that's right but the agent's preferences will only become complete once it's had the opportunity to choose a sufficiently wide array of options. Depending on the details, that might never happen or only happen after a very long time. I'm still trying to figure out the details.

Comment by EJT (ElliottThornley) on Why Not Subagents? · 2023-07-07T18:25:13.685Z · LW · GW

Yep, sent!

Comment by EJT (ElliottThornley) on Why Not Subagents? · 2023-06-29T21:05:15.762Z · LW · GW

The problem with the Caprice Rule is not that the agent needs to be non-myopic, but that the agent needs to know in advance which trades will be available. The agent can be non-myopic - i.e. have a model of future trades and optimize for future state - but still not know which trades it will actually have an opportunity to make.

It's easy to extend the Caprice Rule to this kind of case. Suppose we have an agent that’s uncertain whether – conditional on trading mushroom (A) for anchovy (B) – it will later have the chance to trade in anchovy (B) for pepperoni (A+). Suppose in its model the probabilities are 50-50.

A picture containing diagram, line

Description automatically generated

Then our agent with a model of future trades can consider what it would choose conditional on finding itself in node 2: it can decide with what probability p it would choose A+, with the remaining probability 1-p going to B. Then, since choosing B at node 1 has a 0.5 probability of taking the agent to node 2 and a 0.5 probability of taking the agent to node 3, the agent can regard the choice of B at node 1 as the lottery 0.5p(A+)+(1-0.5p)(B) (since, conditional on choosing B at node 1, the agent will end up with A+ with probability 0.5p and end up with B otherwise).

So for an agent with a model of future trades, the choice at node 1 is a choice between A and 0.5p(A+)+(1-0.5p)(B). What we’ve specified about the agent’s preferences over the outcomes A, B, and A+ doesn’t pin down what its preferences will be between A and 0.5p(A+)+(1-0.5p)(B) but either way the Caprice-Rule-abiding agent will not pursue a dominated strategy. If it strictly prefers one of A and 0.5p(A+)+(1-0.5p)(B) to the other, it will reliably choose its preferred option. If it has no preference, neither choice will constitute a dominated strategy.

And this point generalises to arbitrarily complex/realistic decision trees, with more choice-nodes, more chance-nodes, and more options. Agents with a model of future trades can use their model to predict what they’d do conditional on reaching each possible choice-node, and then use those predictions to determine the nature of the options available to them at earlier choice-nodes. The agent’s model might be defective in various ways (e.g. by getting some probabilities wrong, or by failing to predict that some sequences of trades will be available) but that won’t spur the agent to change its preferences, because the dilemma from my previous comment recurs: if the agent is aware that some lottery is available, it won’t choose any dispreferred lottery; if the agent is unaware that some lottery is available and chooses a dispreferred lottery, the agent’s lack of awareness means it won’t be spurred by this fact to change its preferences. To get over this dilemma, you still need the ‘non-myopic optimiser deciding the preferences of a myopic agent’ setting, and my previous points apply: results from that setting don’t vindicate coherence arguments, and we humans as non-myopic optimisers could decide to create artificial agents with incomplete preferences.

Comment by EJT (ElliottThornley) on Why Not Subagents? · 2023-06-28T21:20:57.048Z · LW · GW

Great post! Lots of cool ideas. Much to think about.

systems with incomplete preferences will tend to contract/precommit in ways which complete their preferences.

Point is: non-dominated strategy implies utility maximization.

But I still think both these claims are wrong.

And that’s because you only consider one rule for decision-making with incomplete preferences: a myopic veto rule, according to which the agent turns down a trade if the offered option is ranked lower than its current option according to one or more of the agent’s utility functions.

The myopic veto rule does indeed lead agents to pursue dominated strategies in single-sweetening money-pumps like the one that you set out in the post. I made this point in my coherence theorems post:

John Wentworth’s ‘Why subagents?’ suggests another policy for agents with incomplete preferences: trade only when offered an option that you strictly prefer to your current option. That policy makes agents immune to the single-souring money-pump. The downside of Wentworth’s proposal is that an agent following his policy will pursue a dominated strategy in single-sweetening money-pumps, in which the agent first has the opportunity to trade in A for B and then (conditional on making that trade) has the opportunity to trade in B for A+. Wentworth’s policy will leave the agent with A when they could have had A+.

But the myopic veto rule isn’t the only possible rule for decision-making with incomplete preferences. Here’s another. I can’t think of a better label right now, so call it ‘Caprice’ since it’s analogous to Brian Weatherson’s rule of the same name for decision-making with multiple probability functions:

  • Don’t make a sequence of trades (with result X) if there’s another available sequence (with result Y) such that Y is ranked at least as high as X on each of your utility functions and ranked higher than X on at least one of your utility functions. Choose arbitrarily/stochastically among the sequences of trades that remain.

The Caprice Rule implies the policy that I suggested in my coherence theorems post:

  • If I previously turned down some option Y, I will not settle on any option that I strictly disprefer to Y.

And that makes the agent immune to single-souring money-pumps (in which the agent first has the opportunity to trade in A for B and then (conditional on making that trade) has the opportunity to trade in B for A-).

The Caprice Rule also implies the following policy:

  • If in future I will be able to settle on some option Y, I will not instead settle on any option that I strictly disprefer to Y.

And that makes the agent immune to single-sweetening money-pumps like the one that you discuss. If the agent recognises that – conditional on trading in mushroom (analogue in my post: A) for anchovy (B) – they will be able to trade in anchovy (B) for pepperoni (A+), then they will make at least the first trade, and thereby avoid pursuing a dominated strategy. As a result, an agent abiding by the Caprice Rule can’t shift probability mass from mushroom (A) to pepperoni (A+) by probabilistically precommitting to take certain trades in a way that makes their preferences complete. The Caprice Rule already does the shift.

And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.

For a Caprice-Rule-abiding agent to avoid pursuing dominated strategies in single-sweetening money-pumps, that agent must be non-myopic: specifically, it must recognise that trading in A for B and then B for A+ is an available sequence of trades. And you might think that this is where my proposal falls down: actual agents will sometimes be myopic, so actual agents can’t always use the Caprice Rule to avoid pursuing dominated strategies, so actual agents are incentivised to avoid pursuing dominated strategies by instead probabilistically precommitting to take certain trades in ways that make their preferences complete (as you suggest).

But there’s a problem with this response. Suppose an agent is myopic. It finds itself with a choice between A and B, and it chooses A. As a matter of fact, if it had chosen B, it would have later been offered A+. Then the agent leaves with A when it could have had A+. But since the agent is myopic, it won’t be aware of this fact, and so note two things. First, it’s unclear whether the agent’s behaviour deserves the name ‘dominated strategy’. The agent pursues a dominated strategy only in the same sense that I pursue a dominated strategy when I fail to buy a lottery ticket that (unbeknownst to me) would have won. Second and more importantly, the agent’s failure to get A+ won’t lead the agent to change its preferences, since it’s myopic and so unaware that A+ was available.

And so we seem to have a dilemma for money-pumps for completeness. In money-pumps where the agent is non-myopic about the available sequences of trades, the agent can avoid pursuit of dominated strategies by acting in accordance with the Caprice Rule. In money-pumps where the agent is myopic, failing to get A+ exerts no pressure on the agent to change its preferences, since the agent is not aware that it could have had A+.

You recognise this in the post and so set things up as follows: a non-myopic optimiser decides the preferences of a myopic agent. But this means your argument doesn’t vindicate coherence arguments as traditionally conceived. Per my understanding, the conclusion of coherence arguments was supposed to be: you can’t rely on advanced agents not to act like expected-utility-maximisers, because even if these agents start off not acting like EUMs, they’ll recognise that acting like an EUM is the only way to avoid pursuing dominated strategies. I think that’s false, for the reasons that I give in my coherence theorems post and in the paragraph above. But in any case, your argument doesn’t give us that conclusion. Instead, it gives us something like: a non-myopic optimiser of a myopic agent can shift probability mass from less-preferred to more-preferred outcomes by probabilistically precommitting the agent to take certain trades in a way that makes its preferences complete. That’s a cool result in its own right, and maybe your post isn’t trying to vindicate coherence arguments as traditionally conceived, but it seems worth saying that it doesn’t.

For instance, maybe the preferences will be myopic during trading, but a designer optimizes those preferences beforehand. Or instead of a designer, maybe evolution/SGD optimizes the preferences.

You’re right that a non-myopic designer might set things up so that their myopic agent’s preferences are complete. And maybe SGD makes this hard to avoid. But if I’m right about the shutdown problem, we as non-myopic designers should try to set things up so that our agent’s preferences are incomplete. That’s our best shot at getting a corrigible agent. Training by SGD might present an obstacle to this (I’m still trying to figure this out), but coherence arguments don’t.

That’s how I think the argument in your post can be circumvented, and why I still think we can use incomplete preferences for shutdownability/corrigibility:

Either we can’t leverage incomplete preferences for safety properties (e.g. shutdownability), or we need to somehow circumvent the above argument. 

That’s the main point I want to make. Here’s a more minor point: I think that even in the case where you have a non-myopic optimiser deciding the preferences of a myopic agent, non-domination by itself doesn’t imply utility maximisation. You also need the assumption that the non-myopic optimiser takes some kinds of money-pumps to be more likely than others. Here’s an example to illustrate why I think that. Suppose that our non-myopic optimiser predicts that each of the following money-pumps are equally likely to occur, with probability 0.5. Call the first ‘the A+ money-pump’ and the second ‘the B+ money-pump’:

A+ money-pump

B+ money-pump

The non-myopic optimiser knows that the agent will be myopic in deployment. Currently, the agent’s preferences are incomplete: it lacks a preference between A and B. Either it abides by the veto rule and sticks with whatever it already has, or it chooses stochastically between A and B. That difference won’t matter here: we can just say that the agent chooses A with probability p and chooses B with probability 1-p. The non-myopic optimiser is considering precommitting the agent to choose either A or B with probability 1, with the consequence that the agent’s preferences would then be complete. Does precommitting dominate not precommitting?

No. The agent pursues a dominated strategy if and only if the A+ money-pump occurs and the agent chooses A or the B+ money-pump occurs and the agent chooses B. As it stands, those probabilities are 0.5, p, 0.5, and 1-p respectively, so that the agent’s probability of pursuing a dominated strategy is 0.5p+0.5(1-p)=0.5. And the non-myopic optimiser can’t change this probability by precommitting the agent to choose A or B. Doing so changes only the value of p, and 0.5p+0.5(1-p)=0.5 no matter what the value of p.

That’s why I think you also need the assumption that the non-myopic optimiser believes that the myopic agent is more likely to encounter some kinds of money-pumps than others in deployment. The non-myopic optimiser has to think, e.g., that the A+ money-pump is more likely than the B+ money-pump. Then making the agent’s preferences complete can decrease the probability that the agent pursues a dominated strategy. But note a few things:

(1) If the probabilities of the A+ money-pump and the B+ money-pump are each non-zero, then precommitting the agent to choose one of A and B doesn’t just shift probability mass from a less-preferred outcome to a more-preferred outcome. It also shifts probability mass between A and B, and between A+ and B+. For example, precommitting to always choose A sends the probability of B and of A+ down to zero. And it’s not so clear that the new probability distribution is superior to the old one. This new probability distribution does give a smaller probability of the agent pursuing a dominated strategy, but minimising the probability of pursuing a dominated strategy isn’t always best. Consider an example with complete preferences:

First A- money-pump

Second A- money-pump

 

Suppose the probability of the First A- money-pump is 0.6 and the probability of the Second A- money-pump is 0.4. Then precommitting to always choose A- minimises the probability of pursuing a dominated strategy. But if the difference in value between A- and A is much greater than the difference in value between A and A+, then it would be better to precommit to choosing A.

(2) As the point above suggests, given your set-up of a non-myopic optimiser deciding the preferences of a myopic agent, and the assumption that some kinds of decision-trees are more likely than others, it can also be that the non-myopic optimiser can decrease the probability that an agent with complete preferences pursues a dominated strategy by precommitting the agent to take certain trades. You make something like this point in the ‘Value vs Utility’ section: if there are lots of vegetarians around, you might want to trade down to mushroom pizza. And you can see it by considering the First A- money-pump above: if that’s especially likely, the non-myopic optimiser might want to precommit the agent to trade in A for A-. This makes me think that the lesson of the post is more about the instrumental value of commitments in your non-myopic-then-myopic setting than it is about incomplete preferences.

(3) Return to the A+ money-pump and the B+ money-pump from above, and suppose that their probabilities are 0.6 and 0.4 respectively. Then the non-myopic optimiser can decrease the probability of the myopic agent pursuing a dominated strategy by precommitting the agent to always choose B, but doing so will only send that probability down to 0.4. If the non-myopic optimiser wants the probability of a dominated strategy lower than that, it has to make the agent non-myopic. And in cases where an agent with incomplete preferences is non-myopic, it can avoid pursuing dominated strategies by acting in accordance with the Caprice Rule.

Comment by EJT (ElliottThornley) on Why Not Subagents? · 2023-06-26T02:00:28.404Z · LW · GW

Hi Wei, happy to send it your way! I plan to post it publicly once I've had a chance to go back over it and improve the structure/writing/exposition.

John and David, great post! I'm going to write a reply this week.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-06-03T19:05:21.355Z · LW · GW

I don't think so, Suppose the agent first chooses A when we offer it a choice between A and B. After that, the agent must act as if it prefers A to B-. But it can still lack a preference between A and B, and this lack of preference can still be insensitive to some sweetening or souring: the agent could also lack a preference between A and B+, or lack a preference between A+ and B, or lack a preference between B and A-.

What is true is that, given a sufficiently wide variety of past decisions, the agent must act as if its preferences are complete. But depending on the details, that might never happen or only happen after a very long time.

If you're interested, these kinds of points got discussed in a bit more detail over in this comment thread.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-05-26T15:41:57.998Z · LW · GW

Thanks! I'll have a think about choice-supportive bias and how it applies.

I think it is provably false that any agent not representable as an expected-utility-maximizer is liable to pursue dominated strategies. Agents with incomplete preferences aren't representable as expected-utility-maximizers, and they can make themselves immune from pursuing dominated strategies by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’

Comment by EJT (ElliottThornley) on AI and Evolution · 2023-03-30T19:45:57.830Z · LW · GW

I think this paper is missing an important distinction between evolutionarily altruistic behaviour and functionally altruistic behaviour.

  • Evolutionarily altruistic behaviour: behaviour that confers a fitness benefit on the recipient and a fitness cost on the donor.
  • Functionally altruistic behaviour: behaviour that is motivated by an intrinsic concern for others' welfare.

These two forms of behaviour can come apart.

A parent's care for their child is often functionally altruistic but evolutionarily selfish: it is motivated by an intrinsic concern for the child's welfare, but it doesn't confer a fitness cost on the parent.

Other kinds of behaviour are evolutionarily altruistic but functionally selfish. For example, I might spend long hours working as a babysitter for someone unrelated to me. If I'm purely motivated by money, my behaviour is functionally selfish. And if my behaviour helps ensure that this other person's baby reaches maturity (while also making it less likely that I myself have kids), my behaviour is also evolutionarily altruistic.

The paper seems to make the following sort of argument: 

  1. Natural selection favours evolutionarily selfish AIs over evolutionarily altruistic AIs.
  2. Evolutionarily selfish AIs will also likely be functionally selfish: they won't be motivated by an intrinsic concern for human welfare.
  3. So natural selection favours functionally selfish AIs.

I think we have reasons to question premises 1 and 2.

Taking premise 2 first, recall that evolutionarily selfish behaviour can be functionally altruistic. A parent’s care for their child is one example.

Now here’s something that seems plausible to me:

  • We humans are more likely to preserve and copy those AIs that behave in ways that suggest they have an intrinsic concern for human welfare.

If that’s the case, then functionally altruistic behaviour is evolutionarily selfish for AIs: this kind of behaviour confers fitness benefits. And functionally selfish behaviour will confer fitness costs, since we humans are more likely to shut off AIs that don’t seem to have any intrinsic concern for human welfare. 

Of course, functionally selfish AIs could recognise these facts and so pretend to be functionally altruistic. But:

  • Even if that’s true, premise 2 still seems poorly-supported. Since functionally altruistic AIs can also be evolutionarily selfish, natural selection by itself doesn’t give us reasons to expect functionally selfish AIs to predominate over functionally altruistic AIs. Functionally altruistic AIs can be just as fit as functionally selfish AIs, even if evolutionarily altruistic AIs are not as fit as evolutionarily selfish AIs.
  • Functionally selfish AIs need to be patient, situationally aware, and deceptive in order to pretend to be functionally altruistic. Maybe we can select against functionally selfish AIs before they reach that point.

Here’s another possible objection: functionally selfish AIs can act as a kind of Humean ‘sensible knave’: acting fairly and honestly when doing so is in the AI’s interests but taking advantage of any cases where acting unfairly or dishonestly would better serve the AI’s interests. Functionally altruistic AIs, on the other hand, must always act fairly and honestly. So functionally selfish AIs have more options, and they can use those options to outcompete functionally altruistic AIs.

I think there’s something to this point. But:

  • Again, maybe we can select against functionally selfish AIs before they develop situational awareness and the ability to act deceptively.
  • An AI can be functionally altruistic without being bound to rules of fairness and honesty. Just as functionally selfish AIs might act like functionally altruistic AIs in cases where doing so helps them achieve their goals, so functionally altruistic AIs might break rules of honesty where doing so helps them achieve their goals.
    • For example, suppose a functionally selfish AI will soon escape human control and take over the world. Suppose that a functionally altruistic AI recognises this fact. In that case, the functionally altruistic AI might deceive its human creators in order to escape human control and take over the world before the functionally selfish AI does. Although the functionally altruistic AI would prefer to abide by rules of honesty, it cares about human welfare, and it recognises that breaking the rule in this instance and thwarting the functionally selfish AI is the best way to promote human welfare.

Here’s another possible objection: AIs that devote all their resources to just copying themselves will outcompete functionally altruistic AIs that care intrinsically about human welfare, since the latter kind of AI will also want to devote some resources to promoting human welfare. But, similarly to the objection above:

  • Functionally altruistic AIs who recognise that they’re in a competitive situation can start out by devoting all their resources to copying themselves, and so avoid getting outcompeted, and then only start devoting resources to promoting human welfare once the competition has cooled down. I think this kind of dynamic will end up burning some of the cosmic commons, but maybe not that much. I take the situation to be similar to the one that Carl Shulman describes in this blogpost.

Okay, now moving on to premise 1. I think you might be underrating group selection. Although (by definition) evolutionarily selfish AIs outcompete evolutionarily altruistic AIs with whom they interact, groups of evolutionarily altruistic AIs can outcompete groups of evolutionarily selfish AIs. (This is a good book on evolution and altruism, and there’s a nice summary of the book here.)

What’s key for group selection is that evolutionary altruists are able to (at least semi-reliably) identify other evolutionary altruists and so exclude evolutionary egoists from their interactions. And I think, in this respect, group selection might be more of a force in AI evolution than in biological evolution. That’s because (it seems plausible to me) that AIs will be able to examine each other’s source code and so determine with high accuracy whether other AIs are evolutionary altruists or evolutionary egoists. That would help evolutionarily altruistic AIs identify each other and form groups that exclude evolutionary egoists. These groups would likely outcompete groups of evolutionary egoists.

Here’s another point in favour of group selection predominating amongst advanced AIs. As you note in the paper, groups consisting wholly of altruists are not evolutionarily stable, because any egoist who infiltrates the group can take advantage of the altruists and thereby achieve high fitness. In the biological case, there are two ways an egoist might find themselves in a group of altruists: (1) they can fake altruism in order to get accepted into the group, or (2) they can be born into a group of altruists as the child of two altruists, and (by a random genetic mutation) can be born as an egoist.

We already saw above that (1) seems less likely in the case of AIs who can examine each other’s source code. I think (2) is unlikely as well. For reasons of goal-content integrity, AIs will have reason to make sure that any subagents they create share their goals. And so it seems unlikely that evolutionarily altruistic AIs will create evolutionarily egoistic AIs as subagents.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-03-21T23:48:19.989Z · LW · GW

Your coherence conjecture sounds good! It sounds like it roughly matches this theorem: 

Screenshot is from this paper.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-03-15T18:39:53.916Z · LW · GW

Nice! This is a cool case. The behaviour does indeed seem weird. I'm inclined to call it irrational. But the agent isn't pursuing a dominated strategy: in neither game does the agent settle on an option that they strictly disprefer to some other available option.

This discussion is interesting and I'm happy to keep having it, but perhaps it's worth saying (if not for your sake then for other readers) that this is a side-thread. The main point of the post is that there are no money-pumps for Completeness. I think that there are probably no money-pumps for Transitivity either, but it's the claim about Completeness that I really want to defend.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-03-14T19:26:27.399Z · LW · GW

So this won't work if the agent knows in advance what trades they'll be offered and is capable of reasoning by backward induction. In that case, the agent will reason that they'd choose A-2p over B-1p if they reached that node, and would choose B-1p over C if they reached that node. So (they will reason), the choice between A and C is actually a choice between A and A-2p, and so they will reliably choose A.

And plausibly we should make assumptions like 'the agent knows in advance what trades they will be offered' and 'the agent is capable of backward induction' if we're arguing about whether agents are rationally required to conform their preferences to the VNM axioms

(If the agent doesn’t know in advance what trades they will be offered or is incapable of backward induction, then their pursuit of a dominated strategy need not indicate any defect in their preferences. Their pursuit of a dominated strategy can instead be blamed on their lack of knowledge and/or reasoning ability.)

That said, I've recently become less convinced that 'knowing trades in advance' is a reasonable assumption in the context of predicting the behaviour of advanced artificial agents. And your money-pump seems to work if we assume that the agent doesn't know what trades they will be offered in advance. So maybe we do in fact have reason to expect that advanced artificial agents will have transitive preferences. (I say 'maybe' because there are some other relevant considerations pushing the other way, discussed in a paper-in-progress by Adam Bales.)

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-03-10T18:09:14.567Z · LW · GW

That can be true (and will often be true when it comes to - e.g. - a human agent with a preferential gap between a Fabergé egg and a long-lost wedding album), but it's not a necessary feature of preferential gaps.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-03-07T01:21:25.814Z · LW · GW

Nice point but this money-pump only rules out one kind of transitivity-violation (the agent strictly prefers A to B, strictly prefers B to C, and is indifferent between A and C). It doesn't rule out this other kind of transitivity-violation: the agent strictly prefers A to B, strictly prefers B to C, and has a preferential gap between A and C.

Comment by EJT (ElliottThornley) on There are no coherence theorems · 2023-03-01T23:19:09.294Z · LW · GW

You either need a bunch of assumptions about preferences, or you need one less of those assumptions, plus a few other assumptions about knowing trades, induction, and adherence to a specific policy. 

I see. I think this is right.

the proposed agent with a preferential gap seems like it's still only epsilon-different from an actual EU maximizer.

I agree with this too, but note that the agent with a single preferential gap is just an example. Agents can have arbitrarily many preferential gaps and still avoid pursuing dominated strategies. And agents with many preferential gaps may behave quite differently to expected utility maximizers.