The "Commitment Races" problem

post by Daniel Kokotajlo (daniel-kokotajlo) · 2019-08-23T01:58:19.669Z · score: 58 (28 votes) · LW · GW · 16 comments


  Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible
  When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.

[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future.]

This post attempts to generalize and articulate a problem that people have been thinking about [AF · GW] since at least 2016 [AF · GW]. [Edit: 2009 in fact! [LW · GW]] In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.

For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible

Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate.

Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.

In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will invite much more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.

So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first! If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.

Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at subjective time t are cravenly submissive to the commitments they've learned about by t.

Logical updatelessness and acausal bargaining combine to create a particularly important example of a dangerous commitment race. There are strong incentives [AF · GW] for consequentialist agents to self-modify to become updateless as soon as possible, and going updateless is like making a bunch of commitments all at once. Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai [AF · GW], and Abram Demski [AF · GW]have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

Anecdote: Playing a board game last week, my friend Lukas said (paraphrase) "I commit to making you lose if you do that move." In rationalist gaming circles this sort of thing is normal and fun. But I suspect his gambit would be considered unsportsmanlike--and possibly outright bullying--by most people around the world, and my compliance would be considered cowardly. (To be clear, I didn't comply. Practice what you preach!)

When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.

This situation is already ridiculous: There is something very silly about two supposedly rational agents racing to limit their own options before the other one limits theirs. But it gets worse.

Sometimes commitments can be made "at the same time"--i.e. in ignorance of each other--in such a way that they lock in an outcome that is disastrous for everyone. (Think both players in Chicken throwing out their steering wheels simultaneously.)

Here is a somewhat concrete example: Two consequentialist AGI think for a little while about game theory and commitment races and then self-modify to resist and heavily punish anyone who bullies them. Alas, they had slightly different ideas about what counts as bullying and what counts as a reasonable request--perhaps one thinks that demanding more than the Nash Bargaining Solution is bullying, and the other thinks that demanding more than the Kalai-Smorodinsky Bargaining Solution is bullying--so many years later they meet each other, learn about each other, and end up locked into all-out war.

I'm not saying disastrous AGI commitments are the default outcome; I'm saying the stakes are high enough that we should put a lot more thought into preventing them than we have so far. It would really suck if we create a value-aligned AGI that ends up getting into all sorts of fights across the multiverse with other value systems. We'd wish we built a paperclip maximizer instead.

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."

Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

Objection: "Threats, submission to threats, and costly fights are rather rare in human society today. Why not expect this to hold in the future, for AGI, as well?"

Reply: Several points:

1. Devastating commitments (e.g. "Grim Trigger") are much more possible with AGI--just alter the code! Inigo Montoya is a fictional character and even he wasn't able to summon lifelong commitment on a whim; it had to be triggered by the brutal murder of his father.

2. Credibility is much easier also, especially in an acausal context (see above.)

3. Some AGI bullies may be harder to retaliate against than humans, lowering their disincentive to make threats.

4. AGI may not have sufficiently strong reputation effects in the sense relevant to consequentialists, partly because threats can be made more devastating (see above) and partly because they may not believe they exist in a population of other powerful agents who will bully them if they show weakness.

5. Finally, these terrible things (Brutal threats, costly fights) do happen to some extent even among humans today--especially in situations of anarchy. We want the AGI we built to be less likely to do that stuff than humans, not merely as likely.

Objection: "Any AGI that falls for this commit-now-before-the-others-do argument will also fall for many other silly do-X-now-before-it's-too-late arguments, and thus will be incapable of hurting anyone."

Reply: That would be nice, wouldn't it? Let's hope so, but not count on it. Indeed perhaps we should look into whether there are other arguments of this form that we should worry about our AI falling for...

Anecdote: A friend of mine, when she was a toddler, would threaten her parents: "I'll hold my breath until you give me the candy!" Imagine how badly things would have gone if she was physically capable of making arbitrary credible commitments. Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.


Overall, I'm not certain that this is a big problem. But it feels to me that it might be, especially if acausal trade turns out to be a real thing. I would not be surprised if "solving bargaining" turns out to be even more important than value alignment, because the stakes are so high. I look forward to a better understanding of this problem.

Many thanks to Abram Demski, Wei Dai, John Wentworth, and Romeo Stevens for helpful conversations.


Comments sorted by top scores.

comment by johnswentworth · 2019-08-23T02:08:12.664Z · score: 8 (4 votes) · LW(p) · GW(p)

One big factor this whole piece ignores is communication channels: a commitment is completely useless unless you can credibly communicate it to your opponent/partner. In particular, this means that there isn't a reason to self-modify to something UDT-ish unless you expect other agents to observe that self-modification. On the other hand, other agents can simply commit to not observing whether you've committed in the first place - effectively destroying the communication channel from their end.

In a game of chicken, for instance, I can counter the remove-the-steering-wheel strategy by wearing a blindfold. If both of us wear a blindfold, then neither of us has any reason to remove the steering wheel. In principle, I could build an even stronger strategy by wearing a blindfold and using a beeping laser scanner to tell whether my opponent has swerved - if both players do this, then we're back to the original game of chicken, but without any reason for either player to remove their steering wheel.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2019-08-23T02:45:10.860Z · score: 4 (3 votes) · LW(p) · GW(p)

I think in the acausal context at least that wrinkle is smoothed out.

In a causal context, the situation is indeed messy as you say, but I still think commitment races might happen. For example, why is [blindfold+laserscanner] a better strategy than just blindfold? It loses to the blindfold strategy, for example. Whether or not it is better than blindfold depends on what you think the other agent will do, and hence it's totally possible that we could get a disastrous crash (just imagine that for whatever reason both agents think the other agent will probably not do pure blindfold. This can totally happen, especially if the agents don't think they are strongly correlated with each other and sometimes even if they do (e.g. if they use CDT)) The game of chicken doesn't cease being a commitment race when we add the ability to blindfold and the ability to visibly attach laserscanners.

comment by johnswentworth · 2019-08-23T05:47:43.639Z · score: 6 (3 votes) · LW(p) · GW(p)

Blindfold + scanner does not necessarily lose to blindfold. The blindfold does not prevent swerving, it just prevents gaining information - the blindfold-only agent acts solely on its priors. Adding a scanner gives the agent more data to work with, potentially allowing the agent to avoid crashes. Foregoing the scanner doesn't actually help unless the other player knows I've foregone the scanner, which brings us back to communication - though the "communication" at this point may be in logical time, via simulation.

In the acausal context, communication kicks even harder, because either player can unilaterally destroy the communication channel: they can simply choose to not simulate the other player. The game will never happen at all unless both agents expect (based on priors) to gain from the trade.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2019-08-26T04:55:58.130Z · score: 4 (2 votes) · LW(p) · GW(p)

If you choose not to simulate the other player, then you can't see them, but they can still see you. So it's destroying one direction of the communication channel. But the direction that remains (they seeing you) is the dimension most relevant for e.g. whether or not there is a difference between making a commitment and credibly communicating it to your partner. Not simulating the other player is like putting on a blindfold, which might be a good strategy in some contexts but seems kinda like making a commitment: you are committing to act on your priors in the hopes that they'll see you make this commitment and then conform their behavior to the incentives implied by your acting on your priors.

comment by Wei_Dai · 2019-08-23T06:02:46.785Z · score: 4 (2 votes) · LW(p) · GW(p)

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016.

I found some related discussions going back to 2009 [LW · GW]. It's mostly highly confused, as you might expect, but I did notice this part which I'd forgotten and may actually be relevant:

But if you are TDT, you can’t always use less com­put­ing power, be­cause that might be cor­re­lated with your op­po­nents also de­cid­ing to use less com­put­ing power

This could potentially be a way out of the "racing to think as little as possible before making commitments" dynamic, but if we have to decide how much to let our AIs think initially before making commitments, on the basis of reasoning like this, that's a really hairy thing to have to do. (This seems like another good reason for wanting to go with a metaphilosophical approach to AI safety instead of a decision theoretic one. What's the point of having a superintelligent AI if we can't let it figure these kinds of things out for us?)

If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can “move first” can get much more than the one that “moves second.”

I'm not sure how the folk theorem shows this. Can you explain?

going updateless is like making a bunch of commitments all at once

Might be a good idea to offer some examples here to help explain updateless and for pumping intuitions.

Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn’t actually commit to anything then.

Interested to hear more details about this. What would have happened if you were actually able to become updateless?

comment by Liam Donovan (liam-donovan) · 2019-12-02T18:57:12.477Z · score: 1 (1 votes) · LW(p) · GW(p)

Would trying to become less confused about commitment races before building a superintelligent AI count as a metaphilosophical approach or a decision theoretic one (or neither)? I'm not sure I understand the dividing line between the two.

comment by Wei_Dai · 2019-12-03T02:59:47.914Z · score: 5 (3 votes) · LW(p) · GW(p)

Trying to become less confused about commitment races can be part of either a metaphilosophical approach or a decision theoretic one, depending on what you plan to do afterwards. If you plan to use that understanding to directly give the AI a better decision theory which allows it to correctly handle commitment races, then that's what I'd call a "decision theoretic approach". Alternatively, you could try to observe and understand what humans are doing when we're trying to become less confused about commitment races and program or teach an AI to do the same thing so it can solve the problem of commitment races on its own. This would be an example of what I call "metaphilosophical approach".

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2019-08-23T18:40:28.462Z · score: 1 (1 votes) · LW(p) · GW(p)

Thanks, edited to fix!

I agree with your push towards metaphilosophy.

I didn't mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1's preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be "earlier in logical time" than player 2 and make a credible commitment.

As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don't do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don't have a super clear example of how this might lead to disaster, but I intend to work one out in the future...

Same goes for my own experience. I don't have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.

comment by capybaralet · 2019-09-12T04:00:49.620Z · score: 3 (3 votes) · LW(p) · GW(p)

I have another "objection", although it's not a very strong one, and more of just a comment.

One reason game theory reasoning doesn't work very well in predicting human behavior is because games are always embedded in a larger context, and this tends to wreck the game-theory analysis by bringing in reputation and collusion as major factors. This seems like something that would be true for AIs as well (e.g. "the code" might not tell the whole story; I/"the AI" can throw away my steering wheel but rely on an external steering-wheel-replacing buddy to jump in at the last minute if needed).

In apparent contrast to much of the rationalist community, I think by default one should probably view game theoretic analyses (and most models) as "just one more way of understanding the world" as opposed to "fundamental normative principles", and expect advanced AI systems to reason more heuristically (like humans).

But I understand and agree with the framing here as "this isn't definitely a problem, but it seems important enough to worry about".

comment by Dagon · 2019-08-23T14:20:34.194Z · score: 3 (2 votes) · LW(p) · GW(p)

I think you're missing at least one key element in your model: uncertainty about future predictions. Commitments have a very high cost in terms of future consequence-effecting decision space. Consequentialism does _not_ imply a very high discount rate, and we're allowed to recognize the limits of our prediction and to give up some power in the short term to reserve our flexibility for the future.

Also, one of the reasons that this kind of interaction is rare among humans is that commitment is impossible for humans. We can change our minds even after making an oath - often with some reputational consequences, but still possible if we deem it worthwhile. Even so, we're rightly reluctant to make serious committments. An agent who can actually enforce it's self-limitations is going to be orders of magnitude more hesitant to do so.

All that said, it's worth recognizing that an agent that's significantly better at predicting the consequences of potential commitments will pay a lower cost for the best of them, and has a material advantage over those who need flexibility because they don't have information. This isn't a race in time, it's a race in knowledge and understanding. I don't think there's any way out of that race - more powerful agents are going to beat weaker ones most of the time.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2019-12-30T20:23:43.878Z · score: 4 (3 votes) · LW(p) · GW(p)

I don't think I was missing that element. The way I think about it is: There is some balance that must be struck between making commitments sooner (risking making foolish decisions due to ignorance) and later (risking not having the right commitments made when a situations arises in which they would be handy). A commitment race is a collective action problem where individuals benefit from going far to the "sooner" end of the spectrum relative to the point that would be optimal for everyone if they could coordinate.

I agree about humans not being able to make commitments--at least, not arbitrary commitments. (Arguably, getting angry and seeking revenge when someone murders your family is a commitment you made when you were born.) I think we should investigate whether this inability is something evolution "chose" or not.

I agree it's a race in knowledge/understanding as well as time. (The two are related.) But I don't think more knowledge = more power. For example, if I don't know anything and decide to commit to plan X which benefits me, else war, and you know more than me--in particular, you know enough about me to know what I will commit to--and you are cowardly, then you'll go along with my plan.

comment by FeepingCreature · 2019-12-30T03:00:25.208Z · score: 2 (2 votes) · LW(p) · GW(p)

I think this undervalues conditional commitments. The problem of "early commitment" depends entirely on you possibly having a wrong image of the state of the world. So if you just condition your commitment on the information you have available, you avoid premature commitments made in ignorance and give other agents an incentive to improve your world model. Likewise, this would protect you from learning about other agents' commitments "too late" - you can always just condition on things like "unless I find an agent with commitment X". You can do this whether or not you even know to think of an agent with commitment X, as long as other agents who care about X can predict your reaction to learning about X.

Commitments aren't inescapable shackles, they're just another term for "predictable behavior." The usefulness of commitments doesn't require you to bind yourself regardless of learning any new information about reality. Oaths are highly binding for humans because we "look for excuses", our behavior is hard to predict, and we can't reliably predict and evaluate complex rule systems. None of those should pose serious problems for trading superintelligences.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-01-03T14:29:57.536Z · score: 6 (2 votes) · LW(p) · GW(p)

I don't think this solves the problem, though it is an important part of the picture.

The problem is, which conditional commitments do you make? (A conditional commitment is just a special case of a commitment) "I'll retaliate against A by doing B, unless [insert list of exceptions here." Thinking of appropriate exceptions is important mental work, and you might not think of all the right ones for a very long time, and moreover while you are thinking about which exceptions you should add, you might accidentally realize that such-and-such type of agent will threaten you regardless of what you commit to and then if you are a coward you will "give in" by making an exception for that agent. The problem persists, in more or less exactly the same form, in this new world of conditional commitments. (Again, which are just special cases of commitments, I think.)

comment by FeepingCreature · 2020-01-04T02:05:52.698Z · score: 1 (1 votes) · LW(p) · GW(p)

I concur in general, but:

you might accidentally realize that such-and-such type of agent will threaten you regardless of what you commit to and then if you are a coward you will “give in” by making an exception for that agent.

this seems like a problem for humans and badly-built AIs. Nothing that reliably one-boxes should ever do this.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-01-04T16:38:50.308Z · score: 3 (3 votes) · LW(p) · GW(p)

EDT reliably one-boxes, but EDT would do this.

Or do you mean one-boxing in Transparent Newcomb? Then your claim might be true, but even then it depends on how seriously we take the "regardless of what you commit to" clause.

comment by FeepingCreature · 2020-01-05T15:38:53.135Z · score: 2 (2 votes) · LW(p) · GW(p)

True, sorry, I forgot the whole set of paradoxes that led up to FDT/UDT. I mean something like... "this is equivalent to the problem that FDT/UDT already has to solve anyways." Allowing you to make exceptions doesn't make your job harder.