# Why you must maximize expected utility

post by Benya (Benja) · 2012-12-13T01:11:13.339Z · LW · GW · Legacy · 76 comments## Contents

Setting the stage Outcomes Preference relations The Axiom of Independence The Axiom of Continuity The VNM theorem Doing without Continuity Next post: Dealing with time To the mathematical appendix None 75 comments

*This post explains von Neumann-Morgenstern (VNM) axioms for decision theory*,* and what follows from them: that if you have a consistent direction in which you are trying to steer the future, you must be an expected utility maximizer. I'm writing this post in preparation for a sequence on updateless anthropics, but I'm hoping that it will also be independently useful.*

The theorems of decision theory say that if you follow certain axioms, then your behavior is described by a utility function. (If you don't know what that means, I'll explain below.) So you should have a utility function! Except, why should you want to follow these axioms in the first place?

A couple of years ago, Eliezer explained how violating one of them can turn you into a money pump — how, at time 11:59, you will *want* to pay a penny to get option B instead of option A, and then at 12:01, you will *want* to pay a penny to switch back. Either that, or the game will have ended and the option won't have made a difference.

When I read that post, I was suitably impressed, but not completely convinced: I would certainly not want to behave one way if behaving differently *always* gave better results. But couldn't you avoid the problem by violating the axiom only in situations where it doesn't give anyone an opportunity to money-pump you? I'm not saying that would be *elegant*, but is there a reason it would be *irrational*?

It took me a while, but I have since come around to the view that you really must have a utility function, and really must behave in a way that maximizes the expectation of this function, on pain of stupidity (or at least that there are strong arguments in this direction). But I don't know any source that comes close to explaining the reason, the way I see it; hence, this post.

I'll use the von Neumann-Morgenstern axioms, which assume probability theory as a foundation (unlike the Savage axioms, which actually *imply* that anyone following them has not only a utility function but also a probability distribution). I will assume that you already accept Bayesianism.

*

*Epistemic* rationality is about figuring out what's true; *instrumental* rationality is about steering the future where you want it to go. The way I see it, the axioms of decision theory tell you how to have a consistent *direction* in which you are trying to steer the future. If my choice at 12:01 depends on whether at 11:59 I had a chance to decide differently, then perhaps I won't ever be money-pumped; but if I want to save as many human lives as possible, and I must decide between different plans that have different probabilities of saving different numbers of people, then it starts to at least seem *doubtful* that which plan is better at 12:01 could *genuinely* depend on my opportunity to choose at 11:59.

So how do we formalize the notion of a coherent direction in which you can steer the future?

*

## Setting the stage

Decision theory asks what you would do if faced with choices between different sets of options, and then places restrictions on how you can act in one situation, depending on how you would act in others. This is another thing that has always bothered me: If we are talking about choices between different lotteries with small prizes, it makes some sense that we could invite you to the lab and run ten sessions with different choices, and you should probably act consistently across them. But if we're interested in the big questions, like how to save the world, then you're not going to face a series of independent, analogous scenarios. So what is the *content* of asking what you would do if you faced a set of choices different from the one you actually face?

The real point is that you have bounded computational resources, and you can't *actually* visualize the exact set of choices you might face in the future. A perfect Bayesian rationalist could just figure out what they *would* do in any conceivable situation and write it down in a giant lookup table, which means that they only face a single one-time choice between different possible tables. But *you* can't do that, and so you need to figure out general principles to follow. A perfect Bayesian is like a Carnot engine — it's what a theoretically perfect engine *would* look like, so even though you can at best approximate it, it still has something to teach you about how to build a real engine.

But decision theory is *about* what a perfect Bayesian would do, and it's annoying to have our practical concerns intrude into our ideal picture like that. So let's give our story some local color and say that *you* aren't a perfect Bayesian, but you have a genie — that is, a powerful optimization process — that is, an AI, which *is*. (That, too, is physically impossible: AIs, like humans, can only approximate perfect Bayesianism. But we *are* still idealizing.) Your *genie* is able to comprehend the set of possible giant lookup tables it must choose between; *you* must write down a formula, to be evaluated by the genie, that chooses the best table from this set, given the available information. (An unmodified human won't *actually* be able to write down an exact formula describing their preferences, but we might be able to write down one for a paperclip maximizer.)

The first constraint decision theory places on your formula is that it must order all options your genie *might* have to choose between from best to worst (though you might be indifferent between some of them), and then given any particular set of feasible options, it must choose the one that is least bad. In particular, if you prefer option A when options A and B are available, then you can't prefer option B when options A, B and C are available.

**Meditation:** *Alice is trying to decide how large a bonus each member of her team should get this year. She has just decided on giving Bob the same, already large, bonus as last year when she receives an e-mail from the head of a different division, asking her if she can recommend anyone for a new project he is setting up. Alice immediately realizes that Bob would love to be on that project, and would fit the bill exactly. But she *needs* Bob on the contract he's currently working on; losing him would be a pretty bad blow for her team.*

*Alice decides there is no way that she can recommend Bob for the new project. But she still feels bad about it, and she decides to make up for it by giving Bob a larger bonus. On reflection, she finds that she genuinely feels that this is the *right* thing to do, simply because she *could* have recommended him but didn't. Does that mean that Alice's preferences are irrational? Or that something is wrong with decision theory?*

**Meditation:*** One kind of answer to the above and to many other criticisms of decision theory goes like this: Alice's decision isn't between giving Bob a larger bonus or not, it's between (give Bob a larger bonus unconditionally), (give Bob the same bonus unconditionally), (only give Bob a larger bonus if I could have recommended him), and so on. But if *that* sort of thing is allowed, is there *any* way left in which decision theory constrains Alice's behavior? If not, what good is it to Alice in figuring out what she should do?*

...

...

...

*

## Outcomes

My short answer is that Alice can care about anything she damn well likes. But there are a lot of things that she *doesn't* care about, and decision theory has something to say about *those*.

In fact, deciding that some kinds of preferences should be outlawed as irrational can be dangerous: you might think that nobody in their right mind should ever care about the detailed planning algorithms their AI uses, as long as they work. But how certain are you that it's wrong to care about whether the AI has planned out your whole life in advance, in detail? (Worse: Depending on how strictly you interpret it, this injunction might even rule out not wanting the AI to run conscious simulations of people.)

But nevertheless, I believe the "anything she damn well likes" needs to be qualified. Imagine that Alice and Carol both have an AI, and fortuitously, both AIs have been programmed with the same preferences and the same Bayesian prior (and they talk, so they also have the same posterior, because Bayesians cannot agree to disagree). But Alice's AI has taken over the stock markets, while Carol's AI has seized the world's nuclear arsenals (and is protecting them well). So Alice's AI not only doesn't want to blow up Earth, it couldn't do so *even if it wanted to*; it couldn't even bribe Carol's AI, because Carol's AI really doesn't want the Earth blown up either. And so, if it makes a difference to the AIs' preference function whether they *could* blow up Earth if they wanted to, they have a conflict of interest.

The moral of this story is not simply that it would be *sad* if two AIs came into conflict even though they have the same preferences. The point is that we're asking what it means to have a consistent direction in which you are trying to steer the future, and it doesn't look like our AIs are on the same bearing. Surely, a direction for steering the world should only depend on features of the *world*, not on additional information about which agent is at the rudder.

You *can* want to not have your life planned out by an AI. But I think you should have to state your wish as a property of the world: you want *all* AIs to refrain from doing so, not just "whatever AI happens to be executing this". And Alice can want Bob to get a larger bonus if the company could have assigned him to the new project and decided not to, but she must figure out whether *this* is the correct way to translate her moral intuitions into preferences over properties of the world.

*

You may care about any feature of the world, but you don't in fact care about most of them. For example, there are many ways the atoms in the sun could be arranged that all add up to the same thing as far as you are concerned, and you don't have *terminal* preferences about which of these will be the actual one tomorrow. And though you might care about *some* properties of the algorithms your AI is running, mostly they *really* do not matter.

Let's define a function that takes a complete description of the world — past, present and future — and returns a data structure containing all information about the world that matters to your terminal values, and *only* that information. (Our imaginary perfect Bayesian doesn't know exactly which way the world will turn out, but it can work with "possible worlds", complete descriptions of ways the world *may* turn out.) We'll call this data structure an "outcome", and we require you to be indifferent between any two courses of action that will always produce the same outcome. Of course, any course of action is something that your AI would be executing in the actual world, and you are certainly allowed to care about the difference — but then the two courses of action do not lead to the same "outcome"!^{1}

With this definition, I think it is pretty reasonable to say that in order to have a consistent direction in which you want to steer the world, you must be able to order these outcomes from best to worst, and always want to pick the least bad you can get.

*

## Preference relations

That won't be *sufficient*, though. Our genie doesn't *know* what outcome each action will produce, it only has probabilistic information about that, and that's a complication we very much do *not* want to idealize away (because we're trying to figure out the right way to *deal* with it). And so our decision theory amends the earlier requirement: You must not only be indifferent between actions that always produce the same outcome, but also between all actions that only yield *the same probability distribution* over outcomes.

This is not at all a mild assumption, though it's usually built so deeply into the definitions that it's not even called an "axiom". But we've assumed that all features of the world you care about are already encoded in the outcomes, so it does seem to me that the only reason left why you might prefer one action over another is that it gives you a better trade-off in terms of what outcomes it makes more or less likely; and I've assumed that you're already a Bayesian, so you agree that *how* likely it makes an outcome is correctly represented by the probability of that outcome, given the action. So it certainly *seems* that the probability distribution over outcomes should give you all the information about an action that you could *possibly* care about. And that you should be able to order these probability distributions from best to worst, and all that.

Formally, we represent a direction for steering the world as a set of possible outcomes and a binary relation on the probability distributions over (with is interpreted as " is at least as good as ") which is a total preorder; that is, for all , and :

- If and , then (that is, is
*transitive*); and - We have either or or both (that is, is
*total*).

In this post, I'll assume that is finite. We write (for "I'm indifferent between and ") when both and , and we write (" is strictly better than ") when but *not* . Our genie will compute the set of all actions it could possibly take, and the probability distribution over possible outcomes that (according to the genie's Bayesian posterior) each of these actions leads to, and then it will choose to act in a way that maximizes . I'll also assume that the set of possible actions will always be finite, so there is always at least one optimal action.

**Meditation:** *Omega is in the neighbourhood and invites you to participate in one of its little games. Next Saturday, it plans to flip a fair coin; would you please indicate on the attached form whether you would like to bet that this coin will fall heads, or tails? If you correctly bet heads, you will win $10,000; if you correctly bet tails, you'll win $100. If you bet wrongly, you will still receive $1 for your participation.*

* *

*We'll assume that you prefer a 50% chance of $10,000 and a 50% chance of $1 to a 50% chance of $100 and a 50% chance of $1. Thus, our theory would say that you should bet heads. But there is a twist: Given recent galactopolitical events, you estimate a 3% chance that after posting its letter, Omega has been called away on urgent business. In this case, the game will be cancelled and you won't get any money, though as a consolation, Omega will probably send you some book from its rare SF collection when it returns (market value: approximately $55–$70). Our theory so far tells you nothing about how you should bet in this case, but does Rationality have anything to say about it?*

...

...

...

*

## The Axiom of Independence

So here's how I think about that problem: If you already *knew* that Omega is still in the neighbourhood (but not which way the coin is going to fall), you would prefer to bet heads, and if you *knew* it has been called away, you wouldn't care. (And what you bet has no influence on whether Omega has been called away.) So heads is either better or exactly the same; clearly, you should bet heads.

This type of reasoning is the content of the von Neumann-Morgenstern *Axiom of Independence*. Apparently, that's the most controversial of the theory's axioms.

You're already a Bayesian, so you already accept that if you perform an experiment to determine whether someone is a witch, and the experiment can come out two ways, then if one of these outcomes is evidence that the person is a witch, the other outcome must be evidence that they are *not*. New information is allowed to make a hypothesis more likely, but not *predictably* so; if *all* ways the experiment could come out make the hypothesis more likely, then you should *already* be finding it more likely than you do. The same thing is true even if only one result would make the hypothesis more likely, but the other would leave your probability estimate exactly unchanged.

The Axiom of Independence is equivalent to saying that if you're evaluating a possible course of action, and one experimental result would make it seem more attractive than it currently seems to you, while the other experimental result would at least make it seem no *less* attractive, then you should *already* be finding it more attractive than you do. This does *seem* rather solid to me.

*

So what does this axiom say formally? *(Feel free to skip this section if you don't care.)*

Suppose that your genie is considering two possible actions and (bet heads or tails), and an event (Omega is called away). Each action gives rise to a probability distribution over possible outcomes: E.g., is the probability of outcome if your genie chooses . But your genie can also compute a probability distribution *conditional on *, . Suppose that conditional on , it doesn't matter which action you pick: for all . And finally, suppose that the probability of doesn't depend on which action you pick: , with . The Axiom of Independence says that in this situation, you should prefer the distribution to the distribution , and therefore prefer to , if and only if you prefer the distribution to the distribution .

Let's write for the distribution , for the distribution , and for the distribution . (Formally, we think of these as vectors in : e.g., .) For all , we have

so , and similarly . Thus, we can state the Axiom of Independence as follows:

- .

We'll assume that you can't ever rule out the possibility that your AI might face this type of situation for any given , , , and , so we require that this condition hold for all probability distributions ,* *and , and for all with .

*

Here's a common criticism of Independence. Suppose a parent has two children, and one old car that they can give to one of these children. Can't they be indifferent between giving the car to their older child or their younger child, but strictly prefer throwing a coin? But let mean that the younger child gets the gift, and that the older child gets it, and ; then by Independence, if , then , so it would seem that the parent can *not* strictly prefer the coin throw.

In fairness, the people who find this criticism persuasive may not be Bayesians. But if *you* think this is a good criticism: Do you think that the parent must be indifferent between throwing a coin and asking the children's crazy old kindergarten teacher which of them was better-behaved, as long as they assign 50% probability to either answer? Because if not, shouldn't you already have protested when we decided that decisions must only depend on the probabilities of different outcomes?

My own resolution is that this is another case of terminal values intruding where they don't belong. *All* that is relevant to the parent's terminal values must *already* be described in the outcome; the parent is allowed to prefer "I threw a coin and my younger child got the car" to "I decided that my younger child would get the car" or "I asked the kindergarten teacher and they thought my younger child was better-behaved", but if so, then these must already be different *outcomes*. The thing to remember is that it isn't a property of the *world* that either child had a 50% probability of getting the car, and you can't steer the future in the direction of having this mythical property. It *is* a property of the world that *the parent assigned a 50% probability* to each child getting the car, and that *is* a direction you can steer in — though the example with the kindergarten teacher shows that this is probably not quite the direction you actually wanted.

The preference relation is *only* supposed to be about *trade-offs* between probability distributions; if you're tempted to say that you want to steer the world towards one probability distribution or another, rather than one outcome or other, something has gone terribly wrong.

*

## The Axiom of Continuity

And… that's it. These are all the axioms that I'll ask you to accept in this post.

There is, however, one more axiom in the von Neumann-Morgenstern theory, the Axiom of Continuity. I do *not* think this axiom is a necessary requirement on any coherent plan for steering the world; I think the best argument for it is that it doesn't make a practical difference whether you adopt it, so you might as well. But there is also a good argument to be made that if we're talking about anything *short* of steering the entire future of humanity, your preferences *do* in fact obey this axiom, and it makes things easier technically if we adopt it, so I'll do that at least for now.

Let's look at an example: If you prefer $50 in your pocket to $40, the axiom says that there must be *some* small such that you prefer a probability of of $50 and a probability of of dying today to a certainty of $40. Some critics seem to see this as the ultimate *reductio ad absurdum* for the VNM theory; they seem to think that no sane human would accept that deal.

Eliezer was surely not the first to observe that this preference is exhibited each time someone drives an extra mile to save $10.

Continuity says that if you strictly prefer to , then there is *no * so terrible that you wouldn't be willing to incur a small probability of it in order to (probably) get rather than , and *no* so wonderful that you'd be willing to (probably) get instead of if this gives you some arbitrarily small probability of getting . Formally, for all , and ,

- If , then there is an such that and .

I think if we're talking about everyday life, we can pretty much rule out that there are things so terrible that for *arbitrarily* small , you'd be willing to die with probability to avoid a probability of of the terrible thing. And if you feel that it's not worth the expense to call a doctor every time you sneeze, you're willing to incur a *slightly* higher probability of death in order to save some mere money. And it seems unlikely that there is *no* at which you'd prefer a certainty of $1 to a chance of $100. And if you have some preference that is so slight that you wouldn't be willing to accept *any* chance of losing $1 in order to indulge it, it can't be a very strong preference. So I think for most practical purposes, we might as well accept Continuity.

*

## The VNM theorem

If your preferences are described by a transitive and complete relation on the probability distributions over some set of "outcomes", and this relation satisfies Independence and Continuity, then you have a utility function, and your genie will be maximizing expected utility.

Here's what that means. A utility function is a function which assigns a numerical "utility" to every outcome. Given a probability distribution over , we can compute the expected value of under , ; this is called the *expected utility*. We can prove that there is some utility function such that for all and , we have if and only if the expected utility under is greater than the expected utility under .

In other words: is *completely* described by ; if you know , you know . Instead of programming your genie with a function that takes two outcomes and says which one is better, you might as well program it with a function that takes one outcome and returns its utility. Any coherent direction for steering the world which happens to satisfy Continuity can be reduced to a function that takes outcomes and assigns them numerical ratings.

In fact, it turns out that the for a given is "almost" unique: Given two utility functions and that describe the same , there are numbers and such that for all , ; this is called an "affine transformation". On the other hand, it's not hard to see that for any such and ,

so two utility functions represent the same preference relation if and only if they are related in this way.

*

You shouldn't read *too* much into this conception of utility. For example, it doesn't make sense to see a fundamental distinction between outcomes with "positive" and with "negative" von Neumann-Morgenstern utility — because adding the right can make any negative utility positive and any positive utility negative, without changing the underlying preference relation. The numbers that have real meaning are ratios between differences between utilities, , because these don't change under affine transformations (the 's cancel when you take the difference, and the 's cancel when you take the ratio). Academian's post has more about misunderstandings of VNM utility.

In my view, what VNM utilities represent is not *necessarily* how *good* each outcome is; what they represent is what trade-offs between probability distributions you are willing to accept. Now, if you strongly felt that the difference between and was about the same as the difference between and , then you should have *a very good reason* before you make your a huge number. But on the other hand, I think it's ultimately your responsibility to decide what trade-offs you are willing to make; I don't think you can get away with "stating how much you value different outcomes" and outsourcing the rest of the job to decision theory, without ever *considering* what these valuations should mean in terms of probabilistic trade-offs.

*

## Doing without Continuity

What happens if your preferences do *not* satisfy Continuity? Say, you want to save human lives, but you're not willing to incur *any* probability, no matter how small, of *infinitely* many people getting tortured infinitely long for this?

I do not see a good argument that this couldn't add up to a coherent direction for steering the world. I do, however, see an argument that in this case you care so little about finite numbers of human lives that in practice, you can probably neglect this concern entirely. (As a result, I doubt that your reflective equilibrium would want to adopt such preferences. But I don't think they're *in**coherent*.)

I'll assume that your morality can still distinguish only a finite number of outcomes, and you can choose only between a finite number of decisions. It's not obvious that these assumptions are justified if we want to take into account the *possibility* that the true laws of physics might turn out to allow for infinite computations, but even in this case *you* and any AI *you* build will probably still be finite (though *it* might build a successor that isn't), so I do in fact think there is a good chance that results derived under this assumption have relevance in the real world.

In this case, it turns out that you *still* have a utility function, in a certain sense. (Proofs for non-standard results can be found in the math appendix to this post. I did the work myself, but I don't expect these results to be new.) This utility function describes only the concern most important to you: in our example, only the probability of infinite torture makes a difference to expected utility; any change in the probability of saving a finite number of lives leaves expected utility unchanged.

Let's define a relation , read " is *much better* than ", which says that there is nothing you wouldn't give up a little probability of in order to get instead of — in our example: doesn't merely save lives compared to , it makes infinite torture less likely. Formally, we define to mean that for all and "close enough" to and respectively; more precisely: if there is an such that for all and with

(Or equivalently: if there are open sets and around and , respectively, such that for all and .)

It turns out that if is a preference relation satisfying Independence, then is a preference relation satisfying Independence and Continuity, and there is a utility function such that iff the expected utility under is larger than the expected utility under . Obviously, implies , so whenever two options have different expected utilities, you prefer the one with the larger expected utility. Your genie is *still* an expected utility maximizer.

Furthermore, unless for *all* and , isn't constant — that is, there are *some* and with . (If this weren't the case, the result above obviously wouldn't tell us very much about !) Being indifferent between all possible actions doesn't make for a particularly interesting direction for steering the world, if it can be called one at all, so from now on let's assume that you are not.

*

It *can* happen that there are two distributions and with the same expected utility, but . ( saves more lives, but the probability of eternal torture is the same.) Thus, if your genie happens to face a choice between two actions that lead to the *same* expected utility, it must do more work to figure out which of the actions it should take. But there is some reason to expect that such situations should be *rare*.

If there are possible outcomes, then the set of probability distributions over is -dimensional (because the probabilities must add up to 1, so if you know of them, you can figure out the last one). For example, if there are three outcomes, is a triangle, and if there are four outcomes, it's a tetrahedron. On the other hand, it turns out that for any , the set of all for which the expected utility equals has dimension or smaller: if , it's a line (or a point or the empty set); if , it's a plane (or a line or a point or the empty set).

Thus, in order to have the same expected utility, and must lie on the same hyperplane — not just on a plane *very close by*, but on *exactly* the same plane. That's not just a small target to hit, that's an infinitely small target. If you use, say, a Solomonoff prior, then it seems *very* unlikely that two of your finitely many options just *happen* to lead to probability distributions which yield the same expected utility.

But we are bounded rationalists, not perfect Bayesians with uncomputable Solomonoff priors. We assign heads and tails exactly the same probability, not because there is no information that would make one or the other more likely (we could try to arrive at a best guess about which side is a little heavier than the other?), but because the problem is so complicated that we simply give up on it. What if it turns out that because of this, all the *difficult* decisions we need to make turn out to be between actions that happen to have the same expected utility?

If you do your imperfect calculation and find that two of your options seem to yield exactly the same probability of eternal hell for infinitely many people, you *could* then try to figure out which of them is more likely to save a finite number of lives. But it seems to me that this is *not* the best approximation of an ideal Bayesian with your stated preferences. Shouldn't you spend those computational resources on doing a *better* calculation of which option is more likely to lead to eternal hell?

For you *might* arrive at a new estimate under which the probabilities of hell are at least slightly different. Even if you *suspect* that the new calculation will again come out with the probabilities exactly equal, you don't *know* that. And therefore, can you truly in good conscience argue that doing the new calculation does not improve the odds of avoiding hell —

— *at least a teeny tiny incredibly super-small for all ordinary intents and purposes completely irrelevant bit?*

Even if it *should* be the case that to a *perfect* Bayesian, the expected utilities under a Solomonoff prior were exactly the same, *you* don't know that, so how can you possibly justify stopping the calculation and saving a mere finite number of lives?

*

So there you have it. In order to have a coherent direction in which you want to steer the world, you must have a set of outcomes and a preference relation over the probability distributions over these outcomes, and this relation must satisfy Independence — or so it seems to me, anyway. And if you do, then you have a utility function, and a perfect Bayesian maximizing your preferences will always maximize expected utility.

It *could* happen that two options have exactly the same expected utility, and in this case the utility function doesn't tell you which of these is better, under your preferences; but as a bounded rationalist, you can never *know* this, so if you have any computational resources left that you could spend on figuring out what your true preferences have to say, you should spend them on a better calculation of the expected utilities instead.

Given this, we might as well just talk about , which satisfies Continuity as well as Independence, instead of ; and you might as well program your genie with your utility function, which only reflects , instead of with your true preferences.

*(Note: I am not literally saying that you should not try to understand the whole topic better than this if you are *actually* going to program a Friendly AI. This is still meant as a metaphor. I *am,* however, saying that expected utility theory, even with boring old real numbers as utilities*,* is not to be discarded *lightly*.)*

*

## Next post: Dealing with time

So far, we've always pretended that you only face *one* choice, at *one* point in time. But not only is there a way to apply our theory to repeated interactions with the environment — there are two!

One way is to say that at each point in time, you should apply decision theory to set of actions you can perform *at that point*. Now, the actual outcome depends of course not only on what you do now, but also on what you do later; but you know that you'll still use decision theory later, so you can *foresee* what you will do in any possible future situation, and take it into account when computing what action you should choose now.

The second way is to make a choice only once, not between the actions you can take at that point in time, but between complete *plans* — giant lookup tables — which specify how you *will* behave in any situation you might possibly face. Thus, you simply do your expected utility calculation *once*, and then stick with the plan you have decided on.

**Meditation:** *Which of these is the right thing to do, if you have a perfect Bayesian genie and you want steer the future in some particular direction?*

*(Does it even make a difference which one you use?)*

**» To the mathematical appendix**

**Notes**

^{1 }The accounts of decision theory I've read use the term "outcome", or "consequence", but leave it mostly undefined; in a lottery, it's the prize you get at the end, but clearly nobody is saying decision theory should *only* apply to lotteries. I'm not changing its role in the mathematics, and I think my explanation of it is what the term always *wanted* to mean; I expect that other people have explained it in similar ways, though I'm not sure how similar precisely.

## 76 comments

Comments sorted by top scores.

This type of argument strikes me as analogous to using Arrow's theorem to argue that we must implement a dictatorship.

You know, given Arrow's result, and given the observation commonly made around here that there are lots of little agents running around in our head, it is not so surprising that human beings exhibit "incoherent behavior." It's a consequence of our mind architecture.

I am not sure I am prepared to start culling my internal Congress just so I can have a coherent utility function that makes the survivors happy.

But the post is an argument for using cardinal utility (VNM utility)! And Arrow's "impossibility" theorem only applies when trying to aggregate ordinal utilities across voters. It is well-known that voting systems which aggregate cardinal utility, such as Range Voting can escape the impossibility theorem.

So Arrow is actually *another* reason for having a VNM utility function: it allows collectively rational decisions, as well as individually rational decisions.

Analogous in what way?

As the Theorem treats them, voters are already utility-maximizing agents who have a clear preference set which they act on in rational ways. The question: how to aggregate these?

It turns out that if you want certain superficially reasonable things out of a voting process from such agents - nothing gets chosen at random, it doesn't matter how you cut up choices or whatever, &c. - you're in for disappointment. There isn't actually a way to have a group that is itself rationally agentic in the precise way the Theorem postulates.

One bullet you could bite is having a dictator. Then none of the inconsistencies arise from having all these extra preference sets lying around because there's only one and it's perfectly coherent. This is very easily comparable to reducing all of your own preferences into a single coherent utility function.

Both involve taking a mathematical result about the only way to do something in a way that satisfies certain intuitively appealing properties, and using it to argue that we therefore should do it that way.

A dictatorship isn't the only resolution to Arrow's theorem. Anyway, this sounds like a rather weak argument against the position.

Not really, because the argument isn't that you should do anything differently at all. It says that there's some utility function that *represents* your preferences, some expected-utility-maximizing genie that makes the same choices as you, but it doesn't tell you to have different preferences, or make different decisions under any circumstances.

In fact, I don't really know why this post is called "Why you must maximize expected utility" instead of "Why you already maximize expected utility." It seems that even if I have some algorithm that is on the surface not maximizing expected utility, such as being risk-averse in some way dealing with money, then I'm really just maximizing the expected value of a non-obvious utility function.

No. Most humans do not maximize expected utility with respect to any utility function whatsoever because they have preferences which violate the hypotheses of the VNM theorem. For example, framing effects show that humans do not even consistently have the same preferences regarding fixed probability distributions over outcomes (but that their preferences change depending on whether the outcomes are described in terms of gains or losses).

**Edit:** in other words, the VNM theorem shows that "you must maximize expected utility" is equivalent to "your preferences should satisfy the hypotheses of the VNM theorem" (and not all of these hypotheses are encapsulated in the VNM axioms), and this is a statement with nontrivial content.

No. Most humans do not maximize the expected utility of any utility function whatsoever because they have preferences which violate the hypotheses of the VNM theorem.

Axioms? (Hypotheses does seem to quite fit. One could have a hypothesis that humans had preferences that are in accord with the VNM axioms and falsify said theorem but the VNM doesn't make the hypothesis itself.)

In the nomenclature that I think is relatively standard among mathematicians, if a theorem states "if P1, P2, ... then Q" then P1, P2, ... are the hypotheses of the theorem and Q is the conclusion. One of the hypotheses of the VNM theorem, which isn't strictly speaking one of the von Neumann-Morgenstern axioms, is that you assign consistent preferences at all (that is, that the decision of whether you prefer A to B depends only on what A and B are). I'm not using "consistent" here in the same sense as the Wikipedia article does when talking about transitivity; I mean consistent over time. (Edit: Eliezer uses "incoherent"; maybe that's a better word.)

Again, among mathematicians, I think "hypotheses" is more common. Exhibit A; Exhibit B. I would guess that "premises" is more common among philosophers...?

**[deleted]**· 2012-12-22T01:49:34.057Z · LW(p) · GW(p)

I usually say “assumptions”, but I'm neither a mathematician nor a philosopher. I do say “hypotheses” if for some reason I'm wearing mathematician attire.

It seems that even if I have some algorithm that is on the surface not maximizing expected utility, such as being risk-averse in some way dealing with money, then I'm really just maximizing the expected value of a non-obvious utility function.

Not all decision algorithms are utility-maximising algorithms. If this were not so, the axioms of the VNM theorem would not be necessary. But they are necessary: the conclusion requires the axioms, and when axioms are dropped, decision algorithms violating the conclusion exist.

For example, suppose that given a choice between A and B it chooses A; between B and C it chooses B; between C and A it chooses C. No utility function describes this decision algorithm. Suppose that given a choice between A and B it never makes a choice. No utility function describes this decision algorithm.

Another way that a decision algorithm can fail to have an associated utility function is by lying outside the ontology of the VNM theorem. The VNM theorem treats only of decisions over probability distributions of outcomes. Decisions can be made over many other things. And what is an "outcome"? Can it be anything less than the complete state of the agent's entire positive light-cone? If not, it is practically impossible to calculate with; but if it can be smaller, what counts as an outcome and what does not?

Here is another decision algorithm. It is the one implemented by a room thermostat. It has two possible actions: turn the heating on, or turn the heating off. It has two sensors: one for the actual temperature and one for the set-point temperature. Its decisions are given by this algorithm: if the temperature falls 0.5 degrees below the set point, turn the heating on; if it rises 0.5 degrees above the set-point, turn the heating off. Exercise: what relationship holds between this system, the VNM theorem, and utility functions?

**Meditation:** So far, we've always pretended that you only face *one* choice, at *one* point in time. But not only is there a way to apply our theory to repeated interactions with the environment — there are two!

One way is to say that at each point in time, you should apply decision theory to set of actions you can perform *at that point*. Now, the actual outcome depends of course not only on what you do now, but also on what you do later; but you know that you'll still use decision theory later, so you can *foresee* what you will do in any possible future situation, and take it into account when computing what action you should choose now.

The second way is to make a choice only once, not between the actions you can take at that point in time, but between complete *plans* — giant lookup tables — which specify how you *will* behave in any situation you might possibly face. Thus, you simply do your expected utility calculation *once*, and then stick with the plan you have decided on.

*Which of these is the right thing to do, if you have a perfect Bayesian genie and you want steer the future in some particular direction? (Does it even make a difference which one you use?)*

"Apply decision theory to the set of actions you can perform at that point" is underspecified — are you computing counterfactuals the way CDT does, or EDT, TDT, etc?

This question sounds like a fuzzier way of asking which decision theory to use, but maybe I've missed the point.

I really like this trend of adding meditations to posts, asking people to figure something out not just on their own but here and out loud.

Does it matter if your utility function is constant with respect to time, provided that the most preferred outcome changes rarely?

There is no distinction between these. How do you construct this hypothetical lookup table? By applying decision theory to every possible future history. In other words, by applying option 1 to calculate out everything in advance. But why bother? Applying option 1 as events unfold will produce results identical to applying it to all possible futures now, and avoids the small problem of requiring vastly more computational resources than the universe is capable of holding, running extraordinarily faster than anything is capable of happening, and operating for gigantically longer than the universe will exist, before you can do anything.

Calculating the locally optimal action without any reference to plans can sometimes get you different results - see the absentminded driver problem.

I'm not convinced that the absentminded driver problem has such implications. Its straightforward (to me) resolution is that the optimal p is 2/3 by the obvious analysis, and that the driver cannot use alpha as a probability, for reasons set out here.

But I'd rather not get into a discussion of self-referential decision theory, since it doesn't currently exist.

It is essential to both of these paradoxes that they deal with social situations. Rephrase them so that the agent is interacting with nature, and the paradoxes disappear.

For example, suppose that the parent is instead collecting shells on the beach. He has room in his bag for one more shell, and finds two on the ground that he has no preference between. Clearly, there's no reason he would rather flip a coin to decide between them than just pick one of them up, say, the one on the left.

What this tells me is that you have to be careful using decision theory in social situations, because you have subtle, unspoken values that you can easily forget to take into account. It's fairly obvious in the parent and kids example: she has no preference between them, but she also wants to *prove* that she has no preference between them, so she flips the coin.

I'm not exactly sure what the social drives are in the first example, though.

Of course this is not different from your own solution, only more specific. As you said,

the parent is allowed to prefer "I threw a coin and my younger child got the car" to "I decided that my younger child would get the car" ... but if so, then these must already be different outcomes.

The presence of a coin flip constitutes a separate outcome, because it matters to her terminal values that her children know that she's not playing favorites.

Is the real-world imperative "you must maximize expected utility", given by the VNM theorem, stronger or weaker than the imperative "everyone must have the same beliefs" given by Aumann's agreement theorem? If only there was some way of comparing these things! One possible metric is how much money I'm losing by not following this or that imperative. Can anyone give an estimate?

People can't order outcomes from best to worst. People exhibit circular preferences. I, myself, exhibit circular preferences. This is a problem for a utility-function based theory of what I want.

**[deleted]**· 2012-12-13T04:53:29.957Z · LW(p) · GW(p)

Interesting. Example of circular preferences?

There's a whole literature on preference intransitivity, but really, it's not that hard to catch yourself doing it. Just pay attention to your pairwise comparisons when you're choosing among three or more options, and don't let your mind cover up its dirty little secret.

Yup. Possible cause: motivations are caused by at least 3 totally different kinds of processes which often conflict.

Can you give an example of circular preferences that aren't contextual and therefore only superficially circular (like Benja's Alice and coin-flipping examples are contextual and only superficially irrational), and that you endorse, rather than regarding as bugs that should be resolved somehow? I'm pretty sure that any time I feel like I have intransitive preferences, it's because of things like framing effects or loss aversion that I would rather not be subject to.

**[deleted]**· 2012-12-13T13:22:32.151Z · LW(p) · GW(p)

That does happen to me from time to time, but when it does (and I notice that) I just think “hey, I've found a bug in my mindware” and try to fix that. (Usually it's a result of some ugh field.)

This would mean, of course, that humans can be money-pumped. In other words, if this is really true, there is a lot of money out there "on the table" for anyone to grab by simply money-pumping arbitrary humans. But in real life, if you went and tried to money-pump people, you would not get very far. But I accept a weaker form of what you are saying, that in the normal course of events when people are not consciously thinking about it we can exhibit circular reasoning. But in a situation where we actually are sitting down and thinking and calculating about it, we are capable of “resolving” those apparently circular preferences.

No, not "of course". It only implies that if they're rational actors, which of course they are not. They are deal-averse and if they see you trying to pump them around in a circle they will take their ball and go home.

You can still profit by doing one step of the money pump, and people do. Lots of research goes into exploiting people's circular preferences on things like supermarket displays.

I think you are taking my point as something stronger than what I said. As you pointed out, with humans you can often money pump them once, but not more than that. So it can not truly be said that that preference is fully circular. It is something weaker, and perhaps you could call it a semi-circular preference. My point was that the thing that humans exhibit is not a “circular preference” in the fullest technical sense of the term.

It seems essential to the idea of "a coherent direction for steering the world" or "preferences" that the ordering between choices does not depend on what choices are actually available. But in standard cooperative multi-agent decision procedures, the ordering *does* depend on the set of choices available. How to make sense of this? Does it mean that a group of more than one agent can't be said to have a coherent direction for steering the world? What is it that they do have then? And if a human should be viewed as a group of sub-agents representing different values and/or moral theories, does it mean a human also doesn't have such a coherent direction?

Does it mean that a group of more than one agent can't be said to have a coherent direction for steering the world?

That's indeed my current intuition. Suppose that there is a paperclip maximizer and a staples maximizer, and the paperclip maximizer has sole control over all that happens in the universe, and the two have a common prior which assigns near-certainty to this being the case. Then I expect the universe to be filled with paperclips. But if Staples has control, I expect the universe to be tiled with staples.

On the other hand (stealing your example, but let's make it about a physical coinflip, to hopefully make it noncontroversial): If both priors assign 50% probability to "Clippy has control and the universe can support 10^10 paperclips or 10^20 staples" and 50% probability to "Staples has control and the universe can support 10^10 staples or 10^20 paperclips", and it turns out that in fact the first of these is true, then I expect Clippy to tile the universe with staples.

I disagree with Stuart's post arguing that this means that Nash's bargaining solution (NBS) can't be correct, because it is dynamically inconsistent, as it gives a different solution after Clippy updates on the information that it has sole control. I think this is simply a counterfactual mugging: Clippy's payoff in the possible world where Staples has control depends on Clippy's cooperation in the world where Clippy has control. The usual solution to counterfactual muggings is to simply optimize expected utility relative to your prior, so the obvious thing to do would be to apply NBS to your *prior* distribution, giving you dynamic consistency.

That said, I'm not saying that I'm sure NBS is in fact the *right* solution. My current intuition is that there should be some way to formalize the "bargaining power" of each agent, and *when holding the bargaining powers fixed*, a group of agents should be steering the world in a coherent direction. This *suggests* that the right formalization of "bargaining power" would give a nonnegative scaling factor to each member of the group, and the group will act to maximize the sum of the agents' expected utilities weighed by their respective scaling factors. (As in Stuart's post, the scaling factors will of course not be invariant under affine transformations applied to the agents' utility functions -- if you multiply an agent's utility function by x, you will need to divide their scaling factor by x in order to compensate.)

Of course, at this point this is merely an intuition, and I do not have a worked-out proposal nor a careful justification.

And if a human should be viewed as a group of sub-agents representing different values and/or moral theories, does it mean a human also doesn't have such a coherent direction?

I have to say that this approach does not make much sense to me in the first place, and I'm tempted to take your question as a modus tollens argument against that approach. Maybe it would be useful to have a more detailed discussion about this, but in short, I think aspiring rationalist humans should see it as their responsibility to actually choose one direction in which they want to steer the world, rather than specifying conflicting goals and then asking for some formula that will decide *for them* how to trade these goals against each other. If you *choose* to trade off different goals by weighing them with different factors, fine; but if you try to find some 'laws of rationality' that will tell you the one correct way to trade off these goals, without ever needing to make a decision about this yourself, I think you're trying to pass off a responsibility that is properly yours.

I think aspiring rationalist humans should see it as their responsibility to actually choose one direction in which they want to steer the world, rather than specifying conflicting goals and then asking for some formula that will decide for them how to trade these goals against each other. If you choose to trade off different goals by weighing them with different factors, fine; but if you try to find some 'laws of rationality' that will tell you the one correct way to trade off these goals, without ever needing to make a decision about this yourself, I think you're trying to pass off a responsibility that is properly yours.

Why so much emphasis on "responsibility"? In my mind, I have a responsibility to fulfill any promises I make to others and ... and that's about it. As for figuring out what my preferences are, or should be, I'm going to try any promising approaches I can find, and see if one of them works out. Thinking of myself as a bunch of sub-agents and using ideas from bargaining theory is one such an approach. Trying to solve normative ethics using the methods of moral philosophers may be another. When you say "see it as their responsibility to actually choose one direction in which they want to steer the world", what does that mean, in terms of an approach I can explore?

**ETA:** I wrote a post that may help explain what I meant here.

This suggests that the right formalization of "bargaining power" would give a nonnegative scaling factor to each member of the group, and the group will act to maximize the sum of the agents' expected utilities weighed by their respective scaling factors. ... Of course, at this point this is merely an intuition, and I do not have a worked-out proposal nor a careful justification.

There is a justification for that intuition. Some have objected to the axiom that the aggregation must also be VNM-rational, but Nisan has proved a similar theorem that does not rely on the VNM-rationality of the collective as an axiom.

I do not understand the first part of the post. As far as I can tell, you are responding to concerns that have been raised elsewhere (possibly in your head while discussing the issue with yourself) but it is unclear to me what exactly these concerns are, so I'm lost. Specifically, I do not understand the following:

Meditation: Alice is trying to decide how large a bonus each member of her team should get this year. She has just decided on giving Bob the same, already large, bonus as last year when she receives an e-mail from the head of a different division, asking her if she can recommend anyone for a new project he is setting up. Alice immediately realizes that Bob would love to be on that project, and would fit the bill exactly. But she needs Bob on the contract he's currently working on; losing him would be a pretty bad blow for her team.

Alice decides there is no way that she can recommend Bob for the new project. But she still feels bad about it, and she decides to make up for it by giving Bob a larger bonus. On reflection, she finds that she genuinely feels that this is the right thing to do, simply because she could have recommended him but didn't. Does that mean that Alice's preferences are irrational? Or that something is wrong with decision theory?

This example has distracting details which I suspect are hiding the point you're actually trying to make (which I can't figure out), at least to me. In practice, it seems to me that what Alice is concerned with are the social (signaling) implications of Bob gaining knowledge of both the bonus and of the possibility of recommendation.

My short answer is that Alice can care about anything she damn well likes. But there are a lot of things that she doesn't care about, and decision theory has something to say about those.

In fact, deciding that some kinds of preferences should be outlawed as irrational can be dangerous: you might think that nobody in their right mind should ever care about the detailed planning algorithms their AI uses, as long as they work. But how certain are you that it's wrong to care about whether the AI has planned out your whole life in advance, in detail? (Worse: Depending on how strictly you interpret it, this injunction might even rule out not wanting the AI to run conscious simulations of people.)

Maybe I'm just being slow right now, but I can't figure out what this has to do with the discussion preceding it.

The point is that we're asking what it means to have a consistent direction in which you are trying to steer the future, and it doesn't look like our AIs are on the same bearing.

I either don't understand or disagree with this. In the situation you describe it sounds to me like the two AIs will make different decisions in practice for game-theoretic reasons, but I don't see why one would suspect that they are trying to steer the future in different directions.

In practice, it seems to me that what Alice is concerned with are the social (signaling) implications of Bob gaining knowledge of both the bonus and of the possibility of recommendation.

I am assuming that Alice, on reflection, decides that she wants to give Bob the higher bonus even if nobody else ever learned that she had the opportunity to recommend him for the project, the way I would not want to steal food from a starving person even if nobody ever found out about it.

The concern I'm replying to is that decision theory assumes your preferences can be described by a binary "is preferred to" relation, but humans might choose option A if the available options are A and B, and option B if the available options are A, B and C, so how do you model that as a binary relation? I actually don't recall seeing this raised in the context of VNM utility theory, but I believe I've seen it in discussions of Arrow's impossibility theorem, where the Independence of Irrelevant Alternatives axiom (confusingly, not the analog of VNM's Independence of Irrelevant Alternatives) says that adding option C must not change the decision from A to B.

I'm not particularly bothered for decision theory if you can do an experiment and have humans exhibit such behavior, because some human behavior is patently self-defeating and I don't think we should require decision theory to explain all our biases as "rational", but I want a decision theory that won't exclude the preferences that we *would* actually want to adopt on reflection, so I either want it to support Alice's preferences or I want to understand why Alice's preferences *are* in fact irrational.

In fact, deciding that some kinds of preferences should be outlawed as irrational can be dangerous: [...]

Maybe I'm just being slow right now, but I can't figure out what this has to do with the discussion preceding it.

It's like this: Caring about the set of options you were able to choose between *seems* like a bad idea to me; I'm *skeptical* that preferences like Alice's are what I would want to adopt, on reflection. I might be tempted to simply say, they're obviously irrational, no problem if decision theory doesn't cater to them. But caring about the algorithm your AI runs *also* seems like a bad idea, and by similar intuitions I might have been willing to accept a decision theory that would outlaw such preferences -- which, as it turns out, would *not* be good.

The point is that we're asking what it means to have a consistent direction in which you are trying to steer the future, and it doesn't look like our AIs are on the same bearing.

I either don't understand or disagree with this. In the situation you describe it sounds to me like the two AIs will make different decisions in practice for game-theoretic reasons, but I don't see why one would suspect that they are trying to steer the future in different directions.

Let's suppose that both AIs have the following preferences: Most importantly, they don't want Earth blown up. However, if they are *able* to blow up Earth no later than one month from now, they would like to maximize the expected number of paperclips in the universe; if they aren't able to, they want to maximize the expected number of staples. Now, if in two months a freak accident wipes out Alice's AI, then the world ends up tiled with paperclips; if it wipes out Carol's AI, the world ends up tiled with staples. (Unless they made a deal that if either was wiped out, the other would carry on its work, as any paperclip maximizer might do with any staples maximizer -- though they might not have a reason to, since they're not risk-averse.) This does not sound like steering the future in the same direction, to me.

Could you expand on the game-theoretic reasons? My intuition is that from a game theoretic perspective, "steering the future in the same direction" should mean we're talking about a partnership game, i.e., that both agents will get the same payoff for any strategy profile, and I do not see why this would lead to reasons to "make different decisions in practice".

The concern I'm replying to is that decision theory assumes your preferences can be described by a binary "is preferred to" relation, but humans might choose option A if the available options are A and B, and option B if the available options are A, B and C, so how do you model that as a binary relation?

Oh. I still do not think the example you gave illustrates this concern. One interpretation of the situation is that Alice gains new knowledge in the scenario. The existence of a new project suited to Bob's talents increases Alice's assessment of Bob's value. More generally, it's reasonable for an agent's preferences to change as its knowledge changes.

In response to this objection, I think you only need to assume that deciding between A and B and C is equivalent to deciding between A and (B and C) and also equivalent to deciding between (A and B) and C, together with the assumption that your agent is capable of consistently assigning preferences to "composite choices" like (A and B).

Caring about the set of options you were able to choose between seems like a bad idea to me; I'm skeptical that preferences like Alice's are what I would want to adopt, on reflection. I might be tempted to simply say, they're obviously irrational, no problem if decision theory doesn't cater to them. But caring about the algorithm your AI runs also seems like a bad idea, and by similar intuitions I might have been willing to accept a decision theory that would outlaw such preferences -- which, as it turns out, would not be good.

Are you claiming that these two situations are analogous or only claiming that they are two examples of caring about whether decision theory should allow certain kinds of preferences? That's one of the things I was confused about (because I can't see the analogy but your writing suggests that one exists). Also, where does your intuition that it is a bad idea to care about the algorithm your AI runs come from? It seems like an obviously good idea to care about the algorithm your AI runs to me.

Could you expand on the game-theoretic reasons? My intuition is that from a game theoretic perspective, "steering the future in the same direction" should mean we're talking about a partnership game, i.e., that both agents will get the same payoff for any strategy profile, and I do not see why this would lead to reasons to "make different decisions in practice".

I guess that depends on what "same" means. If you instantiate two AIs that are running identical algorithms but both AIs are explicitly trying to monopolize all of the resources on the planet, then they're playing a zero-sum game but there's a reasonable sense in which they are trying to steer the future in the "same" direction (namely that they are running identical algorithms).

If this isn't a reasonable notion of sameness because the algorithm involves reference to thisAgent and the referent of this pointer changes depending on who's instantiating the algorithm, then the preferences you've described are also not the same preferences because they also refer to thisAgent. If the preferences are modified to say "if an agent running thisAlgorithm has access to foo," then as far as I can tell the two AIs you describe should behave as if they are the same agent.

Thanks for the feedback!

It's possible that I'm just misreading your words to match my picture of the world, but it *sounds* to me as if we're not disagreeing too much, but I failed to get my point across in the post. Specifically:

If this isn't a reasonable notion of sameness because the algorithm involves reference to thisAgent and the referent of this pointer changes depending on who's instantiating the algorithm, then the preferences you've described are also not the same preferences because they also refer to thisAgent. If the preferences are modified to say "if an agent running thisAlgorithm has access to foo," then as far as I can tell the two AIs you describe should behave as if they are the same agent.

I *am* saying that I think that a "direction for steering the future" should not depend on a global thisAgent variable. To make the earlier example even more blatant, I don't think it's useful to call "If thisAgent = Alice's AI, maximize paperclips; if thisAgent = Carol's AI, maximize staples" a coherent direction, I'd call it a function that *returns* a coherent direction. Whether or not the concept I'm trying to define is the *best* meaning for "same direction" is of course only a definitional debate and not that interesting, but I think it's a useful concept.

I agree that the most obvious formalization of Alice's preferences would depend on thisAgent. So I'm saying that there actually *is* a nontrivial restriction on her preferences: If she wants to keep something like her informal formulation, she will need to decide what they are supposed to mean in terms that do not refer to thisAgent. They *may* simply refer to "Alice", but then the AI is influenced only by what *Alice* was able to do, not by what the *AI* was able to do, and Alice will have to decide whether that is what she *wants*.

Oh. I still do not think the example you gave illustrates this concern. One interpretation of the situation is that Alice gains new knowledge in the scenario. The existence of a new project suited to Bob's talents increases Alice's assessment of Bob's value. More generally, it's reasonable for an agent's preferences to change as its knowledge changes.

But how could you come up with a pair of situations such that in situation (i), the agent can choose options A and B, while in situation (ii), the agent can choose between A, B and C, and yet the agent has *exactly the same information* in situations (i) and (ii)? So under your rules, how could *any* example illustrate the concern?

I *do* agree that it's reasonable for Alice to choose a different option because the knowledge she has is different -- that's my *resolution* to the problem.

In response to this objection, I think you only need to assume that deciding between A and B and C is equivalent to deciding between A and (B and C) and also equivalent to deciding between (A and B) and C, together with the assumption that your agent is capable of consistently assigning preferences to "composite choices" like (A and B).

Sorry, I do not understand -- what do you mean by your composite choices? What does it mean to choose (A and B) when A and B are mutually exclusive options?

Are you claiming that these two situations are analogous or only claiming that they are two examples of caring about whether decision theory should allow certain kinds of preferences? That's one of the things I was confused about (because I can't see the analogy but your writing suggests that one exists).

I'm claiming they are both examples of preferences you might *think* you could outlaw as irrational, so you might *think* it's ok to use a decision theory that doesn't allow for such preferences. In one of the two cases, it's clearly not ok, which suggests we shouldn't be too quick to decide it's ok in the other case.

Also, where does your intuition that it is a bad idea to care about the algorithm your AI runs come from? It seems like an obviously good idea to care about the algorithm your AI runs to me.

Could it be that it's not clear enough that I'm talking about terminal values, not instrumental values?

Maybe it's not right to say that it seems like a *bad* idea, more like it would seem at first that people just *don't* have terminal preferences about the algorithm run (or at least not strong ones -- you might derive enjoyment from an elegant algorithm, but that wouldn't outweigh your desire to save lives, so your instrumental preference for a well-working algorithm would always dominate your terminal preference for enjoying an elegant algorithm, if the two came into conflict). So at first it might seem reasonable to design a decision theory where you are not *allowed* to care about the algorithm your AI is running -- I find it at least conceivable that when trying to prove theorems about self-modifying AI, making such an assumption might simplify things, so this does seem like a conceivable failure mode to me.

I agree that the most obvious formalization of Alice's preferences would depend on thisAgent. So I'm saying that there actually is a nontrivial restriction on her preferences: If she wants to keep something like her informal formulation, she will need to decide what they are supposed to mean in terms that do not refer to thisAgent.

Got it. I think.

But how could you come up with a pair of situations such that in situation (i), the agent can choose options A and B, while in situation (ii), the agent can choose between A, B and C, and yet the agent has exactly the same information in situations (i) and (ii)?

In situation (i), Alice can choose between chocolate and vanilla ice cream. In situation (ii), Alice can choose between chocolate, vanilla, and strawberry ice cream. Having access to these options doesn't change Alice's knowledge about her preferences for ice cream flavors (under the assumption that access to flavors on a given day doesn't reflect some kind of global shortage of a flavor). In general it might help to have Alice's choices randomly determined, so that Alice's knowledge of her choices doesn't give her information about anything else.

Sorry, I do not understand -- what do you mean by your composite choices? What does it mean to choose (A and B) when A and B are mutually exclusive options?

Sorry, I should probably have used "or" instead of "and." If A and B are the primitive choices "chocolate ice cream" and "vanilla ice cream," then the composite choice (A or B) is "the opportunity to choose between chocolate and vanilla ice cream." The point is that once you allow a decision theory to assign preferences among composite choices, then composition of choices is associative, so preferences among an arbitrary number of primitive choices are determined by preferences among pairs of primitive choices.

Maybe it's not right to say that it seems like a bad idea, more like it would seem at first that people just don't have terminal preferences about the algorithm run (or at least not strong ones -- you might derive enjoyment from an elegant algorithm, but that wouldn't outweigh your desire to save lives, so your instrumental preference for a well-working algorithm would always dominate your terminal preference for enjoying an elegant algorithm, if the two came into conflict). So at first it might seem reasonable to design a decision theory where you are not allowed to care about the algorithm your AI is running -- I find it at least conceivable that when trying to prove theorems about self-modifying AI, making such an assumption might simplify things, so this does seem like a conceivable failure mode to me.

Okay, but it still seems reasonable to have instrumental preferences about algorithms that AIs run, and I don't see why decision theory is not allowed to talk about instrumental preferences. (Admittedly I don't know very much about decision theory.)

**[deleted]**· 2012-12-14T16:36:37.831Z · LW(p) · GW(p)

The "not wanting the AI to run conscious simulations of people" link under the "Outcomes" heading does not work.

What happens if your preferences do not satisfy Continuity? Say, you want to save human lives, but you're not willing to incur any probability, no matter how small, of infinitely many people getting tortured infinitely long for this?

Then you basically have a two-step optimization; "find me the set of actions that have a minimal number of infinitely many people getting tortured infinitely long, and then of that set, find me the set of actions that save a maximal number of human lives." The trouble with that is that people like to *express* their preferences with rules like that, but those preferences are not ones that they reflectively endorse. For example, would you rather it be certain that all intelligent life in the universe is destroyed forever, or there be a one out of R chance that infinitely many people get tortured for an infinitely long period, and R-1 out of R chance that humanity continues along happily? If R is sufficiently large (say, x^x with x ^s, with x equal to the number of atoms in the universe), then it seems that the first option is obviously worse.

A way to think about this is that infinities destroy averages, and VNM relies on scoring actions by their *average* utility. If utilities are bounded, then Continuity holds, and average utility always gives the results you expect if you measured utility correctly.

New information is allowed to make a hypothesis more likely, but not predictably so; if all ways the experiment could come out make the hypothesis more likely, then you should already be finding it more likely than you do. The same thing is true even if only one result would make the hypothesis more likely, but the other would leave your probability estimate exactly unchanged.

One result might change my probability estimate by less than my current imprecision/uncertainty/rounding error in stating said estimate. If the coin comes up H,H,H,H,H,H,**T**,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H... then I can assign a low probability to it being fair, but a lower probability to it having two heads.

Thank you for this excellent post. I read this primarily because I would like to use formal theories to aid my own decision-making about big, abstract decisions like what to do with my life and what charities to donate to, where the numbers are much more available than the emotional responses. In a way this didn't help at all: it only says anything about a situation where you start with a preference ordering. But in another way it helped immensely, of course, since I need to understand these fundamental concepts. It was really valuable to me that you were so careful about what these "utilities" really mean.

**[deleted]**· 2012-12-18T23:16:08.391Z · LW(p) · GW(p)

Maximizing expected utility can be paradoxically shown to minimize actual utility, however. Consider a game in which you place an initial bet of $1 on a 6-sided die coming up anything but 1 (2-6), which pays even money if you win and costs you your bet if you lose. The twist, however, is that upon winning (i.e. you now have $2 in front of you) you must either bet the entire sum formed by your bet and its wins or leave the game permanently. Theoretically, since the odds are in your favor, you should always keep going. Always. But wait, this means you will eventually lose it all. Even if you say "just one more and I'll stop", it'll be mathematically optimal to keep repeating this behavior. This "optimal" strategy does worse than any arbitrary random strategy possible.

You aren't analyzing this game correctly. At the beginning of the game, you're deciding between possible strategies for playing the game, and you should be evaluating the expected value of each of these strategies.

The strategy where you keep going until you lose has expected value -1. There is also a sequence of strategies depending on a positive integer n where you quit at the latest after the nth bet, and their expected values form an arithmetic progression. In other words, there isn't an optimal strategy for this game because there are infinitely many strategies and their expected values get arbitrarily high.

In addition, the sequence of strategies I described tends to the first strategy in the limit as n tends to infinity, in some sense, but their expected values don't respect this limit, which is what leads to the apparent paradox that you noted. In more mathematical language, what you're seeing here is a failure of the ability to exchange limits and integrals (where the integrals are expected values). Less mathematically, you can't evaluate the expected value of a sequence of infinitely many decisions by adding up the expected value of each individual decision. In practice, you will never be able to make infinitely many decisions, so this doesn't really matter.

This issue is closely related to the puzzle where the Devil gives you money and takes it away infinitely many times. I don't remember what it's called.

**[deleted]**· 2012-12-19T03:08:11.360Z · LW(p) · GW(p)

Indeed they don't, but the point is that while stopping at N+1 always dominates stopping at N, this thinking leads one to keep continuing and lose. As such, the only winning move is to do exactly NOT this and decide some arbitrary prior point to stop at (or decide indeterministically such as by coin flip). Attempting to maximize expected utility is the only strategy that won't work. This game, prisoners' dilemma, and newcomblike problems are all cases where choosing in such a way that does better (than the alternative) in all cases can still do worse overall.

The point isn't that the strategy that is supposed to maximize expected utility is a bad idea. The point is that you're computing its expected utility *incorrectly* because you're switching a limit and an integral that you can't switch. This is a completely different issue from the prisoner's dilemma; it is entirely an issue of infinities and has nothing to do with the practical issue of being a decision-maker with bounded resources making finitely many decisions.

**[deleted]**· 2012-12-19T03:33:13.136Z · LW(p) · GW(p)

It isn't a matter of switching a limit and an integral, or any means of infinity really. You could just consider the 1 number you're currently on, your options are to continue or stop. To come out of the game with any money, one must at some point say "forget maximizing expected utility, I'm not risking losing what I've acquired". By stopping, you lose expected utility compared to continuing exactly 1 more time. My point being that it is not always the case that "you must maximize expected utility", for in some cases it may be wrong or impossible to do so.

All you've shown is that maximizing expected utility infinitely many times does not maximize the expected utility you get at the end of the infinitely many decisions you've made. This is entirely a matter of switching a limit and an integral, and it is irrelevant to practical decision-making.

**1** This argument only works if the bet is denominated in utils rather than in dollars. Otherwise, someone who gets diminishing marginal utility from dollars for very large sums -- that would include most people -- will eventually decide to stop. (If I have utility = log(dollars) and initial assets of $1M then I will stop after 25 wins, if I did the calculations right.)

**1a** It is not at all clear that a bet denominated in utils is even actually possible. Especially not one which, with high probability, ends up involving an astronomically large quantity of utility.

**2** Even someone who doesn't generally get diminishing marginal utility from dollars -- say, an altruist who will use all those dollars for saving other people's lives, and who cares equally about all -- will find marginal utility decreasing for large enough sums, because (a) eventually the cheap problems are solved and saving the next life starts costing more, and (b) if you give me 10^15 dollars and I try to spend it all (on myself or others) then the resulting inflation will make them worth less.

**3** Given that "you will eventually lose it all", a *strategy* of continuing to bet does not in fact maximize expected utility.

**4** The expected utility from a given choice at a given stage in the game depends on what you'd then do with the remainder of the game. For instance, if I know that my future strategy after winning this roll is going to be "keep betting for ever" then I know that my expected utility if I keep playing is zero, so I'll choose not to do that.

**5** So at most what we have (even if we assume we've dealt somehow with issues of diminishing marginal utility etc.) is a game where there's an infinite "increasing" sequence of strategies but no limiting strategy that's better than all of them. But that's no surprise. Here's another game with the same property: You name a positive integer N and Omega gives you $N. For any fixed N, it is best not to choose N because larger numbers are better. "Therefore" you can't name any particular number, so you refuse to play and get nothing. If you don't find this paradoxical -- and I confess that I don't -- then I don't think you need find the die-rolling game any worse. (Choosing N in this game <--> deciding to play for N turns in the die-rolling game.)

[EDITED to stop the LW software turning my numbered points into differently numbered and weirdly formatted points.]

[EDITED again to acknowledge that after writing all that I read on and found that others had already said more or less the same things as me. D'oh. Anyway, since apparently Qiaochu_Yuan wasn't successful in convincing srn247, perhaps my slightly different presentation will be of some help.]

This is the St. Petersburg paradox, discussed here from time to time.

It isn't really very much like the St. Petersburg paradox. The St. Petersburg game runs for a random length of time, you don't choose whether to continue; the only choice you make is at the beginning of the game where you decide how much to pay.

Or is it equivalent in some subtle way?

Is it just me or is this essentially the same as the Lifespan Dilemma?

At the very least, in both cases, you find that you get high expected utilities by choosing very low probabilities of getting anything at all.

If your preferences can always be modelled with a utility function, does that mean that no matter how you make decisions, there's some adaptation of this paradox that will lead you to accept a near certainty of death?

**[deleted]**· 2012-12-21T04:24:17.500Z · LW(p) · GW(p)

It is essentially that, and it does show that trying to maximize expected utility can lead to such negative outcomes. Unfortunately, there doesn't seem to be a simple alternative to maximizing expected utility that doesn't lead to being a money pump. The kelly criterion is an excellent example of a decision-making strategy that doesn't maximize expected utility but still wins compared to it, so at least it's known that it can be done.

I appreciate the hard work here, but all the math sidesteps the real problems, which are in the axioms, particularly the axiom of independence. See this sequence of comments on my post arguing that saying expectation maximization is correct is equivalent to saying that average utilitarianism is correct.

People object to average utilitarianism because of certain "repugnant" scenarios, such as the utility monster (a single individual who enjoys torturing everyone else so much that it's right to let him or her do so). Some of these scenarios can be transformed into a repugnant scenario for expectation maximization over your own utility function, where instead of "one person" you have "one possible future you". Suppose the world has one billion people. Do you think it's better to give one billion and one utilons to one person than to give one utilon to everyone? If so, why would you believe it's better to take an action that results in you having one billion and one utilons one-one-billionth of the time, and nothing all other times, than an action that reliably gives you one utilon?

The way people think about the lottery suggests that most people prefer to distribute utilons equally among different people, but to lump them together and give them to a few winners in distributions among their possible future selves. This is a case where we reliably violate the Golden Rule, and call ourselves virtuous for doing so.

**[deleted]**· 2012-12-14T04:29:47.844Z · LW(p) · GW(p)

Suppose the world has one billion people. Do you think it's better to give one billion and one utilons to one person than to give one utilon to everyone?

Yes. If you think this conclusion is repugnant, you have not comprehended the meaning of 1000000001 times as much utility. The only thing that utility value even *means* is that you'd accept such a deal.

You don't "give" people utilons though. That implies scarcity, which implies some real resource to be distributed, which we correctly recognize as having diminishing returns on one person, and less diminishing returns on lots of people. The better way to think of it is that you *extract* utility from people.

Would you rather get 1e9 utils from one person, or 1 util from each of 1e9 people? Who cares 1e9 utils is 1e9 utils.

If so, why would you believe it's better to take an action that results in you having one billion and one utilons one-one-billionth of the time, and nothing all other times, than an action that reliably gives you one utilon?

Again, by construction, we take this deal.

VNM should not have called it "utility"; it drags in too many connotations. VNM utility is a very personal thing that describes what decisions *you* would make.

It is permissible to prefer the outcome that has a constant probability distribution to the outcome that has the higher definite integral across the probability distribution.

**[deleted]**· 2012-12-20T14:55:59.235Z · LW(p) · GW(p)

What do you mean? Specifically, what is a "constant probability distribution"?

If you mean I can prefer $1M to a 1/1000 chance of $2B, then sure. Money is not utility.

On the other hand, I *can't* prefer 1M *utils* to 1/1000 chance of 2B *utils*.

A constant probability distribution is a flat distribution; i.e. a flat line.

And the outcomes can be ordered however one chooses. It is not necessary to provide additive numeric values.

Are you saying that utils are defined such that if one outcome is preferred over another, it has more expected utils?

**[deleted]**· 2012-12-20T16:07:40.109Z · LW(p) · GW(p)

Are you saying that utils are defined such that if one outcome is preferred over another, it has more expected utils?

Yes. That's exactly what I mean.

And I'm afraid I still don't know what you are getting at with this constant probability distribution thing.

I mean an outcome where there is 1-epsilon chance of A.

It is permissible to assign utils arbitrarily, such that flipping a coin to decide between A and B has more utils than selecting A and more utils than selecting B. In that case, the outcome is "Flip a coin and allow the coin to decide", which has different utility from the sum of half of A and half of B.

**[deleted]**· 2012-12-21T03:39:36.802Z · LW(p) · GW(p)

It is permissible to assign utils arbitrarily, such that flipping a coin to decide between A and B has more utils than selecting A and more utils than selecting B. In that case, the outcome is "Flip a coin and allow the coin to decide", which has different utility from the sum of half of A and half of B.

Perhaps if you count "I flipped a coin and got A" > A.

You can always define some utility function such that it is rational to shoot yourself in the foot, but at that point, you are just doing a bunch of work to describe stupid behavior that you could just do anyways. You don't have to follow the VNM axioms either.

The point of VNM and such is to constrain your behavior. And if you input sensible things, it does. You don't have to let it constrain your behavior, but if you don't, it is doing no work for you.

Right. If you think "I flipped a coin to decide" is more valuable than half of the difference between results of the coin flip (perhaps because those results are very close to equal, but you fear that systemic bias is a large negative, or perhaps because you demand that you are provably fair), then you flip a coin to decide.

The utility function, however, is not something to be defined. It is something to be determined and discovered- I already want things, and while what I want is time-variant, it isn't arbitrarily alterable.

**[deleted]**· 2012-12-21T06:20:12.409Z · LW(p) · GW(p)

Unless your utility assigns a positive utility to your utility function being altered, in which case you'd have to seek to optimize your meta-utility. Desire to change one's desires reflects an inconsistency, however, so one who desires to be consistent should desire not to desire to change one's desires. (my apologies if this sounds confusing)

One level deeper: One who is not consistent but desires to be consistent desires to change their desires to desires that they will not then desire to change.

If you don't like not liking where you are, and you don't like where you are, move to somewhere where you will like where you are.

**[deleted]**· 2012-12-21T07:53:38.747Z · LW(p) · GW(p)

Ah, so true. Ultimately, I think that's exactly the point this article tries to make: if you don't want to do A, but you don't want to be the kind of person who doesn't want to do A (or you don't want to be the kind of person who doesn't do A), do A. If that doesn't work, change who you are.

If so, why would you believe it's better to take an action that results in you having one billion and one utilons one-one-billionth of the time, and nothing all other times, than an action that reliably gives you one utilon?

One possible response is that the former action *is* preferable, but the intuition pump yields a different result because our intuitions are informed by *actual* small and large rewards (e.g., money), and in the real world getting $1 every day for eight years with certainty does not have the same utility as getting $2922 with probability 1/2922 each day for the next eight years. If real-world examples like money -- which is almost always more valuable now than later, inflation aside; and which bears hidden and nonlinearly changing utilities like 'security' and 'versatility' and 'social status' and 'peace of mind' that we learn to reason with intuitively as though they could not be quantified in a single utility metric analogous to the currency measure itself -- are the only intuitive grasp we have on 'utilons,' then we may make systematic errors in trying to cash out how our values would, if we better understood our biases, be reflectively cashed out.

See this sequence of comments on my post arguing that saying expectation maximization is correct is equivalent to saying that average utilitarianism is correct.

That thesis seems obviously wrong: the term "utilitarianism" refers not to maximising, but to maximising something pretty specific - namely: the happiness of all people.