Deriving Conditional Expected Utility from Pareto-Efficient Decisions

post by Thomas Kwa (thomas-kwa) · 2022-05-05T03:21:38.547Z · LW · GW · 1 comments

Contents

  Introduction
  Summary
  Pareto efficiency over possible worlds implies EUM
  EUM implies conditional expected value
  Multiple decisions might imply conditional EV is meaningful
None
1 comment

This is a distillation of this post [LW · GW] by John Wentworth.

Introduction

Suppose you're playing a poker game. You're an excellent poker player (though you've never studied probability), and your goal is to maximize your winnings.

Your opponent is about to raise, call, or fold, and you start thinking ahead.

Let's break down your thinking in the case where your opponent raises. Your thought process is something like this:

  1. If he raises, you want to take the action that maximizes your expected winnings.
  2. You want to make the decision that's best in the worlds where he would raise. You don't care about the worlds where he wouldn't raise, because we're currently making the assumption that he raises.
  3. Your poker intuition tells you that the worlds where he would raise are mostly the ones where he is bluffing. In these worlds your winnings are maximized by calling. So you decide the optimal policy if he raises is to call.

Step 2 is the important one here. Let's unpack it further.

  1. You don't know your opponent's actual hand or what he will do. But you're currently thinking about what to do if he raises.
  2. The optimal decision here depends only on worlds where he would raise.
  3. You decide how much you care about winning in different worlds precisely by thinking "how likely is this world, given that he raises?".

This sounds suspiciously like you're maximizing the Bayesian conditional expectation of your winnings: the expected value given some partial information about the world. This can be precisely defined as , where  is your winnings,  is your action, and  is the probability of world . But you don't know any probability, so you don't know how to assign probability to worlds, much less what conditioning and expectation are! How could you possibly be maximizing a "conditional expectation"?

Luckily, your opponent folds and you win the hand. You resolve to (a) study coherence theorems and probability so you know the Law behind optimal poker strategy, and (b) figure out why you have a voice in your head telling you about "conditional expectations" and reading equations at you.

It turns out your behavior at the poker table can be derived from one particular property of your poker strategy: you never make a decision that is worse than another possible decision in all possible worlds. (An economist would say you're being Pareto-efficient about maximizing your winnings in different possible worlds).

Summary

An agent which has some goal, has uncertainty over which world it's in, and is Pareto-efficient in the amount of goal achieved in different possible worlds, can be modeled as using conditional probability. We show this result in two steps:

There's also a third, more speculative step:

This result is essentially a very weak selection theorem [AF · GW].

Pareto efficiency over possible worlds implies EUM

Suppose that an agent is in some world  and has uncertainty over which world it's in. The agent has a goal  and is Pareto-efficient with respect to maximizing the amount of goal achieved in each world. A well-known result in economics says that Pareto efficiency implies the existence of some function  such that the agent chooses its actions  to maximize the weighted sum . (Without loss of generality, we can let P sum to 1.) If we interpret  as the probability of world , the agent maximizes , i.e. expected utility.

Note that we have not determined anything about P other than that it sums to 1. Some properties we don't know or derive in this setup:

The following example assumes that we have an expected utility maximizer in the sense of being Pareto efficient over multiple worlds, and shows that it behaves as if it uses conditional probabilities.

EUM implies conditional expected value

Another example, but we actually walk through the math this time.

You live in Berkeley, CA, like Korean food, and have utility function u = "subjective quality of food you eat". Suppose you are deciding where to eat based only on names and Yelp reviews of restaurants. You are uncertain about X, a random variable representing the quality of all restaurants under your preferences, and Yelp reviews give you partial information about this. Your decision-making is some function A(f(X)) of the information f(X) in the Yelp reviews, and you choose A to maximize your expected utility between worlds: maybe the optimal A is to compare the average star ratings, give Korean restaurants a 0.2 star bonus, and pick the restaurant with the best adjusted average rating.

Here, we assume you behave like an "expected utility maximizer" in the weak sense above. I claim we can model you as maximizing conditional expected value.

Suppose you're constructing a lookup table for the best action A given each possible observation of reviews. Your lookup table looks something like

f(X)A(f(X))
{("Mad Seoul", 4.5), ("Sushinista", 4.8)}eat at Sushinista
{("Kimchi Garden", 4.3), ("Great China", 4.4)}eat at Kimchi Garden


 

You always calculate the action A that maximizes 

Suppose that in a given row we have , where  is some observation. Then we are finding . We can make a series of simplifications:

Thus, we can model you as using conditional expected value.

Multiple decisions might imply conditional EV is meaningful

This section is a distillation of, and expansion upon, this comment thread [LW(p) · GW(p)].

Suppose now that you're making multiple decisions  in a distributed fashion to maximize the same utility function, where there is no information flow between the decisions. For example, 10 copies of you (with the same preferences and same choice of restaurants) are dropped into Berkeley, but they all have slightly different observation processes : Google Maps reviews, Grubhub reviews, personal anecdotes, etc.

Now, when constructing a lookup table for , each copy of you will still condition each row's output on its input. When making decision  from input , you don't have the other information  for , so you consider each decision separately, still maximizing . Here, the information  does not depend on other decisions, but this is not necessary for the core point.[2]

In the setup with one decision, we showed that a Pareto-efficient agent can be modeled as maximizing conditional EU over possible worlds . But because one can construct [LW · GW] a utility function of type  consistent with any agent's behavior, the agent can also be modeled as maximizing conditional EU over possible observations . In the single-decision case, there is no compelling reason to model the agent as caring about worlds rather than observations, especially because storing and processing observations should be simpler than storing and processing distributions of worlds.

When the agent makes multiple decisions based on different observations , there are two possible "trivial" ways to model it: either as maximizing a utility function , or as maximizing separate utility functions . However, with sufficiently many decisions, neither of these trivial representations is as "nice" as conditional EU over possible worlds:

  1. ^

    John made the following comment: 

    We are showing that the agent performs Bayesian updates, in some sense. That's basically what conditioning is. It's just not necessarily performing a series of updates over time, with each retaining the information from the previous, the way we usually imagine.

  2. ^

    When f depends on past decisions, the agent just maximizes . To see the math for the multi-decision case, read the original post [LW · GW] by John Wentworth.

  3. ^

    If the world has  bits of state, and the observations reveal  bits of information each, the pigeonhole principle says this surely happens when there are  observations. Our universe has about  bits of state, so this won't happen unless our agent can operate coherently in ~ different decisions; this number can maybe be reduced if we suppose that our agent can only actually observe, say,  bits of state.

1 comments

Comments sorted by top scores.

comment by Thomas Kwa (thomas-kwa) · 2022-06-09T05:44:30.457Z · LW(p) · GW(p)

Thoughts on the process of writing this post:

  • It took a lot of effort to write, something like 3 days of my time. Distillation is hard.
  • Most of this effort was not in understanding the original post (took me 2-3 hours to understand the math)
  • I sent drafts to johnwentworth several times and had several conversations with him to refine this piece. This probably spent ~2 hours of his time.
  • I'm not satisfied with the final result. It seems like the point the original post made was fairly obvious and I used way too many words to explain it properly. Maybe John thought the interpretation of the math was fairly deep and I thought it wasn't very deep?
  • I think that since John is a good and prolific writer already compared to most alignment researchers, there is higher value in distilling ideas of other researchers. It's hard to produce a lot of value from content already on LW.
    • Paul Christiano blogposts are somewhat famously opaque; distillations of these have worked [LW · GW] in the past and still seem pretty valuable. The highest-relevance academic papers might be better. But many of the highest-value distillations probably involve talking to researchers to get things they're too busy to write down at all.