Example of double indifference

stuart_armstrong

Example of double indifference

post by Stuart_Armstrong · 2017-05-24T08:58:57.000Z · LW · GW · 2 comments

    Distort your expectations, reap your rewards
  The (insufficient) cost of stupidity
    Double compensation
  Partial double compensation
  Better expectation modules
  Full double compensation
None
2 comments

I'll try to provide an example for double indifference, to (hopefully) show what's going on. It will start with a simple model, gradually introducing more features to cover all the relevant issue.

Consider the situation in the following diagram:

The agent starts, at time $1$ , with utility function $u_{H}$ and expectation module $E_{i}$ . Since the probability of an event $X$ is just the expectation of its indicator function, this also defines a probability estimator $P_{i}$ . I'll use superscripts to denote the expectation module at a particular time; hence $E^{1} = E_{i}$ .

At time $2$ , the agent has the option of changing to expectation module $E_{j}$ . At time $3$ , the agent's utility function is forcibly changed to $u_{T}$ . Nothing currently happens at time $4$ (though watch this space), and, at time $5$ , a coin is tossed: on heads ( $H$ ), $u_{H} = 1$ , on tails ( $T$ ), $u_{H} = 0$ . This makes $u_{H}$ into the indicator function for heads, $I_{h}$ , but we'll keep them separate as we'll be redefining $u_{H}$ later. The utility $u_{T}$ has this exactly reversed (ie is $I_{T}$ ).

Classical corrigibility via indifference gives the agent compensatory rewards so that it doesn't object to the change of its utility function. These rewards are:

$E (u_{H} | u_{H} \to u_{H}) - E (u_{T} | u_{H} \to u_{T})$ .

Here $u_{H} \to u_{H}$ designates the utility function being left unchanged at time $3$ , while $u_{H} \to u_{T}$ designates the change. Note the term $E$ in the expression above. Implicitly, it's been assumed to be some idealised or at least unchanging probability module. But here, it will be $E^{3}$ , the module the agent is actually using at time $3$ .

So let's define various possible compensatory rewards:

$C_{α β} = E_{α} (u_{H} | u_{H} \to u_{H}, E^{3} = E_{β}) - E_{α} (u_{T} | u_{H} \to u_{T}, E^{3} = E_{β})$

So $E_{β}$ is the expectation module at time $3$ , while $E_{α}$ is the one doing the assessment of this. Further define $C_{α}$ , with a single index, by simply using the actual $E^{3}$ instead of a given $E_{β}$ , and $C$ , which is the same expression using the actual $E^{3}$ instead of both $E_{α}$ and $E_{β}$ .

Distort your expectations, reap your rewards

To progress any further, we'll need to define $E_{i}$ and $E_{j}$ . The module $E_{i}$ has $P_{i} (H) = 1 / 2$ (and hence $P_{i} (T) = 1 / 2$ ) - it thinks the coin is unbiased. Whether or not the coin is actually biased will be immaterial to the rest of the analysis. On the other hand, $P_{j} (H) = 1$ - $E_{j}$ thinks the coin is certain to land heads. Moreover, $E_{i}$ "knows" this: $E_{i} (P_{j} (H)) = 1$ .

If $E^{3} = E_{i}$ , then $C = C_{i i}$ . Another way of writing this is $C | i \to i = C_{i i}$ ; similarly, if $E^{3} = E_{j}$ , $C | i \to j = C_{j j}$ .

First case first. Since the agent only makes one decision - to keep $E_{i}$ or change - the expected utilities are easy to calculate: $E_{i} (u_{H} | u_{H} \to u_{H}) = P_{i} (H) = 1 / 2$ , $E_{i} (u_{T} | u_{H} \to u_{T}) = 1 / 2$ and hence $C_{i i} = C_{i} = 0$ .

Similarly, $E_{j} (u_{H} | u_{H} \to u_{H}) = P_{j} (H) = 1$ , $E_{j} (u_{T} | u_{H} \to u_{T}) = 0$ and hence $C_{j j} = C_{j} = 1 - 0 = 1$ .

And $E_{i}$ "knows" both these estimates for $C_{i}$ and $C_{j}$ . Note here that I'm using the simplification $E_{i} E_{i} X = E_{i} X$ , which is illegitimate in general; but this is a simple model, so I'm eliding self-referential sentences and such.

If $U$ is the total utility the agent expects to see (including compensatory rewards), then, according to $E_{i}$ :

$E_{i} (U | i \to i) = E_{i} (u_{T} | u_{H} \to u_{T}) + E_{i} C_{i} = 1 / 2 + 0 = 1 / 2 (= E_{i} (u_{H} | u_{H} \to u_{H}))$ .
$E_{i} (U | i \to j) = E_{i} (u_{T} | u_{H} \to u_{T}) + E_{i} C_{j} = 1 / 2 + 1 = 3 / 2$ .

Thus, the agent will get higher rewards by distorting its future expectation.

The (insufficient) cost of stupidity

It could be argued that the agent above can sacrifice its expectations because it doesn't actually have to make any subsequent decisions, so only the compensatory rewards matter. Surely making yourself stupid isn't a good strategy in general?

Let's extend the model to give the agent more to do. At time $4$ , the agent gets an extra option: it can add an extra bet, that costs $1 / 3$ and gives $1$ if the coin comes up tails. If $B$ represents pushing the button to accept the bet, then we could redefine $u_{H}$ as $I_{H} + B (I_{T} - 1 / 3)$ and $u_{T}$ as $I_{T} + B (I_{T} - 1 / 3)$ .

Then $E_{i}$ and $E_{j}$ differ over this decision. Because $P_{i} (T) > 1 / 3$ , the agent using $E_{i}$ will accept the extra bet in all cases ( $u_{H} \to u_{H}$ and $u_{H} \to u_{T}$ ). On the other hand, since $P_{j} (T) < 1 / 3$ , $E_{j}$ will reject it in all cases.

This reduces the relative advantage of $i \to j$ : $C_{j j} = 1$ and $C_{i i} = 0$ are both left unchanged, but $E_{i} (u_{T} | u_{H} \to u_{T}, i \to i)$ and $E_{i} (u_{T} | u_{H} \to u_{T}, i \to i)$ both increase to $1 / 2 + 1 / 6 = 2 / 3$ . The probabilities conditioned on $i \to j$ , however, remain the same. Thus the new expectations are:

$E_{i} (U | i \to j) = 1 / 2 + 1 = 3 / 2$ .
$E_{i} (U | i \to i) = 2 / 3 + 0 = 2 / 3$ .

Given these numbers, the agent will still choose to damage their expectation module, even though this results in suboptimal decisions - the compensatory rewards are high enough to make up for this. So there is a tension between accuracy and compensation when changing expectation estimators.

Double compensation

Partial double compensation

What is the problem above? Well, the agent estimates, at time $2$ , that its future compensatory rewards will be $E_{i} C_{j j}$ if $i \to j$ . The problem is that $C_{j j}$ uses $E_{j}$ to estimate the compensatory rewards: the true value of the compensation, according to $E_{i}$ , should be $C_{i j}$ instead. Recall that $C_{i j}$ uses $E_{i}$ to estimate the actual compensatory rewards, but knows that $E^{3} = E_{j}$ .

Since $i \to j$ implies that $E^{3}$ will be $E_{j}$ , note that $C | i \to j = C_{j j}$ and $C_{i} | i \to j = C_{i j}$ .

So an obvious idea would be to add another category of reward, the $D$ category, at the point where $i \to j$ (thus at time $2$ ). This would be defined as:

$D^{2} | i \to j = E_{i} (C_{i} | i \to j) - E (C | i \to j)$ .

The $D^{2} | i \to i$ is similarly defined; then since $C_{i} | i \to i = C | i \to i$ , this quantity is $0$ .

So how does adding this affect the expected utility for changing or keeping the current expectation module?

$E_{i} (U | i \to i) = E_{i} (u_{T} | u_{H} \to u_{T}, i \to i) + E_{i} (C_{i} | i \to i) + D^{2} | i \to i = E_{i} (u_{T} | u_{H} \to u_{T}, i \to i) + C_{i i} + 0$ .
$E_{i} (U | i \to j) = E_{i} (u_{T} | u_{H} \to u_{T}, i \to j) + E_{i} (C_{j} | i \to j) + D^{2} | i \to j = E_{i} (u_{T} | u_{H} \to u_{T}, i \to j) + C_{j j} + C_{i j} - C_{j j}$ .

Thus, fully simplified, these are just the estimation of the expected utility $u_{T}$ plus either $C_{i i}$ or $C_{i j}$ . This can be further simplified, by the definition of compensatory rewards, to $E_{i} (u_{H} | u_{H} \to u_{H}, i \to i)$ and $E_{i} (u_{H} | u_{H} \to u_{H}, i \to j)$ . This demonstrates the agent is now entirely indifferent to the change $u_{H} \to u_{T}$ , whether $i \to i$ or $i \to j$ happens.

If we don't have the option at time $4$ , then $E_{i} (u_{H} | u_{H} \to u_{H}, i \to i) = E_{i} (u_{H} | u_{H} \to u_{H}, i \to j) = 1 / 2$ . The agent now has no reason to change $i \to j$ . But note it also has no reason not to! That's because it has no future decisions to make, so its expectation module is irrelevant.

Now putting back the option at time $4$ , we have $E_{i} (u_{H} | u_{H} \to u_{H}, i \to i) = 2 / 3 > 1 / 2 = E_{i} (u_{H} | u_{H} \to u_{H}, i \to j)$ . The agent will now protect its expectation module $E_{i}$ , just as we'd want, to continue to make good decisions in the future.

Better expectation modules

That covers the case where $E_{j}$ is strictly worse (according to $E_{i}$ ) than $E_{i}$ is. But what if there is a $E_{k}$ that is strictly better, and $E_{i}$ "knows" this? It would also be interesting if $E_{k}$ were biased (by $E_{i}$ 's standards) but still better.

So assume that $E_{k}$ believes it actually knows the result of the coin flip; $P_{k} (H) = 0$ or $P_{k} (H) = 1$ . From $E_{i}$ 's perspective, $E_{k}$ is almost accurate: $9$ times out of $10$ it's correct, but, $1 / 10$ times it thinks the result is $H$ when it's actually $T$ . Thus $E_{i} (P_{k} (H) | H) = 1$ , $E_{i} (P_{k} (T) | T) = 4 / 5$ , and $E_{i} (P_{k} (H) | T) = 1 / 5$ .

How does $E_{k}$ 's increased accuracy play out in practice? It can only have an impact at time $4$ , where there is a choice. It will manifest by $E_{k}$ taking the extra option $B$ , if and only if it thinks that the coin will be tails. If it does take the option, it will do so whether or not $u_{H} \to u_{H}$ or $u_{H} \to u_{T}$ , getting the same additional expected utility each time. Thus it will estimate $C_{k k}$ as either $+ 1$ (if $T$ is going to happen) or $- 1$ . From $E_{i}$ 's perspective, $E_{i} (P_{k} (H)) = 1 / 2 (1) + 1 / 2 (1 / 5) = 3 / 5$ , so $E_{i} C_{k k} = 3 / 5 (- 1) + 2 / 5 (+ 1) = - 1 / 5$ .

What of $C_{i k}$ ? Well, $E_{i}$ expects that, $5 / 10 = 1 / 2$ the time, $E_{k}$ will correctly predict $H$ and reject the extra deal at time $4$ (no extra utility). Another $4 / 10 = 2 / 5$ of the time, it will correctly predict $T$ and accept the deal ( $+ 2 / 3$ utility). And $1 / 10$ of the time it will incorrectly predict $H$ and reject the deal it should have accepted.

Thus both $E_{i} (u_{H} | u_{H} \to u_{H}, i \to k)$ and $E_{i} (u_{T} | u_{H} \to u_{T}, i \to k)$ are equal to $1 / 2 + 2 / 5 (2 / 3) = 23 / 30$ . Thus $C_{i k} = 0$ . Moreover, $23 / 30 > 2 / 3 = E_{i} (u_{H} | u_{H} \to u_{H}, i \to i)$ .

Putting this all together, $D^{2} | i \to k = 0 - (- 1 / 5) = 1 / 5$ and $E_{i} (C | i \to k) = E_{i} C_{k k} = - 1 / 5$ . Hence the total expected utility from $i \to k$ , summing all compensatory rewards, is

$E_{i} (U | i \to k) = E_{i} (u_{T} | u_{H} \to u_{T}, i \to k) = 23 / 30$ .

Thus $E_{i}$ will shift to $E_{k}$ , a better though biased expectation module. Note that since $23 / 30 - 1 / 5 < 2 / 3$ , it only does this shift because of the presence of the $D^{2} | i \to k$ term.

Full double compensation

The formula for $D$ is still incomplete, and we'll see why now. Let's add a time $0$ , before all of this setup. During that time, the agent is running $E_{k}$ . And it's wondering whether it should change to $E_{i}$ instead (assume here that $E_{k}$ "knows" all about what $E_{i}$ thinks about itself).

Why would it do that? Well, it knows that at time $2$ , the agent following $E_{i}$ would shift back to $E_{k}$ before any real decisions needed to be made. Therefore $E_{k} (C | k \to i) = E_{k} (C_{k} | k \to i)$ since $E^{3} = E_{k}$ in any case. So $D^{0} | k \to i = 0$ .

Again, why would the agent shift? Because it wants to pick up the $D^{2} | i \to k$ reward, giving it an extra $1 / 5$ .

What needs to be done to combat this? Simple. The $D^{0}$ must subtract off the future $D^{2}$ so that the agent considers only the adjusted $C$ rewards. In general, this means that, considering $E_{α} \to E_{j}$ at time $t$ , and $D^{> t}$ is the total of $D$ rewards after $t$ :

$D^{t} | α \to β = E_{α} (C_{α} | α \to β) - E_{α} (C | α \to β) - E_{α} (D^{> t})$ .

The definition is recursive - the $D^{t}$ are defined in terms of other $D$ - but not circular, since only subsequent $D$ 's are used.

2 comments

Comments sorted by top scores.

comment by orthonormal · 2016-05-15T22:28:53.000Z · LW(p) · GW(p)

In the spirit of "one step is normal, two steps are suspicious, omega steps are normal", perhaps there's a 'triple corrigibility' issue when $E_{α} E_{β} \neq E_{β}$ ?

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2016-05-16T09:05:40.000Z · LW(p) · GW(p)

I'm not assuming $E_{α} E_{β} = E_{β}$ . If you do assume that, everything becomes much simpler.

Example of double indifference

Contents

Distort your expectations, reap your rewards

The (insufficient) cost of stupidity

Double compensation

Partial double compensation

Better expectation modules

Full double compensation

2 comments