Example of double indifference

post by Stuart_Armstrong · 2017-05-24T08:58:57.000Z · LW · GW · 2 comments

Contents

    Distort your expectations, reap your rewards
  The (insufficient) cost of stupidity
    Double compensation
  Partial double compensation
  Better expectation modules
  Full double compensation
None
2 comments

I'll try to provide an example for double indifference, to (hopefully) show what's going on. It will start with a simple model, gradually introducing more features to cover all the relevant issue.

Consider the situation in the following diagram:

The agent starts, at time , with utility function and expectation module . Since the probability of an event is just the expectation of its indicator function, this also defines a probability estimator . I'll use superscripts to denote the expectation module at a particular time; hence .

At time , the agent has the option of changing to expectation module . At time , the agent's utility function is forcibly changed to . Nothing currently happens at time (though watch this space), and, at time , a coin is tossed: on heads (), , on tails (), . This makes into the indicator function for heads, , but we'll keep them separate as we'll be redefining later. The utility has this exactly reversed (ie is ).

Classical corrigibility via indifference gives the agent compensatory rewards so that it doesn't object to the change of its utility function. These rewards are:

Here designates the utility function being left unchanged at time , while designates the change. Note the term in the expression above. Implicitly, it's been assumed to be some idealised or at least unchanging probability module. But here, it will be , the module the agent is actually using at time .

So let's define various possible compensatory rewards:

So is the expectation module at time , while is the one doing the assessment of this. Further define , with a single index, by simply using the actual instead of a given , and , which is the same expression using the actual instead of both and .

Distort your expectations, reap your rewards

To progress any further, we'll need to define and . The module has (and hence ) - it thinks the coin is unbiased. Whether or not the coin is actually biased will be immaterial to the rest of the analysis. On the other hand, - thinks the coin is certain to land heads. Moreover, "knows" this: .

If , then . Another way of writing this is ; similarly, if , .

First case first. Since the agent only makes one decision - to keep or change - the expected utilities are easy to calculate: , and hence .

Similarly, , and hence .

And "knows" both these estimates for and . Note here that I'm using the simplification , which is illegitimate in general; but this is a simple model, so I'm eliding self-referential sentences and such.

If is the total utility the agent expects to see (including compensatory rewards), then, according to :

Thus, the agent will get higher rewards by distorting its future expectation.

The (insufficient) cost of stupidity

It could be argued that the agent above can sacrifice its expectations because it doesn't actually have to make any subsequent decisions, so only the compensatory rewards matter. Surely making yourself stupid isn't a good strategy in general?

Let's extend the model to give the agent more to do. At time , the agent gets an extra option: it can add an extra bet, that costs and gives if the coin comes up tails. If represents pushing the button to accept the bet, then we could redefine as and as .

Then and differ over this decision. Because , the agent using will accept the extra bet in all cases ( and ). On the other hand, since , will reject it in all cases.

This reduces the relative advantage of : and are both left unchanged, but and both increase to . The probabilities conditioned on , however, remain the same. Thus the new expectations are:

Given these numbers, the agent will still choose to damage their expectation module, even though this results in suboptimal decisions - the compensatory rewards are high enough to make up for this. So there is a tension between accuracy and compensation when changing expectation estimators.

Double compensation

Partial double compensation

What is the problem above? Well, the agent estimates, at time , that its future compensatory rewards will be if . The problem is that uses to estimate the compensatory rewards: the true value of the compensation, according to , should be instead. Recall that uses to estimate the actual compensatory rewards, but knows that .

Since implies that will be , note that and .

So an obvious idea would be to add another category of reward, the category, at the point where (thus at time ). This would be defined as:

The is similarly defined; then since , this quantity is .

So how does adding this affect the expected utility for changing or keeping the current expectation module?

Thus, fully simplified, these are just the estimation of the expected utility plus either or . This can be further simplified, by the definition of compensatory rewards, to and . This demonstrates the agent is now entirely indifferent to the change , whether or happens.

If we don't have the option at time , then . The agent now has no reason to change . But note it also has no reason not to! That's because it has no future decisions to make, so its expectation module is irrelevant.

Now putting back the option at time , we have . The agent will now protect its expectation module , just as we'd want, to continue to make good decisions in the future.

Better expectation modules

That covers the case where is strictly worse (according to ) than is. But what if there is a that is strictly better, and "knows" this? It would also be interesting if were biased (by 's standards) but still better.

So assume that believes it actually knows the result of the coin flip; or . From 's perspective, is almost accurate: times out of it's correct, but, times it thinks the result is when it's actually . Thus , , and .

How does 's increased accuracy play out in practice? It can only have an impact at time , where there is a choice. It will manifest by taking the extra option , if and only if it thinks that the coin will be tails. If it does take the option, it will do so whether or not or , getting the same additional expected utility each time. Thus it will estimate as either (if is going to happen) or . From 's perspective, , so .

What of ? Well, expects that, the time, will correctly predict and reject the extra deal at time (no extra utility). Another of the time, it will correctly predict and accept the deal ( utility). And of the time it will incorrectly predict and reject the deal it should have accepted.

Thus both and are equal to . Thus . Moreover, .

Putting this all together, and . Hence the total expected utility from , summing all compensatory rewards, is

Thus will shift to , a better though biased expectation module. Note that since , it only does this shift because of the presence of the term.

Full double compensation

The formula for is still incomplete, and we'll see why now. Let's add a time , before all of this setup. During that time, the agent is running . And it's wondering whether it should change to instead (assume here that "knows" all about what thinks about itself).

Why would it do that? Well, it knows that at time , the agent following would shift back to before any real decisions needed to be made. Therefore since in any case. So .

Again, why would the agent shift? Because it wants to pick up the reward, giving it an extra .

What needs to be done to combat this? Simple. The must subtract off the future so that the agent considers only the adjusted rewards. In general, this means that, considering at time , and is the total of rewards after :

The definition is recursive - the are defined in terms of other - but not circular, since only subsequent 's are used.

2 comments

Comments sorted by top scores.

comment by orthonormal · 2016-05-15T22:28:53.000Z · LW(p) · GW(p)

In the spirit of "one step is normal, two steps are suspicious, omega steps are normal", perhaps there's a 'triple corrigibility' issue when ?

Replies from: Stuart_Armstrong
comment by Stuart_Armstrong · 2016-05-16T09:05:40.000Z · LW(p) · GW(p)

I'm not assuming . If you do assume that, everything becomes much simpler.