Very different, very adequate outcomes

post by Stuart_Armstrong · 2019-08-02T20:31:00.751Z · score: 13 (4 votes) · LW · GW · 10 comments

Let be the utility function that - somehow - expresses your preferences[1]. Let be the utility function expresses your hedonistic pleasure.

Now imagine an AI is programmed to maximise . If we vary in the range of to , then we will get very different outcomes. At , we will generally be hedonically satisfied, and our preferences will be followed if they don't cause us to be unhappy. At , we will accomplish any preference that doesn't cause us huge amounts of misery.

It's clear that, extrapolated over the whole future of the universe, these could lead to very different outcomes[2]. But - and this is the crucial point - none of these outcomes are really that bad. None of them are the disasters that could happen if we picked a random utility . So, for all their differences, they reside in the same nebulous category of "yeah, that's an ok outcome." Of course, we would have preferences as to where lies exactly, but few of us would risk the survival of the universe to yank around within that range.

What happens when we push towards the edges? Pushing towards seems a clear disaster: we're happy, but none of our preferences are respected; we basically don't matter as agents interacting with the universe any more. Pushing towards might be a disaster: we could end up always miserable, even as our preferences are fully followed. The only thing protecting us from that fate is the fact that our preferences include hedonistic pleasure; but this might not be the case in all circumstances. So moving to the edges is risky in the way that moving around in the middle is not.

In my research agenda [LW · GW], I talk about adequate outcomes, given a choice of parameters, or acceptable approximations. I mean these terms in the sense of the example above: the outcomes may vary tremendously from one another, given the parameters or the approximation. Nevertheless, all the outcomes avoid disasters and are clearly better than maximising a random utility function.


  1. This being a somewhat naive form of preference utilitarianism, along the lines of "if the human choose it, then its ok". In particular, you can end up in equilibriums where you are miserable, but unwilling to choose not to be (see for example, some forms of depression). ↩︎

  2. This fails to be true if preference and hedonism can be maximised independently; eg if we could take an effective happy pill and still follow all our preferences. I'll focus on the situation where there are true tradeoffs between preference and hedonism. ↩︎

10 comments

Comments sorted by top scores.

comment by johnswentworth · 2019-08-02T23:31:12.008Z · score: 9 (4 votes) · LW · GW

One potential problem: if the two utilities have different asymptotic behavior, then one of them can dominate decision-making. For instance, suppose we're using 0-1 normalization, but one of the two utilities has a big spike or tail somewhere. Then it's going to have near-zero slope everywhere else.

More concrete example: on the hedonism axis, humans have more capacity for severe pain than extreme pleasure. So that end of the axis has a big downward spike, and the hedonism-utility would be near-flat at the not-severe-pain end (at least for any of the normalizations you suggest [LW · GW], other than max-mean, which has the same problem with the other end of the axis). But if the preferences-utility lacks a big spike like that, then we're liable to end up with constant low-grade hedonic unhappiness.

That's still a lot better than plenty of other possible outcomes - preference-utility still looks good, and we're not in constant severe pain. But it still seems not very good.

comment by Veedrac · 2019-08-03T06:25:19.692Z · score: 4 (2 votes) · LW · GW
Pushing q towards 1 might be a disaster

If I consider satisfaction of my preferences to be a disaster, in what sense can I realistically call them my preferences? It feels like you're more caught up on the difficulty of extrapolating these preferences outside of their standard operation, but that seems like a rather different issue.

comment by Stuart_Armstrong · 2019-08-04T19:25:15.968Z · score: 5 (3 votes) · LW · GW

I've thinking of a rather naive form of preference utilitarianism, of the sort "if the human agree to it or choose it, then it's ok". In particular, you can end up with some forms of depression where the human is miserable, but isn't willing to change.

I'll clarify that in the post.

comment by Wei_Dai · 2019-08-03T15:12:52.242Z · score: 3 (1 votes) · LW · GW

This seems way too handwavy. If q being close enough 0 will cause a disaster, why isn't 5% close enough to 0? How much do you expect switching from q=1 to q=5% to reduce ? Why?

If moving from q=1 to q=5% reduces by a factor of 2, for example, and it turns out that is the correct utility function, that would be equivalent to incurring a 50% x-risk. Do you think that should be considered "ok" or "adequate", or have some reason to think that wouldn't be reduced nearly this much?

comment by Stuart_Armstrong · 2019-08-04T19:23:21.198Z · score: 2 (1 votes) · LW · GW

I'm finding these "is the correct utility function" hard to parse. Humans have a bit of and a bit of . But we are underdefined systems; there is no specific value of that is "true". We can only assess the quality of using other aspects of human underdefined preferences.

This seems way too handwavy.

It is. Here's an attempt at a more formal definition: humans have collections of underdefined and somewhat contradictory preferences (using preferences in a more general sense than preference utilitarianism). These preferences seem to be stronger in the negative sense than in the positive: humans seem to find the loss of a preference much worse than the gain. And the negative is much more salient, and often much more clearly defined, than that positive.

Given that maximising one preference tends to put the values of others at extreme values, human overall preferences seem better captured by a weighted mix of preferences (or a smooth min of preferences) than by any single preference, or small set of preferences. So it is not a good idea to be too close to the extremes (extremes being places where some preferences have weight put on them).

Now there may be some sense in which these extreme preferences are "correct", according to some formal system. But this formal system must reject the actual preferences of humans today; so I don't see why these preferences should be followed at all, even if they are correct.

Ok, so the extremes are out; how about being very close to the extremes? Here is where it gets wishywashy. We don't have a full theory of human preferences. But, according to the picture I've sketched above, the important thing is that each preference gets some positive traction in our future. So, yes to might no mean much (and smooth min might be better anyway). But I believe I could say:

  • There are many weighted combinations of human preferences that are compatible with the picture I've sketched here. Very different outcomes, from the numerical perspective of the different preferences, but all falling within an "acceptability" range.

Still a bit too handwavy. I'll try and improve it again.

comment by johnswentworth · 2019-08-02T21:50:23.487Z · score: 3 (2 votes) · LW · GW

How do you imagine standardizing the utility functions? E.g., if we multiply by 2, then it does just as good a job representing our happiness, but gets twice as much weight.

comment by Stuart_Armstrong · 2019-08-02T22:24:51.781Z · score: 4 (2 votes) · LW · GW

https://www.lesswrong.com/posts/hBJCMWELaW6MxinYW/intertheoretic-utility-comparison [LW · GW]

Or we could come up with a normalisation method by having people rank the intensity of their preferences versus the intensity of their enjoyments. It doesn't have to be particularly good, just give non-crazy results.

comment by johnswentworth · 2019-08-02T23:15:06.214Z · score: 4 (2 votes) · LW · GW
It doesn't have to be particularly good, just give non-crazy results.

The intertheoretic utility post makes a lot more sense in that light; I had mostly dismissed it as a hack job when I first saw it. But if this is the sort of thing you're trying to do, it seems more useful. Thanks for clarifying.

comment by Charlie Steiner · 2019-08-03T04:57:14.364Z · score: 2 (1 votes) · LW · GW

And of course you can go further and have different that all have similarly valid claims to be , because they're all similarly good generalizations of our behavior into a consistent function on a much larger domain.

comment by Donald Hobson (donald-hobson) · 2019-08-02T22:51:29.744Z · score: 1 (1 votes) · LW · GW

As far as I am concerned, hedonism is an approximate description of some of my preferences. Hedonism is a utility function close to, but not equal to mine. I see no reason why a FAI should contain a special term for hedonism. Just maximize preferences, anything else is strictly worse, but not necessarily that bad.

I do agree that there are many futures we would consider valuable. Our utility function is not a single sharp spike.