# Inferring Values from Imperfect Optimizers

post by nigerweiss · 2012-12-29T22:22:14.209Z · LW · GW · Legacy · 20 comments

One approach to constructing a Friendly artificial intelligence is to create a piece of software that looks at large amounts of evidence about humans, and attempts to infer their values.  I've been doing some thinking about this problem, and I'm going to talk about some approaches and problems that have occurred to me.

In a naive approach, we might define the problem like this: take some unknown utility function, U, and plug it into a mathematically clean optimization process (like AIXI) O.  Then, look at your data set and take the information about the inputs and outputs of humans, and find the simplest U that best explains human behavior.

Unfortunately, this won't work.  The best possible match for U is one that models not just those elements of human utility we're interested in, but also all the details of our broken, contradictory optimization process.  The U we derive through this process will optimize for confirmation bias, scope insensitivity, hindsight bias, the halo effect, our own limited intelligence and inefficient use of evidence, and just about everything else that's wrong with us.  Not what we're looking for.

Okay, so let's try putting a bandaid on it - let's go back to our original problem setup.  However, we'll take our original O, and use all of the science on cognitive biases at our disposal to handicap it.  We'll limit its search space, saddle it with a laundry list of cognitive biases, cripple its ability to use evidence, and in general make it as human-like as we possibly can.  We could even give it akrasia by implementing hyperbolic discounting of reward.  Then we'll repeat the original process to produce U'.

If we plug U' into our AI, the result will be that it will optimize like a human who had suddenly been stripped of all the kinds of stupidity that we programmed into our modified O.  This is good!  Plugged into a solid CEV infrastructure, this might even be good enough to produce a future that's a nice place to live.  However, it's not quite ideal.  If we miss a cognitive bias, then it'll be incorporated into the learned utility functions, and we may never be rid of it.  What would be nice would be if we could get the AI to learn about cognitive biases, exhaustively, and update in the future if it ever discovered a new one.

If we had enough time and money, we could do this the hard way: acquire a representative sample of the human population, and pay them to perform tasks with simple goals under tremendous surveillance, and have the AI derive the human optimization process from the actions taken towards a known goal.  However, if we assume that the human optimization process can be defined as a function over the state of the human brain, we should not trust the completeness of any such process learned from less data than the entropy of the human brain, which is on the order of tens of petabytes of extremely high quality evidence.  If we want to be confident in the completeness of our model, we may need more experimental evidence than it is really practical to accumulate.  Which isn't to say that this approach is useless - if we can hit close enough to the mark, then the AI may be able to run more exhaustive experimentation later and refine its own understanding of human brains to be closer to the ideal.

But it'd really be nice if our AI could do unsupervised learning to figure out the details of human optimization.  Then we could simply dump the internet into it, and let it grind away at the data and spit out a detailed, complete model of human decision-making, from which our utility function could be derived.  Unfortunately, this does not seem to be a tractable problem.  It's possible that some insight could be gleaned by examining outliers with normal intelligence, but deviant utility functions (I am thinking specifically of sociopaths), but it's unclear how much insight can be produced by these methods.  If anyone has suggestions for a more efficient way of going about it, I'd love to hear it.  As it stands, it might be possible to get enough information from this to supplement a supervised learning approach - the closer we get to a perfectly accurate model, the higher the probability of Things Going Well.

Anyways, that's where I am right now.  I just thought I'd put up my thoughts and see if some fresh eyes see anything I've been missing.

Cheers,

Niger

Comments sorted by top scores.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2012-12-30T13:18:44.186Z · LW(p) · GW(p)

AIXI does not take general utility functions.

AIXI can only optimize direct functions of sense data.

It cannot have utility functions over the state of worlds in which it is embedded.

This cannot be fixed without using something entirely different in place of AIXI's Solomonoff Induction.

Replies from: nigerweiss, timtyler
comment by nigerweiss · 2012-12-30T18:42:48.848Z · LW(p) · GW(p)

I believe I saw a post a while back in which Anja discussed creating a variant on AIXI with a true utility function, though I may have misunderstood it. Some of the math this stuff involves I'm still not completely comfortable with, which is something I'm trying to fix.

In any case, what you'd actually want want to do is to model your agents using whatever general AI architecture you're using in the first place - plus whatever set of handicaps you've calibrated into it - which, presumably has a formal utility function, and is an efficient optimizer.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2012-12-31T00:44:12.027Z · LW(p) · GW(p)

I could be mistaken, but I think this is a case of (unfortunately) several people using the term "utility function" for functions over sensory information instead of a direct reward channel. Dewey has a paper on why such functions don't add up to utility functions over outcomes, IIRC.

Replies from: nigerweiss
comment by nigerweiss · 2012-12-31T01:38:24.509Z · LW(p) · GW(p)

That would make sense. I assume the problem is lotus eating - the system, given the choice between a large cost to optimize whatever you care about, or small cost to just optimize its own sense experiences, will prefer the latter.

I find this stuff extremely interesting. I mean, when we talk about value modelling what we're really talking about isolating some subset of the causal mechanics driving human behavior (our values) from those elements we don't consider valuable. And, since we don't know if that subset is a natural category (or how to define it if it is), we've got a choice of how much we want to remove. Asking people to make a list of their values would be an example of the extreme sparse end of the spectrum, where we almost certainly don't model as much as we want to, and we know the features we're missing are important. On the other extreme end, we're just naively modelling the behaviors of humans, and letting the models vote. Which definitely captures all of our values, but also captures a bunch of extraneous stuff that we don't really want our system optimizing for. The target you're trying to hit is somewhere in the middle. It seems to me that it's probably best to err on the side of including too much than too little, since, if we get close enough, the optimizer will likely remove a certain amount of cruft on its own.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2012-12-31T11:54:29.966Z · LW(p) · GW(p)

given the choice between a large cost to optimize whatever you care about, or small cost to just optimize its own sense experiences, will prefer the latter.

You built the machine to optimize its sense experiences. It is not constructed to optimize anything else. That is just what it does. Not when it's cheaper, not when it's inconvenient to do otherwise, but at all times universally.

comment by timtyler · 2012-12-30T14:52:20.496Z · LW(p) · GW(p)

Solomonoff induction is a general-purpose approach to inductive inference. It can predict anything that can be serialised. Want to predict the future state of some real world object, given some historical information relating to it? Solomonoff induction can do that for you.

comment by Emile · 2012-12-29T23:02:27.934Z · LW(p) · GW(p)

I suspect having a good estimation of the "human utility function" (even stripped of biases etc.) is not the hardest part of the problem. A "perfect" human, given great power and ability to self-modify, may still result in a disaster. Human morality is mostly calibrating for dealing with others of around the same power.

Replies from: nigerweiss
comment by nigerweiss · 2012-12-29T23:08:16.759Z · LW(p) · GW(p)

Well, human values are probably variant to some degree between humans, so a Friendly AI wouldn't so much be 'maximize generic human utility function' as 'take all the human utility functions you can find as of now, find those portions which are reflexively consistent, weight them by frequency, and take those actions that are best supported by the convergent portions of those utility functions.' At least, that was the gist of CEV circa 2004. Not sure what Eliezer and co are working on these days, but that sounds like a reasonable way to build a nice future to me. A fair one, at least.

comment by TrE · 2012-12-29T22:42:13.443Z · LW(p) · GW(p)

I read U² for "U-squared". This doesn't appear to be what you meant. I suggest swapping the ² with a ', giving you U'.

Replies from: nigerweiss
comment by nigerweiss · 2012-12-29T22:55:44.656Z · LW(p) · GW(p)

You're right, that is confusing. Fixed.

comment by timtyler · 2012-12-29T23:41:26.932Z · LW(p) · GW(p)

Um, surely this field is well established - and called "revealed preference theory".

Replies from: nigerweiss
comment by nigerweiss · 2012-12-29T23:51:52.223Z · LW(p) · GW(p)

Well... no, actually. That sort of economical preference induction assumes that the agents it's modelling are Homo Economus. If an agent consistently chooses x over y, it's believed that it values x more than y -- even if x is one marshmallow now, and y is two marshmallows in fifteen minutes. It's a useful abstraction for modelling how people actually behave, but that is notably not what we're interested in.

That sort of modelling was my starting point, but it's obviously totally inadequate for this sort of application. Hence why everything past the third paragraph of this post talks about dealing with imperfect optimizers.

Replies from: lukeprog, timtyler
comment by lukeprog · 2012-12-30T04:53:03.525Z · LW(p) · GW(p)

That sort of economical preference induction assumes that the agents it's modelling are Homo Economus.

For a quick summary of that paper, see the 3rd-to-last paragraph of The Singularity and Machine Ethics, which is really something you should read if you're interested in FAI and inferring preferences. Likewise Learning What to Value.

Replies from: nigerweiss
comment by nigerweiss · 2012-12-30T06:27:41.257Z · LW(p) · GW(p)

Thank you, that was interesting reading. If I'm not mistaken, though, the Nielsen-Jenson paper is talking about how to make the value inference more robust in the presence of contradictory behavior. It doesn't seem to me that this sort of procedure will reliably isolate the values we're interested in from limitations on human rationality.

The idea (page sixteen of your second citation) of extracting a human utility function by eliminating contradictory or inconsistent features of your model of human behavior-in-general is interesting, but I have some reservations about it. There are numerous studies floating around suggesting that human moral intuition can be contradictory or incoherent, and I'd prefer not to throw the baby out with the bathwater if that's the case.

comment by timtyler · 2012-12-30T03:31:17.585Z · LW(p) · GW(p)

Um, "revealed preference theory" applies to imperfect optimizers just fine.

Replies from: nigerweiss, MugaSofer
comment by nigerweiss · 2012-12-30T05:36:06.562Z · LW(p) · GW(p)

Does it have a canonical method of isolating the utility function from the details of the optimization process?

Replies from: timtyler
comment by timtyler · 2012-12-30T13:11:22.834Z · LW(p) · GW(p)

I am not sure what you mean. Essentially, it tries to find a utility maximiser that behaves in the same way. There are multiple ways of doing this - not one canonical method.

comment by MugaSofer · 2012-12-31T20:14:46.069Z · LW(p) · GW(p)

No, it doesn't. It treats actual preferences and mistakes made while trying to implement them the same way.

Replies from: timtyler
comment by timtyler · 2013-01-01T13:58:31.803Z · LW(p) · GW(p)

You're just using a different standard from me to assess the method. By saying it "applies", I mean that you can feed these techniques imperfect optimizers and they will construct utility-based models their actions. Revealed preference theory can be successfully applied to imperfect optimizers. A good job too, since all known optimisers are imperfect.

Maybe those models don't have the properties you are looking for - but that doesn't represent a point of disagreement between us.

Replies from: MugaSofer
comment by MugaSofer · 2013-01-01T14:23:03.143Z · LW(p) · GW(p)

Oh, I know it will construct a utility-based optimizer perfectly well. But considering it wont actually determine their preferences, that's rather useless for most practical purposes - such as the comment you replied to.

We don't seem to disagree on facts, though, so I don't think this conversation is going to go anywhere.