Value Loading

ryjm

Value Loading

post by ryjm · 2012-10-23T04:47:07.919Z · LW · GW · Legacy · 11 comments

  Daniel Dewey's Paper
None
11 comments

This article was originally on the FHI wiki and is being reposted to LW Discussion with permission. All content in this article is credited to Daniel Dewey.

In value loading, the agent will pick the action:

$\arg\max_{a \in A} \sum_{w \in W} p(w|e , a) \sum_{u \in U} u(w)p(C(u)|w)$

Here A is the set of actions the agent can take, e is the evidence the agent has already seen, W is the set of possible worlds, and U is the set of utility functions the agent is considering.

The parameter C(u) is some measure of the 'correctness' of the utility u, so the term p(C(u)|w) is the probability of u being correct, given that the agent is in world w. A simple example is of an AI that completely trusts the programmers; so if u is some utility function that claims that giving cake is better than giving death, and w₁ is a world where the programmers have said "cake is better than death" while w₂ is a world where they have said the opposite, then p(C(u)|w₁) = 1 and p(C(u) | w₂) = 0.

There are several challenging things in this formula:

W : How to define/represent the class of all worlds under consideration

U : How to represent the class of all utility functions over such worlds

C : What do we state about the utility function: that it is true? believed by humans?

p(C(u)|w) : How to define this probability

$\sum_{u\in U} u(w)$ : How to sum up utility functions (a moral uncertainty problem)

In contrast:

$\sum_{w\in W} p(w|e , a)$

is mostly the classic AI problem. It is hard to predict what the world is like from evidence, but this is a well known and studied problem and not unique to the present research. There is a trick to it here in that the nature of w includes the future actions of the agent which will depend upon how good future states look to it, but this recursive definition eventually bottoms out like a game of chess (where what happens when I make a move depends on what moves I make after that). It may cause an additional exponential explosion in calculating out the formula though, so the agent may need to make probabilistic guesses as to its own future behaviour to actually calculate an action.

This value loading equation is not subject to the classical Cake or Death problem, but is vulnerable to the more advanced version of the problem, if the agent is able to change the expected future value of p(C(u)) through its actions.

Daniel Dewey's Paper

The above idea was partially inspired by a draft of Learning What to Value, a paper by Daniel Dewey. He restricted attention to streams of interactions, and his equation, in a simplified form, is:

$\arg\max_{a \in A} \sum_{s \in S} p(s|e , a) \sum_{u \in U} u(s)p(u|s)$

where S is the set of all possible streams of all past and future observations and actions.

11 comments

Comments sorted by top scores.

comment by wgd · 2012-10-23T04:57:14.880Z · LW(p) · GW(p)

Maybe "value loading" is a term most people here can be expected to know, but I feel like this post would really be improved by ~1 paragraph of introduction explaining what's being accomplished and what the motivation is.

As it is, even the text parts make me feel like I'm trying to decipher an extremely information-dense equation.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2012-10-23T11:22:33.854Z · LW(p) · GW(p)

Maybe "value loading" is a term most people here can be expected to know

It's the first time I've seen the term, and the second it has appeared at all on LessWrong.

It may be more current among "people who are on every mailing list, read every LW post, or are in the Bay Area and have regular conversations with [the SI]" (from its original mention on LW).

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2012-10-23T14:16:42.729Z · LW(p) · GW(p)

It's more an FHI term than a SI/LessWrong term.

It's often called "indirect normativity": a strategy in which instead of directly encoding the goal for an AI (or moral agent), we specify a certain way of "learning what to value/inferring human values" so that the AI can then deduce human values (and then implement it).

Replies from: Manfred

↑ comment by Manfred · 2012-10-23T20:28:31.100Z · LW(p) · GW(p)

Ah, so it means the same thing as "value learning?" For some reason when I read "value loading" I thought of, like, overloading a function :D "I want cake, and that desire is also a carnal lust for BEES!"

Replies from: DaFranker

↑ comment by DaFranker · 2012-10-23T20:57:32.704Z · LW(p) · GW(p)

What helped me was thinking of it in terms of: "Oh, like 'reading' human preferences as if they were an XML config file that the program loads at runtime."

comment by Steve_Rayhawk · 2012-10-23T12:32:22.471Z · LW(p) · GW(p)

See also "Acting Rationally with Incomplete Utility Information" by Urszula Chajewska, 2002.

comment by Manfred · 2012-10-23T07:13:24.558Z · LW(p) · GW(p)

Could you define the "Cake or Death problem" and given an example of a decision-making system that falls prey to it?

First nitpick: Since the sum on i (i just being some number I'm using to number utility functions) of u_i(w)·p(C(u_i)|w) is a function only dependent on w, it's really just a complicatedly-written utility function. I think you want u_i(w)·p(C(u_i)|w, e) - that would allow the agent to gain some sort of evidence about its utility function. Also, since C(u_i) is presumably supposed to represent a fixed logical thingamabob, to be super-precise we could talk about some logical uncertainty measure over whether the utility function is correct, M(u_i, w, e), rather than a probability - but I think we don't have to care about that.

Second nitpick: To see what happens, let's assume our agent has figured out its utility function - it now picks the action with the largest sum on w of p(w|e, a)·u(w), where "w" is a world describing present, past and future, and u(w) is its one true utility function. This happens to look a lot like an evidential decision theory (EDT) agent, which runs into known problems. For example, if there was a disease that had low utility but made you unable to punch yourself in the face, this fact makes an EDT agent want to punch itself in the face so it could increase the probability it didn't have the disease.

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2012-10-23T11:22:57.434Z · LW(p) · GW(p)

I'll post the "cake or death" problem in a post soon.

Replies from: David_Gerard

↑ comment by David_Gerard · 2012-10-23T12:47:05.503Z · LW(p) · GW(p)

This one?

(Remember: always give your esoteric philosophical conundra good names.)

Replies from: Manfred, Stuart_Armstrong

↑ comment by Manfred · 2012-10-23T20:23:46.371Z · LW(p) · GW(p)

Oh, okay, thanks. So, shallowly speaking, you just needed to multiply the utilities of the strategies "don't ask and pick cake" and "don't ask and pick death" by 0.5.

↑ comment by Stuart_Armstrong · 2012-10-23T13:52:35.695Z · LW(p) · GW(p)

Yep! :-)

Value Loading

Contents

11 comments