An algorithm with preferences: from zero to one variable

stuart_armstrong

An algorithm with preferences: from zero to one variable

post by Stuart_Armstrong · 2016-11-15T15:14:01.313Z · LW · GW · Legacy · 2 comments

A simple way of thinking that I feel clarifies a lot of issues (related to Blue-Minimizing Robot):

Suppose you have an entity H that follows algorithm al^H . Then define:

What H does is its actions/outputs in the environment.
What H is is al^H .
What H wants is an interpretation of what H does (and possibly what it is), in order to construct a utility function or reward function corresponding with its preferences.

The interpretation part of wants is crucial, but it is often obscured in practice in value learning. That's because we often start with things like `H is a boundedly rational agent that maximises u...', or we lay out the agent in such a way that that's clearly the case.

What we're doing there is writing the entity as al^H(u) --- an algorithm with a special variable u that tracks what the entity wants. In the case of cooperative inverse reinforcement learning, this is explicit, as the human's values are given by a θ, known to the human. Thus the human's true algorithm is al^H(.), the human observes θ, meaning that θ is an objective fact about the universe. And then the human follows al^H(θ).

Note here that knowing what the human is in the one-variable sense (i.e. knowing al^H(.)) helps with the correct deduction about what they want, while simply knowing the joint al^H(θ) does not.

In contrast an interpretation starts with a zero-variable algorithm, and attempts to construct a one-variable one. There for, given al^H it constructs (one or more) al^H_i(.) and u_i such that

al^H = al^H_i(u_i).

This illustrates the crucial role of interpretation, especially if al^H is highly complex.

2 comments

Comments sorted by top scores.

comment by entirelyuseless · 2016-11-15T15:56:04.307Z · LW(p) · GW(p)

If H is an entity, it is not an algorithm. The algorithm is just one aspect of the thing. You can also see that in terms of what it does; any entity will affect its environment in many ways that have nothing to do with any algorithm which it has.

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2016-11-15T16:56:32.816Z · LW(p) · GW(p)

Yeah. I'm not too fused about the definitions of doing and being (you can restrict down to the entity itself if you want). It's the "wanting" that I'm focusing on here.