An algorithm with preferences: from zero to one variable

post by Stuart_Armstrong · 2017-06-02T16:35:55.000Z · LW · GW · 0 comments

A putative new idea for AI control; index here.

A simple way of thinking that I feel clarifies a lot of issues (related to Blue Minimising Robot):

Suppose you have an entity that follows algorithm . Then define:

The interpretation part of wants is crucial, but it is often obscured in practice in value learning. That's because we often start with things like ` is a boundedly rational agent that maximises ...', or we lay out the agent in such a way that that's clearly the case.

What we're doing there is writing the entity as --- an algorithm with a special variable that tracks what the entity wants. In the case of cooperative inverse reinforcement learning, this is explicit, as the human's values are given by a , known to the human. Thus the human's true algorithm is , the human observes , meaning that is an objective fact about the universe. And then the human follows .

Note here that knowing what the human is in the one-variable sense (i.e. knowing ) helps with the correct deduction about what they want - while simply knowing the joint does not.

In contrast an interpretation starts with a zero-variable algorithm, and attempts to construct a one-variable one. There for, given it constructs (one or more) and such that

This illustrates the crucial role of interpretation, especially if is highly complex.

0 comments

Comments sorted by top scores.