Reward function learning: the value functionpost by Stuart_Armstrong · 2018-04-24T16:29:32.971Z · score: 30 (7 votes) · LW · GW · 1 comments
0 The main points 1 Framework and Examples 1.1 Washing and Cooking and POMDPs 1.2 Rewards and reward functions 1.3 Learning your reward 2 The value function 2.1 The correct value function 2.2 Future-regarding inconsistency 2.3 Change-averse inconsistency 2.4 The volatile learning agent
I've written quite a few posts about the problems with agents learning values/rewards, and manipulating the learning process. I won't be linking to those past posts, because the aim of this post and the next one [LW · GW] is to present a complete, clear, and easy(er) to understand overview of the whole situation. They should stand alone, and serve as an introduction to the subject. This first post will present the framework, and the value function for learnt reward functions; the next one will be looking at the properties of the learning function. I'll be using variants on a single running example throughout, to try and increase understanding.
Feel free to comment on ways the ideas could be made clearer.
0 The main points
The central insight of these posts may seem trivial by now: that a learning process that the agent can influence, does not have the same properties as one that it can't. Moving from "if you don't know what is right, look it up on this read-only list" to "if you don't know what is right, look it up on this read-write list (or ask the programmer)" is a huge change. Especially when read-only lists can easily become read-write in practice when the agent becomes more powerful.
Why write long posts and papers about this idea, though? First of all, because it is easy, in practice, not to notice that shift. And secondly, because it is easy to define a "learning" process that doesn't behave like we'd expect.
So by defining the value function and learning function of an ideal learning process, this allows us to know when we are facing an ideal uninfluenceable, unmanipulable learning process, and when we are not.
Furthermore, many learning processes cannot be easily made unmanipulable - especially those that involve human feedback, feedback conditional on the agent's actions. By specifying the ideal case, and seeing examples of what goes wrong in the non-ideal case, this can help develop learning processes where a small amount of manipulation is allowed, traded off against a large amount of desirable learning.
As a minor example of this, in the next post [LW · GW], we'll see that though "uninfluenceable" is the ideal property, the weaker property of "unriggable" (which I previously called "unbiasable") is enough to get many of the desirable properties.
1 Framework and Examples
1.1 Washing and Cooking and POMDPs
The analysis will be illustrated by an ongoing example: that of a robot purchased to do domestic tasks.
The robot can specialise in cooking or washing (assume specialised robots are ultimately more effective than generalists). The robot has been turned on, but it has not yet been informed as to what its task is - it therefore has uncertainty about its reward function.
This robot will be modelled in the MDP (Markov Decision Process) and POMDP (Partially Observable Markov Decision Process) formalisms. In these formalisms, the agent occupies a state in the state space . In the POMDP formalism, the agent doesn't observe the state directly, instead it sees an observation drawn from the observation set . The agent can then take an action from the action set . This transfers the agent to a new state, where it makes a new observation.
The Markov property is about how the transitions and observations are handled; basically these can depend on the previous state and/or action, but not on the those further in the past. If we define as the set of of probability distributions over a set , then we have three transition rules:
Here, takes the current state and the action and returns a distribution over the next state, takes the current state and returns a distribution over observations, and is the distribution over the initial state where the agent begins. Both and ignore any previous information.
The whole POMDP will be called the environment, and designated by .
For our example, the robot will be evolving in the following environment:
In this grid world, there are six grids the robot could occupy (the reason for this shape will be made clear later). There are two pizzas to cook in the top square, and one mud splatter to wash in the bottom one. The state space is therefore of size , since there are squares the robot could occupy, levels of uncooked pizza (, , or uncooked pizzas), and levels of mud splatter ( or splatters). There is also a -th state, the 'episode ends' state.
In this case, there is no hidden information, so the set of states is the same as the set of observations, and is trivial. The robot always starts in the central position, and there are always uncooked pizzas and mud splatter; this defines the starting state, with being trivial and simply returning that starting state with certainty.
The robot has five actions, , which involve moving in the four directions (staying put is not an action we'll need). If the robot can move into a square, it will. If it tries to move into a wall, it turns off and the episode ends (this avoids us having to give the robot an extra action to end the episode). If it is in in a room with a mud splatter or an uncooked pizza, then all pizzas in that room get cooked, or all mud splatters get washed. If the episode has ended, then it stays ended. This defines the transition function , which is deterministic in this instance.
1.2 Rewards and reward functions
Now we need to define the possible rewards of the robot. There are reward functions that map the robots's history to a numerical reward.
So what is a history? That is just a sequence of actions. The agent starts in initial state , picks action , then transitions (via < refresh to render LaTeX > ) to state < refresh to render LaTeX > , and makes observation < refresh to render LaTeX > (via < refresh to render LaTeX > ). It then picks action , and so on.
A history of length is a sequence of actions and o:
Let be the set of all histories of length . For our purposes, we'll assume that the agent only operates for a finite number of steps: let be the maximal number of steps the agent operates for, let be the set of complete (full-length) histories, and let be the set of all histories. We might want to focus on the initial steps of any given history ; designate this by for any .
Then a reward function is something that maps each history to a numerical value in , the reward. Let be the set of all relevant reward functions. If the agent had a single clear reward , and followed history , then it would get total reward:
This is the total reward accumulated by reward function over the course of history ; first applying it to the first action and observation, then to the first two actions and observations, then to the first three... all the way to the full final history .
In our ongoing example, we will consider two reward functions, . The reward function