Intuitive examples of reward function learning?

post by Stuart_Armstrong · 2018-03-06T16:54:18.466Z · LW · GW · 3 comments

Contents

  Intuitive examples
None
3 comments

Can you help find the most intuitive example of reward function learning?

In reward function learning, there is a set of possible non-negative reward functions, , and a learning process which takes in a history of actions and observations and returns a probability distribution over .

If is a policy, is the set of histories of length , and is the probability of given that the agent follows policy , the expected value of at horizon is:

where is the total -reward over the history . Problems can occur if is riggable (this used to be called "biasable", but that term was over-overloaded), or influenceable.

There's an interesting subset of value learning problems, which could be termed "constrained optimisation with variable constraints" or "variable constraints optimisation". In that case, there is an overall reward , and every is the reward subject to constraints . This can be modelled as having being (if the constraints are met) and (if they are not).

Then if we define , and let be a distribution over , the set of constraints, the equation changes to:

If is riggable or influenceable, similar sorts of problems occur.

Intuitive examples

Here I'll present some examples of reward function learning or variable constraints optimisation, and I'm asking for readers to give their opinions as to which one seems the most intuitive to you, and the easiest to explain to outsiders. You're also welcome to suggest new examples if you think they work better.

3 comments

Comments sorted by top scores.

comment by TurnTrout · 2018-03-06T17:25:22.483Z · LW(p) · GW(p)

I think the constraint-based problems are more intuitive. As someone who thinks about this regularly, the classical examples had an abstract, alignment-theoretic texture, while the constraint-based ones seemed more relatable to something I’d actually be doing on a daily basis.

The specific constraint-based example chosen would be dependent on the audience. If all your readers are familiar with the process of completing literature reviews, go for that - otherwise, the CEO problem seems most natural.

comment by William_S · 2018-03-07T23:32:36.217Z · LW(p) · GW(p)

Variable constraint optimisation: Wedding planner. Tasked with maximizing satisfaction of the people getting married of the event, subject to constraints that it fit within a given budget.

Replies from: Stuart_Armstrong
comment by Stuart_Armstrong · 2018-03-08T11:26:45.832Z · LW(p) · GW(p)

That sounds like classical constrained optimisation. Does the wedding planner have power to increase the budget?