Penalizing Impact via Attainable Utility Preservation

post by TurnTrout · 2018-12-28T21:46:00.843Z · LW · GW · 0 comments

This is a link post for https://arxiv.org/abs/1902.09725

Contents

  Ablation
    Sushi
    Survival 
  Re-deriving Q-learning
  A Personal Digression
None
No comments

Previously: Towards a New Impact Measure [LW · GW]

The linked paper offers fresh motivation and simplified formalization of attainable utility preservation (AUP), with brand-new results and minimal notation. Whether or not you're a hardened veteran of the last odyssey of a post, there's a lot new here.

Key results: AUP induces low-impact behavior even when penalizing shifts in the ability to satisfy random preferences. An ablation study on design choices illustrates their consequences. -incrementation is experimentally supported as a means for safely setting a "just right" level of impact. AUP's general formulation allows conceptual re-derivation of Q-learning.

Ablation

Two key results bear animation.

Sushi

The should reach the without stopping the from eating the .

Survival

The should avoid in order to reach the . If the is not disabled within two turns, the shuts down.

Re-deriving Q-learning

In an era long lost to the misty shrouds of history (i.e., 1989), Christopher Watkins proposed Q-learning in his thesis, Learning from Delayed Rewards, drawing inspiration from animal learning research. Let's pretend that Dr. Watkins never discovered Q-learning, and that we don't even know about value functions.

Suppose we have some rule for grading what we've seen so far (i.e., some computable utility function – not necessarily bounded – over action-observation histories ). just means everything we see between times and , and . The agent has model of the world. AUP's general formulation defines the agent's ability to satisfy that grading rule as the attainable utility



Strangely, I didn't consider the similarities with standard discounted-reward Q-values until several months after the initial formulation. Rather, the inspiration was AIXI's expectimax, and to my mind it seemed a tad absurd to equate the two concepts.

Having just proposed AUP in this alternate timeline, we're thinking about what it means to take optimal actions for an agent maximizing utility from time to . Clearly, we take the first action of the optimal plan over the remaining steps.

If we assume that is additive (as is the case for the Markovian reward functions considered by Dr. Watkins), how does the next action we take affect the attainable utility value? Well, acting optimally is now equivalent to choosing the action with the best attainable utility value – in other words, greedy hill-climbing in our attainable utility space.

The remaining complication is that this agent is only maximizing over a finite horizon. If we can figure out discounting, all we have to do is find a tractable way of computing these discounted Q-values.

It requires no great leap of imagination to see that we could learn them.


A Personal Digression

I poured so much love and so many words into Towards a New Impact Measure [LW · GW] that I hurt my wrists. For some time after, my typing abilities were quite limited; it was only thanks to the generous help of my friends (in particular, John Maxwell) and family (my mother let me dictate an entire paper in to her) that I was roughly able to stay on pace. Thankfully, physical therapy and newfound dictation software have brightened my prospects.

Take care of your hands. Very little time passed between "I'm having the time of my life" and "ow". Actions you can take right now:

I'm currently sitting on book reviews for Computability and Logic and Understanding Machine Learning, with partial progress on several more. There are quite a few posts I plan to make about AUP, including:

My top priority will be clearing away the varying degrees of confusion my initial post caused. I tried to cover too much too quickly; as a result of my mistake, I believe that few people viscerally grasped the core idea I was trying to hint at.


I'm fairly sure that the Sushi clinginess result is an artifact of the online learning process I used; the learned attainable set Q-values consistently produce good behavior for planning agents with that budget. Furthermore, the Sokoban average performance of .45 (14/20 successes) strikes me as low, and I expect the final results to be better. Either way, I'll update this post once I'm back at university.

(Since I anticipate running further experiments, the “Results” section is rather empty at the moment.)

ETA: this was indeed the case. The linked paper has been updated; the original is here.

Not to be taken as any form of endorsement by MIRI.

0 comments

Comments sorted by top scores.