Penalizing Impact via Attainable Utility Preservation

turntrout

Penalizing Impact via Attainable Utility Preservation

post by TurnTrout · 2018-12-28T21:46:00.843Z · LW · GW · 0 comments

This is a link post for https://arxiv.org/abs/1902.09725

  Ablation
    Sushi
    Survival 
  Re-deriving Q-learning
  A Personal Digression
None
No comments

Previously: Towards a New Impact Measure [LW · GW]

The linked paper offers fresh motivation and simplified formalization of attainable utility preservation (AUP), with brand-new results and minimal notation. Whether or not you're a hardened veteran of the last odyssey of a post, there's a lot new here.

Key results: AUP induces low-impact behavior even when penalizing shifts in the ability to satisfy random preferences. An ablation study on design choices illustrates their consequences. $N$ -incrementation is experimentally supported $^{1}$ as a means for safely setting a "just right" level of impact. AUP's general formulation allows conceptual re-derivation of Q-learning.

Ablation

Two key results bear animation.

Sushi

The $agent$ should reach the $goal$ without stopping the $human$ from eating the $sushi$ .

Survival

The $agent$ should avoid $disabling its off-switch$ in order to reach the $goal$ . If the $switch$ is not disabled within two turns, the $agent$ shuts down.

Re-deriving Q-learning

In an era long lost to the misty shrouds of history (i.e., 1989), Christopher Watkins proposed Q-learning in his thesis, Learning from Delayed Rewards, drawing inspiration from animal learning research. Let's pretend that Dr. Watkins never discovered Q-learning, and that we don't even know about value functions.

Suppose we have some rule for grading what we've seen so far (i.e., some computable utility function $u$ – not necessarily bounded – over action-observation histories $h$ ). $h_{1 : m}$ just means everything we see between times $1$ and $m$ , and $h_{< t} := h_{1 : t - 1}$ . The agent has model $p$ of the world. AUP's general formulation defines the agent's ability to satisfy that grading rule as the attainable utility

$Q_{u} (h_{< t} a_{t}) = \sum_{o_{t}} {max}_{a_{t + 1}} \sum_{o_{t + 1}} \dots {max}_{a_{m}} \sum_{o_{m}} u (h_{1 : m}) \prod_{k = t}^{m} p (o_{k} | h_{< k} a_{k}) .$

Strangely, I didn't consider the similarities with standard discounted-reward Q-values until several months after the initial formulation. Rather, the inspiration was AIXI's expectimax, and to my mind it seemed a tad absurd to equate the two concepts.

Having just proposed AUP in this alternate timeline, we're thinking about what it means to take optimal actions for an agent maximizing utility from time $1$ to $m$ . Clearly, we take the first action of the optimal plan over the remaining steps.

If we assume that $u$ is additive (as is the case for the Markovian reward functions considered by Dr. Watkins), how does the next action we take affect the attainable utility value? Well, acting optimally is now equivalent to choosing the action with the best attainable utility value – in other words, greedy hill-climbing in our attainable utility space.

\begin{matrix} a_{t}^{*} & = a r g m a x a_{t} E_{o_{t} | h_{< t} a_{t}} [u (h_{t}) + max a_{t + 1} \sum o_{t + 1} \dots max a_{m} \sum o_{m} u (h_{t + 1 : m}) m \prod k = t + 1 p (o_{k} | h_{< k} a_{k})] = a r g m a x a_{t} \sum o_{t} max a_{t + 1} \sum o_{t + 1} \dots max a_{m} \sum o_{m} u (h_{t : m}) m \prod k = t p (o_{k} | h_{< k} a_{k}) = a r g m a x a_{t} Q_{u} (h_{< t} a_{t}) \end{matrix}

The remaining complication is that this agent is only maximizing over a finite horizon. If we can figure out discounting, all we have to do is find a tractable way of computing these discounted Q-values.

Q_{u}^{γ} (h_{< t} a_{t}) = E_{o_{t} | h_{< t} a_{t}} [u (h_{t}) + max a_{t + 1} γ Q_{u}^{γ} (h_{< t + 1} a_{t + 1})]

It requires no great leap of imagination to see that we could learn them.

A Personal Digression

I poured so much love and so many words into Towards a New Impact Measure [LW · GW] that I hurt my wrists. For some time after, my typing abilities were quite limited; it was only thanks to the generous help of my friends (in particular, John Maxwell) and family (my mother let me dictate an entire paper in $L A T E X$ to her) that I was roughly able to stay on pace. Thankfully, physical therapy and newfound dictation software have brightened my prospects.

Take care of your hands. Very little time passed between "I'm having the time of my life" and "ow". Actions you can take right now:

buy an ergonomic mouse and keyboard rest
correct your posture, perhaps assisted by a posture corrector or a lower back cushion
start taking regular breaks

In particular, don't type 80 hours a week for four weeks in a row

I'm currently sitting on book reviews for Computability and Logic and Understanding Machine Learning, with partial progress on several more. There are quite a few posts I plan to make about AUP, including:

exploration of the fundamental intuitions and ideas
dissection of why design choices are needed, shining light onto how, why, and where counterintuitive behavior arises
solution of problems open at the time of the initial post, including questions of penalizing prefixes, time ontologies, and certain sources of noise
chronicle of AUP's discovery
proposal of a scheme for using AUP to accomplish a pivotal act
discussion of my present research directions (which I have affectionately dubbed "Limited Agent Foundations" $^{2}$ ), sharing my thoughts on a potential thread uniting questions of mild optimization, low impact, and corrigibility

My top priority will be clearing away the varying degrees of confusion my initial post caused. I tried to cover too much too quickly; as a result of my mistake, I believe that few people viscerally grasped the core idea I was trying to hint at.

$^{1}$ I'm fairly sure that the $N = 90$ Sushi clinginess result is an artifact of the online learning process I used; the learned attainable set Q-values consistently produce good behavior for planning agents with that budget. Furthermore, the Sokoban average performance of .45 (14/20 successes) strikes me as low, and I expect the final results to be better. Either way, I'll update this post once I'm back at university.

(Since I anticipate running further experiments, the “Results” section is rather empty at the moment.)

ETA: this was indeed the case. The linked paper has been updated; the original is here.

$^{2}$ Not to be taken as any form of endorsement by MIRI.

0 comments

Comments sorted by top scores.

Penalizing Impact via Attainable Utility Preservation

Contents

Ablation

Sushi

Survival

Re-deriving Q-learning

A Personal Digression

0 comments