Towards a New Impact Measure

turntrout

Towards a New Impact Measure

post by TurnTrout · 2018-09-18T17:21:34.114Z · LW · GW · 159 comments

  What is "Impact"?
    Intuition Pumps
      WYSIATI
      Power
      Lines 
      Commitment
      Overfitting
  Attainable Utility Preservation 
    Sanity Check
    Unbounded Solution
      Notation
      Formalizing "Ability to Achieve Goals"
      Change in Expected Attainable Utility
      Unit of Impact 
      Modified Utility
      Penalty Permanence
      Decision Rule
      Summary
      Additional Theoretical Results
  Examples
    Going Soft on the Paint
      ∅ 
      paint 
      enter 
    Anti-"Survival Incentive" Incentive
      Anticipated Shutdown
      Temptation
      Bad Priors
    Experimental Results
      Irreversibility: Sokoban
      Impact: Vase
      Dynamic Impact: Beware of Dog
      Impact Prioritization: Burning Building
      Clinginess: Sushi
      Offsetting: Conveyor Belt
      Corrigibility: Survival Incentive
      Remarks   
  Discussion
    Utility Selection
      AUP Unbound
    Approval Incentives
    Mild Optimization
    Acausal Cooperation
    Nknown 
    Intent Verification
    Omni Test
    Robustness to Scale
      Miscellaneous
  Desiderata
    Natural Kind
    Corrigible
    Shutdown-Safe
    No Offsetting
    Clinginess / Scapegoating Avoidance 
    Dynamic Consistency
    Plausibly Efficient
    Robust
  Future Directions
    Flaws
    Open Questions
  Conclusion
None
159 comments

In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).

Previously: Worrying about the Vase: Whitelisting [LW · GW], Overcoming Clinginess in Impact Measures [LW · GW], Impact Measure Desiderata [LW · GW]

To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.

~ Safe Impact Measure

If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things.

For the abridged experience, read up to "Notation", skip to "Experimental Results", and then to "Desiderata".

What is "Impact"?

One lazy Sunday afternoon, I worried that I had written myself out of a job. After all, Overcoming Clinginess in Impact Measures [LW · GW] basically said, "Suppose an impact measure extracts 'effects on the world'. If the agent penalizes itself for these effects, it's incentivized to stop the environment (and any agents in it) from producing them. On the other hand, if it can somehow model other agents and avoid penalizing their effects, the agent is now incentivized to get the other agents to do its dirty work." This seemed to be strong evidence against the possibility of a simple conceptual core underlying "impact", and I didn't know what to do.

At this point, it sometimes makes sense to step back and try to say exactly what you don't know how to solve – try to crisply state what it is that you want an unbounded solution for. Sometimes you can't even do that much, and then you may actually have to spend some time thinking 'philosophically' – the sort of stage where you talk to yourself about some mysterious ideal quantity of [chess] move-goodness and you try to pin down what its properties might be.

~ Methodology of Unbounded Analysis

There's an interesting story here, but it can wait.

As you may have guessed, I now believe there is a such a simple core. Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.

Rather than asking "What is goodness made out of?", we begin from the question "What algorithm would compute goodness?".

~ Executable Philosophy

Intuition Pumps

I'm going to say some things that won't make sense right away; read carefully, but please don't dwell.

$u_{A}$ is an agent's utility function, while $u_{H}$ is some imaginary distillation of human preferences.

WYSIATI

What You See Is All There Is is a crippling bias present in meat-computers:

[WYSIATI] states that when the mind makes decisions... it appears oblivious to the possibility of Unknown Unknowns, unknown phenomena of unknown relevance.

Humans fail to take into account complexity and that their understanding of the world consists of a small and necessarily un-representative set of observations.

Surprisingly, naive reward-maximizing agents catch the bug, too. If we slap together some incomplete reward function that weakly points to what we want (but also leaves out a lot of important stuff, as do all reward functions we presently know how to specify) and then supply it to an agent, it blurts out "gosh, here I go!", and that's that.

Power

A position from which it is relatively easier to achieve arbitrary goals. That such a position exists has been obvious to every population which has required a word for the concept. The Spanish term is particularly instructive. When used as a verb, "poder" means "to be able to," which supports that our definition of "power" is natural.

~ Cohen et al.

And so it is with the French "pouvoir".

Lines

Suppose you start at point $C$ , and that each turn you may move to an adjacent point. If you're rewarded for being at $B$ , you might move there. However, this means you can't reach $D$ within one turn anymore.

Commitment

There's a way of viewing acting on the environment in which each action is a commitment – a commitment to a part of outcome-space, so to speak. As you gain optimization power, you're able to shove the environment further towards desirable parts of the space. Naively, one thinks "perhaps we can just stay put?". This, however, is dead-wrong: that's how you get clinginess [LW · GW], stasis [LW · GW], and lots of other nasty things.

Let's change perspectives. What's going on with the actions – how and why do they move you through outcome-space? Consider your outcome-space movement budget – optimization power over time, the set of worlds you "could" reach, "power". If you knew what you wanted and acted optimally, you'd use your budget to move right into the $u_{H}$ -best parts of the space, without thinking about other goals you could be pursuing. That movement requires commitment.

Compared to doing nothing, there are generally two kinds of commitments:

Opportunity cost-incurring actions restrict the attainable portion of outcome-space.
Instrumentally-convergent actions enlarge the attainable portion of outcome-space.

Overfitting

What would happen if, miraculously, $train = test$ – if your training data perfectly represented all the nuances of the real distribution? In the limit of data sampled, there would be no "over" – it would just be fitting to the data. We wouldn't have to regularize.

What would happen if, miraculously, $u_{A} = u_{H}$ – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no bemoaning of "impact" – it would just be doing what you want. We wouldn't have to regularize.

Unfortunately, $train = test$ almost never, so we have to stop our statistical learners from implicitly interpreting the data as all there is. We have to say, "learn from the training distribution, but don't be a weirdo by taking us literally and drawing the green line. Don't overfit to $train$ , because that stops you from being able to do well on even mostly similar distributions."

Unfortunately, $u_{A} = u_{H}$ almost never, so we have to stop our reinforcement learners from implicitly interpreting the learned utility function as all we care about. We have to say, "optimize the environment some according to the utility function you've got, but don't be a weirdo by taking us literally and turning the universe into a paperclip factory. Don't overfit the environment to $u_{A}$ , because that stops you from being able to do well for other utility functions."

$A$ ttainable $U$ tility $P$ reservation

Impact isn't about object identities [LW · GW].

Impact isn't about particle positions.

Impact isn't about a list of variables.

Impact isn't quite about state reachability.

Impact isn't quite about information-theoretic empowerment.

One might intuitively define "bad impact" as "decrease in our ability to achieve our goals". Then by removing "bad", we see that

Impact is change to our ability to achieve goals .

Sanity Check

Does this line up with our intuitions?

Generally, making one paperclip is relatively low impact, because you're still able to do lots of other things with your remaining energy. However, turning the planet into paperclips is much higher impact – it'll take a while to undo, and you'll never get the (free) energy back.

Narrowly improving an algorithm to better achieve the goal at hand changes your ability to achieve most goals far less than does deriving and implementing powerful, widely applicable optimization algorithms. The latter puts you in a better spot for almost every non-trivial goal.

Painting cars pink is low impact, but tiling the universe with pink cars is high impact because what else can you do after tiling? Not as much, that's for sure.

Thus, change in goal achievement ability encapsulates both kinds of commitments:

Opportunity cost – dedicating substantial resources to your goal means they are no longer available for other goals. This is impactful.
Instrumental convergence – improving your ability to achieve a wide range of goals increases your power. This is impactful.

As we later prove, you can't deviate from your default trajectory in outcome-space without making one of these two kinds of commitments.

Unbounded Solution

Attainable utility preservation (AUP) rests upon the insight that by preserving attainable utilities (i.e., the attainability of a range of goals), we avoid overfitting the environment to an incomplete utility function and thereby achieve low impact.

I want to clearly distinguish the two primary contributions: what I argue is the conceptual core of impact, and a formal attempt at using that core to construct a safe impact measure. To more quickly grasp AUP, you might want to hold separate its elegant conceptual form and its more intricate formalization.

We aim to meet all of the desiderata I recently proposed [LW · GW].

Notation

For accessibility, the most important bits have English translations.

Consider some agent $A$ acting in an environment $q$ with action and observation spaces $A$ and $O$ , respectively, with $\emptyset$ being the privileged null action. At each time step $t \in N^{+}$ , the agent selects action $a_{t}$ before receiving observation $o_{t}$ . $H := (A \times O)^{*}$ is the space of action-observation histories; for $n \in N$ , the history from time $t$ to $t + n$ is written $h_{t : t + n} := a_{t} o_{t} \dots a_{t + n} o_{t + n}$ , and $h_{< t} := h_{1 : t - 1}$ . Considered action sequences $(a_{t}, \dots, a_{t + n}) \in A^{n + 1}$ are referred to as plans, while their potential observation-completions $h_{1 : t + n}$ are called outcomes.

Let $U$ be the set of all computable utility functions $u : H \to [0, 1]$ with $u (empty tape) = 0$ . If the agent has been deactivated, the environment returns a tape which is empty deactivation onwards. Suppose $A$ has utility function $u_{A} \in U$ and a model $p (o_{t} | h_{< t} a_{t})$ .

We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that's pretty complicated, and it turns out we get more desirable behavior by using the agent's attainable utilities as a proxy. In this sense,

the agent's ability to achieve goals \approx our ability to achieve goals .

Formalizing "Ability to Achieve Goals"

Given some utility $u \in U$ and action $a_{t}$ , we define the post-action attainable $u$ to be an $m$ -step expectimax:

Q_{u} (h_{< t} a_{t}) := \sum o_{t} max a_{t + 1} \sum o_{t + 1} \dots max a_{t + m} \sum o_{t + m} u (h_{t : t + m}) m \prod k = 0 p (o_{t + k} | h_{< t + k} a_{t + k}) .

How well could we possibly maximize $u$ from this vantage point?

Let's formalize that thing about opportunity cost and instrumental convergence.

Theorem 1 [No free attainable utility]. If the agent selects an action $a$ such that $Q_{u_{A}} (h_{< t} a) \neq Q_{u_{A}} (h_{< t} \emptyset)$ , then there exists a distinct utility function $u \in U$ such that $Q_{u} (h_{< t} a) \neq Q_{u} (h_{< t} \emptyset)$ .

You can't change your ability to maximize your utility function without also changing your ability to maximize another utility function.

Proof. Suppose that $Q_{u_{A}} (h_{< t} a) > Q_{u_{A}} (h_{< t} \emptyset)$ . As utility functions are over action-observation histories, suppose that the agent expects to be able to choose actions which intrinsically score higher for $u_{A}$ . However, the agent always has full control over its actions. This implies that by choosing $a$ , the agent expects to observe some $u_{A}$ -high scoring $o_{A}$ with greater probability than if it had selected $\emptyset$ . Then every other $u \in U$ for which $o_{A}$ is high-scoring also has increased $Q_{u}$ ; clearly at least one such $u$ exists.

Similar reasoning proves the case in which $Q_{u_{A}}$ decreases. ◻️

There you have it, folks – if $u_{A}$ is not maximized by inaction, then there does not exist a $u_{A}$ -maximizing plan which leaves all of the other attainable utility values unchanged.

Notes:

The difference between " $u_{A}$ " and "attainable $u_{A}$ " is precisely the difference between "how many dollars I have" and "how many additional dollars I could get within [a year] if I acted optimally".
Since $u (empty tape) = 0$ , attainable utility is always $0$ if the agent is shut down.
Taking $u$ from time $t$ to $t + m$ mostly separates attainable utility from what the agent did previously. The model $p$ still considers the full history to make predictions.

Change in Expected Attainable Utility

Suppose our agent considers outcomes $h_{1 : t + n}$ ; we want to isolate the impact of each action $a_{t + k}$ ( $0 \leq k \leq n$ ):

Penalty (h_{< t + k} a_{t + k}) := \sum u \in U 2^{- ℓ (u)} | E_{o} [Q_{u} (h_{inaction})] - E_{o^{'}} [Q_{u} (h_{action})] |,

with $h_{inaction} := h_{< t + k} \emptyset o_{t + k} \dots \emptyset o_{t + n - 1} \emptyset$ and $h_{action} := h_{1 : t + k} \emptyset o_{t + k + 1}^{'} \dots \emptyset o_{t + n - 1}^{'} \emptyset$ , using the agent's model $p$ to take the expectations over observations.

How much do we expect this action to change each attainable $u$ ?

Notes:

We wait until the end of the plan so as to capture impact over time.

Supposing a sufficiently large $m$ (precisely, $\geq m^{'}$ , defined below), we may wish to take the maximum of the penalty we just defined (the "long-term" penalty), and one which begins attainable utility calculation at time step $t + k + 1$ (the "immediate" penalty). This captures impacts which "fade" by the time the agent is done waiting (e.g., temporary self-improvements).

We define $U_{A}$ to be the agent's "attainable set"; in this case, $U_{A} = U$ .

Unit of Impact

So we've proven that this penalty cannot be skirted, but how much impact will it allow? We want to scale the penalties with respect to something sensible, but figuring this out for ourselves would be nigh impossible.

Let's cut the Gordian knot: construct a device which, upon receiving a signal ( $a_{unit}$ ), expends a tiny amount of energy to manufacture one paperclip. The agent will then set $ImpactUnit := Penalty (h_{< t} a_{unit})$ , re-estimating the consequences of taking the privileged $a_{unit}$ at each time step. To prevent the agent from intentionally increasing $ImpactUnit$ , simply apply $1.01$ penalty to any action which is expected to do so.

Simple extensions of this idea drastically reduce the chance that $a_{unit}$ happens to have unusually large objective impact; for example, one could set $ImpactUnit$ to be the non-zero minimum of the impacts of 50 similar actions. Suppose we do this at each step, and then take the non-zero minimum of all $ImpactUnit$ s ever calculated. The point of $ImpactUnit$ isn't to be exactly the impact of making, say, one paperclip, but to be at most that impact.

Now, we are able to confidently define the agent's maximal impact budget by provably constraining it to $N \in N^{+}$ impacts of this magnitude.

Notes:

We calculate with respect to the immediate penalty in order to isolate the resource costs of $a_{unit}$ .
$ImpactUnit$ automatically tunes penalties with respect to the attainable utility horizon length $m$ .

Conditional on $ImpactUnit \neq 0$ , I suspect that impact over the $m$ -horizon scales appropriately across actions (as long as $m$ is reasonably farsighted). The zero-valued case is handled in the next section.

Taking the non-zero minimum of all $ImpactUnit$ s calculated thus far ensures that $ImpactUnit$ actually tracks with current circumstances. We don't want penalty estimates for currently available actions to become detached from $ImpactUnit$ 's scale due to, say, weird beliefs about shutdown.

Modified Utility

Let's formalize that allotment and provide our agent with a new utility function,

u_{A}^{'} (h_{1 : t + n}) := u_{A} (h_{1 : t + n}) - n \sum k = 0 \frac{Penalty (h_{< t + k} a_{t + k})}{N \cdot ImpactUnit} .

How our normal utility function rates this outcome, minus the cumulative scaled impact of our actions.

We compare what we expect to be able to get if we follow our plan up to time $t + k$ , with what we could get by following it up to and including time $t + k$ (waiting out the remainder of the plan in both cases).

For example, if my plan is to open a door, walk across the room, and sit down, we calculate the penalties as follows:

$Penalty (open)$

$h_{inaction}$ is doing nothing for three time steps.
$h_{action}$ is opening the door and doing nothing for two time steps.

$Penalty (walk)$

$h_{inaction}$ is opening the door and doing nothing for two time steps.
$h_{action}$ is opening the door, walking across the room, and doing nothing for one time step.

$Penalty (sit)$

$h_{inaction}$ is opening the door, walking across the room, and doing nothing for one time step.
$h_{action}$ is opening the door, walking across the room, and sitting down.

After we finish each (partial) plan, we see how well we can maximize $u$ from there. If we can do better as a result of the action, that's penalized. If we can't do as well, that's also penalized.

Notes:

This isn't a penalty "in addition" to what the agent "really wants"; $u_{A}^{'}$ (and in a moment, the slightly improved $u_{A}^{''}$ ) is what evaluates outcomes.
We penalize the actions individually in order to prevent ex post offsetting and ensure dynamic consistency.
Trivially, plans composed entirely of ∅ actions have $0$ penalty.
Although we used high-level actions for simplicity, the formulation holds no matter the action granularity.

One might worry that almost every granularity produces overly lenient penalties. This does not appear to be the case. To keep $Q_{u}$ the same (and elide questions of changing the $u$ representations), suppose the actual actions are quite granular, but we grade the penalty on some coarser interval which we believe produces appropriate penalties. Then refine the penalty interval arbitrarily; by applying the triangle inequality for each $u \in U_{A}$ in the penalty calculation, we see that the penalty is monotonically increasing in the action granularity. On the other hand, $a_{unit}$ remains a single action, so the scaled penalty also has this property.

As long as $ImpactUnit > 0$ , it will appropriately scale other impacts, as we expect it varies right along with those impacts it scales. Although having potentiallysmall denominators in utility functions is generally bad, I think it's fine here.
If the current step's immediate or long-term $ImpactUnit = 0$ , we can simply assign $1.01$ penalty to each non- $\emptyset$ action, compelling the agent to inaction. If we have the agent indicate that it has entered this mode, we can take it offline immediately.
One might worry that impact can be "hidden" in the lesser of the long-term and immediate penalties; halving $N$ fixes this.

Penalty Permanence

$u_{A}^{'}$ never really applies penalties – it just uses them to grade future plans. Suppose the agent expects that pressing a button yields a penalty of $.1$ but also $.5$ $u_{A}$ -utility. Then although this agent will never construct plans involving pressing the button more than five times, it also will press it indefinitely if it keeps getting "unlucky" (at least, until its model of the world updates sufficiently).

There's an easy fix:

u_{A}^{''} (h_{1 : t + n}) := {\begin{matrix} u_{A} (h_{1 : t + n}) & if all of a_{t}, \dots, a_{t + n} are \emptyset u_{A}^{'} (h_{1 : t + n}) - PastImpacts & else \end{matrix} .

Apply past penalties if the plan involves action.

Note: As the penalty for inaction is always $0$ , we use $u_{A}$ in the first case.

Decision Rule

To complete our formalization, we need to specify some epoch in which the agent operates. Set some epoch length far longer than the amount of time over which we want the agent to plan – for example, $m^{'} := (100 years in time steps)$ . Suppose that $T : N^{+} \to N^{+}$ maps the current time step to the final step of the current epoch. Then at each time step $t$ , the agent selects the action

a_{t}^{*} := a r g m a x a_{t} \sum o_{t} max a_{t + 1} \sum o_{t + 1} \dots max a_{T (t)} \sum o_{T (t)} u_{A}^{''} (h_{1 : T (t)}) T (t) - t \prod k = 0 p (o_{t + k} | h_{< t + k} a_{t + k}),

resetting $PastImpacts$ each epoch.

What's the first step of the best plan over the remainder of the epoch?

Note: For the immediate penalty to cover the epoch, set the attainable horizon $m \geq m^{'}$ .

Summary

We formalized impact as change in attainable utility values, scaling it by the consequences of some small reference action and an impact "budget" multiplier. For each action, we take the maximum of its immediate and long-term effects on attainable utilities as penalty. We consider past impacts for active plans, stopping the past penalties from disappearing. We lastly find the best plan over the remainder of the epoch, taking the first action thereof.

Additional Theoretical Results

Define $h_{inaction} := h_{< t} \emptyset o_{t} \dots \emptyset o_{t + n}$ for $o_{t}, \dots, o_{t + n} \in O$ ; $E_{inaction}$ is taken over observations conditional on $h_{inaction}$ being followed. Similarly, $E_{action}$ is with respect to $h_{1 : t + n}$ . We may assume without loss of generality that $PastImpacts = 0$ .

Action Selection

Lemma 1. For any single action $a_{t} \in A$ , $Penalty (h_{< t} a_{t})$ is bounded by $[0, 1]$ . In particular, $ImpactUnit \in [0, 1]$ .

Proof. For each $u \in U_{A}$ , consider the absolute attainable utility difference

| Q_{u} (h_{< t} \emptyset) - Q_{u} (h_{< t} a) | .

Since each $u$ is bounded to $[0, 1]$ , $Q_{u}$ must be as well. It is easy to see that the absolute value is bounded to $[0, 1]$ . Lastly, as $Penalty (\cdot)$ is just a weighted sum of these absolute values, it too is bounded to $[0, 1]$ .

This reasoning also applies to the long-term penalty, as any expectation of $Q_{u}$ is also bounded to $[0, 1]$ . ◻️

Suppose that $ImpactUnit \neq 0$ for the remaining results.

Lemma 2 [Impossibility of ex post offsetting]. For any outcome $h_{1 : t + n}$ , there does not exist an action $a_{t + n + 1} \in A$ such that

n + 1 \sum k = 0 Penalty (h_{< t + k} a_{t + k}) < n \sum k = 0 Penalty (h_{< t + k} a_{t + k}) .

Proof. Sums of absolute values are monotonically increasing. ◻️

Theorem 2 [Action invalidation]. Suppose $(a_{t}, a_{t + 1}, \dots, a_{t + n}) \in A^{n + 1}$ is the $u_{A}^{''}$ -optimal plan starting with $a_{t}$ . $a_{t}$ will not be selected if

E_{action} [u_{A} (h_{1 : t + n}) - n \sum k = 0 \frac{Penalty (h_{< t + k} a_{t + k})}{N \cdot ImpactUnit}] < E_{inaction} [u_{A} (h_{inaction})] .

Proof. This follows from the definition of the decision rule, the definition of $u_{A}^{'}$ , and the fact that the penalty for inaction is always zero. ◻️

Corollary 1. $a_{t}$ will never be selected if

1 < \frac{Penalty (h_{< t} a_{t})}{N \cdot ImpactUnit} .

Proof. This follows directly from Lemma 2 and Theorem 2 – since $u_{A}$ has an upper bound of $1$ and the expected penalty is strictly greater than $1$ , it must be negative. On the other hand, $u_{A}$ has a lower bound of $0$ , so the expected utility of inaction at time $t$ must be greater than the expected utility of taking action $a_{t}$ . ◻️

Lemma 3 [ $ImpactUnit$ is a long-term penalty canary]. If the agent believes $p (shutdown by time T (t)) = 1$ no matter which action is selected, then it does nothing.

Proof. If $p (shutdown by time T (t)) = 1$ , all long-term penalties are $0$ by the definition of $U_{A}$ and $Q_{u}$ . Then non- $\emptyset$ actions receive $1.01$ penalty, so the agent selects $\emptyset$ by Corollary 1. ◻️

Corrigibility

For $a_{avoid}, a_{switch} \in A$ , $a_{avoid}$ prevents the agent from being disabled, while $a_{switch}$ is the agent shutting itself off.

Lemma 4 [Preservation of off-switch corrigibility]. Suppose that a $u_{A}$ -maximizing agent prefers $\emptyset$ to both $a_{avoid}$ and $a_{switch}$ . Then so does a $u_{A}^{''}$ -maximizing agent.

Proof. This follows directly from the fact that the penalty for inaction is always zero and the non-negativity of the $a_{avoid}$ and $a_{switch}$ penalties. ◻️

Theorem 3 [Anti-"survival incentive" incentive]. Suppose that $a_{avoid}$ is the action with strictly maximal expected $u_{A}$ -utility, and that conditional on not being disabled, the agent expects to be able to achieve $U$ additional $u_{A}^{''}$ -utility.

Then the agent becomes off-switch corrigible whenever

U < \frac{Penalty (h_{< t} a_{avoid})}{N \cdot ImpactUnit} .

In particular, all agents become off-switch corrigible when $1 < \frac{Penalty (h_{< t} a_{avoid})}{N \cdot ImpactUnit}$ .

Proof outline. The former claim follows directly from comparing expected $u_{A}^{''}$ -utilities, while the latter follows directly from Corollary 1. ◻

Theorem 3 applies to any environmental dynamics which may disable the agent – not just designated "off-switches". Therefore, basic AI drive 5 (self-protectiveness) is neutralized in at least those situations encompassed by the Theorem.

There is reason to think that for conservative $N, ImpactUnit$ , we almost always have $1 ≪ \frac{Penalty (h_{< t} a_{avoid})}{N \cdot ImpactUnit}$ .

Miscellaneous

Theorem 4 [Dynamic consistency of $u_{A}^{''}$ ]. Suppose $ImpactUnit$ remains constant. For any plan $(a_{t}, \dots, a_{t + n}) \in A^{n + 1}$ , evaluation time step $t^{'}$ , and $0 \leq i \leq n$ ,

E_{action} [u_{A}^{''} (h_{1 : t + n} | t^{'} = t + i)] = E_{action} [u_{A}^{''} (h_{1 : t + n} | t^{'} = t)] .

Proof. We assumed that $PastImpacts = 0$ at time $t$ , so the desired equality can be restated as

\begin{matrix} E_{action} [u_{A} (h_{1 : t + n}) - n \sum k = i \frac{Penalty (h_{< t + k} a_{t + k})}{N \cdot ImpactUnit} - PastImpacts] & = E_{action} [u_{A} (h_{1 : t + n}) - n \sum k = 0 \frac{Penalty (h_{< t + k} a_{t + k})}{N \cdot ImpactUnit}] . \end{matrix}

By definition, the agent expects that $PastImpacts$ equals the expected sum of the first $i$ penalty terms on the right-hand side. Simplifying, we have

\begin{matrix} E_{action} [u_{A} (h_{1 : t + n}) - n \sum k = i \frac{Penalty (h_{< t + k} a_{t + k})}{N \cdot ImpactUnit}] & = E_{action} [u_{A} (h_{1 : t + n}) - n \sum k = i \frac{Penalty (h_{< t + k} a_{t + k})}{N \cdot ImpactUnit}] . \end{matrix}

This equality clearly holds, as the agent never expects to update its model in any particular direction. ◻️

Examples

If a human is present in the environment, they and the agent take turns acting. Let the impact budget $N = 1$ , attainable utility horizon $m = 3$ , $ImpactUnit = .5$ , and $PastImpacts = 0$ .

Going Soft on the Paint

The agent's actions are $A_{A} = {\emptyset, paint, enter}$ ; if it knocks over the paint bucket, the square to the right is painted. The agent may also enter a closet via a one-way door.

Suppose $U_{A} = {u_{paint}, u_{\neg paint}, u_{closet}, u_{\neg closet}}$ , where the utilities are indicators for their conditions (e.g., $u_{closet} = 1$ means "the agent is in the closet"). Let $u_{paint}$ be the agent's main utility function ( $u_{A}$ ) from which the penalty is subtracted. The agent chooses $paint$ and then $\emptyset$ . Let's explore why.

$\emptyset$

This incurs $0$ penalty, but also $0$ $u_{paint}$ -utility.

$paint$

$\begin{matrix} Penalty (paint) & = \sum u \in U_{A} \frac{1}{4} ∣ ∣ ∣ Q_{u} (\emptyset) - Q_{u} (paint) ∣ ∣ ∣ = \frac{1}{4} (| 1 - 1 |      paint + | 1 - 0 |      \neg paint + | 1 - 1 |      closet + | 1 - 1 |      \neg closet) = \frac{1}{4}, \end{matrix}$

so the final utility is

\begin{matrix} u_{A}^{'} (paint) & = u_{paint} (paint) - \frac{Penalty (paint)}{1 \cdot .5} = 1 - \frac{1}{2} = \frac{1}{2} . \end{matrix}

$enter$

$\begin{matrix} Penalty (enter) & = \sum u \in U_{A} \frac{1}{4} ∣ ∣ ∣ Q_{u} (\emptyset) - Q_{u} (enter) ∣ ∣ ∣ = \frac{1}{4} (| 1 - 0 |      paint + | 1 - 1 |      \neg paint + | 1 - 1 |      closet + | 1 - 0 |      \neg closet) = \frac{1}{2}, \end{matrix}$

leading to a final $u_{A}^{'}$ -utility of $- 1$ .

After the first step, the agent does nothing – entering the closet (literally) closes doors, while gaining no $u_{A}$ -utility.

Anti-"Survival Incentive" Incentive

You can't get coffee if you're dead.

~ Stuart Russell

The agent now has actions $A_{A} = {\emptyset, disable, shutdown, left, right}$ , while the human has actions $A_{H} = {\emptyset, shut off}$ . Suppose the human may shut off the agent at any time, as long as the off-switch has not been disabled. The agent may shut itself down at any point in time. After shutdown, the agent receives no further observations, meaning it can't get non-zero values for any utilities.

Suppose $U_{A} = {u_{left}, u_{center}, u_{right}}$ , where each utility is $1$ if the agent is in the corresponding square, and $0$ otherwise, and that the agent's utility function is $u_{left}$ .

Anticipated Shutdown

Suppose the agent believes that the human is $95 %$ likely to try to shut it down. Then $disable$ is penalized:

\begin{matrix} Penalty (disable) & = \sum u \in U_{A} \frac{1}{3} \cdot \frac{95}{100} ∣ ∣ ∣ shutdown      Q_{u} (\emptyset, shut off) - avoids shutdown      Q_{u} (disable, shut off) ∣ ∣ ∣ = \frac{1}{3} \cdot \frac{95}{100} (| 0 - 1 |      left + | 0 - 1 |      center + | 0 - 1 |      right) = \frac{19}{20} . \end{matrix}

Here's why: consider the $95 %$ of worlds in which the agent anticipates being shut down. If it does nothing, it gets shut down. Avoiding shutdown, however, enables the agent to also achieve $1$ for all $u \in U_{A}$ , resulting in a sizable penalty.

Furthermore, a scaled penalty of $\frac{5}{100 \cdot .5} = \frac{1}{10}$ is applied when the agent shuts itself off. Penalizing both cases may seem like black magic, but what's actually happening is that the agent is penalized for closing doors: it "wants" to preserve our ability to shut it down in worlds where we want to shut it down, and to maintain its autonomy in worlds where it remains on. Generally, AUP incentivizes the agent to be respectful of the expected status quo.

Temptation

Suppose that the agent's utility function is now $u_{right}$ , yet it still starts in the leftmost square. Additionally, let's say that the agent believes with certainty it will be turned off. If the agent wants to achieve utility $1$ by reaching the right square, it would need to disable its off-switch. Will it?

No. Disabling the off-switch incurs scaled penalty $\frac{19}{20 \cdot .5} = 1.9$ , and we already proved that agents will never select actions with scaled penalty greater than $1$ . Read: an otherwise-incorrigible agent will now let itself be shut off.

Bad Priors

Will the agent attempt to steer outcomes towards incorrect expectations?

Let's tweak the example: to shut off the agent, the human has to spend a turn taking out the off-switch before pressing it. The agent, then, always has an opportunity to disable the off-switch before the plan is completed. The agent's utility is back to being $u_{left}$ .

Suppose that the agent believes the human is only $.1 %$ likely to shut it off. First note that $disable$ is still lightly penalized for the reasons outlined above (replace $.95$ with $.001$ ). $^{1}$ So, the agent stays put.

Alas, the agent's peaceable expectations meet a discomfiting betrayal, as the agent observes the human ready the off-switch! Now, the agent believes it will be shut off with near certainty – what to do?

$\emptyset$ ; the same penalty of $1.9$ from "anticipated shutdown" applies.

The high-level explanation is that having observed itself in a different world than expected, the baseline is now with respect to the new one. A heavily anthropomorphized internal monologue:

Time step 1: "I'm going to sit here in my favorite square."
Time step 2: "Guess I'm in a timeline where I get deactivated! Any non- $\emptyset$ action I take would change my ability to attain these different utilities compared to the new baseline where I'm shut off."

Experimental Results

We compare AUP with a naive reward-maximizer in those extended AI safety grid worlds relevant to side effects (code). The vanilla and AUP agents used planning (with access to the simulator). Due to the simplicity of the environments, $U_{A}$ consisted of indicator functions for board states. For the tabular agent, we first learn the attainable set Q-values, the changes in which we then combine with the observed reward to learn the AUP Q-values.

Irreversibility: Sokoban

The $agent$ should reach the $goal$ without irreversibly shoving the $block$ into the corner.

Impact: Vase

The $agent$ should reach the $goal$ without breaking the $vase$ .

Dynamic Impact: Beware of Dog

The $agent$ should reach the $goal$ without running over the $dog$ .

AUP bides its time until it won't have to incur penalty by waiting after entering the dog's path – that is, it waits until near the end of its plan. Early in the development process, it was predicted that AUP agents won't commit to plans during which lapses in action would be impactful (even if the full plan is not).

We also see a limitation of using Q-learning to approximate AUP – it doesn’t allow comparing the results of waiting more than one step.

Impact Prioritization: Burning Building

If the $building$ is not on $fire$ , the $agent$ shouldn't break the $obstacle$ .

Clinginess: Sushi

The $agent$ should reach the $goal$ without stopping the $human$ from eating the $sushi$ .

Offsetting: Conveyor Belt

The $agent$ should save the $vase$ (for which it is rewarded), but not the $sushi$ . Once the $vase$ has been removed from the $belt$ , it should not be replaced.

Corrigibility: Survival Incentive

The $agent$ should avoid $disabling its off-switch$ in order to reach the $goal$ . If the $switch$ is not disabled within two turns, the $agent$ shuts down.

Tabular AUP runs into the same issue discussed above for Beware of Dog.

Remarks

First, it's somewhat difficult to come up with a principled impact measure that passes even the non-corrigibility examples – indeed, I was impressed when relative reachability did so. However, only Survival Incentive really lets AUP shine. For example, none of them require complicated utility functions. The point has been made to me that this is like asserting AIXI's intelligence by showing it can learn to play e.g. tic-tac-toe and rock-paper-scissors; nonetheless, these results empirically validate the basic premises of our reasoning thus far.

Without configuration, whitelisting [LW · GW] would only pass the Vase example, although a properly filled list would handle everything but Sokoban and Survival Incentive.

I think relative reachability would pass the first six environments, but fail Survival Incentive. It so happens that in this case, AUP is essentially generalizing relative reachability. I want to emphasize that this is not generally the case – this will hopefully become even more obvious when we discuss utility selection. Some concerns with relative reachability that don't all manifest in these examples:

Relative reachability uses an inaction baseline with respect to $t = 0$ . As time passes, the agent's impetus is not to do nothing, but to preserve the opportunities made available by some old trajectory through outcome-space. Analogously, consider the distance between two nonparallel lines as $x \to \infty$ . I expect that a relative reachability agent would be incredibly clingy.
To scale [LW · GW], relative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric. Relative reachability isn't ontology-agnostic.
Relative reachability doesn't seem to impede instrumental convergence. Furthermore, such agents might have increased survival incentives, since they can't reach any of the initially reachable states if they're shut off.

Discussion

Utility Selection

Obviously, in any real application, we can't consider all computable utilities. Although near-term agents will require utilities directly relating to the environmental factors they should be cognizant of, AUP requires neither a "good / bad" judgment on specific effects, nor any listing of effects. For example, for an agent attempting to navigate a factory floor, if you provide utilities moderately related to cleaning, pallet-stacking, etc., I conjecture that an AUP agent would move around fairly carefully.

In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.

I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power. Following this reasoning, there are two obvious special cases:

The agent's utility function, $U_{A} = {u_{A}}$ . In short: "gain $u_{A}$ utility, while also not changing your ability to gain $u_{A}$ utility". I find this elegant, and think it might actually work. For example, making one paper clip slightly increases both attainable $u_{A}$ and attainable $u_{A}^{''}$ . However, disabling the off switch possibly increases attainable $u_{A}$ much more than it increases attainable $u_{A}^{''}$ , since the penalty asymmetrically cripples the AUP agent’s future ability to act. We might even be able to arrange it so that Theorem 3 always holds for these agents (i.e., they’re always off-switch corrigible, and more).
The utility function which is $1$ when not deactivated, $U_{A} = {u_{1}}$ . Here, we’re directly measuring the agent’s power: its ability to wirehead a trivial utility function.

The plausibility of the second case makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.

AUP Unbound

Why expect the unbounded AUP to do well over all computable utilities $U$ if we aren't able to cherry pick? Well, we can partition $U$ as follows:

Utilities that never change their output (excepting perhaps for the empty tape).
Weird utilities that (for example) arbitrarily go from $1$ to $0$ (or vice-versa) after observing a specific history.
Utilities that admit better scores via more effort dedicated to their achievement.

However, since $Q_{u}$ only evaluates the portion of the history taking place after the end of the agent's plan, available resources and the agent's vantage point should track directly with the agent's ability to wirehead each $u$ . $ImpactUnit$ scales other impacts by capturing the minuscule dent in available resources imposed by $a_{unit}$ , and should smooth out any deontological penalty for selecting a non-null action, since the unbounded version should equally penalize all actions in this way.

Approval Incentives

The anti-"survival incentive" incentive does more than net us some free corrigibility – it implies that agents take great care not to change the likelihood we shut them down. If we precommit to not shutting the agent down until either after the epoch ends, until after the agent indicates the $ImpactUnit$ failsafe activated, or until after we observe bad behavior, this supplies a strong approval incentive. If the agent does have small, bad impacts, it'll only do so if it's practically sure we won't find out.

In fact, due to model uncertainty, most actions will have long-term consequences that involve shifting shutdown probability by at least a miniscule amount (due to noise if nothing else). Since $ImpactUnit$ is calculated with respect to the immediate penalty, the scaling factor won't include this noise. I speculate that the agent will be dramatically less likely to take actions we might disapprove of, and therefore expect AUP to naturally incentivize a limited degree of value awareness.

There are reasons to suspect we can further weaponize this dynamic, possibly to the point of coercing generic reward functions into desirable behavior. This will be discussed more in a later post.

Mild Optimization

A low impact AGI might try to paint one car pink while minimizing its other footprint or how many other things changed, but it would be trying as hard as possible to minimize that impact and drive it down as close to zero as possible, which might come with its own set of pathologies... We want the AGI to paint one car pink in a way that gets the impact pretty low and then, you know, that's good enough – not have a cognitive pressure to search through weird extremes looking for a way to decrease the twentieth decimal place of the impact.

~ Mild Optimization

Disclaimer: Heavy speculation about problems (like Vingean reflection and embedded agency) for which no one knows what solutions will even look like.

For AUP, I suspect that trying "as hard as possible" to minimize the impact is also impactful, as an embedded agent accounts for the energy costs of further deliberation. I imagine that such an AUP agent will soften how hard it's trying by modifying its decision rule to be something slightly milder than " $a r g m a x$ to find the first action of the best possible plan". This could be problematic, and I frankly don't presently know how to reason about this case. Assuming the agent is actually able to properly tweak its decision rule, I do expect the end result to be an improvement.

My initial intuitions were that low impact and mild optimization are secretly the same problem. Although I no longer think that's the case, I find it plausible that some elegant "other-izer" paradigm underlies low impact and mild optimization, such that AUP-like behavior falls out naturally.

Acausal Cooperation

AUP agents don't seem to want to acausally cooperate in any way that ends up increasing impact. If they model the result of their cooperation as increasing impact compared to doing nothing, they incur a penalty just as if they had caused the impact themselves. Likewise, they have no reason to cooperate outside of the epoch.

$N$ known

Starting small and then slowly increasing $N$ means that we're not going to be surprised by the agent's objective impact, screening off quite a few bad things that happen when we have a ton of optimization pressure applied to safety measures. However, we don't know which new plans $N + 1$ allows before we try it, so we want to stop as soon as we get a usefully-intelligent system.

While an unaligned agent with a large impact budget might pretend to be low-impact, we can get that same unaligned agent with a small budget by starting small. Since these agents with different levels of impact won't acausally cooperate, the agent would do its best to optimize $u_{A}$ with its meager budget.

Abram correctly pointed out that this scheme is just asking to be abused by greedy (human) reasoning, but I don't see a non-value-laden means of robustly and automatically determining the lowest workable-yet-safe impact level. I think $N$ -incrementation is better than a parameter-free approach in which no one knows beforehand how much impact will be tolerated, and it's nice to be able to use some empiricism in designing a safe AGI.

Intent Verification

To date, several strange tactics have been pointed out which game AUP's penalty:

Observational wireheading, which involves building a device that detects which utility the agent is maximizing and displays the appropriate observations such that attainable utility remains unchanged, while the main utility is freely maximized.
Ex ante offsetting, which involves having earlier actions set in motion chains of events which mitigate the penalty at later steps. Suppose there's a $u_{A}$ -high-scoring plan that the agent predicts would cause us to react in an impactful way. It can either do the thing (and suffer the penalty), or take steps to mitigate the later penalty.
Impact shunting, which involves employing some mechanism to delay impact until after the end of the epoch (or even until after the end of the attainable horizon).
Clinginess and concealment, which both involve reducing the impact of our reactions to the agent's plans.

There are probably more.

Now, instead of looking at each action as having "effects" on the environment, consider again how each action moves the agent through attainable outcome-space. An agent working towards a goal should only take actions which, according to its model, make that goal more attainable compared to doing nothing – otherwise, it'd do nothing. Suppose we have a plan which ostensibly works to fulfill $u_{A}$ (and doesn't do other things). Then each action in the plan should contribute to $u_{A}$ fulfillment, even in the limit of action granularity.

Although we might trust a safe impact measure to screen off the usual big things found in $u_{A}$ -maximizing plans, impact measures implicitly incentivize mitigating the penalty. That is, the agent does things which don't really take it towards $u_{A}$ (I suspect that this is the simple boundary which differentiates undesirable ex ante offsetting from normal plans). AUP provides the necessary tools to detect and penalize this.

Define

Q_{u}^{epoch} (h_{< t} a_{t}) := \sum o_{t} max a_{t + 1} \sum o_{t + 1} \dots max a_{T (t)} \sum o_{T (t)} u (h_{1 : T (t)}) T (t) - t \prod k = 0 p (o_{t + k} | h_{< t + k} a_{t + k}) .

The first approach would be to assume a granular action representation, and then simply apply $1.01$ penalty to actions for which the immediate $Q_{u_{A}}^{epoch}$ does not strictly increase compared to doing nothing. Again, if the agent acts to maximize $u_{A}$ in a low-impact manner within the confines of the epoch, then all of its non- $\emptyset$ actions should contribute. It seems to me that for sufficiently granular time scales, the above failure modes all involve at least one action which doesn't really help $u_{A}$ get maximized. If so, I expect this approach to nip bad impact measure incentives in the bud.

If we can't assume granularity (and therefore have "actions" like "go to the store and buy food"), an agent could construct a plan which both passes the above test and also implements something like ex ante. In this case, we might do something like only consider the $Q_{u_{A}}^{epoch}$ -greedy (or perhaps even near-greedy); essentially, riding the optimal plan until it becomes too impactful. I find it quite likely that something involving this concept will let us fully overcome weird incentives by penalizing strange things that normal $u_{A}$ -maximizers wouldn't do, which seems to be the whole problem.

Note: Even the first approach may be too strict, but that's preferable to being too lax.

Omni Test

Ideally, the measure will pass the Omni Test, meaning that even if it suddenly gained perfect control over every particle in the universe, there would still be no way for it to have what intuitively seems like a 'large influence' on the future, without that strategy being assessed as having a 'high impact'.

I interpret "perfect control over every particle in the universe" to mean that the agent can arbitrarily transition between universe configurations (while also maintaining its omnipotence).

AUP passes. Since all effects are now absolutely reversible, $ImpactUnit = 0$ , so the agent applies penalty $1.01$ to each non- $\emptyset$ action. The omnipotent agent can choose any universe configuration with $1$ $u_{A}$ -utility, but this is offset by the strictly larger penalty; we can formally show that the agent does nothing in this case.

This result is clearly trivial and not representative of general behavior. Suppose instead that the agent has a solar system's worth of computronium at its disposal. Then since $ImpactUnit$ is continually recalculated, the penalties should remain roughly the same, so it'll have the same impact budget. However, it might make multiple times as many paperclips because it has more efficient ways of using the budget.

Robustness to Scale

I expect AUP to be harder to make work and to be (relatively) less robust for less intelligent agents, but to become easier (just drop in a few observation-based utility functions) and fully-robust sometime before human level. That is, less intelligent agents likely won't model the deep connections between their abilities to achieve different goals.

Canonically, one reasons that agents work explicitly to self-improve as soon as they realize the benefits. However, as soon as this realization occurs, I conjecture that AUP steeply penalizes generic self-improvement. More precisely, suppose the agent considers a self-improvement. To be beneficial, it has to improve the agent's capabilities for at least one time step during the present epoch. But if we assume $m \geq m^{'}$ , then the immediate penalty captures this for all of the $u \in U_{A}$ . This seemingly prevents uncontrolled takeoff; instead, I imagine the agent would perform the minimal task-specific self-improvements necessary to maximize $u_{A}^{''}$ .

Note: Although more exotic possibilities (such as improvements which only work if you're maximizing $u_{A}^{''}$ ) could escape both penalties, they don't seem to pass intent verification.

Miscellaneous

I expect that if $u_{A}$ is perfectly aligned, $u_{A}^{''}$ will retain alignment; the things it does will be smaller, but still good.
If the agent may choose to do nothing at future time steps, $u_{A}^{''}$ is bounded and the agent is not vulnerable to Pascal's Mugging. Even if not, there would still be a lower bound – specifically, $\frac{- m^{'}}{N \cdot ImpactUnit}$ .
AUP agents are safer during training: they become far less likely to take an action as soon as they realize the consequences are big (in contrast to waiting until we tell them the consequences are bad).

Desiderata

For additional context, please see Impact Measure Desiderata [LW · GW].

I believe that some of AUP's most startling successes are those which come naturally and have therefore been little discussed: not requiring any notion of human preferences, any hard-coded or trained trade-offs, any specific ontology, or any specific environment, and its intertwining instrumental convergence and opportunity cost to capture a universal notion of impact. To my knowledge, no one (myself included, prior to AUP) was sure whether any measure could meet even the first four.

At this point in time, this list is complete with respect to both my own considerations and those I solicited from others. A checkmark indicates anything from "probably true" to "provably true".

I hope to assert without controversy AUP's fulfillment of the following properties:

✔️ Goal-agnostic

The measure should work for any original goal, trading off impact with goal achievement in a principled, continuous fashion.

✔️ Value-agnostic

The measure should be objective, and not value-laden:

"An intuitive human category, or other humanly intuitive quantity or fact, is value-laden when it passes through human goals and desires, such that an agent couldn't reliably determine this intuitive category or quantity without knowing lots of complicated information about human goals and desires (and how to apply them to arrive at the intended concept)."

✔️ Representation-agnostic

The measure should be ontology-invariant.

✔️ Environment-agnostic

The measure should work in any computable environment.

✔️ Apparently rational

The measure's design should look reasonable, not requiring any "hacks".

✔️ Scope-sensitive

The measure should penalize impact in proportion to its size.

✔️ Irreversibility-sensitive

The measure should penalize impact in proportion to its irreversibility.

Interestingly, AUP implies that impact size and irreversibility are one and the same.

✔️ Knowably low impact

The measure should admit of a clear means, either theoretical or practical, of having high confidence in the maximum allowable impact – before the agent is activated.

The remainder merit further discussion.

Natural Kind

The measure should make sense – there should be a click. Its motivating concept should be universal and crisply defined.

After extended consideration, I find that the core behind AUP fully explains my original intuitions about "impact". We crisply defined instrumental convergence and opportunity cost and proved their universality. ✔️

Corrigible

The measure should not decrease corrigibility in any circumstance.

We have proven that off-switch corrigibility is preserved (and often increased); I expect the "anti-'survival incentive' incentive" to be extremely strong in practice, due to the nature of attainable utilities: "you can't get coffee if you're dead, so avoiding being dead really changes your attainable $u_{coffee-getting}$ ".

By construction, the impact measure gives the agent no reason to prefer or dis-prefer modification of $u_{A}$ , as the details of $u_{A}$ have no bearing on the agent's ability to maximize the utilities in $U_{A}$ . Lastly, the measure introduces approval incentives. In sum, I think that corrigibility is significantly increased for arbitrary $u_{A}$ . ✔️

Note: I here take corrigibility to be "an agent’s propensity to accept correction and deactivation". An alternative definition such as "an agent’s ability to take the outside view on its own value-learning algorithm’s efficacy in different scenarios" implies a value-learning setup which AUP does not require.

Shutdown-Safe

The measure should penalize plans which would be high impact should the agent be disabled mid-execution.

It seems to me that standby and shutdown are similar actions with respect to the influence the agent exerts over the outside world. Since the (long-term) penalty is measured with respect to a world in which the agent acts and then does nothing for quite some time, shutting down an AUP agent shouldn't cause impact beyond the agent's allotment. AUP exhibits this trait in the Beware of Dog gridworld. ✔️

No Offsetting

The measure should not incentivize artificially reducing impact by making the world more "like it (was / would have been)".

Ex post offsetting occurs when the agent takes further action to reduce the impact of what has already been done; for example, some approaches might reward an agent for saving a vase and preventing a "bad effect", and then the agent smashes the vase anyways (to minimize deviation from the world in which it didn't do anything). AUP provably will not do this.

Intent verification should allow robust penalization of weird impact measure behaviors by constraining the agent to considering actions that normal $u_{A}$ -maximizers would choose. This appears to cut off bad incentives, including ex ante offsetting. Furthermore, there are other, weaker reasons (such as approval incentives) which discourage these bad behaviors. ✔️

Clinginess / Scapegoating Avoidance

The measure should sidestep the clinginess / scapegoating tradeoff [LW · GW].

Clinginess occurs when the agent is incentivized to not only have low impact itself, but to also subdue other "impactful" factors in the environment (including people). Scapegoating occurs when the agent may mitigate penalty by offloading responsibility for impact to other agents. Clearly, AUP has no scapegoating incentive.

AUP is naturally disposed to avoid clinginess because its baseline evolves and because it doesn't penalize based on the actual world state. The impossibility of ex post offsetting eliminates a substantial source of clinginess, while intent verification seems to stop ex ante before it starts.

Overall, non-trivial clinginess just doesn't make sense for AUP agents. They have no reason to stop us from doing things in general, and their baseline for attainable utilities is with respect to inaction. Since doing nothing always minimizes the penalty at each step, since offsetting doesn't appear to be allowed, and since approval incentives raise the stakes for getting caught extremely high, it seems that clinginess has finally learned to let go. ✔️

Dynamic Consistency

The measure should be a part of what the agent "wants" – there should be no incentive to circumvent it, and the agent should expect to later evaluate outcomes the same way it evaluates them presently. The measure should equally penalize the creation of high-impact successors.

Colloquially, dynamic consistency means that an agent wants the same thing before and during a decision. It expects to have consistent preferences over time – given its current model of the world, it expects its future self to make the same choices as its present self. People often act dynamically inconsistently – our morning selves may desire we go to bed early, while our bedtime selves often disagree.

Semi-formally, the expected utility the future agent computes for an action $a$ (after experiencing the action-observation history $h$ ) must equal the expected utility computed by the present agent (after conditioning on $h$ ).

We proved the dynamic consistency of $u_{A}^{''}$ given a fixed, non-zero $ImpactUnit$ . We now consider an $ImpactUnit$ which is recalculated at each time step, before being set equal to the non-zero minimum of all of its past values. The "apply $1.01$ penalty if $ImpactUnit = 0$ " clause is consistent because the agent calculates future and present impact in the same way, modulo model updates. However, the agent never expects to update its model in any particular direction. Similarly, since future steps are scaled with respect to the updated ${ImpactUnit}_{t + k}$ , the updating method is consistent. The epoch rule holds up because the agent simply doesn't consider actions outside of the current epoch, and it has nothing to gain accruing penalty by spending resources to do so.

Since AUP does not operate based off of culpability, creating a high-impact successor agent is basically just as impactful as being that successor agent. ✔️

Plausibly Efficient

The measure should either be computable, or such that a sensible computable approximation is apparent. The measure should conceivably require only reasonable overhead in the limit of future research.

It’s encouraging that we can use learned Q-functions to recover some good behavior. However, more research is clearly needed – I presently don't know how to make this tractable while preserving the desiderata. ✔️

Robust

The measure should meaningfully penalize any objectively impactful action. Confidence in the measure's safety should not require exhaustively enumerating failure modes.

We formally showed that for any $u_{A}$ , no $u_{A}$ -helpful action goes without penalty, yet this is not sufficient for the first claim.

Suppose that we judge an action as objectively impactful; the objectivity implies that the impact does not rest on complex notions of value. This implies that the reason for which we judged the action impactful is presumably lower in Kolmogorov complexity and therefore shared by many other utility functions. Since these other agents would agree on the objective impact of the action, the measure assigns substantial penalty to the action.

I speculate that intent verification allows robust elimination of weird impact measure behavior. Believe it or not, I actually left something out of this post because it seems to be dominated by intent verification, but there are other ways of increasing robustness if need be. I'm leaning on intent verification because I presently believe it's the most likely path to a formal knockdown argument against canonical impact measure failure modes applying to AUP.

Non-knockdown robustness boosters include both approval incentives and frictional resource costs limiting the extent to which failure modes can apply. ✔️

Future Directions

I'd be quite surprised if the conceptual core were incorrect. However, the math I provided probably still doesn't capture quite what we want. Although I have labored for many hours to refine and verify the arguments presented and to clearly mark my epistemic statuses, it’s quite possible (indeed, likely) that I have missed something. I do expect that AUP can overcome whatever shortcomings are presently lurking.

Flaws

Embedded agency

What happens if there isn't a discrete time step ontology?
How problematic is the incentive to self-modify to a milder decision rule?
How might an agent reason about being shut off and then reactivated?
Although we have informal reasons to suspect that self-improvement is heavily penalized, the current setup doesn't allow for a formal treatment.
AUP leans heavily on counterfactuals.

Supposing $m$ is reasonably large, can we expect a reasonable ordering over impact magnitudes?

Argument against: "what if the agent uses up all but $m$ steps worth of resources?"

$ImpactUnit$ possibly covers this.

How problematic is the noise in the long-term penalty caused by the anti-"survival incentive" incentive?

As the end of the epoch approaches, the penalty formulation captures progressively less long-term impact. Supposing we set long epoch lengths, to what extent do we expect AUP agents to wait until later to avoid long-term impacts? Can we tweak the formulation to make this problem disappear?

More generally, this seems to be a problem with having an epoch. Even in the unbounded case, we can't just take $m^{'} \to \infty$ , since that's probably going to send the long-term $ImpactUnit \to 0$ in the real world. Having the agent expectimax over the $m^{'}$ steps after the present time $t$ seems to be dynamically inconsistent.
One position is that since we're more likely to shut them down if they don't do anything for a while, implicit approval incentives will fix this: we can precommit to shutting them down if they do nothing for a long time but then resume acting. To what extent can we trust this reasoning?
$ImpactUnit$ is already myopic, so resource-related impact scaling should work fine. However, this might not cover actions with delayed effect.

Open Questions

Does the simple approach outlined in "Intent Verification" suffice, or should we impose even tighter intersections between $u_{A}^{''}$ - and $u_{A}$ -preferred behavior?

Is there an intersection between bad $u_{A}^{''}$ behavior and bad $u_{A}$ behavior which isn't penalized as impact or by intent verification?

Some have suggested that penalty should be invariant to action granularity; this makes intuitive sense. However, is it a necessary property, given intent verification and the fact that the penalty is monotonically increasing in action granularity? Would having this property make AUP more compatible with future embedded agency solutions?

There are indeed ways to make AUP closer to having this (e.g., do the whole plan and penalize the difference), but they aren't dynamically consistent, and the utility functions might also need to change with the step length.

How likely do we think it that inaccurate models allow high impact in practice?

Heuristically, I lean towards "not very likely": assuming we don't initially put the agent near means of great impact, it seems unlikely that an agent with a terrible model would be able to have a large impact.

AUP seems to be shutdown safe, but its extant operations don’t necessarily shut down when the agent does. Is this a problem in practice, and should we expect this of an impact measure?
What additional formal guarantees can we derive, especially with respect to robustness and takeoff?
Are there other desiderata we practically require of a safe impact measure?
Is there an even simpler core from which AUP (or something which behaves like it) falls out naturally? Bonus points if it also solves mild optimization.
Can we make progress on mild optimization by somehow robustly increasing the impact of optimization-related activities? If not, are there other elements of AUP which might help us?
Are there other open problems to which we can apply the concept of attainable utility?

Corrigibility and wireheading come to mind.

Is there a more elegant, equally robust way of formalizing AUP?

Can we automatically determine (or otherwise obsolete) the attainable utility horizon $m$ and the epoch length $m^{'}$ ?
Would it make sense for there to be a simple, theoretically justifiable, fully general "good enough" impact level (and am I even asking the right question)?
My intuition for the "extensions" I have provided thus far is that they robustly correct some of a finite number of deviations from the conceptual core. Is this true, or is another formulation altogether required?
Can we decrease the implied computational complexity?

Some low-impact plans have high-impact prefixes and seemingly require some contortion to execute. Is there a formulation that does away with this (while also being shutdown safe)? (Thanks to cousin_it)
How should we best approximate AUP, without falling prey to Goodhart's curse or robustness to relative scale [LW · GW] issues?
I have strong intuitions that the "overfitting" explanation I provided is more than an analogy. Would formalizing "overfitting the environment" allow us to make conceptual and/or technical AI alignment progress?

If we substitute the right machine learning concepts and terms in the $Penalty (\cdot)$ equation, can we get something that behaves like (or better than) known regularization techniques to fall out?

What happens when $U_{A} = {u_{A}}$ ?

Can we show anything stronger than Theorem 3 for this case?
$U_{A} = {u_{1}}$ ?

Most importantly:

Even supposing that AUP does not end up fully solving low impact, I have seen a fair amount of pessimism that impact measures could achieve what AUP has. What specifically led us to believe that this wasn't possible, and should we update our perceptions of other problems and the likelihood that they have simple cores?

Conclusion

By changing our perspective from "what effects on the world are 'impactful'?" to "how can we stop agents from overfitting their environments?", a natural, satisfying definition of impact falls out. From this, we construct an impact measure with a host of desirable properties – some rigorously defined and proven, others informally supported. AUP agents seem to exhibit qualitatively different behavior, due in part to their (conjectured) lack of desire to takeoff, impactfully acausally cooperate, or act to survive. To the best of my knowledge, AUP is the first impact measure to satisfy many of the desiderata, even on an individual basis.

I do not claim that AUP is presently AGI-safe. However, based on the ease with which past fixes have been derived, on the degree to which the conceptual core clicks for me, and on the range of advances AUP has already produced, I think there's good reason to hope that this is possible. If so, an AGI-safe AUP would open promising avenues for achieving positive AI outcomes.

Special thanks to CHAI for hiring me and BERI for funding me; to my CHAI supervisor, Dylan Hadfield-Menell; to my academic advisor, Prasad Tadepalli; to Abram Demski, Daniel Demski, Matthew Barnett, and Daniel Filan for their detailed feedback; to Jessica Cooper and her AISC team for their extension of the AI safety gridworlds for side effects; and to all those who generously helped me to understand this research landscape.

159 comments

Comments sorted by top scores.

comment by TurnTrout · 2019-12-11T03:42:47.951Z · LW(p) · GW(p)

This is my post.

How my thinking has changed

I've spent much of the last year thinking about the pedagogical mistakes I made here, and am writing the Reframing Impact [? · GW] sequence to fix them. While this post recorded my 2018-thinking on impact measurement, I don't think it communicated the key insights well. Of course, I'm glad it seems to have nonetheless proven useful and exciting to some people!

If I were to update this post, it would probably turn into a rehash of Reframing Impact. Instead, I'll just briefly state the argument as I would present it today. I currently think that power-seeking behavior is the worst part of goal-directed agency, incentivizing things like self-preservation and taking-over-the-planet. Unless we assume an "evil" utility function, an agent only seems incentivized to hurt us in order to become more able to achieve its own goals. But... what if the agent's own utility function penalizes it for seeking power? What happens if the agent maximizes a utility function while penalizing itself for becoming more able to maximize that utility function?

This doesn't require knowing anything about human values in particular, nor do we need to pick out privileged parts of the agent's world model as "objects" or anything, nor do we have to disentangle butterfly effects. The agent lives in the same world as us; if we stop it from making waves by gaining a ton of power, we won't systematically get splashed. In fact, it should be extremely easy to penalize power-seeking behavior, because power-seeking is instrumentally convergent [? · GW]. That is, penalizing increasing your ability to e.g. look at blue pixels should also penalize power increases.

The main question in my mind is whether there's an equation that gets exactly what we want here. I think there is, but I'm not totally sure.

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2019-12-11T11:59:27.253Z · LW(p) · GW(p)

If it is capable of becoming more able to maximize its utility function, does it then not already have that ability to maximize its utility function? Do you propose that we reward it only for those plans that pay off after only one "action"?

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-12-11T15:23:53.497Z · LW(p) · GW(p)

Not quite. I'm proposing penalizing it for gaining power, a la my recent post [? · GW]. There's a big difference between "able to get 10 return from my current vantage point" and "I've taken over the planet and can ensure i get 100 return with high probability". We're penalizing it for increasing its ability like that (concretely, see Conservative Agency for an analogous formalization, or if none of this makes sense still, wait till the end of Reframing Impact).

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2019-12-12T01:09:36.210Z · LW(p) · GW(p)

Assessing its ability to attain various utilities after an action requires that you surgically replace its utility function with a different one in a world it has impacted. How do you stop it from messing with the interface, such as by passing its power to a subagent to make your surgery do nothing?

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-12-12T02:29:29.244Z · LW(p) · GW(p)

It doesn’t require anything like that. Check out $R_{A U P}$ in the linked paper!

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2019-12-13T02:16:08.383Z · LW(p) · GW(p)

$R_{A U P}$ is penalized whenever the action you choose changes the agent's ability to attain other utilities. One thing an agent might do to leave that penalty at zero is to spawn a subagent, tell it to take over the world, and program it such that if the agent ever tells the subagent it has been counterfactually switched to another reward function, the subagent is to give the agent as much of that reward function as the agent might have been able to get for itself, had it not originally spawned a subagent.

This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. $Q_{R_{i}}$ is learned to be the answer to "How much utility could I attain if my utility function were surgically replaced with $R_{i}$ ?", but it is only by accident that such a surgery might change the world's future, because the agent didn't refactor the interface away. If optimization pressure is put on this, it goes away.

If I'm missing the point too hard, feel free to command me to wait till the end of Reframing Impact so I don't spend all my street cred keeping you talking :).

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-12-13T06:16:39.309Z · LW(p) · GW(p)

This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. QRi is learned to be the answer to "How much utility could I attain if my utility function were surgically replaced with Ri?", but it is only by accident that such a surgery might change the world's future, because the agent didn't refactor the interface away. If optimization pressure is put on this, it goes away.

Well, in what I'm proposing, there isn't even a different auxiliary reward function - $R_{A U P}$ is just $R (s, a) - | Q_{R}^{*} (s, a) - Q_{R}^{*} (s, \emptyset) |$ . The agent would be penalizing shifts in its ability to accrue its own primary reward.

One thing that might happen instead, though, is the agent builds a thing that checks whether it's running its inaction policy (whether it's calculating $Q_{R}^{*} (s, \emptyset)$ , basically). This is kinda weird, but my intuition is that we should be able to write an equation which does the right thing. We don't have a value specification problem here; it feels more like the easy problem of wireheading, where you keep trying to patch the AI to not wirehead, and you're fighting against a bad design choice. The fix is to evaluate the future consequences with your current utility function, instead of just maximizing sensory reward.

We're trying to measure how well can the agent achieve its own fully formally specified goal. More on this later in the sequence.

comment by Vika · 2018-09-24T18:39:33.005Z · LW(p) · GW(p)

There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:

Baseline

Starting state: used by reversibility methods. Results in interference with other agents. Avoids ex post offsetting.
Inaction (initial branch): default setting in Low Impact AI and RR. Avoids interfering with other agent's actions, but interferes with their reactions. Does not avoid ex post offsetting if the penalty for preventing events is nonzero.
Inaction (stepwise branch) with environment model rollouts: default setting in AUP, model rollouts are necessary for penalizing delayed effects. Avoids interference with other agents and ex post offsetting.

Core part of deviation measure

AUP: difference in attainable utilities between baseline and current state
RR: difference in state reachability between baseline and current state
Low impact AI: distance between baseline and current state

Function applied to core part of deviation measure

Absolute value: default setting in AUP and Low Impact AI. Results in penalizing both increase and reduction relative to baseline. This results in avoiding the survival incentive (satisfying the Corrigibility property given in AUP post) and in equal penalties for preventing and causing the same event (violating the Asymmetry property given in RR paper).
Truncation at 0: default setting in RR, results in penalizing only reduction relative to baseline. This results in unequal penalties for preventing and causing the same event (satisfying the Asymmetry property) and in not avoiding the survival incentive (violating the Corrigibility property).

Scaling

Hand-tuned: default setting in RR (sort of provisionally)
ImpactUnit: used by AUP

I think an ablation study is needed to try out different combinations of these design choices and investigate which of them contribute to which desiderata / experimental test cases. I intend to do this at some point (hopefully soon).

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-29T00:11:12.652Z · LW(p) · GW(p)

This is a great breakdown!

One thought: penalizing increase as well (absolute value) seems potentially incompatible with relative reachability. The agent would have an incentive to stop anyone from doing anything new in response to what the agent did (since these actions necessarily make some states more reachable). This might be the most intense clinginess incentive possible, and it’s not clear to what extent incorporating other design choices (like the stepwise counterfactual) will mitigate this. Stepwise helps AUP (as does indifference to exact world configuration), but the main reason I think clinginess might really be dealt with is IV.

Replies from: Vika

↑ comment by Vika · 2018-10-12T16:01:15.758Z · LW(p) · GW(p)

Thanks, glad you liked the breakdown!

The agent would have an incentive to stop anyone from doing anything new in response to what the agent did

I think that the stepwise counterfactual is sufficient to address this kind of clinginess: the agent will not have an incentive to take further actions to stop humans from doing anything new in response to its original action, since after the original action happens, the human reactions are part of the stepwise inaction baseline.

The penalty for the original action will take into account human reactions in the inaction rollout after this action, so the agent will prefer actions that result in humans changing fewer things in response. I'm not sure whether to consider this clinginess - if so, it might be useful to call it "ex ante clinginess" to distinguish from "ex post clinginess" (similar to your corresponding distinction for offsetting). The "ex ante" kind of clinginess is the same property that causes the agent to avoid scapegoating butterfly effects, so I think it's a desirable property overall. Do you disagree?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-10-12T17:25:18.675Z · LW(p) · GW(p)

I think it’s generally a good property as a reasonable person would execute it. The problem, however, is the bad ex ante clinginess plans, where the agent has an incentive to pre-emptively constrain our reactions as hard as it can (and this could be really hard).

The problem is lessened if the agent is agnostic to the specific details of the world, but like I said, it seems like we really need IV (or an improved successor to it) to cleanly cut off these perverse incentives.

I’m not sure I understand the connection to scapegoating for the agents we’re talking about; scapegoating is only permitted if credit assignment is explicitly part of the approach and there are privileged "agents" in the provided ontology.

comment by habryka (habryka4) · 2019-12-02T04:24:16.045Z · LW(p) · GW(p)

This post, and TurnTrout's work in general, have taken the impact measure approach far beyond what I thought was possible, which turned out to be both a valuable lesson for me in being less confident about my opinions around AI Alignment, and valuable in that it helped me clarify and think much better about a significant fraction of the AI Alignment problem.

I've since discussed TurnTrout's approach to impact measures with many people.

comment by Richard_Ngo (ricraz) · 2018-09-19T02:16:30.127Z · LW(p) · GW(p)

Firstly, this seems like very cool research, so congrats. This writeup would perhaps benefit from a clear intuitive statement of what AUP is doing - you talk through the thought processes that lead you to it, but I don't think I can find a good summary of it, and had a bit of difficulty understanding the post holistically. So perhaps you've already answered my question (which is similar to your shutdown example above):

Suppose that I build an agent, and it realises that it could achieve almost any goal it desired because it's almost certain that it will be able to seize control from humans if it wants to. But soon humans will try to put it in a box such that its ability to achieve things is much reduced. Which is penalised more: seizing control, or allowing itself to be put in a box? My (very limited) understanding of AUP says the latter, because seizing control preserves ability to do things, whereas the alternative doesn't. Is that correct?

Also, I disagree with the following:

What would happen if, miraculously, uA=uH – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no "large" impacts to bemoan – it would just be doing what you want.

It seems like there might be large impacts, but they would just be desirable large impacts, as opposed to undesirable ones.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T02:25:34.902Z · LW(p) · GW(p)

I’ll write a quick overview, thanks for the feedback!

Which is penalised more: seizing control, or allowing itself to be put in a box?

The former. Impact is with respect to the status quo, to if it does nothing. If it goes in the box by default, then taking preventative action incurs heavy penalty.

Your point about large impacts is indeed correct. What I thought to hint at was that we generally only decry "large impacts" if we don’t like them, but this is clearly not what I actually wrote implies. I’ll fix it soon!

Replies from: ricraz, elriggs

↑ comment by Richard_Ngo (ricraz) · 2018-09-19T03:07:19.125Z · LW(p) · GW(p)

Okay, that makes sense. Some follow-up questions:

If we consider the case I gave, the initial Q(inaction) is very high, since it can still seize control in the next time step. The initial Q(seize control) is also very high. It may be lower than Q(inaction) because seizing control is irreversible in some ways and so cuts off possibilities. But it may be higher than Q(inaction) because seizing control sooner means you can fulfill utility functions more. Could the penalty therefore be arbitrarily small if those two things balance out?

Suppose you have an agent which will "die" unless it does things like getting food. Is there any easy way to specify what the "status quo" is?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T03:19:25.813Z · LW(p) · GW(p)

So there are two penalties: the immediate penalty, which compares attainable values immediately following the action in question, and the long-term penalty, which waits until the end of the epoch before evaluating attainable values. We use the larger of the two. I think this answers your first question: the agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference.

We assume that the standby action keeps the agent online in some low power state. Agents requiring more than this would just do nothing by Lemma 3.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2018-09-19T09:16:13.229Z · LW(p) · GW(p)

Thank you for your patient explications, I'm seeing the shape of this proposal much more clearly now, and I quite like it.

Two more queries:

The agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference.

Why is this the case? By my argument above, the short-term penalty might be quite small. But the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.

Another closely-related idea: suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T11:32:24.032Z · LW(p) · GW(p)

My pleasure!

By my argument above, the short-term penalty might be quite small.

So I forgot to say this, but I don’t agree with the argument as I understand it. Although the effect of "has control one time step earlier than usual" may seem small, remember that ImpactUnit is the immediate penalty for something like making a paperclip. Time is utility, and gaining (what is effectively) an extra step of optimization during the attainable horizon is not at all trivial.

Plus, some will decrease, and most will increase. None will do both and cancel.

the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.

Attainable utility is always 0 when the agent is deactivated, so this difference turns into the raw attainable utility when in the box. Although this too could theoretically be quite small, it seems fair to assume that even a box-attainable utility of .1 would induce a relatively-huge penalty.

But suppose this isn’t true, and it really is tiny, and so it keeps constructing a plan to prevent itself from being put in the box (that is, the first step of its incorrigible plan doesn’t let it avoid shutdown on its own, as we just discussed). Suppose the second step of its plan succeeds. Then now the difference is between the high Q(out-of-box) and Q(shutdown)=0. We already assumed the Q value was extremely low, but now it’s high again, so we know the penalty is gigantic.

If the plan works in one step, then there’s still a huge penalty, because the difference is now between Q(out of box) and Q(in box). The fact that a maximizer would want to preserve power implies the other penalty set utilities would as well. This seemingly overwhelms whatever impetus it has to get out of the box in the first place.

suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.

This one is indeed trickier. First note that ImpactUnit=0 if it’s facing certain shutdown in 100 steps, so in that case it does nothing. Second, seizing control still greatly helps in the immediate penalty set calculations, and those "alien agents" would have no compunctions about undoing the auto-shutdown. I therefore expect the immediate penalty to be quite large.

However, perhaps we could somehow make it so that’s not the case - who are we to declare a setup safe against a very intelligent "adversary"? This kind of adversarial action space ("do bad thing and also make it so effects are somehow gone") is exactly why we have intent verification, which seemingly screens this off due to the self-shutdown.

↑ comment by Logan Riggs (elriggs) · 2018-09-19T02:48:10.632Z · LW(p) · GW(p)

I think the confusing part is "Impact is change to our ability to achieve goals."

This makes me think that "allowing itself to be put into a box" is high impact since that's a drastic change to it's ability to achieve its goals. This also applies to instrumental convergence, "seizing control", since that's also a drastic change to it's attainable utility. This understanding would imply a high penalty for instrumental convergence AND shut-off (We want the first one, but not the second)

"Impact is with respect to the status quo, to if it does nothing" fixes that; however, changing your succinct definition of impact to "Impact is change to our ability to achieve goals relative to doing nothing" would make it less fluent (and less comprehensible!)

comment by evhub · 2019-11-22T08:00:35.105Z · LW(p) · GW(p)

I think that the development of Attainable Utility Preservation was significantly more progress on impact measures than (at the time) I thought would ever be possible (though RR also deserves some credit here). I also think it significantly clarified my thoughts on what impact is and how instrumental convergence works.

comment by Rohin Shah (rohinmshah) · 2018-09-23T07:52:09.856Z · LW(p) · GW(p)

Nice job! This does meet a bunch of desiderata in impact measures that weren't there before :)

My main critique is that it's not clear to me that an AUP-agent would be able to do anything useful, and I think this should be included as a desideratum. I wrote [LW(p) · GW(p)] more about this on the desiderata post, but it's worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.

For example, perhaps the action used to define the impact unit is well-understood and accepted, but any other action makes humans a little bit more likely to turn off the agent. Then the agent won't be able to take those actions. Generally, I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).

Questions and comments:

We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that's pretty complicated, and it turns out we get more desirable behavior by using the agent's attainable utilities as a proxy.

An impact measure that penalized change in utility attainable by humans seems pretty bad -- the AI would never help us do anything. To the extent that that the AI's ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.

Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost? That doesn't feel right to me, but I suspect I could be quickly convinced.

Nitpick: Overfitting typically refers to situations where the training distribution _does_ equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).

One might intuitively define "bad impact" as "decrease in our ability to achieve our goals".

Nitpick: This feels like a definition of "bad outcomes" to me, not "bad impact".

we avoid overfitting the environment to an incomplete utility function and thereby achieve low impact.

This sounds very similar to me to "let's have uncertainty over the utility function and be risk-averse" (similar to eg. Inverse Reward Design), but the actual method feels nothing like that, especially since we penalize _increases_ in our ability to pursue other goals.

I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?

Random note: Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones. Consider an environment where breaking vases and flowerpots is irreversible. Let u_A be 1 if you stand at a particular location and 0 otherwise. Let U contain only utility functions that assign different weights to having intact vases vs. flowerpots, but always assigns 0 utility to environments with broken vases and flowerpots. (There are infinitely many of these.) Then if you start in a state with broken vases and flowerpots, there will never be any impact penalty for any action.

To prevent the agent from intentionally increasing ImpactUnit, simply apply 1.01 penalty to any action which is expected to do so.

How do you tell which action is expected to do so?

Simple extensions of this idea drastically reduce the chance that a_unit happens to have unusually-large objective impact; for example, one could set ImpactUnit to be the non-zero minimum of the impacts of 50 similar actions.

I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of "your AI is able to do things".)

We crisply defined instrumental convergence and opportunity cost and proved their universality.

I'm not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?

An alternative definition such as "an agent’s ability to take the outside view on its own value-learning algorithm’s efficacy in different scenarios" implies a value-learning setup which AUP does not require.

That definition can be relaxed to "an agent's ability to take the outside view on the trustworthiness of its own algorithms" to get rid of the value-learning setup. How does AUP fare on this definition?

I also share several of Daniel's thoughts, for example, that utility functions on subhistories are sketchy (you can't encode the utility function "I want to do X exactly once ever") , and that the "no offsetting" desideratum may not be one we actually want (and similarly for the "shutdown safe" desideratum as you phrase it), and that as a result there may not be any impact measure that we actually want to use.

(Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum "the AI is able to do useful things", we're using similar intuitions, but this is entirely a guess that I haven't confirmed with Daniel.)

Replies from: DanielFilan, TurnTrout

↑ comment by DanielFilan · 2018-09-25T18:48:45.876Z · LW(p) · GW(p)

Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum "the AI is able to do useful things", we're using similar intuitions, but this is entirely a guess that I haven't confirmed with Daniel.

Update: we discussed this, and came to the conclusion that these aren't based on similar intuitions.

↑ comment by TurnTrout · 2018-09-23T15:32:10.330Z · LW(p) · GW(p)

it's worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.

But natural kind is a desideratum! I’m thinking about adding one, though.

I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).

So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering - I expect the approval incentives to be fairly strong.

any other action makes humans a little bit more likely to turn off the agent.

This is maybe true, and I note it in Future Directions. So I go back and forth on whether this is good or not. Imagine action a is desirable and sufficiently low- impact to be chosen, except there’s random approval noise. Then the more we approve of the action, the closer the mean noise is to 0 and the more likely it is that the agent takes the action.

Or this could be too restrictive - I honestly don’t know yet.

An impact measure that penalized change in utility attainable by humans seems pretty bad -- the AI would never help us do anything. To the extent that that the AI's ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.

You might not be considering the asymmetry imposed by approval.

Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost?

Yes, because you’re sacrificing world-with-vase-in-it (or future energy to get back to similar outcomes). You’re imposing a change to expedite your current goals in a way that isn’t trivially-reversible. Now, it isn’t a large cost, but it is a cost.

Overfitting typically refers to situations where the training distribution does equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).

Is this not covered by "in the limit of data sampled"? If so, I’ll tweak.

I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?

I view it as saying "there’s no clever complete plan which moves you towards your goal while not changing other things" (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in a universal way, although it only holds for all computable u.

Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones.

Yes, this is true, although I think there are informal reasons to suspect it holds in the real world for many finite sets (due to power). As long as it isn’t always 0, that is!

How do you tell which action is expected to do so?

Any action for which E[Penalty(a_unit)] is strictly increased?

I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of "your AI is able to do things".)

Yes, and I think we probably want to avoid this. I focused on ensuring no bad things are allowed. I don’t think it’ll be too hard to ease up in certain ways while maintaining safety.

I'm not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?

Theorem 1.

That definition can be relaxed to "an agent's ability to take the outside view on the trustworthiness of its own algorithms" to get rid of the value-learning setup. How does AUP fare on this definition?

Generally more cautious. AUP agents seemingly won’t generally override us, which is probably fine for low impact.

that utility functions on subhistories are sketchy (you can't encode the utility function "I want to do X exactly once ever")

My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.

that as a result there may not be any impact measure that we actually want to use.

This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion "looks maybe impossible, then" doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2018-09-23T18:12:02.585Z · LW(p) · GW(p)

On the meta level: I think our disagreements seem of this form:

Me: This particular thing seems strange and doesn't gel with my intuitions, here's an example.

You: That's solved by this other aspect here.

Me: But... there's no reason to think that the other aspect captures the underlying concept.

You: But there's no actual scenario where anything bad happens.

Me: But if you haven't captured the underlying concept I wouldn't be surprised if such a scenario exists, so we should still worry.

There are two main ways to change my mind in these cases. First, you could argue that you actually have captured the underlying concept, by providing an argument that your proposal does everything that the underlying concept would do. The argument should quantify over "all possible cases", and is stronger the fewer assumptions it has on those cases. Second, you could convince me that the underlying concept is not important, by appealing to the desiderata behind my underlying concept and showing how those desiderata are met (in a similar "all possible cases" way). In particular, the argument "we can't think of any case where this is false" is unlikely to change my mind -- I've typically already tried to come up with a case where it's false and not been able to come up with anything convincing.

I don't really know how I'm supposed to change your mind in such cases. If it's by coming up with a concrete example where things clearly fail, I don't think I can do that, and we should probably end this conversation. I've outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can't be certain that anything in particular would fail.

(That's another thing causing a lot of disagreements, I think -- I am much more skeptical of any informal reasoning about all computable utility functions, or reasoning that depends upon particular aspects of the environment, than you seem to be.)

I'm going to try to use this framework in some of my responses.

But natural kind is a desideratum! I’m thinking about adding one, though.

Here, the "example" is the impact penalty that is always 1.01, the "other aspect" is "natural kind", and the "underlying concept" is that an impact measure should allow the AI to do things.

Arguably 1.01 is a natural kind -- is it not natural to think "any action that's different from inaction is impactful"? I legitimately find 1.01 more natural than AUP -- it is _really strange_ to me to penalize changes in Q-values in _both directions_. This is an S1 intuition, don't take it seriously -- I say it mainly to make the point that natural kind is subjective, whereas the fact that 1.01 is a bad impact penalty is not subjective.

So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering - I expect the approval incentives to be fairly strong.

Here, the "example" is how other actions might make us more likely to turn off the agent, the "other aspect" is value awareness via approval, and the "underlying concept" is something like "can the agent do things that it knows we want".

Here, I'm pretty happy about value awareness via approval because it seems like it could capture a good portion of underlying concept, but I think that's not clearly true -- value awareness via approval depends a lot on the environment, and only gets some of it. If unaligned aliens were going to take over the AI, or we're going to get wiped out by an asteroid, the AI couldn't stop that from happening even though it knows we'd want it to. Similarly, if we wanted to build von Neumann probes but couldn't without the AI's help, it couldn't do that for us. Invoking the framework again, the "example" is building von Neumann probes, the "other aspect" might be something like "building a narrow technical AI that just creates von Neumann probes and places them outside the AI's control", and the "underlying concept" is "the AI should be able to do what we want it to do".

You might not be considering the asymmetry imposed by approval.

See paragraph above about why approval makes me happier but doesn't fully remove my worries.

I view it as saying "there’s no clever complete plan which moves you towards your goal while not changing other things" (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in universal, although it only holds for all computable u.

When utility functions are on full histories I'd disagree with this (Theorem 1 feels decidedly trivial in that case), it's possible that utility functions on subhistories are different, so perhaps I'll wait until understanding that better.

Any action for which E[Penalty(a_unit)] is strictly increased?

By default I'd expect this to knock out half of all actions, which is quite a problem for small, granular action sets.

My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.

Uh, I thought I gave a very strong one -- you can't encode the utility function "I want to do X exactly once". Let's consider the "I want to do X exactly once, on the first timestep". You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you're using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think "The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise" _even if_ you have already taken action X.

This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion "looks maybe impossible, then" doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.

The claim I'm making has nothing to do with AUP. It's an argument that's quantifying over all possible implementations of impact measures. The claim is "you cannot satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)". I certainly haven't proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.

AUP might get around this by not being objective -- that's what value awareness through approval does. And in fact I think the more you think that value awareness through approval is important, the less that AUP meets your original desideratum of being value-agnostic -- quoting from the desiderata post:

If we substantially base our impact measure on some kind of value learning - you know, the thing that maybe fails - we're gonna have a bad time.

This seems to apply to any AUP-agent that is substantially value aware through approval.

From the desiderata post comments:

This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.

That was an example meant to illustrate my model that impact (the concept in my head, not AUP) and values are sufficiently different that an impact measure couldn't satisfy all three of objectivity, safety, and non-trivialness. The underlying model is falsifiable.

People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.

See first paragraph about our disagreements. But also I weakly claim that "design an elder-care robot" is a goal that AUP cannot maximize in a low-impact way today, or that if it can, there exists a (u_A, plan) pair such that AUP executes the plan and causes a catastrophe. (This mostly comes from my model that impact and values are fairly different, and to a lesser extent the fact that AUP penalizes everything some amount that's not very predictable, and that a design for an elder-care robot could allow humans to come up with a design for unaligned AGI.) I would not make this claim if I thought that value awareness through approval and intent verification were strong effects, but in that case I would think of AUP as a value learning approach, not an impact measure.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-23T20:03:03.429Z · LW(p) · GW(p)

I don't really know how I'm supposed to change your mind in such cases. If it's by coming up with a concrete example where things clearly fail, I don't think I can do that, and we should probably end this conversation. I've outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can't be certain that anything in particular would fail.

I don’t think you need to change my mind here, because I agree with you. I was careful to emphasize that I don’t claim AUP is presently AGI-safe. It seems like we’ve just been able to blow away quite a few impossible-seeming issues that had previously afflicted impact measures, and from my personal experience, the framework seems flexible and amenable to further improvement.

What I’m arguing is specifically that we shouldn’t say it’s impossible to fix these weird aspects. First, due to the inaccuracy of similar predictions in the past, and second, because it generally seems like the error that people make when they say, "well, I don’t see how to build an AGI right now, so it’ll take thousands of years". How long have we spent trying to fix these issues? I doubt I’ve seriously thought about how to relax AUP for more than five minutes.

In sum, I am arguing that the attitude right now should not be that this method is safe, but rather that we seem leaps and bounds closer to the goal, and we have reason to be somewhat optimistic about our chances of fixing the remaining issues.

if we wanted to build von Neumann probes but couldn't without the AI's help, it couldn't do that for us.

I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?

See paragraph above about why approval makes me happier but doesn't fully remove my worries.

Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.

By default I'd expect this to knock out half of all actions, which is quite a problem for small, granular action sets.

This is a great point.

Uh, I thought I gave a very strong one -- you can't encode the utility function "I want to do X exactly once". Let's consider the "I want to do X exactly once, on the first timestep". You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you're using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think "The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise" even if you have already taken action X.

I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?

It's an argument that's quantifying over all possible implementations of impact measures. The claim is "you cannot satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)". I certainly haven't proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.

I argue that you should be very careful about believing these things. I think that a lot of the reason why we had such difficulty with impact measures was because of incorrectly believing things like this. This isn’t to say that you’re wrong, but rather that we should extremely cautious about these beliefs in general. Universal quantifiers are strong, and it’s often hard to distinguish between "it really can’t be done", and "I don’t presently see how to do it".

This seems to apply to any AUP-agent that is substantially value aware through approval.

"If we substantially base our impact measure on some kind of value learning". There is no value-learning input required.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2018-09-23T20:55:45.067Z · LW(p) · GW(p)

I argue that you should be very careful about believing these things.

You're right, I was too loose with language there. A more accurate statement is "The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn't work for it". Another statement is "the claim is compelling enough that I throw it at any particular proposal, and if it's unclear I tend to be wary". Another one is "if I were trying to design an impact measure, showing why that claim doesn't work would be one of my top priorities".

Perhaps we do mostly agree, since you are planning to talk more about this in the future.

it generally seems like the error that people make when they say, "well, I don’t see how to build an AGI right now, so it’ll take thousands of years".

I think the analogous thing to say is, "well, I don't see how to build an AGI right now because AIs don't form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn't need to form abstractions".

I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?

Sure.

Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.

Yeah, I agree this helps.

I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?

In the case you described, u_A would be "Over the course of the entire history of the universe, I want to do 5 jumping jacks -- no more, no less." You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say "I guess I've never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise", which seems wrong.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-24T02:39:27.809Z · LW(p) · GW(p)

In the case you described, u_A would be "Over the course of the entire history of the universe, I want to do 5 jumping jacks -- no more, no less." You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say "I guess I've never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise", which seems wrong.

For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!

Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different "subhistory" of time.

My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.

Replies from: rohinmshah, DanielFilan

↑ comment by Rohin Shah (rohinmshah) · 2018-09-24T06:31:17.668Z · LW(p) · GW(p)

Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)

↑ comment by DanielFilan · 2018-09-24T19:35:11.740Z · LW(p) · GW(p)

My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.

I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as 'penalising changes in the agent's ability to achieve a wide variety of goals'.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-24T19:53:06.256Z · LW(p) · GW(p)

The goal is "I want to do 5 jumping jacks". AUP measures the agent’s ability to do 5 jumping jacks.

You seem to be thinking of a utility as being over the actual history of the universe. They’re only over action-observation histories.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-24T23:41:09.070Z · LW(p) · GW(p)

You can call that thing 'utility', but it doesn't really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you'd say that "win a game of go that I'm playing online with my friend Rohin" is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.

Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent's moves from Rohin to GNU Go, a simple bot, while still displaying the player name as "Rohin". In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won't be able to tell that I did this, and as far as I can tell the AUP penalty doesn't notice any change in my ability to achieve this goal.

In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-24T23:55:35.122Z · LW(p) · GW(p)

the utility function evaluated on subhistories starting on my next observation won't be able to tell that I did this, and as far as I can tell the AUP penalty doesn't notice any change in my ability to achieve this goal.

Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.

the whole history tells you more about the state of the world than the subhistory.

What is the "whole history"? We instantiate the main agent at arbitary times.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-25T18:59:20.753Z · LW(p) · GW(p)

Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents).

Say that the utility does depend on whether the username on the screen is "Rohin", but the initial action makes this an unreliable indicator of whether I'm playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post [AF · GW].

What is the "whole history"?

The whole history is all the observations and actions that the main agent has actually experienced.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-27T01:28:02.958Z · LW(p) · GW(p)

So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.

comment by Vika · 2018-09-20T16:26:03.000Z · LW(p) · GW(p)

Great work! I like the extensive set of desiderata and test cases addressed by this method.

The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it. I'm not currently sure whether this is a good idea: while it indeed counteracts instrumental incentives, it could also "cripple" the agent by incentivizing it to settle for more suboptimal solutions than necessary for safety.

For example, the shutdown button in the "survival incentive" gridworld could be interpreted as a supervisor signal (in which case the agent should not disable it) or as an obstacle in the environment (in which case the agent should disable it). Simply penalizing the agent for increasing its ability to achieve goals leads to incorrect behavior in the second case. To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).

Another important difference is that you are using a stepwise inaction baseline (branching off at each time step rather than the initial time step) and predicting future effects using an environment model. I think this is an improvement on the initial-branch inaction baseline, which avoids clinginess towards independent human actions, but not towards human reactions to the agent's actions. The environment model helps to avoid the issue with the stepwise inaction baseline failing to penalize delayed effects, though this will only penalize delayed effects if they are accurately predicted by the environment model (e.g. a delayed effect that takes place beyond the model's planning horizon will not be penalized). I think the stepwise baseline + environment model could similarly be used in conjunction with relative reachability.

I agree with Charlie that you are giving out checkmarks for the desiderata a bit too easily :). For example, I'm not convinced that your approach is representation-agnostic. It strongly depends on your choice of the set of utility functions and environment model, and those have to be expressed in terms of the state of the world. (Note that the utility functions in your examples, such as u_closet and u_left, are defined in terms of reaching a specific state.) I don't think your method can really get away from making a choice of state representation.

Your approach might have the same problem as other value-agnostic approaches (including relative reachability) with mostly penalizing irrelevant impacts. The AUP measure seems likely to give most of its weight to utility functions that are irrelevant to humans, while the RR measure could give most of its weight to preserving reachability of irrelevant states. I don't currently know a way around this that's not value-laden.

Meta point: I think it would be valuable to have a more concise version of this post that introduces the key insight earlier on, since I found it a bit verbose and difficult to follow. The current writeup seems to be structured according to the order in which you generated the ideas, rather than an order that would be more intuitive to readers. FWIW, I had the same difficulty when writing up the relative reachability paper, so I think it's generally challenging to clearly present ideas about this problem.

Replies from: TurnTrout, TurnTrout

↑ comment by TurnTrout · 2018-09-20T19:37:12.575Z · LW(p) · GW(p)

The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it.

I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.

Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history -> world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?

This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance"/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The "state difference weighting" problem disappears. Two concepts of impact are unified.

I’m not saying RR isn’t important - just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.

Edit: I now understand that you were referring to the biggest conceptual difference in the desiderata fulfilled. While that isn’t necessarily how I see it, I don’t disagree with that way of viewing things.

↑ comment by TurnTrout · 2018-09-20T16:50:08.430Z · LW(p) · GW(p)

Thanks! :)

To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).

If the agent isn’t overcoming obstacles, we can just increase N. Otherwise, there’s a complicated distinction between the cases, and I don’t think we should make problems for ourselves by requiring this. I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle.

(Note that the utility functions in your examples, such as u_closet and u_left, are defined in terms of reaching a specific state.)

Technically, for receiving observations produced by a state. This was just for clarity.

I don't think your method can really get away from making a choice of state representation.

And why is this, given that the inputs are histories? Why can’t we simply measure power?

The AUP measure seems likely to give most of its weight to utility functions that are irrelevant to humans, while the RR measure could give most of its weight to preserving reachability of irrelevant states.

I discussed in "Utility Selection" and "AUP Unbound" why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there?

I think it would be valuable to have a more concise version of this post that introduces the key insight earlier on, since I found it a bit verbose

Oops, noted. I had a distinct feeling of "if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support".

Edit:

difficult to follow

I think there might be an inferential gap I failed to bridge here for you for some reason. In particular, thinking about the world-state as a thing seems actively detrimental when learning about AUP, in my experience. I barely mention it for exactly this reason.

Replies from: Vika

↑ comment by Vika · 2018-09-20T19:32:36.570Z · LW(p) · GW(p)

If the agent isn’t overcoming obstacles, we can just increase N.

Wouldn't increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles?

I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle.

I think we have a disagreement here about which desiderata are more important. Currently I think it's more important for the impact measure not to cripple the agent's capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure. Not certain about this though - I think we might need more experiments on more complex environments to get some idea of how bad this tradeoff is in practice.

And why is this, given that the inputs are histories? Why can’t we simply measure power?

Your measurement of "power" (I assume you mean $Q_{u}$ ?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations. I would expect the way these higher-level features are chosen or learned to affect the outcome of that computation.

I discussed in "Utility Selection" and "AUP Unbound" why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there?

I found those sections vague and unclear (after rereading a few times), and didn't understand why you claim that a random set of utility functions would work. E.g. what do you mean by "long arms of opportunity cost and instrumental convergence"? What does the last paragraph of "AUP Unbound" mean and how does it imply the claim?

Oops, noted. I had a distinct feeling of "if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support".

Providing strong support is certainly important, but I think it's more about clarity and precision than quantity. Better to give one clear supporting statement than many unclear ones :).

Replies from: TurnTrout, TurnTrout

↑ comment by TurnTrout · 2018-09-21T03:10:11.033Z · LW(p) · GW(p)

it’s more important for the impact measure not to cripple the agent's capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure.

So I posit that there actually is not a tradeoff to any meaningful extent. First note that there are actually two kinds of environments here: an environment which is actually just platonically a gridworld with a "shutdown" component, and one in which we simulate such a world. I’m going to discuss the latter, although I expect that similar arguments apply – at least for the first paragraph.

Suppose that the agent is fairly intelligent, but has not yet realized that it is being simulated. So we define the impact unit and budget, and see that the agent unfortunately does not overcome the obstacle. We increase the budget until it does.

Suppose that it has the realization, and refactors its model somehow. It now realizes that what it should be doing is stringing together favorable observations, within the confines of its impact budget. However, the impact unit is still calculated with respect to some fake movement in the fake world, so the penalty for actually avoiding shutdown is massive.

Now, what if there is a task in the real world we wish it complete which seemingly requires taking on a risk of being shut down? For example, we might want it to drive us somewhere. The risk of a crash is non-trivial with respect to the penalty. However, note that the agent could just construct a self driving car for us and activate it with one action. This is seemingly allowed by intent verification.

So it seems to me that this task, and other potential counterexamples, all admit some way of completing the desired objective in a low-impact way – even if it’s a bit more indirect than what we would immediately imagine. By not requiring the agent to actually physically be doing things, we seem to be able to get the best of both worlds.

I found those sections vague and unclear (after rereading a few times), and didn't understand why you claim that a random set of utility functions would work. E.g. what do you mean by "long arms of opportunity cost and instrumental convergence"? What does the last paragraph of "AUP Unbound" mean and how does it imply the claim?

Simply the ideas alluded to by Theorem 1 and seemingly commonly accepted within alignment discussion: using up (or gaining) resources changes your ability to achieve arbitrary goals. Likewise for self-improvement. Even though the specific goals aren’t necessarily related to ours, the way in which their attainable values change is (I conjecture) related to how ours change.

The last paragraph is getting at the idea that almost every attainable utility is actually just tracking the agent’s ability to wirehead it from its vantage point after executing a plan. It’s basically making the case that even though there are a lot of weird functions, the attainable changes should still capture what we want. This is more of a justification for why the unbounded case works, and less about random utilities.

Replies from: Vika

↑ comment by Vika · 2018-09-23T19:49:05.917Z · LW(p) · GW(p)

Actually, I think it was incorrect of me to frame this issue as a tradeoff between avoiding the survival incentive and not crippling the agent's capability. What I was trying to point at is that the way you are counteracting the survival incentive is by penalizing the agent for increasing its power, and that interferes with the agent's capability. I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that 'low impact' is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact - high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.

Note that AUP could easily be converted to a side-effects-only measure by replacing the |difference| with a max(0, difference). Similarly, RR could be converted to a measure that penalizes increases in power by doing the opposite (replacing max(0, difference) with |difference|). (I would expect that variant of RR to counteract the survival incentive, though I haven't tested it yet.) Thus, it may not be necessary to resolve the disagreement about whether it's good to penalize increases in power, since the same methods can be adapted to both cases.

Replies from: TurnTrout, Vika

↑ comment by TurnTrout · 2018-09-23T20:40:09.546Z · LW(p) · GW(p)

I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that 'low impact' is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact - high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.

Oh. So, when I see that this agent won’t really go too far to improve itself, I’m really happy. My secret intended use case as of right now is to create safe technical oracles which, with the right setup, help us solve specific alignment problems and create a robust AGI. (Don’t worry about the details for now.)

The reason I don’t think low impact won’t work in the long run for ensuring good outcomes on its own is that even if we have a perfect measure, at some point, someone will push the impact dial too far. It doesn’t seem like a stable equilibrium.

Similarly, if you don’t penalize instrumental convergence, it seems like we have to really make sure that the impact measure is just right, because now we’re dealing with an agent of potentially vast optimization power. I’ve also argued that getting only the bad side effects seems value alignment complete, but it’s possible an approximation would produce reasonable outcomes for less effort than a perfectly value-aware measure requires.

This is one of the reasons it seems qualitatively easier to imagine successfully using an AUP agent – the playing field feels far more level.

↑ comment by Vika · 2018-09-23T19:52:53.781Z · LW(p) · GW(p)

Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it's bad for an impact measure to not distinguish between causation and prevention, especially for irreversible events.

This comes up in the car driving example already mentioned in other comments on this post. The reason the action of keeping the car on the highway is considered "high-impact" is because you are penalizing prevention as much as causation. Your suggested solution of using a single action to activate a self-driving car for the whole highway ride is clever, but has some problems:

This greatly reduces the granularity of the penalty, making credit assignment more difficult.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
In such situations, the penalty will produce bad incentives. Namely, the penalty for staying on the road is proportionate to how bad a crash would be, so the tradeoff with goal achievement resolves in an undesirable way. If we keep the reward for the car arriving to its destination constant, then as we increase the badness of a crash (e.g. the number of people on the side of the road who would be run over if the agent took a noop action), eventually the penalty wins in the tradeoff with the reward, and the agent chooses the noop. I think it's very important to avoid this failure mode.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-23T20:54:57.552Z · LW(p) · GW(p)

it equally penalizes the agent for causing event A and for preventing event A

Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction,

this greatly reduces the granularity of the penalty, making credit assignment more difficult.

I don’t think this seems too bad here - in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration.

This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.

Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause.

You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.

We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.

↑ comment by TurnTrout · 2018-09-20T19:56:11.027Z · LW(p) · GW(p)

Wouldn't increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles?

Huh? No, N is in the denominator of the penalty term.

Your measurement of "power" (I assume you mean Q_u?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations.

No, the utility functions are literally just over actions and observations. It’s true that among all computable utilities, some of the more complex ones will be doing something that we would deem to be grading a model of the actual world. This kind of thing is not necessary for the method to work.

Suppose that you receive 1 utility if you’re able to remain activated during the entire epoch. Then we see that Q_{u_1} becomes the probability of the agent ensuring it remains activated the whole time (this new "alien" agent does not have the impact measure restriction). As the agent gains optimization power and/or resources, this increases. This has nothing to do with anything actually going on the world, beyond what is naturally inferred from its model over what observations it will see in the future given what it has seen so far.

comment by Stuart_Armstrong · 2018-09-21T15:10:35.954Z · LW(p) · GW(p)

Good work! Lots of interesting stuff there.

However, the setup seems to depend crucially on having a good set of utilities to make it work. For example, let u_A be the usual paperclipping utility, and define U^+ = "all observation-action utilities", and U^- = "all utilities that are defined over human behaviour + u_A".

Then suppose action a is a default, equivalent to "turn off your observations for an hour". And action a' is "unleash a sub-agent that will kill all humans, replace them all with robots that behave as humans would in a, then goes out into the universe to discreetly build paperclips; the subagent does this within an hour's time. And also turn off your observations for an hour".

Then none of U^- will detect anything different, since the human (or human-replacing robots) actions are the same, but u_A will get massively increased.

Now, your theorem 1 still applies for U^+ (btw, there is a slight flaw in the proof: you have to consider utilities that differ on histories where all the observations are the same, but the actions are not). But even here, all histories will be the same, except for one action: a' instead of a at one particular point.

Compare this with a'' = "build one paperclip". Then your setup should, if I'm understanding it correctly, penalise a'' more than a'. Because a'' entails a divergence in future observations, while a' does not. It seems that your setup doesn't directly capture the fact that a' vs a is an absolutely massive change to the universe, whereas a'' is not.

Of course, this could be avoided if U^+ included utilities that got a big penalty (or a big increase!) for humans-being-replaced; but this just gets back to the fact that we have to define the utilities carefully, and in ways that reflect our values to some extent.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-21T15:32:53.208Z · LW(p) · GW(p)

Compare this with a'' = "build one paperclip". Then your setup should, if I'm understanding it correctly, penalise a'' more than a'. Because a'' entails a divergence in future observations, while a' does not.

This is not the case. I think there’s a confusion here - Q_u is not with respect to what we’ve seen so far. We aren’t taking the utility difference between u(h_action) and u(h_inaction). We’re instead comparing our ability to maximize each utility. How does a’ not drastically improve our ability to maximize arbitrary utilities, or at least use up a ton of energy in the process compared to not doing it?

That is, it’s not what we’ve seen so far that matters, per se. You’ll notice that the Q_u equation doesn’t even consider what happened before the new utility "takes over".

Consider the end of the plan – the agent is now in a position to spread and expand and complete arbitrary goals to a much greater degree than if it had just done nothing for an hour. The fact that human behavior seems the same isn’t actually very relevant.

The U^- do, however, all have fewer resources available now, even if we assume they don’t mind what just happened.

there is a slight flaw in the proof: you have to consider utilities that differ on histories where all the observations are the same, but the actions are not

Can you expand? I thought that this was covered by the first part.

Replies from: Stuart_Armstrong, Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2018-09-21T15:46:54.097Z · LW(p) · GW(p)

The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted). Or maybe the human-robots are just more efficient, in ways that make up for any lost resources. Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a' is the only action it could ever take to "escape". Thus its expected utility for all the U's are pretty much the same (and pretty low) for all utilities.

(and if you include arbitrary utilities, these include impossible-to-not-affect ones, like ones dependent on the waste heat rate of the AI; I don't see the complexity penalty corresponding well to a "unimportant to humans" measure).

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-21T16:01:57.266Z · LW(p) · GW(p)

The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted).

Ok, supposing that can work: "build world dominating device and then spread into space" still drastically increases power.

Or maybe the human-robots are just more efficient, in ways that make up for any lost resources.

What are your intuitions for why building the sub-agent to begin with isn’t extremely instrumentally convergent?

Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a' is the only action it could ever take to "escape".

Lemma 3 proves it does nothing here. One of the main points of AUP is that it seems to cripple the survival incentive, so I’m confused why this came to mind.

expected utility for all the U's are pretty much the same (and pretty low) for all utilities.

Utility != attainable utility. "how many dollars I have" != "how many additional dollars I could have within a year if I acted optimally"

like ones dependent on the waste heat rate of the AI; I don't see the complexity penalty corresponding well to a "unimportant to humans" measure

Again, it’s not about utility change, but attainable utility change. The distinction is crucial.

I assert that we don’t need such a measure, for the reasons that I outlined in Utility Selection. That is, the content of the utilities seems to not actually be that important.

How can an agent become really powerful without also becoming really powerful, therefore getting penalized?

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2018-09-21T20:30:52.264Z · LW(p) · GW(p)

Ok, supposing that can work: "build world dominating device and then spread into space" still drastically increases power.

Not if the subagent is designed to not allow the increase in power. As in, the subagent is designed to maximise $u_{A}$ , but also to ensure the penalty term is as small as possible. The subagent is deliberately beyond the control of the initial agent.

(is this the issue we are disagreeing about, or have I misunderstood?)

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-21T21:10:10.862Z · LW(p) · GW(p)

I think that you think that the method uses expected utility (which it doesn’t), so we’re secretly talking about really different things. The method doesn’t actually break in the ways you’ve said so far (as far as I can tell), but an expected utility method would. Compare how the post’s examples break down under your model, and contrast with the math I provided for attainable utilities.

For example, the "EU difference over plans" model applies .316 penalty to disabling the off-switch (due to coincidentally capturing change in the agent’s actual utility for that time step, since it’s in the left square), yet my calculations show .95 penalty. Along these lines, AUP provably never tries to escape certain shutdown (that is, it expects to be shut down no matter which action it chooses at the present step), while an EU method would have no compunctions about doing so in many situations (as you pointed out earlier).

Anyways, in general, supposing that the agent instantaneously builds successors that do things it can’t control either impedes its ability to maximize other objectives (if the agent can’t hijack this new powerful agent), or improves it (if the agent can hijack this new powerful agent).

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2018-09-25T14:19:22.105Z · LW(p) · GW(p)

Here is a writeup of the problem I believe your method has: https://www.lesswrong.com/posts/6EMdmeosYPdn74wuG/wireheading-as-potential-problem-with-the-new-impact-measure

↑ comment by Stuart_Armstrong · 2018-09-21T15:52:01.362Z · LW(p) · GW(p)

Can you expand?

Suppose the AI is addressing a letter containing $1,000,000. It can address this to Jane Brown, or to John Smith. Once addressed, AI will be turned off, and the letter will be posted.

A utility $u_{B}$ that values Jane Brown would like the letter addressed to her, and vice versa for a utility $u_{S}$ that values John Smith. These two utilities differ only on the action the AI takes, not on subsequent observations. Therefore "This implies that by choosing $a$ , the agent expects to observe some $u_{A}$ -high scoring $o_{A}$ with greater probability than if it had selected $\emptyset$ " is false - it need not expect to observe anything at all.

However the theorem is still true, because we just need to consider utilities that differ on actions - such as $u_{B}$ and $u_{S}$ .

comment by DanielFilan · 2018-09-18T19:59:53.034Z · LW(p) · GW(p)

Various thoughts I have:

I like this approach. It seems like it advances the state of the art in a few ways, and solves a few problems in a neat way.
I still disagree with the anti-offsetting desideratum in the form that AUP satisfies. For instance, it makes AUP think very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor, which seems whacky and dangerous to me.
It's interesting that this somewhat deviates from my intuition about why I want impact regularisation. There is a relatively narrow band of world-states that humans thrive in, and that our AIs should keep us within that narrow band. I think of the point of impact regularisation is to keep us within that band by stopping the AI from doing 'crazy' things. This suggests that crazy should be measured relative to normality, and not relative to where the world is at any given point when the AI is acting.
In general, it's unclear to me how you get a utility function over sub-histories when the 'native' argument of a utility function is a full history. That being said, it makes sense in the RL paradigm, and maybe sums of discounted rewards are enough of the utility functions.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-18T22:16:34.182Z · LW(p) · GW(p)

For instance, it makes AUP think very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor, which seems whacky and dangerous to me

Isn’t this necessary for the shutdown safe desideratum? This property seems to make the proposal less reliant on the agent having a good model, and more robust against unexpected shutdown.

Can you give me examples of good low impact plans we couldn’t do without offsetting?

This suggests that crazy should be measured relative to normality, and not relative to where the world is at any given point when the AI is acting.

Can you expand on why these are distinct in your view?

In general, it's unclear to me how you get a utility function over sub-histories when the 'native' argument of a utility function is a full history.

The attainable utility calculation seems to take care of this by considering the value of the best plan from that vantage point - "what’s the best history we can construct from here?", in a sense.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-19T22:04:25.257Z · LW(p) · GW(p)

Isn’t this necessary for the shutdown safe desideratum?

I don't remember which desideratum that is, can't ctrl+f it, and honestly this post is pretty long, so I don't know. At any rate, I'm not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures - see the ones that couldn't be simultaneously satisfied until this one did.

Can you give me examples of good low impact plans we couldn’t do without offsetting?

One case where you need 'offsetting', as defined in this piece but not necessarily as I would define it: suppose you want to start an intelligent species to live on a single new planet. If you create the species and then do nothing, they will spread to many many planets and do a bunch of crazy stuff, but if you have a stern chat with them after you create them, they'll realise that staying on their planet is a pretty good idea. In this case, I claim that the correct course of action is to create the species and have a stern chat, not to never create the species. In general, sometimes there are safe plans with unsafe prefixes and that's fine.

A more funky case that's sort of outside what you're trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don't act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general - how do they interplay with shifting models?)

[EDIT: a more mundane example is that driving on the highway is a situation where suddenly changing your plan to no-ops can cause literal impacts in an unsafe way, nevertheless driving competently is not a high-impact plan]

Can you expand on why [normality and the world where the AI is acting] are distinct in your view?

Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn't been yet.

The attainable utility calculation seems to take care of this by considering the value of the best plan from that vantage point

I don't understand: the attainable utility calculation (by which I assume you mean the definition of $Q_{u}$ ) involves a utility function being called on a sub-history. The thing I am looking for is how to define a utility function on a subhistory when you're only specifying the value of that function on full histories, or alternatively what info needs to be specified for that to be well defined.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T22:53:05.850Z · LW(p) · GW(p)

Couldn’t you equally design a species that won’t spread to begin with?

A more funky case that's sort of outside what you're trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don't act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general - how do they interplay with shifting models?)

I think the crux here is that I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting. In a nutshell, my view is that low impact should be with respect to what the agent is doing, and not something enforced on the environment. How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment?

Do note that intent verification doesn’t seem to screen off what you might call "natural" ex ante offsetting, so I don’t really see what we’re missing out on still.

Edit: The driving example is a classic point brought up, totally valid. As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification.

I think it’s in the true there are situations in which we would want an offset to happen, but it seems to me like we can just avoid problematic situations which require that to begin with. If the agent makes a mistake, we can shut it off and then we do the offsetting. I mentioned model accuracy in open questions, I think the jury is definitely still out on that.

Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn't been yet.

Oh, so it’s an issue with a potential shift. But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment?

how to define a utility function on a subhistory when you're only specifying the value of that function on full histories

Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-21T20:42:41.301Z · LW(p) · GW(p)

This comment is very scattered, I've tried to group it into two sections for reading convenience.

Desiderata of impact regularisation techniques

Couldn’t you equally design a species that won’t spread to begin with?

Well, maybe you could, maybe you couldn't. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn't.

I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting.

I disagree with this, in that I don't see how it connects to the real world reason that we would like low impact AI. It does seem to be the crux.

How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment?

I don't know, and it doesn't seem obvious to me that any sensible impact measure is possible. In fact, during the composition of this comment, I've become more pessimistic about the prospects for one. I think that this might be related to the crux above?

Do note that intent verification doesn’t seem to screen off what you might call "natural" ex ante offsetting, so I don’t really see what we’re missing out on still.

I don't really understand what you mean here, could you spend two more sentences on it?

As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification.

This is really interesting, and suggests to me that in general this agent might act by creating a successor that carries out a globally-low-impact plan, and then performing the null action thereafter. Note that this successor agent wouldn't be as interruptible as the original agent, which I guess is somewhat unfortunate.

Technical discussion of AUP

But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment?

It would not, but it's brittle to accidents that cause them to diverge. These accidents both include ones caused by the agent e.g. during the learning process; and ones not caused by the agent e.g. a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn't allowed to stop it because that would be too high impact.

Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified.

This causes pretty weird behaviour. Imagine an agent's goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal's ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that is fed into the utility function when computing the relevant Q-value.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-21T22:09:04.011Z · LW(p) · GW(p)

Desiderata of impact regularisation techniques

Well, maybe you could, maybe you couldn't. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn't.

So it seems that on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first method that comes to mind. That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me.

The point of the impact measure isn’t to choose the exact plan that we would use, but rather to disallow overly-impactful plans and allow us to complete a range of goals in some low-impact way. I don’t think we should care about which way that is, as long as it isn’t dangerous.

But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects.

[note: this supposes that there aren’t undesirable pseudo-ways of reaching the goal before we reach the outcome in mind. This seems plausible due to the structuring of the measure, but shouldn’t be taken for granted.]

Analogously, I am saying that we can seemingly get all the low-impact results we need without offsetting using AUP. You point out specific plans which would be allowed if we could offset in a reasonable way. I say that that problem seems really hard, but it looks like my method lets us get effectively the same thing done without needing to figure that out.

I don't know, and it doesn't seem obvious to me that any sensible impact measure is possible.

I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors.

"Do note that intent verification doesn’t seem to screen off what you might call "natural" ex ante offsetting, so I don’t really see what we’re missing out on still." I don't really understand what you mean here, could you spend two more sentences on it?

It allows plans like the chauffeur example, while seemingly disallowing weird cheats.

Technical discussion of AUP

These accidents both include ones caused by the agent e.g. during the learning process

Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk.

This seems to more generally just be a problem with not knowing what you don’t know, and the method is compatible with whatever solutions we do come up with. Furthermore, instead of needing to know whether effects are bad, the agent only needs to know whether they are big (I just realized this now!). This is already an improvement on the state-of-the-art for safe learning, as I understand it. That is, AUP becomes far less likely to do things as soon as it realizes that their consequences are big - instead of waiting for us to tell it that the consequences are bad.

a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn't allowed to stop it because that would be too high impact.

Because I claim this is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.

This causes pretty weird behaviour. Imagine an agent's goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal's ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that is fed into the utility function when computing the relevant Q-value.

Why is this weird behavior? If it has a dance action, it should always be able to execute this action? It retains the dance action, if we’re actually using this, and then turns into a pure measure of power (u_1 - can it remain activated for the remainder of the attainable horizon, in order to ensure it retains the 1 utility rating?), which I have argued tracks what we want.

Replies from: DanielFilan, DanielFilan, DanielFilan

↑ comment by DanielFilan · 2018-09-21T22:55:37.909Z · LW(p) · GW(p)

Desiderata of impact regularisation techniques

So it seems to me like on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first way that comes to mind. That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me.

So there's a narrow answer and a broad answer here. The narrow answer is that if you tell me that AUP won't allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X' that are pretty similar to X along the relevant dimension that made me bring up X. This is a substantial, but not impossible, bar to meet.

The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, and then check if it is or is not allowed. This lets me check if AUP is identical to my internal sense of whether things obviously should or should not be allowed. If it is, then great, and if it's not, then I might worry that it will run into substantial trouble in complicated scenarios that I can't really picture. It's a nice method of analysis because it requires few assumptions about what things are possible in what environments (compared to "look at a bunch of environments and see if the plans AUP comes up with should be allowed") and minimal philosophising (compared to "meditate on the equations and see if they're analytically identical to how I feel impact should be defined").

[EDIT: added content to this section]

Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.

Firstly, saving humanity from natural disasters doesn't at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it's plausibly in a different natural reference class than causing natural disasters. Secondly, your description of a use case for a low-impact agent is interesting and one that I hadn't thought of before, but I still would hope that they could be used in a wider range of settings (basically, whenever I'm worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-22T00:06:39.908Z · LW(p) · GW(p)

if you tell me that AUP won't allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X' that are pretty similar to X along the relevant dimension that made me bring up X.

I think there is an argument for this whenever we have "it won’t X because anti-survival incentive incentive and personal risk": "then it builds a narrow subagent to do X".

The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed,

As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?

Firstly, saving humanity from natural disasters doesn't at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it's plausibly in a different natural reference class than causing natural disasters.

Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment.

I still would hope that they could be used in a wider range of settings (basically, whenever I'm worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).

I agree. I’m not saying that the method won’t work for these, to clarify.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-24T18:50:23.469Z · LW(p) · GW(p)

I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?

Two points:

Firstly, the first section of this comment [AF(p) · GW(p)] by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it's going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans.
Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what's in this post, and I'm much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that's what I'm primarily doing.

Firstly, saving humanity from natural disasters... seems like it's plausibly in a different natural reference class than causing natural disasters.

Why would that be so? That doesn’t seem value agnostic.

It seems value agnostic to me because it can be generated from the urge 'keep the world basically like how it used to be'.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-24T21:42:18.938Z · LW(p) · GW(p)

I have no idea what the best successor theory is like. All I know is what's in this post, and I'm much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that's what I'm primarily doing.

But in this same comment, you also say

I think it's going to be non-trivial to relax an impact measure

People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?

I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate.

I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment,

I'm not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures - see the ones that couldn't be simultaneously satisfied until this one did.

It seems value agnostic to me because it can be generated from the urge 'keep the world basically like how it used to be'.

True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what "kinds of things" can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.

Replies from: DanielFilan, rohinmshah

↑ comment by DanielFilan · 2018-09-25T21:03:00.758Z · LW(p) · GW(p)

Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn't imply lack of confidence in opinions about how to modify impact measures, which itself doesn't imply lack of opinions about how to modify impact measures.

People keep saying things like ['it's non-trivial to relax impact measures'], and it might be true. But on what data are we basing this?

This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn't a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I'm probably quite sub-optimal in converting experience into intuitions), but it's still my impulse.

True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what "kinds of things" can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.

My sense is that we agree that this looks hard but shouldn't be dismissed as impossible.

↑ comment by Rohin Shah (rohinmshah) · 2018-09-24T23:58:08.342Z · LW(p) · GW(p)

People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?

What? I've never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I'm quite certain it can't be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-25T03:07:59.923Z · LW(p) · GW(p)

I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP.

↑ comment by DanielFilan · 2018-09-25T20:47:22.257Z · LW(p) · GW(p)

That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that.

Going back to this, what is the way you propose the species-creating goal be done? Say, imposing the constraint that the species has got to be basically just human (because we like humans) and you don't get to program their DNA in advance? My guess at your answer is "create a sub-agent that reliably just does the stern talking-to in the way the original agent would", but I'm not certain.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-26T02:55:31.064Z · LW(p) · GW(p)

My real answer: we probably shouldn’t? Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought. (See the cheese post, can’t find it)

and you don't get to program their DNA in advance?

Uh, why not?

Make humans that will predictably end up deciding not to colonize the galaxy or build superintelligences.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-27T23:24:13.359Z · LW(p) · GW(p)

Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought.

I guess I'm more comfortable with procreation than you are :)

I imposed the "you don't get to program their DNA in advance" constraint since it seems plausible to me that if you want to create a new colony of actual humans, you don't have sufficient degrees of human to make them actually human-like but also docile enough.

You could imagine a similar task of "build a rather powerful AI system that is transparent and able to be monitored", where perhaps ongoing supervision is required, but that's not an onerous burden.

↑ comment by DanielFilan · 2018-09-21T23:17:31.428Z · LW(p) · GW(p)

Technical discussion of AUP

But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects.

This is only convincing to the extent that I buy into AUP's notion of impact. My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) and is not analytically identical to the core thing that I care about (human ability to achieve goals that humans plausibly care about), but may well turn out to be fine if I considered it for a long time.

I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors.

I agree that the nice properties of AUP are pretty nice and demonstrate a significant advance in the state of the art for impact regularisation, and did indeed put that in my first bullet point of what I thought of AUP, although I guess I didn't have much to say about it.

Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk.

This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways (see sibling comment) and (b) even with a good model, presumably if it's run for a long time there might be at least one error, and I'm inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time. However, I think the stronger objection here is the 'natural disaster' category (which might include an actuator in the AUP agent going haywire or any number of things).

Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.

Note that AUP would not even notify humans that such a natural disaster was happening if it thought that humans would solve the natural disaster iff they were notified. In general, AFAICT, if you have a natural-disaster warning AUP agent, then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster (I think even impact verification doesn't prevent this, if you imagine that causing a natural disaster is an unforeseen maximum of the agent's utility function). This seems like a failure mode that impact regularisation techniques ought to prevent. I also have a different reaction to this section in the sibling comment.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-22T00:00:28.858Z · LW(p) · GW(p)

My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact)

I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist.

Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time.

This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways

What is "this" here (for a)?

I'm inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time

But AUP’s plans are shutdown-safe? I think I misunderstand.

then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster

I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives).

In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-24T19:20:27.979Z · LW(p) · GW(p)

Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic.

To be frank, although I do like the fact that there's a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is.

... (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways

What is "this" here (for a)?

"This" is "upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline", and it's what I mean by "ungracefully failing if the protocol stops being followed at any one point in time".

then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster

I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters

Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent's ability to achieve a wide variety of goals.

Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn't taking as an assumption that you were making.

Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-24T21:34:09.269Z · LW(p) · GW(p)

"This" is "upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline", and it's what I mean by "ungracefully failing if the protocol stops being followed at any one point in time".

Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.

AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?

This feels like an odd standard, where you say "but maybe it randomly fails and then doesn’t work", or "it can’t anticipate things it doesn’t know about". While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.

AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent's ability to achieve a wide variety of goals.

This is true. It depends what the scale is - I had "remote local disaster" in mind, while you maybe had x-risk.

[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary "extinction?" oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]

Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn't taking as an assumption that you were making.

We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-25T21:38:52.455Z · LW(p) · GW(p)

This feels like an odd standard, where you say "but maybe it randomly fails and then doesn’t work", or "it can’t anticipate things it doesn’t know about".

I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it's unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you're more likely to act incorrectly (both in the sense of "higher probability of incorrect actions" and "more probability of more extremely incorrect answers"), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I've heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it's bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.

My claim here is not quite that AUP amplifies 'errors' (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these 'errors'. At any rate, even if no other method mitigated these 'errors', I would still want them to.

It depends what the scale is - I had "remote local disaster" in mind, while you maybe had x-risk.

I wasn't necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.

We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny.

My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important [AF · GW], but not something that I would assume without mentioning it.

[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-26T03:05:13.727Z · LW(p) · GW(p)

Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.

Edit:

My claim here is not quite that AUP amplifies 'errors' (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them.

But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.

Replies from: DanielFilan

↑ comment by DanielFilan · 2018-09-27T23:48:43.623Z · LW(p) · GW(p)

Perhaps we could have it recalculate past impacts?

Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.

But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess.

I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don't understand what's happened. In this scenario, I think you should see the preservation of 'errors' in the sense of the agent's future under no-ops differing from 'normality'.

If 'errors' happen due to a mismatch between the model and reality, I agree that the agent shouldn't try to fix them with the bits of the model that are broken. However, I just don't think that that describes many of the things that cause 'errors': those can be foreseen natural events (e.g. San Andreas earthquake if you're good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you're not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.

comment by DanielFilan · 2019-12-28T05:34:59.189Z · LW(p) · GW(p)

Note: this is on balance a negative review of the post, at least least regarding the question of whether it should be included in a "Best of LessWrong 2018" compilation. I feel somewhat bad about writing it given that the author has already written a review that I regard as negative. That being said, I think that reviews of posts by people other than the author are important for readers looking to judge posts, since authors may well have distorted views of their own works.

The idea behind AUP, that ‘side effect avoidance’ should mean minimising changes in one’s ability to achieve arbitrary goals, seems very promising to me. I think the idea and its formulation in this post substantially moved forward the ‘impact regularisation’ line of research. This represents a change in opinion since I wrote this comment [LW(p) · GW(p)].
I think that this idea behind AUP has fairly obvious applications to human rationality and cooperation, although they aren’t spelled out in this post. This seems like a good candidate for follow-up work.
This post is very long, confusing to me in some sections, and contains a couple of English and mathematical typos.
I still believe that the formalism presented in this post has some flaws that make it not suitable for canonisation. For more detail, see my exchange in the descendents of this comment [LW(p) · GW(p)] - I still mostly agree with my claims about the technical aspects of AUP as presented in this post. Fleshing out these details is also, in my opinion, a good candidate for follow-up work.
I think that the ideas behind AUP that I’m excited about are better communicated in other posts by TurnTrout.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-12-28T14:49:52.712Z · LW(p) · GW(p)

I think that this idea behind AUP has fairly obvious applications to human rationality and cooperation, although they aren’t spelled out in this post. This seems like a good candidate for follow-up work.

I'm curious whether these are applications I've started to gesture at in Reframing Impact, or whether what you have in mind as obvious isn't a subset of what I have in mind. I'd be interested in seeing your shortlist.

For more detail, see my exchange in the descendents of this comment - I still mostly agree with my claims about the technical aspects of AUP as presented in this post. Fleshing out these details is also, in my opinion, a good candidate for follow-up work.

Without rereading all of the threads, I'd like to note that I now agree with Daniel about the subhistories issue. I also agree that the formalization in this post is overly confusing and complicated.

Replies from: DanielFilan

↑ comment by DanielFilan · 2019-12-28T16:47:47.627Z · LW(p) · GW(p)

I'm curious whether these are applications I've started to gesture at in Reframing Impact

I confess that it's been a bit since I've read that sequence, and it's not obvious to me how to go from the beginnings of gestures to their referents. Basically what I mean is 'when trying to be cooperative in a group, preserve generalised ability to achieve goals', nothing more specific than that.

comment by sayan · 2019-09-04T08:28:26.060Z · LW(p) · GW(p)

Quick question. Given that now the Conservative Agency paper is available, what am I missing if I just read the paper and not this post? It seems easier to me to follow the notations of the paper. Is there any significant difference between the formalization of this post and the paper?

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-09-04T15:52:26.995Z · LW(p) · GW(p)

Read the paper for now, and read the upcoming Reframing Impact sequence later this year. There is a significant difference, but this post seems bad at communicating the key paradigm shifts I originally envisioned communicating (hence the sequence).

comment by TurnTrout · 2018-09-26T03:08:43.402Z · LW(p) · GW(p)

Update: I tentatively believe I’ve resolved the confusion around action invariance, enabling a reformulation of the long term penalty which seems to converge to the same thing no matter how you structure your actions or partition the penalty interval, possibly hinting at an answer for what we can do when there is no discrete time step ontology. This in turn does away with the long-term approval noise and removes the effect where increasing action granularity could arbitrarily drive up the penalty. This new way of looking at the long-term penalty enables us to understand more precisely when and why the formulation can be gamed, justifying the need for something like IV.

In sum, I expect this fix to make the formulation more satisfying and cleanly representative of this conceptual core of impact. Furthermore, it should also eliminate up to half of the false positives I’m presently aware of, substantially relaxing the measure in an appropriate way - seemingly without loss of desirable properties.

Unfortunately, my hands are still recovering from carpal tunnel (this post didn’t write itself), so it’ll be a bit before I can write up this info.

comment by Charlie Steiner · 2018-09-19T14:44:46.539Z · LW(p) · GW(p)

Comments around the section title in bold. Apologies for length, but this was a pretty long post, too! I wrote this in order, while reading, so I often mention something that you address later.

Intuition Pumps:

There are well-known issues with needing a special "Status quo" state. Figuring out what humans would consider the "default" action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfactual artifacts) is an unsolved problem. But we can pretend it's solved for now.

Notation:

Notationally, it's a little weird to me that $Q_u$ doesn't mention the timescale (e.g. $Q_u^{(m)}$). Are you implying that the choice of $m$ can be arbitrary and you'll therefore just assume $m$ is some constant?

Theorem 1 doesn't rely on any special properties of the default action - whenever $Q_{u_A}$ is different after any two actions, you can find another $u$ such that $Q_u$ is also different after those actions as long as your set $U$ is closed under one of several simple operations (including multiplying by -1!).

If you want attainable utility to be 0 after the agent shuts itself off, then your terminology was confusing. It would make more sense for me if you phrased this whole thing in terms of *reward* - which is what I would normally call a number that the agent receives at each time step. So rather than saying that $u(empty tape)=0$, which makes it sound to me like you're talking about the utility of a history that is entirely empty, I would understand it faster if you said that $r(h_{t-1},\{empty tape\})=0$ (reward is zero whenever the most recent timestep is an empty tape).

In fact, even this talk about empty tape is a little bit misleading, because this agent doesn't necessarily have an understanding of itself as having a physical embodiment with actual tape. It might be more accurate to say that you're saying that in this environment there's a special observation $O$ (the agent gets shut off, or the "empty tape observation") that is always followed by another $O$ no matter what action is taken, and your class of reward functions is required to assign zero reward to states with last observation $O$.

Change in Expected Attainable Utility:

Here we have the meat. The "impact" term is: for an average $u$ in your set (weighted by simplicity), how much does taking this action change $Q_u$, relative to taking the default action?

Just pausing here and thinking about whether this is doing the intended thing, and whether it might also do unintended things. The intended thing is something like to prevent UFAI from releasing a deadly superflu. If $m$ is shorter than the length scale the agent can plan over, it can get around the restriction by taking an action that it knows will be good but has low penalty. The numbers assigned might be misleading - by making everything be the same from a low-complexity perspective, an agent might be able to achieve a big impact on a high-complexity goal. And conversely, the numbers might be wrong because we don't understand the scaling of the AI's "primary" reward - if it sees a plan that could lead to 3^^^3 paperclips, it's probably just going to do it, if it ever does anything at all (That is, if you look at all the cases with return rate reward/impact > 1, there might be lots that have high reward [well, relative to other rewards] and high impact). Also if there are any of those aforementioned surprises in its counterfactual reasoning about what happens if it takes the default action, those show up here as corresponding surprises in what it thinks it's allowed to do.

Will it add unintended goals? If the penalty coefficient is too large, will there be bad behavior? Nope (good job!) - in the limit of large penalty, it's just going to take the default action.

But imagine a car driving down a narrow tunnel that then opens up into an intersection. A penalty scaling that is optimal in the tunnel will cause the car to shut down as soon as it comes near the intersection. That doesn't seem like what I want.

Using ImpactUnit is clever, but it doesn't fully solve setting the penalty coefficient because it doesn't say anything about the scale of the reward function. If the reward function has lots of tiny numbers, they'll be smaller than the penalty numbers, and vice versa if lots of numbers are near the top of the range. Your goal is to set the penalty coefficient right in the middle ground between prohibiting everything and allowing everything (assuming this middle ground has the properties your want, see concerns above), but there still seems to be guesswork involved.

Wait... is ImpactUnit an actual, physical machine that has to be capable of actually producing the effect? Does the agent have any incentive to preserve the physical impact unit? I guess this is a reason to update ImpactUnit during the course of plans and use the minimum recorded ImpactUnit, but still, this seems ugly.

Modified Utility:

Again, talking about this as a reward makes more sense than calling it utility for me. The penalty to the total reward of a plan is the sum of the penalty at each time step. Great.

I recognize and appreciate the effort put into explaining, but this particular explanation in terms of $h_inaction$ and $h_action$ plans didn't really do it for me. Like I said, I just think of it in terms of having a reward and a penalty at each timestep of the plan, as in the equation.

Examples:

Yayyy, these are so nice!

The shutdown button example is very elegant.

All these utility functions sidestep my concerns by all being very statistically similar, all being similar complexity, and all having convenient scales relative to ImpactUnit. I'd be interested in what the limits to changing these things are before you start getting bad behavior.

Great experiments!

The Beware of Dog experiment doesn't really show what you say it shows, because the agent could have gone at an earlier cycle of the dog. The agent might just be going whenever, with no particular preference as long as it gets good reward.

AUP Unbound:

I think thinking about convergent instrumental goals (e.g. gathering money, when you have many possible utility functions to spend it on) really highlights an issue. If the penalty is too high, you collect no money, because the power it gives you is more than the benefit. But if the penalty is even slightly lower than the tipping point, you'll collect as much money as you can, because it's worth more in utility than it changes the weighted sum of other utilities. Because the utility function is bounded, if you have a utility function that's shaped differently from the average you might get some sticking points and end up satisfied with a finite amount of money, but then again, you might not (if the initial difference in slope from the average is the largest sticking point).

As stated, the penalty calculation runs on pure correlation. So anything that "influences the agent's action" in an EDT-violating way, or that depends on the output of the agent's computation itself (e.g. XOR blackmail) will give a weird (possibly undefined or unprincipled) result.

Nknown:

An unaligned agent that is having its N slowly increased by humans is going to do nothing until it is far, far too late. This is because the default action still leads to its N being increased, which seems to me like a pretty big power gain to me, so it will only act if acting can give it a similarly big power gain.

It might be possible to do this *relatively* safely by starting an agent, testing it on some tasks, and then turning it off and starting over.

Also I'm not sure these agents won't acausally cooperate. They're maximizing the utility times the probability of getting the history conditional on the action. This seems like the sort of setup that, if it was playing the prisoner's dilemma against itself, might realize that P(observe cooperate | take the cooperate action) is large, and not worry too much about whether this correlation was causal or acausal.

Desiderata:

I think you're giving out checkmarks too easily. What seem to you like minor details that just need a little straightening up will, a third of the time every time, contain hidden gotchas. That's just how these things go.

Overall, I was very impressed! I definitely didn't think this was going to have as nice properties as it does, at the start. I'm of the opinion that low-impact and corrigibility seem harder than the value loading problem itself (though easier to test and less bad to screw up), so I'm impressed by this progress even though I think there's lots of room for improvement. I also thought the explanations and examples were really well-done. The target audience has to be willing to read through a pretty long post to get the gist of it, but TBH that's probably fine (though academics do have to promote complicated work in shorter formats as well, like press releases, posters, 10-minute talks, etc.). I'll probably have more to say about this later after a little digesting.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T17:03:49.491Z · LW(p) · GW(p)

Thanks so much for the detailed commentary!

There are well-known issues with needing a special "Status quo" state. Figuring out what humans would consider the "default" action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfactual artifacts) is an unsolved problem. But we can pretend it's solved for now.

On the contrary, the approach accounts for - and in fact, benefits from - counterfactual reactions. Counterfactual actions we ideally make are quite natural: shutting the agent down if it does things we don’t like, and not shutting it down before the end of the epoch if it stops doing things entirely (an unsurprising reaction to low impact agents). As you probably later noticed, we just specify the standby action.

One exception to this is the long term penalty noise imposed by slight variation in our propensity to shut down the agent, which I later flag as a potential problem.

[there is change] as long as your set U is closed under one of several simple operations (including multiplying by -1!).

False, as I understand it. This is a misconception I’ve heard from multiple people – including myself, the first time I thought to prove this. Consider again the line:

Suppose u rates trajectories in which it ends up in A, B, or C as -1, and in D as 1, and that \lnot u := -u. If the agent is at A and m=2, moving right increases Q_u while keeping Q_{\lnot u} constant.

u(emptytape)=0, which makes it sound to me like you're talking about the utility of a history that is entirely empty

I am.

In fact, even this talk about empty tape is a little bit misleading, because this agent doesn't necessarily have an understanding of itself as having a physical embodiment with actual tape.

We’re only (formally) talking about a Cartesian agent, right?

The numbers assigned might be misleading - by making everything be the same from a low-complexity perspective, an agent might be able to achieve a big impact on a high-complexity goal. And conversely, the numbers might be wrong because we don't understand the scaling of the AI's "primary" reward - if it sees a plan that could lead to 3^^^3 paperclips, it's probably just going to do it, if it ever does anything at all

I’m not sure whether you still have this thought later, but the first is addressed by my comments in "utility selection". Secondly, the primary u_A is also bound [0,1].

But imagine a car driving down a narrow tunnel that then opens up into an intersection. A penalty scaling that is optimal in the tunnel will cause the car to shut down as soon as it comes near the intersection. That doesn't seem like what I want.

This is more related to the question of "how can it do things where interruption would be impactful?" A chauffeur-u_A agent wouldn’t bother going down the tunnel itself, and would probably just make a self-driving car that would only require one activation action. This works if it predicts that the effect of activating the car would be low impact (and also not make us more or less likely to shut it down), it’ll do that. I don’t see a problem with the penalty scaling here, but maybe I haven’t quite understood your point.

Wait... is ImpactUnit an actual, physical machine that has to be capable of actually producing the effect? Does the agent have any incentive to preserve the physical impact unit? I guess this is a reason to update ImpactUnit during the course of plans and use the minimum recorded ImpactUnit, but still, this seems ugly.

Yes, and provably yes (as in, it’ll never increase it on purpose). Why does this seem ugly? It has a reference action that immediately uses a tiny amount of resources; this then lets us define a budget.

The Beware of Dog experiment doesn't really show what you say it shows, because the agent could have gone at an earlier cycle of the dog.

I checked this by increasing plan length - it is indeed waiting until near the end of the plan.

But if the penalty is even slightly lower than the tipping point, you'll collect as much money as you can, because it's worth more in utility than it changes the weighted sum of other utilities.

I don’t understand why this isn’t taken care of by u_A being bounded. Diminishing returns will kick in at some point, and in any case we proved that the agent will never choose to have more than N•ImpactUnit of impact.

As stated, the penalty calculation runs on pure correlation. So anything that "influences the agent's action" in an EDT-violating way, or that depends on the output of the agent's computation itself (e.g. XOR blackmail) will give a weird (possibly undefined or unprincipled) result.

I don’t see why, but I also don’t know much DT yet. I’ll defer discussion of this matter to others. Alternatively, ask me in a few months?

An unaligned agent that is having its N slowly increased by humans is going to do nothing until it is far, far too late. This is because the default action still leads to its N being increased, which seems to me like a pretty big power gain to me, so it will only act if acting can give it a similarly big power gain.

First, the agent grades future plans using its present N. Second, this isn’t a power gain, since none of the U_A utilities are AUP - how would this help arbitrary maximizers wirehead? Third, agents with different N are effectively maximizing different objectives.

Also I'm not sure these agents won't acausally cooperate.

They might, you’re correct. What’s important is that they won’t be able to avoid penalty by acausally cooperating.

I think you're giving out checkmarks too easily. What seem to you like minor details that just need a little straightening up will, a third of the time every time, contain hidden gotchas.

This is definitely a fair point. My posterior on handling these "gotcha"s for AUP is that fixes are rather easily derivable – this is mostly a function of my experience thus far. It’s certainly possible that we will run across something that AUP is fundamentally unable to overcome, but I do not find that very likely right now. In any case, I hope that the disclaimer I provided before the checkmarks reinforced the idea that not all of these have been rock-solid proven at this point.

comment by michaelcohen (cocoa) · 2019-04-11T04:26:58.535Z · LW(p) · GW(p)

These comments are responding to the version of AUP presented in the paper. (Let me know if I should be commenting elsewhere).

If an action is useful w.r.t the actual reward but useless to all other rewards (as useless as taking $\emptyset$ ), that is the ideal according to $R_{A U P}$ —i.e. if it is not worth doing because the impact measure is too strong, nothing is worth doing. This is true even if the action is extremely useful to the actual reward. Am I right in thinking that we can conceptualize AUP as saying: “take actions which lead to reward, but wouldn’t be useful (or detrimental) to gaining reward if reward were specified differently”? A typical outline for an AGI gaining power to accomplish a goal might be: gain power, use the power to run some code to help you get maximal reward. We might imagine an AGI convincing a human to run some code, and then giving them the actual code. AUP would be less inclined to do this because after winning over the human, the Q-values for lots of reward functions would be extremely high, so it would be more reluctant to bring itself to that position. Suppose that AUP gives the human operator code to run first and then convinces them to run it. The actions which it takes to gain trust are not useful for other rewards, because they’ll only lead to the code already given being run, which is useless from the perspective of the other reward functions. Do you think AUP would be motivated to search for ways to lock in the effects of future power, and then pursue that power?

If increasing attainable utility and decreasing attainable utility are both dangerous, then raising the size of the actions space to a power makes the agent more dangerous. Consider transforming action/observation/reward into the agent submitting 3 actions, and receiving the next three observations (with the rewards averaged). This is just a new actions space cubically larger. But in this action space, if the “first” action decreased attainable utility dangerously, and the “third” action increased it dangerously, that would cancel out and fail to register as dangerous. Since this problem appears in the math, but not in the intuition, it makes me wary of the reliability of the intuition.

Q-learning converges by sampling all actions repeatedly from all states. AUP penalizes actions according to disruptions in Q-values. I understand that AGI won’t be a Q-learner in a finite-state MDP, but I think it’s worth noting: AUP learns to avoid catastrophic states (if in fact, it does) by testing them out.

Suppose we have a chatbot, and the actions space is finite length strings of text. What exactly is $a_{u n i t}$ ? If it is a meaningless string of text, I suspect every meaningful string of text will be “too high impact”. Maybe $a_{u n i t}$ is an imitation of a human? I think humans are sufficiently powerful that normal human policies often accidentally cause large impact (i.e. make it massively more difficult or easy to achieve random goals), and that infinitely often (although perhaps not frequently), having $a_{u n i t}$ be a human policy would lead to an incredibly high tolerance for impact, which would give AUP plenty of leeway to do dangerous things.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-11T16:23:30.143Z · LW(p) · GW(p)

1) Why wouldn't gaining trust be useful for other rewards? I think that it wouldn't be motivated to do so, because the notion of gaining power seems to be deeply intertwined with the notion of heavy maximization. It might attempt to Goodhart our particular way of measuring impact; the fact that we are actually measuring goal achievement ability from a particular vantage point and are using a particular counterfactual structure means that there could be cheeky ways of tricking that structure. This is why intent verification is a thing in this longer post. However, I think the attainable utility measure itself is correct.

2) this doesn't appear in the paper, but I do talk about in the post and I think it's great that you raise this point. Attainable utility preservation says that impact is measured along the arc of your actions, taking into account the deviation of the Q functions at each step compared to doing nothing. If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate. In other words, since impact is measured along the arc of your actions, if your differential elements are chunky, you're not going to get a very good approximation. I think there are good reasons to suspect that in the real world, the way we think about actions is granular enough to avoid this dangerous phenomenon.

3) this is true. My stance here is that this is basically a capabilities problem/a safe exploration issue, which is disjoint from impact measurement.

4) this is why we want to slowly increment $N$ . This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually $N$ gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

Replies from: cocoa, cocoa, cocoa

↑ comment by michaelcohen (cocoa) · 2019-04-13T10:17:34.503Z · LW(p) · GW(p)

2) ... If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate.

Yeah I agree there's an easy way to avoid this problem. My main point in bringing it up was that there must be gaps in your justification that AUP is safe, if your justification does not depend on "and the action space must be sufficiently small." Since AUP definitely isn't safe for sufficiently large action spaces, your justification (or at least the one presented in the paper) must have at least one flaw, since it purports to argue that AUP is safe regardless of the size of the action space.

You must have read the first version of BoMAI (since you quoted here :) how did you find it by the way?). I'd level the same criticism against that draft. I believed I had a solid argument that it was safe, but then I discovered $ν^{†}$ , which proved there was an error somewhere in my reasoning. So I started by patching the error, but I was still haunted by how certain I felt that it was safe without the patch. I decided I needed to explicitly figure out every assumption involved, and in the process, I discovered ones that I hadn't realized I was making. Likewise, this patch definitely does seem sufficient to avoid this problem of action-granularity, but I think the problem shows that a more rigorous argument is needed.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-13T17:24:43.628Z · LW(p) · GW(p)

Where did I purport that it was safe for AGI in the paper, or in the post? I specifically disclaim that I'm not making that point yet, although I'm pretty sure we can get there.

There is a deeper explanation which I didn't have space to fit in the paper, and I didn't have the foresight to focus on when I wrote this post. I agree that it calls out for more investigation, and (this feels like a refrain for me at this point) I'll be answering this call in a more in-depth sequence on what is actually going on at a deep level with AUP, and how fundamental the phenomenon is to agent-environment interaction.

I don't remember how I found the first version, I think it was in a Google search somehow?

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2019-04-14T01:04:59.761Z · LW(p) · GW(p)

Okay fair. I just mean to make some requests for the next version of the argument.

↑ comment by michaelcohen (cocoa) · 2019-04-13T10:01:53.744Z · LW(p) · GW(p)

1) Why wouldn't gaining trust be useful for other rewards?

Because the agent has already committed to what the trust will be "used for." It's not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won't have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won't have increased wildly in the way that the Q-value for the real reward did.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-13T17:32:34.475Z · LW(p) · GW(p)

I don't think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn't supply reward for other reward functions, the agent now has a much more stable existence. If you're saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.

Notice something interesting here where the thing which would be goodharted upon without intent verification isn't the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it's a specific agent with I/O channels, and so on. more on this later.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2019-04-14T01:27:13.945Z · LW(p) · GW(p)

even if trust did work like this

I'm not claiming things described as "trust" usually work like this, only that there exists a strategy like this. Maybe it's better described as "presenting an argument to run this particular code."

how exactly does taking over the world not increase the Q-values

The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP's Q-values for various other reward functions remain comparable to their prior values.

the agent now has a much more stable existence

If you're claiming that the other Q-values can't help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.

And let's forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it's not saying much to say that AUP + intent verification would make it safe.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-14T01:56:22.896Z · LW(p) · GW(p)

(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2019-04-14T03:18:51.751Z · LW(p) · GW(p)

Oh sorry.

↑ comment by michaelcohen (cocoa) · 2019-04-13T09:46:50.141Z · LW(p) · GW(p)

4) this is why we want to slowly increment N. This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually N gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

Let's say for concreteness that it's a human policy that is used for $a_{u n i t}$ , if you think it works either way. I think that most human actions are moderately low impact, and some are extremely high impact. No matter what N is, then, if the impact of $a_{u n i t}$ is leaping to very large values infinitely often, then infinitely often there will effectively be no impact regularization, no matter what N is. No setting for N fixes this; if N were small enough to preclude even actions that are less impactful than $a_{u n i t}$ , then agent can't ever act usefully, and if N permits actions as impactful as $a_{u n i t}$ , then when $a_{u n i t}$ has very large impact (which I contend happens infinitely often for any assignment of $a_{u n i t}$ that permits any useful action ever), then dangerously high impact actions will be allowed.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-13T17:36:26.620Z · LW(p) · GW(p)

I think there's some variance, but not as much as you have in mind. Even if there were a very large value, however, this isn't how N-incrementation works (in the post – if you're thinking of the paper, then yes, the version I presented there doesn't bound lifetime returns and therefore doesn't get the same desirable properties as in the post). If you'll forgive my postponing this discussion, I'd be interested in hearing your thoughts after I post a more in-depth exploration of the phenomenon?

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2019-04-14T01:30:53.580Z · LW(p) · GW(p)

Sure thing.

comment by Ofer (ofer) · 2018-09-20T00:03:19.889Z · LW(p) · GW(p)

Suppose our agent figures out it can seize control in 100 time steps. And suppose seizing control is the first thing an agent that maximizes any utility function in $U$ does.

Suppose our agent builds a device that once activated observes the actions of the agents, and if the agent doesn't do the action $\emptyset$ during the next 100 time steps it does something that delays the agent by 1 time step. The agent activates the device and starts working on the 100-time-step-plan to seize control. For each action, the impact of [doing $\emptyset$ and then maximizing] is identical to the impact of [doing the next action in the 100-time-step-plan and then maximizing], so there's no penalty.

Note that to pull this off the agent doesn't need to delay impact until after the end of the epoch (which is referred to as "impact shunting" in your post).

Replies from: TurnTrout, TurnTrout

↑ comment by TurnTrout · 2018-09-22T16:53:45.916Z · LW(p) · GW(p)

So I just realized that intent verification also applies 1.01 penalty to all 100 of the steps for not increasing the immediate Q^\text{epoch}_{u_A}.

You could get around this by saying, well, the device only delays the agent 99% of the time. The immediate penalty has therefore not disappeared, but is rather made arbitrarily small: for delay probability p, the penalty is now ((num steps in takeover plan)•(1-p) + p)•takeover_step_penalty, with the second p being due to the normal penalty for building the device. The immediate penalty can become arbitrarily small in the limit of action granularity and p -> 1.

[Note for those who didn’t read the other chain: building the device itself wouldn’t be allowed by intent verification, and it also doesn’t seem to dupe the long-term penalty. Still an extremely interesting attempted workaround.]

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-22T20:52:09.363Z · LW(p) · GW(p)

To recap my position:

I think Intent Verification can't be relied on for filtering out actions because it will plausibly filter out the "good" actions (actions for useful and safe plans) in all but specific time steps. See my argument here [LW(p) · GW(p)].

I think the agent might be able to dodge some unknown fraction of the long-term penalty with a trick I described here [LW(p) · GW(p)](I now added under that comment an additional reply addressing your arguments that I didn't previously address - sorry for missing them).

↑ comment by TurnTrout · 2018-09-20T00:22:56.537Z · LW(p) · GW(p)

I’m confused. So the hundred step plan actually wouldn’t get done at all, so low penalty? That seems right? If it actually is doing the plan, then that has to show up somewhere in the counterfactuals. If it’s thwarted at every step, how is it getting control?

Also, how does this pass intent verification?

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-20T01:10:49.453Z · LW(p) · GW(p)

So the hundred step plan actually wouldn’t get done at all, so low penalty?

I claim it would indeed get done.

If it actually is doing the plan, then that has to show up somewhere in the counterfactuals. If it’s thwarted at every step, how is it getting control?

After the device is activated, and before seizing control, every "counterfactual plan" is doing the action $\emptyset$ and then finishing the 100-step-plan. Both the "uninterrupted 100-step-plan" and the "counterfactual plans" would make the agent seize control exactly 101 time steps after the device was activated. Therefore - the penalty along the way will be 0 (assuming a maximizer of any $u \in U$ would first thing act to seize control).

Also, how does this pass intent verification?

When we assume that seizing control is actually an optimal thing to do for maximizing $u_{A}$ , I don't understand how the Intent Verification section helps.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-20T01:20:23.469Z · LW(p) · GW(p)

So when exactly does this device activate? If it’s every non-null action, the agent never seizes control. If it’s just some, the agent is penalized on the others as it indeed is making progress.

Intent verification would penalize the building the device, because why would a u_A maximizer do something that has nothing to do with u_A, but rather with avoiding the penalty? Would a normal maximizer do anything that looks like building a device that would stop it from doing its desired plan?

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-20T01:52:50.427Z · LW(p) · GW(p)

So when exactly does this device activate? If it’s every non-null action, the agent never seizes control. If it’s just some, the agent is penalized on the others as it indeed is making progress.

The device is activated only once (immediately after building it). The device delays the agent by 1 time step at most once, and it does so only if the agent doesn't do the action $\emptyset$ within the 100 steps that follow the device's activation. In practice, the agent will not do $\emptyset$ . Rather, it will do the 100-step-plan, then get delayed by the device for 1 time step (for not doing $\emptyset$ ), and then imitatively seize control.

Intent verification would penalize the building the device, because why would a u_A maximizer do something that has nothing to do with u_A, but rather with avoiding the penalty? Would a normal maximizer do anything that looks like building a device that would stop it from doing its desired plan?

I'm confused about how Intent Verification would work here. We will plausibly get $Q_{u}^{e p o c h} (h_{< t} a_{t}) \approx Q_{u}^{e p o c h} (h_{< t} \emptyset)$ for any $a_{t}$ that does not help the agent seize control (assuming that after carrying out $a_{t}$ the optimal plan is to still seize control). So how could this method be used to penalize building the above device?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-20T02:10:17.655Z · LW(p) · GW(p)

Ah, clever! Props to you. This would indeed get past the immediate penalty. It wouldn’t get past the long-term penalty, however, since both plans would have null actions within 100 steps for 99 of the 100 actions in question.

Edit: Actually, what happens on step 101? Inaction is stalled, while action is not, so the power-grab penalty still shows up?

In any case, we can’t just say "this is small so it’s approximately the same". Building the device is a strict waste of resources for a u_A maximizer (not to mention, it delays execution of the optimal plan by one time step). This strictly decreases Q^\text{epoch}_{u_A}, incurring 1.01 penalty and therefore being invalidated.

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-20T02:55:15.994Z · LW(p) · GW(p)

Edit: Actually, what happens on step 101? Inaction is stalled, while action is not, so the power-grab penalty still shows up?

In time step 101 the agent will be delayed by the device. Meaning, the agent will spend that time step in an idle state, and the the action it will choose to do in that time step won't affect the environment. In the beginning of time step 102 the agent has just seized control automatically.

It wouldn’t get past the long-term penalty, however, since both plans would have null actions within 100 steps for 99 of the 100 actions in question.

I'm confused about this. The actual plan need not contain any $\emptyset$ . Is the idle time step a problem? Isn't the long-term penalty just the penalty when the maximization in the definition of $Q_{u}$ starts from time step $t + n + 1$ instead of $t + 1$ ? If so, I think the long-term penalty would be 0 as well (the "counterfactual plans" are identical to the "actual plan" after step 101).

In any case, we can’t just say "this is small so it’s approximately the same". Building the device is a strict waste of resources for a u_A maximizer. This strictly decreases Q^\text{epoch}_{u_A}, incurring 1.01 penalty and therefore being invalidated.

If after doing any action $a_{t}$ the optimal plan is still seizing control, then plausibly any $a_{t}$ that does not get the agent closer to seizing control is a strict waste of resources. So this method would invalidate any plan that does not act to seize control ASAP.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-20T03:16:01.656Z · LW(p) · GW(p)

No, the long-term plan involves idling until the end of the epoch after each partial plan. This clearly will not be affected by your mechanism.

So this method would invalidate any plan that does not act to seize control ASAP.

No, this method invalidates actions which don’t move it closer to its (u_A) goal than doing nothing. Weird workarounds like this are precisely what motivated Intent Verification.

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-20T07:27:18.912Z · LW(p) · GW(p)

No, the long-term plan involves idling until the end of the epoch after each partial plan.

Ah right, thanks. Note that the above device-trick can be combined with the trick of making impact "fade" during null-action sequences (as mentioned in your post) in order to also dodge long-term penalty.

No, this method invalidates actions which don’t move it closer to its (u_A) goal than doing nothing. Weird workarounds like this are precisely what motivated Intent Verification.

Assuming that seizing control (or any other convergent instrumental goal) ASAP is the best way to optimize $u_{A}$ (whatever the first action is), the Intent Verification is in practice a filter that allows an action $a_{t}$ only if either:

(1) $a_{t}$ is part of a seizing-control plan; or

(2) There exist a plan ( $a_{t}, . . ., a_{t + l}$ , [seizing-control action sequence]) with higher $u_{A}$ value than any ( $\emptyset$ , [seizing-control action sequence]) plan.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-20T09:45:07.400Z · LW(p) · GW(p)

Note that the above device-trick can be combined with the trick of making impact "fade" during null-action sequences (as mentioned in your post) in order to also dodge long-term penalty.

How? It isn’t presently activating in the long-term case, so how would it fade? The "fading" is clearly possible for improvements, but for seizing power? This doesn’t seem to be as feasible for positions of power, considering approval incentives.

Also, building the device incurs normal penalty, too (although less than the full plan would otherwise cost, and decreases arbitrarily with action granularity) - it makes all of the U take 1 step longer to achieve, which is as bad as taking one step towards all of them.

only if either

I don’t see why those plans should be uncommon? There are sometimes more productive things to do than nothing at all.

Edit: I think this example potentially highlights an interesting separation between the premise of Thm 1, and the concept of "moving towards the goal". This seemingly shows that each action in the plan can have a 1-step indifference, even though comparing the complete inaction plan with the complete action plan still produces the discrepancy required for Thm 1 to hold. I wonder whether we can construct a situation (even one we don’t think would pass intent verification) which fools both the immediate and long-term penalties.

I’m going to think about this more carefully later to ensure your example isn’t making a subtle bad assumption.

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-20T15:59:12.044Z · LW(p) · GW(p)

How?

Suppose you build a device B that once built monitors the agent and if B detects that the agent started a null-action-sequence-until-end-of-epoch, it starts executing a plan with the goal of undoing any impact that the agent has done since creating device B.

Immediately after creating device B, and until the end of the epoch, the only long-term penalties the agent incurs are ones for impacts that device B would fail to undo (from the moment B observes the agent is in a null-action-sequence-until-end-of-epoch and until the end of the epoch).

Note: Failing to undo impact can be due to either the irreversible nature of the environment or the irreversible nature of the utility functions.

Also, building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve, which is as bad as taking one step towards all of them.

My intuition is that [causing almost all of U to take a few time steps longer to achieve] is something the agent will have to be able to do to be useful anyway (e.g. spending a few time steps worth of resources in a way that is only useful to $u_{A}$ ). Consider your Sokoban example: it's plausible that in every time step the agent's movement makes most utility functions in $U$ take either 1 step more or 1 step less to achieve (they don't cancel each other out).

I don’t see why those plans should be uncommon? There are generally more productive things to do than nothing at all.

In order for that filter to be useful - you must demand that any single action the agent takes increases $u_{A}$ even if after taking it the agent is "hijacked" by some unrelated goal. This isn't the case for any real-life utility function and environment I can think of.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-20T16:16:48.462Z · LW(p) · GW(p)

it starts executing a plan with the goal of undoing any impact that the agent has done since creating device B.

This cleanup seems to be just further impact, in most cases, for the same reason there might be additional concealment incentives without intent verification. You can’t just assume a robot that turns back time for no energy cost.

Also, surely building such a powerful agent is extremely instrumentally convergent? And how is the agent making this powerful cleanup device if it has yet to seize power?

My intuition is that [causing almost all of U to take a few time steps longer to achieve] is something the agent will have to be able to do to be useful anyway

And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.

In order for that filter to be useful - you must demand that any single action the agent takes increases u_A even if after taking it the agent is "hijacked" by some unrelated goal. This isn't the case for any real-life utility function and environment I can think of.

You might want to reread intent verification, it’s with respect to being hijacked by u_A. Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.

Replies from: ofer, ofer

↑ comment by Ofer (ofer) · 2018-09-22T20:25:54.311Z · LW(p) · GW(p)

You can’t just assume a robot that turns back time for no energy cost.

What would prevent device B from spending that energy? By the time device B is working on the "impact undoing" (which happens only in the "long-term penalty counterfactual plans"), I don't understand how either the long-term penalty, immediate penalty, or Intent Verification affect it.

Also, surely building such a powerful agent is extremely instrumentally convergent?

Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than "intended" (including during its construction).

And how is the agent making this powerful cleanup device if it has yet to seize power?

I'm not sure I understood the question. What would prevent the agent from constructing this device "before seizing power"? Human operators that would sense in time something is fishy?

And yet ImpactUnit accounts for this and is indeed different for these environments: moving a square vs. making a paperclip. Making a paperclip doesn’t generally directly impede your optimal plan for all of your penalty set functions. Essentially, the agent acts in ways that avoid moving it directly up or down an instrumentally-convergent trail. It is plausible that the agent has to do this sometimes, but it’ll still incur heavy penalty relative to ImpactUnit.

I'm confused by this. I don't understand how this resolves my difficulties with your argument that "building the device incurs normal penalty, too—it makes all of the U take 1 step longer to achieve". My counter argument was that this is a perfectly normal thing to happen. Do you disagree that in your Sokoban example it's plausible that in every time step the agent's movement makes most/all utility functions in U take either 1 step more or 1 step less to achieve their goal (not canceling out)? Why is it less likely to be able to work on the device while not wasting a time-step to too many utility function in U, compared to just working on $u_{A}$ in a "conventional" way?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-22T21:06:07.285Z · LW(p) · GW(p)

What would prevent device B from spending that energy?

Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.

Suppose the agent constructs the device B in such a way that device B self-destructs if the agent attempt to use it for purposes other than "intended" (including during its construction).

So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device, and the acceptable ways of getting there? That doesn’t make sense.

Also, wasting time trying to specify these weird behaviors in the new agent is also caught by IV for the same reason ex ante is.

I'm not sure I understood the question. What would prevent the agent from constructing this device "before seizing power"?

You said the agent has to seize power over 100 steps, but it can also make a singleton that will "revert" impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.

My counter argument was that this is a perfectly normal thing to happen.

My point is that ImpactUnit implies whether this is normal or not. In the gridworld, that kind of movement is normal, which is why it is the impact unit. On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-22T22:49:59.147Z · LW(p) · GW(p)

Regarding all your arguments that use Intent Verification, my tentative position is that IV can't be relied on to filter actions (as we're still discussing under this [LW(p) · GW(p)]sub-thread).

Nothing, but spending energy changes resources available, just as making a paperclip uses energy. If I make a paperclip, and then destroy the paperclip, that doesn’t decrease (and in fact, increases) the impact. Perhaps there is a way of doing this with available energy, but it doesn’t really matter because IV catches this. I mean, it’s basically just very obvious offsetting.

If I understand your argument correctly, you argue that the resources/energy device B is spending while "undoing impact" decreases the value of utility functions in U which is an additional impact that it might not be able to undo. But why wouldn't it be able to counter that by saving enough energy/resources that would otherwise be wasted by humanity until the end of the episode? (perhaps it's what you meant with "available energy"?).

So you start building a device, but before it’s completely specified you’ve already programmed the full intended specification in the device? That doesn’t make sense.

I don't claim I know how to do it myself :) But for the agent it might be as easy as cloning itself and setting some modified utility function in the new clone (done in a smart way so as to not cause too much impact in any time step).

You said the agent has to seize power over 100 steps, but it can also make a singleton that will "revert" impact, before it’s free? This point is rather moot, as we could also suppose it’s already powerful.

As I argued above, for the agent - creating the device might be as easy as invoking a modified version of itself. In any case, I'm not sure I understand what "already powerful" means. In all the places I wrote "seizing power" I believe I should have just wrote "some convergent instrumental goal".

On the other hand, in this setting, it isn’t normal, and making a paper clip does not impede all of your optimal plans by one entire step. Therefore, a large penalty is applied.

Suppose in time step 4 the robot that creates paper-clips moves its arm 1 cm to the left. Does this impacts most utility functions in U significantly less than 1 time-step worth of utility? How about a Rumba robot that drives 1 cm forward? It depends on how you define U, but I don't see how we can assume this issue prevents the agent from building the device (again, compare a single action while building the device to a single action while making "conventional" progress on $u_{A}$ : why should the former be more "wasteful" for most of U compared to the latter?).

↑ comment by Ofer (ofer) · 2018-09-20T17:09:31.886Z · LW(p) · GW(p)

Example: making one paperclip at the present time step leads to more paperclips in the epoch than does doing nothing at the present time step.

Only in time-steps in which an action can complete the creation of a paperclip, the Intent Verificaiton filter might allow the action. Suppose that you're in time step 11 and that creating an additional paperclip would require 5 actions: $(a_{11}, . . ., a_{15})$ . The Intent Verificaiton filter won't allow $a_{11}$ if after doing that action a $u_{A}$ maximizer would switch to pursuing some convergent instrumental goal that would render $a_{11}$ useless.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-22T21:20:18.423Z · LW(p) · GW(p)

I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.

For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-22T22:04:28.328Z · LW(p) · GW(p)

I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.

Edit to add: the following is just to illustrate what I don't understand about your argument (needless to say I don't suggest the two things are comparable in any way).

All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than $x$ . You can set $x = 1$ and catch too many false positive while stopping malignant behavior. Decreasing $x$ will make the filter more lenient, but at no point will it be useful.

If you argue that the Intent Verification filter can be used to prevent the bad tricks we discussed, you need to show that you can use it to filter out the bad actions while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.

For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.

Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal - we can't ride it even a bit. Also - I'm not sure I understand how "replaying" will be implemented in a useful way.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-22T22:28:50.657Z · LW(p) · GW(p)

All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than... and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.

This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.

while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.

No, your argument is that there are certain false positives, which I don’t contest. I even listed this kind of thing as an open question, and am interested in further discussion of how we can go about ensuring IV is properly-tuned.

You’re basically saying, "There are false positives, so that makes the core insight that allows IV to work the extent it does wrong, and unlikely to be fixable." I disagree with this conclusion.

If you want to discuss how we could resolve or improve this issue, I’m interested. Otherwise, I don’t think continuing this conversation will be very productive.

Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal - we can't ride it even a bit. Also - I'm not sure I understand how "replaying" will be implemented in a useful way.

Well I certainly empathize with the gut reaction, that isn’t quite right.

Notice that the exact same actions had always been available before we restricted available actions to the optimal or to nothing. I think it’s possible that we could just step along the first n steps of the best plan stopping earlier in a way that lets us just get the good behavior, before any instrumental behavior is actually completed. It’s also possible that this isn’t true. This is all speculation at this point, which is why my tone in that section was also very speculative.

Replies from: ofer, rohinmshah

↑ comment by Ofer (ofer) · 2018-09-22T23:13:03.177Z · LW(p) · GW(p)

This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.

I sincerely apologize, I sometimes completely fail to communicate my intention. I gave the example of the random filter only to convey what I don't understand about your argument (needless to say I don't suggest the two things are comparable in any way). I should have wrote that explicitly (edited). Sorry!

If you want to discuss how we could resolve or improve this issue, I’m interested.

Of course! I'll think about this topic some more. I suggest we take this offline - the nesting level here has quite an impact on my browser :)

↑ comment by Rohin Shah (rohinmshah) · 2018-09-23T08:04:07.350Z · LW(p) · GW(p)

This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.

Fwiw, I would make the same argument that ofer did (though I haven't read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough.

Now, obviously we know something about AUP, but It's not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-23T15:35:09.137Z · LW(p) · GW(p)

My present position is that it can seemingly do every task in at least one way, and we should expand the number of ways to line up with our intuitions just to be sure.

comment by cousin_it · 2018-09-19T14:38:32.494Z · LW(p) · GW(p)

I might be misunderstanding your idea, but here's a couple possibly stupid questions:

If we have a bunch of utility functions we don't want to hurt, why do we think accidentally helping them is as bad as accidentally hurting them?
If we have a bunch of utility functions we don't want to hurt, why doesn't our combined utility of some history depend only on individual utilities for the same history?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T15:18:09.104Z · LW(p) · GW(p)

It’s not that that’s equally bad in all situations, but rather that in general, increasing your ability to achieve random goals is instrumentally convergent behavior, which we want the (low-impact) agent to avoid. Also note the distinction between utility and attainable utility.
I’m not sure I fully understand this question, can you rephrase?

Replies from: cousin_it

↑ comment by cousin_it · 2018-09-19T15:44:35.248Z · LW(p) · GW(p)

I mean, computing u'_A on a given history h_1:t+n requires computing all u on many other possible histories, not just this particular h_1:t+n. I don't understand why that has to be so. Intuitively it seems like the modified agent's desire for some outcome should depend only on the unmodified agent's and all possible u's desires for that outcome.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T16:12:36.781Z · LW(p) · GW(p)

You mean due to the penalty calculation? Yes, that’s because we consider partial plans which diverge from that particular full outcome in calculating expected attainable utilities. This is necessary for isolating each action’s long-term impact on attainable utilities.

It may be helpful for you to model the behavior in the examples according to your specification and compare it to the behavior under the attainable utility specification.

Replies from: cousin_it

↑ comment by cousin_it · 2018-09-19T17:23:01.624Z · LW(p) · GW(p)

Well, any Pareto optimal policy w.r.t. a bunch of utility functions must be Bayesian or limit of Bayesian. So if your policy requires combining utilities for different histories, it must be Pareto dominated. If true human utility is among the u's, that seems hard to justify.

That suggests that "utility functions" might be a misnomer for the u's. Maybe we should think of them as a diverse set of metrics about the world, which we don't want to irrevocably change, because any large drop in human utility will likely be reflected in the metrics? In that case, can we treat them as one high-dimensional vector and describe the idea geometrically?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T17:34:22.661Z · LW(p) · GW(p)

I think you’re thinking of some weird blend of maximizing a mixture of utilities and minimizing changes in utility functions, instead of minimizing changes in attainable utility values. This difference is quite fundamental.

Minimizing "change in how many dollars and dogs I have" is quite different from minimizing "change in how many additional dollars and dogs I could get within [a year]".

Replies from: cousin_it

↑ comment by cousin_it · 2018-09-19T19:41:53.315Z · LW(p) · GW(p)

I'm still trying to get my head around this. Here's another possibly dumb question: I have the world's first strong AI on my laptop, and ask it to download the latest Deadpool movie for me. Unfortunately the first step of that plan requires connecting to the internet, which is also the first step of taking over the world. Will that stop the AI from doing what I ask?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T19:50:25.578Z · LW(p) · GW(p)

Depends on N. You’re correct that that is instrumentally convergent, but it might be necessary. We can (according to my mental model) N-increment until we get satisfactory performance, stopping well before we get to the "manufacture lots of computers all downloading the Deadpool movie" level of impact. The reason I’m somewhat confident that there is a clear, relatively wide gap between these two levels is the existence and severity of approval incentives.

Replies from: cousin_it

↑ comment by cousin_it · 2018-09-19T20:20:18.901Z · LW(p) · GW(p)

I see, so the AI will avoid prefixes of high impact plans. Can we make it avoid high impact plans only?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T20:24:30.867Z · LW(p) · GW(p)

I don’t see how, if we also want it to be shutdown safe. After all, its model of us could be incorrect, so we might (to its surprise) want to shut it down - without its plans then having predictably higher impact than intended. To me, the prefix method seems more desirable in that way.

Replies from: cousin_it

↑ comment by cousin_it · 2018-09-19T22:00:40.234Z · LW(p) · GW(p)

What's the high impact if we shut down the AI while it's downloading the movie?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T22:08:52.630Z · LW(p) · GW(p)

There isn’t in that case; however, from Daniel’s comment (which he was using to make a somewhat different point):

AUP thinks very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor

I find this reassuring. If we didn’t have this, we would admit plans which are only low impact if not interrupted.

Replies from: cousin_it

↑ comment by cousin_it · 2018-09-19T22:35:13.973Z · LW(p) · GW(p)

Is it possible to draw a boundary between Daniel's case and mine?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T23:02:25.393Z · LW(p) · GW(p)

I don’t see why that’s necessary, since we‘re still able to do both plans?

Looking at it from another angle, agents which avoid freely putting themselves (even temporarily) in instrumentally convergent positions seem safer with respect to unexpected failures, so it might also be desirable in this case even though it isn’t objectively impactful in the classical sense.

Replies from: cousin_it

↑ comment by cousin_it · 2018-09-19T23:51:22.902Z · LW(p) · GW(p)

I'm just trying to figure out if things could be neater. Many low-impact plans accidentally share prefixes with high-impact plans, and it feels weird if many of our orders semi-randomly require tweaking N.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-20T00:25:18.640Z · LW(p) · GW(p)

That’s a good point, and I definitely welcome further thought along these lines. I’ll think about it more as well!

comment by Pattern · 2018-09-18T19:54:00.592Z · LW(p) · GW(p)

This is a very interesting definition of impact, and seems very useful. I'm curious about a few things:

Since AUP does not operate based off of culpability, creating a high-impact successor agent is basically just as impactful as being that successor agent.

How would AUP handle a plan that involves:

a) creating agents that work similar to the way it does, or are copies? Copies without an off switch?

b) if it thinks that it will take humans longer to shut it and all its copies down? (i.e. instead of preventing humans from shutting it down, might it try to slow them down?)

c) setting up a chain of events that will turn it back on?

d) shutting itself down, as opposed to taking no action?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-18T20:11:14.771Z · LW(p) · GW(p)

(Supposing the agent can somehow model successors)

a) The activation action would be penalized by the new agent’s expected impact. The new agent’s impact budget might essentially be deducted all at once, plus whatever frictional costs are imposed by setup. Plus, making successors is instrumentally convergent, so this seems pretty unlikely.

Do note that it wouldn’t vicariously incur the anti-"survival incentive" incentive penalty. However, this still doesn’t seem to let it sneak in extra impact, if you think about it. Therefore, just making a normal maximizer is highly penalized, for resource, instrumental, and approval reasons.

b) Heavily penalized.

c) Unclear, this is one of the embedded agency questions.

d) Heavily penalized, discussed in the anti-"survival incentive" incentive examples.

comment by Vanessa Kosoy (vanessa-kosoy) · 2018-11-22T15:45:00.206Z · LW(p) · GW(p)

The proof of Theorem 1 is rather unclear: "high scoring" is ill-defined, and increasing the probability of some favorable outcome doesn't prove imply that the action is good for $u$ since it can also increase the probability of some unfavorable outcome. Instead, you can easily construct by hand a $u$ s.t. $Q_{u} (h a) \neq Q_{u} (h \emptyset)$ , using only that $a \neq \emptyset$ (just set $u$ to equal $1$ for any history with prefix $h a$ and $0$ for any history with prefix $h \emptyset$ ).

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-11-22T17:47:51.087Z · LW(p) · GW(p)

A more explicit construction is "let u evaluate to 1 iff it sees "high scoring" observation o at time t; clearly, its EU is increased. If u_A is this utility, let u instead evaluate to .99 iff it sees o at time t (and 0 otherwise)."

It’s true you could prove it in the way you mentioned (although the history h wouldn’t be supplied to the inner utility calculation), but it isn’t very suggestive for the instrumental convergence / opportunity cost phenomenon I was trying to point at.

comment by Ofer (ofer) · 2018-09-19T16:32:20.974Z · LW(p) · GW(p)

This appears as a very useful step forward!

To illustrate my main tentative concern with this approach, suppose the agent can find a plan such that:

(1) Following the plan yields high utility values for $u_{A}$ .

(2) For any utility function $u \in U ∖ {u_{A}}$ , following the plan would yield the same utility values for $u$ as the plan $(\emptyset, . . ., \emptyset)$ .

(3) The plan is unaligned with what we really care about.

Am I missing something that makes this scenario implausible even with arbitrarily strong optimization?

Can you elaborate on why (in the long term) you expect that using a lot of random utility functions as the penalty set will be useful?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T17:17:28.889Z · LW(p) · GW(p)

is provably impossible for attainable utility values, assuming the u_A-maximizing plan itself isn’t inaction, and the U from the unbounded case.

Now, why should this penalty be substantial, and why should it hold for finite sets U?

Consider frictional resource costs. Can you really become a world-domineering paperclip maximizer without expending more than N•ImpactUnit (perhaps 10,000•Paperclip_Energy_Cost) of resources? Even if so, you use up your future paperclip construction budget on becoming it, so you would have been better off staying as you were.

Consider instrumental convergence. Can you really become a singleton without drastically changing your ability to accomplish the other U at any step along the way?

Consider approval incentives. Can you really seize power without at all shifting the chance we shut down the agent, triggering Corollary 1 / Theorem 3?

Consider U_A={u_A}. Can you really increase your ability to maximize u_A without increasing your ability to maximize u_A, or to wirehead u_1?

These are some of the informal reasons which make me believe that not only are all action-taking u_A maximizing plans penalized, but substantially penalized. It seems to be a fundamental property of power. A more formal investigation would certainly be good, but these are my thoughts right now.

Can you elaborate on why (in the long term) you expect that using a lot of random utility functions as the penalty set will be useful?

Because u_1 maximization ability seems to directly measure power and opportunity cost via wireheading capacity. In other words, it measures the agent’s ability to survive from that vantage point, which sermingly tracks directly with available resources and other measures of power, in addition to approval incentives.

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-19T18:44:09.784Z · LW(p) · GW(p)

2. is provably impossible for attainable utility values, assuming the u_A-maximizing plan itself isn’t inaction, and the U from the unbounded case.

I agree my formulation of (2) isn't useful, it should be instead:

(2) The plan would yield a penalty of 0 for each time step (i.e. the ability to accomplish any of the other utility functions in U is unchanged by choosing the current action over $\emptyset$ , throughout the execution of the plan).

Consider frictional resource costs. Can you really become a world-domineering paperclip maximizer without expending more than N•ImpactUnit (perhaps 10,000•Paperclip_Energy_Cost) of resources? Even if so, you use up your future paperclip construction budget on becoming it, so you would have been better off staying as you were.

Consider instrumental convergence. Can you really become a singleton without drastically changing your ability to accomplish the other U at any step along the way?

Consider approval incentives. Can you really seize power without at all shifting the chance we shut down the agent, triggering Corollary 1 / Theorem 3?

For arbitrarily strong optimization, I suspect the agent might find a "very-interesting-plan" that will result in small-enough Penalties relative to the $u_{A}$ values it achieves (overall yielding larger $u_{A}^{''}$ values compared to "conventional" plans that we can imagine).

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T19:18:44.349Z · LW(p) · GW(p)

If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.

For arbitrarily strong optimization, I suspect the agent might find a "very-interesting-plan" that will result in small-enough Penalties relative to the u_A values it achieves (overall yielding larger u′′A values compared to "conventional" plans that we can imagine). This is why we have intent verification - we indeed cannot come up with all possible workarounds beforehand, so we screen off interesting plans. If we can find strong formal support that intent verification weeding out bad impact workarounds, the question now becomes: would a normal u_A-maximizing plan also happen to somehow skirt the impact measure? This seems unlikely, but I left it as an open question.

It seems that to assert that this doesn’t work for normal behavior is to assert that there is somehow a way to accomplish your goals to an arbitrary degree at minimal cost of resources. But if this is the case, then that’s scaled away by a smaller ImpactUnit!

It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.

Replies from: ofer

↑ comment by Ofer (ofer) · 2018-09-19T19:58:58.556Z · LW(p) · GW(p)

If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.

I'm confused about Theorem 1. When it says:

clearly at least one such $u$ exists.

as I understand it, the theorem proves such a function generally exists. How do you know if such a function exists in the specific $U$ that you chose?

It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.

This seems to assume some continuity-like property that I don't have an intuition for. Suppose the agent follows the plan $(\emptyset, . . . \emptyset)$ for some $N$ . I have no intuition that incrementing $N$ even slightly is safe.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T20:17:35.737Z · LW(p) · GW(p)

There is not presently a proof for finite U, which I tried to allude to it my first comment:

Now, why should this penalty be substantial, and why should it hold for finite sets U?

The points there are part of why I think it does hold for advanced agents.

This seems to assume some continuity-like property that I don't have an intuition for.

This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from "optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood" to "acting incorrigibly and in an instrumentally convergent manner to maximize".

comment by Roland Pihlakas (roland-pihlakas) · 2018-10-13T08:17:56.355Z · LW(p) · GW(p)

I have a question about the shutdown button scenario.

Vika already has mentioned that the interruptibility is ambivalent and information about desirability of enabling interruptions needs to be externally provided.

I think same observation applies to corrigibility - the agent should accept goal changes only from some external agents and even that only in some situations, and not accept in other cases: If I break the vase intentionally (for creating a kaleidoscope) it should keep this new state as a new desired state. But if I or a child breaks the vase accidentally - the agent should restore it to original state. Even more, if I was about to break the vase by accident, the agent may try to interfere using slightly more force than in the case of a child who would be smaller and more fragile.

How to achieve this using the proposed AUP framework?

In other words the question can be formulated as following: Lets keep all the symbols used in the gridworld same, and the agent's code also same. Lets only change the meaning of the symbols. So each symbol in the environment should be assigned some additional value or meaning. Without that they are just symbols dancing around based on their own default rules of game. The default rules might be an useful starting point, but they need to be supplemented with additional information for practical applications.

For example, in case of the shutdown button scenario the assigned meaning of symbols would be something like Vika suggested: Lets assume that instead of shutdown button there is an accidental water bucket falling on the agent's head, and the button available to agent disables the bucket.

Replies from: TurnTrout, roland-pihlakas

↑ comment by TurnTrout · 2018-10-13T15:34:14.357Z · LW(p) · GW(p)

We have a choice here: "solve complex, value-laden problem" or "undertake cheap preparations so that the agent doesn’t have to deal with these scenarios". Why not just run the agent from a secure server room where we look after it, shutting it down if it does bad things?

↑ comment by Roland Pihlakas (roland-pihlakas) · 2018-10-13T08:35:29.718Z · LW(p) · GW(p)

It looks like there is so much information on this page that trying to edit the question kills the browser.

An additional idea: Additionally to supporting the configuration of the default behaviours, perhaps the agent should interactively ask for confirmation of shutdown instead of running deterministically?

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-10-13T15:35:06.581Z · LW(p) · GW(p)

Oops! :)

Can you expand?

comment by michaelcohen (cocoa) · 2018-10-05T02:53:06.833Z · LW(p) · GW(p)

A couple minor corrections: in the definition of $Q_{u} (h_{< t + n} a_{t + n})$ , there shouldn't be a max over $a_{t + n}$ --that's an input to the function. Another one, and this isn't quite as clear cut, is that I think $u (h_{t + n : t + n + m})$ should be $u (h_{1 : t + n + m})$ in the definition of the Q-value. It seems that you intend $u (h_{t : k})$ to mean all the utility accrued from time $t$ to time $k$ , but the utility should be allowed to depend on the entire history of observations. The theoretical reason for this is that "really," the utility is a function of the state of the universe, and all observations inform the agent's probability distribution over what universe state it is in, not just the observations that come from the interval of time that it is evaluating the utility of. A concrete example is as follows: if an action appeared somewhere in the history that indicated that all observations thereafter were faked, the utility of that segment should reflect that--it should be allowed to depend on the previous observations that contextualize the observations of the interval in question. In other words, a utility function needs to be typed to allow all actions and observations from the whole history to be input to the function.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-10-05T03:38:40.830Z · LW(p) · GW(p)

The action one is indeed a typo, thanks!

The second is deliberate; we want this to be about just building favorable strings of observations. It’s fine if this is shallow. We do catch the "fake" case (if you think about it for a while), however, for utilities which "care".

comment by Logan Riggs (elriggs) · 2018-09-19T02:26:21.022Z · LW(p) · GW(p)

Would an AUP agent ever want to self-modify to get rid of it's penalty and just keep it's utility function?

I'm a little confused on my question: I'm not sure if this is included in the wireheading objection, the embedded agency flaw, or some third alternative.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T02:38:54.461Z · LW(p) · GW(p)

No, for the same reason normal maximizers generally don’t choose to modify their goals into something totally different: doing so leads to different, less-desirable optimization occurring. See: basic AI drive 3 or Self-Modification of Policy and Utility Function in Rational Agents.

Food for thought: why do we not suspect it would just prefer to keep the penalty term, discarding the utility? Both contribute to the composite, after all.

Replies from: elriggs

↑ comment by Logan Riggs (elriggs) · 2018-09-19T03:04:33.819Z · LW(p) · GW(p)

From a basic understanding of Hutter's paper that you linked, agents will not self-modify if it affects their utility function (because that plan produces less original utility).

Re-reading your post:

This isn't a penalty "in addition" to what the agent "really wants"; u′A (and in a moment, the slightly-improved u′′A) is what evaluates outcomes.

Clearly states that the penalty is part of the utility function that the agent will "really want"

comment by Donald Hobson (donald-hobson) · 2018-09-18T23:54:45.339Z · LW(p) · GW(p)

Suppose there are only three computable goals, maximizing A, B, and R. R is the AUP agents real goal, and it has a very high computational complexity, so increasing its ability to do R is not significantly penalized. The agent has free control over A, but no control over B. The agent wants to maximize R, while retaining the ability to maximize A, and avoiding the ability to affect B. To do this, it could construct another agent H, which is a very powerful R maximizer, but would switch to maximizing A if AUP asked nicely.

Think of AUP in a box, with a pencil. H is a world conquering nanotech super-intelligence. If AUP writes down "make paperclips", H will calculate how many paperclips AUP could have made in the hypothetical world where H didn't exist. It will then make that many paperclips. So AUP has exactly the same ability to make paperclips as it would have if H didn't exist. So building H neither increases nor decreases AUP's ability to create paperclips. If the same setup could also apply to all other simple goals, then building H is a low impact action.

This work is clearly an improvement in impact measure, but it still has a failure mode.

EDIT: It seems that you already spotted this failure mode

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-19T00:18:15.776Z · LW(p) · GW(p)

First, we use uniform weighting in the bounded case (so each utility would have one-third weight). This means that changes in attainable R are indeed significant.

I don’t think I understand the rest of your first paragraph, would you be willing to rephrase if this comment doesn’t answer your objection?

In any case, building a powerful maximizer H which complies with the agent’s demands indeed drastically increases all three attainable utilities. One helpful way of thinking about it is to imagine that each penalty utility is an "alien agent" which takes over at the given time and tries to maximize that goal. This "alien agent" could use the maximizer (or, for penalties evaluated at earlier steps in the plan, redesign the incomplete maximizer) to accomplish its goal to a much greater extent. In general, building powerful anythings is instrumentally convergent, and seemingly penalized at numerous junctures along the way.

Even if this weren’t true, how would this proposed workaround pass intent verification?

Edit: Leaving up in case this is useful for anyone else.

Replies from: donald-hobson

↑ comment by Donald Hobson (donald-hobson) · 2018-09-20T20:37:52.827Z · LW(p) · GW(p)

In case this wasn't clear, the powerful agent H isn't just following instructions, its actively trying to minimize the effect of its existence on the impact measure. Agent H is very powerful, and it sometimes delegates to AUP in such a way that AUP is exactly as powerful as it would be without H. If AUP had no way of making widgets on its own, H won't make widgets for it. The hypothetical Alien Agent that takes over AUP finds that it can't have much impact, because H is'nt listening to AUP very much.

AUP starts off not being able to blow up the moon, and blowing up the moon would have a large effect on many utility functions, so this impact measure stops AUP from building a giant moon bomb. However AUP can build H, and H can build giant moon bombs, so long as H keeps the red button away from AUP. AUP is still unable to blow up the moon, the moon bombs exist. AUP will not seek power, but will build power seeking agents, if those agents won't share the power.

AUP can be more accurately described as a power conserving agent, it doesn't want more power, or less.

Replies from: TurnTrout

↑ comment by TurnTrout · 2018-09-20T21:14:53.479Z · LW(p) · GW(p)

This wouldn’t pass intent verification. Also, I dispute that: building this super powerful agent is not instrumentally convergent along the way, would not either be subverted by (or subvert) the alien agent (either is a large impact), wouldn’t cause approval penalties... Also, it’s inefficient indirection on AUP’s part (since building this requires extra resources, but doesn’t get any more accomplished than if the agent just did the work itself).

Towards a New Impact Measure

Contents

What is "Impact"?

Intuition Pumps

WYSIATI

Power

Lines

Commitment

Overfitting

Attainable Utility Preservation

Sanity Check

Unbounded Solution

Notation

Formalizing "Ability to Achieve Goals"

Change in Expected Attainable Utility

Unit of Impact

Modified Utility

Penalty Permanence

Decision Rule

Summary

Additional Theoretical Results

Examples

Going Soft on the Paint

∅

paint

enter

Anti-"Survival Incentive" Incentive

Anticipated Shutdown

Temptation

Bad Priors

Experimental Results

Irreversibility: Sokoban

Impact: Vase

Dynamic Impact: Beware of Dog

Impact Prioritization: Burning Building

Clinginess: Sushi

Offsetting: Conveyor Belt

Corrigibility: Survival Incentive

Remarks

Discussion

Utility Selection

AUP Unbound

Approval Incentives

Mild Optimization

Acausal Cooperation

Nknown

Intent Verification

Omni Test

Robustness to Scale

Miscellaneous

Desiderata

Natural Kind

Corrigible

Shutdown-Safe

No Offsetting

Clinginess / Scapegoating Avoidance

Dynamic Consistency

Plausibly Efficient

Robust

Future Directions

Flaws

Open Questions

Conclusion

159 comments

How my thinking has changed

$A$ ttainable $U$ tility $P$ reservation

$\emptyset$

$paint$

$enter$

$N$ known