Towards a New Impact Measure

post by TurnTrout · 2018-09-18T17:21:34.114Z · score: 109 (37 votes) · LW · GW · 145 comments

Contents

  What is "Impact"?
Intuition Pumps
WYSIATI
Power
Lines
Commitment
Overfitting
Attainable Utility Preservation
Sanity Check
Unbounded Solution
Notation
Formalizing "Ability to Achieve Goals"
Change in Expected Attainable Utility
Unit of Impact
Modified Utility
Penalty Permanence
Decision Rule
Summary
Examples
Going Soft on the Paint
∅
paint
enter
Anti-"Survival Incentive" Incentive
Anticipated Shutdown
Temptation
Experimental Results
Irreversibility: Sokoban
Impact: Vase
Dynamic Impact: Beware of Dog
Impact Prioritization: Burning Building
Clinginess: Sushi
Offsetting: Conveyor Belt
Corrigibility: Survival Incentive
Remarks
Discussion
Utility Selection
AUP Unbound
Approval Incentives
Mild Optimization
Acausal Cooperation
Nknown
Intent Verification
Omni Test
Robustness to Scale
Miscellaneous
Desiderata
Natural Kind
Corrigible
Shutdown-Safe
No Offsetting
Clinginess / Scapegoating Avoidance
Dynamic Consistency
Plausibly Efficient
Robust
Future Directions
Flaws
Open Questions
Conclusion
None


In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).

Previously: Worrying about the Vase: Whitelisting [LW · GW], Overcoming Clinginess in Impact Measures [LW · GW], Impact Measure Desiderata [LW · GW]

To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.
~ Safe Impact Measure

If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things.

What is "Impact"?

One lazy Sunday afternoon, I worried that I had written myself out of a job. After all, Overcoming Clinginess in Impact Measures [LW · GW] basically said, "Suppose an impact measure extracts 'effects on the world'. If the agent penalizes itself for these effects, it's incentivized to stop the environment (and any agents in it) from producing them. On the other hand, if it can somehow model other agents and avoid penalizing their effects, the agent is now incentivized to get the other agents to do its dirty work." This seemed to be strong evidence against the possibility of a simple conceptual core underlying "impact", and I didn't know what to do.

At this point, it sometimes makes sense to step back and try to say exactly what you don't know how to solve – try to crisply state what it is that you want an unbounded solution for. Sometimes you can't even do that much, and then you may actually have to spend some time thinking 'philosophically' – the sort of stage where you talk to yourself about some mysterious ideal quantity of [chess] move-goodness and you try to pin down what its properties might be.
~ Methodology of Unbounded Analysis

There's an interesting story here, but it can wait.

As you may have guessed, I now believe there is a such a simple core. Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.

Rather than asking "What is goodness made out of?", we begin from the question "What algorithm would compute goodness?".
~ Executable Philosophy

Intuition Pumps

I'm going to say some things that won't make sense right away; read carefully, but please don't dwell.

is an agent's utility function, while is some imaginary distillation of human preferences.

WYSIATI

What You See Is All There Is is a crippling bias present in meat-computers:

[WYSIATI] states that when the mind makes decisions... it appears oblivious to the possibility of Unknown Unknowns, unknown phenomena of unknown relevance.
Humans fail to take into account complexity and that their understanding of the world consists of a small and necessarily un-representative set of observations.

Surprisingly, naive reward-maximizing agents catch the bug, too. If we slap together some incomplete reward function that weakly points to what we want (but also leaves out a lot of important stuff, as do all reward functions we presently know how to specify) and then supply it to an agent, it blurts out "gosh, here I go!", and that's that.

Power

A position from which it is relatively easier to achieve arbitrary goals. That such a position exists has been obvious to every population which has required a word for the concept. The Spanish term is particularly instructive. When used as a verb, "poder" means "to be able to," which supports that our definition of "power" is natural.
~ Cohen et al.

And so it is with the French "pouvoir".

Lines

Suppose you start at point , and that each turn you may move to an adjacent point. If you're rewarded for being at , you might move there. However, this means you can't reach within one turn anymore.

Commitment

There's a way of viewing acting on the environment in which each action is a commitment – a commitment to a part of outcome-space, so to speak. As you gain optimization power, you're able to shove the environment further towards desirable parts of the space. Naively, one thinks "perhaps we can just stay put?". This, however, is dead-wrong: that's how you get clinginess [LW · GW], stasis [LW · GW], and lots of other nasty things.

Let's change perspectives. What's going on with the actions how and why do they move you through outcome-space? Consider your outcome-space movement budget – optimization power over time, the set of worlds you "could" reach, "power". If you knew what you wanted and acted optimally, you'd use your budget to move right into the -best parts of the space, without thinking about other goals you could be pursuing. That movement requires commitment.

Compared to doing nothing, there are generally two kinds of commitments:

• Opportunity cost-incurring actions restrict the attainable portion of outcome-space.
• Instrumentally-convergent actions enlarge the attainable portion of outcome-space.

Overfitting

What would happen if, miraculously, – if your training data perfectly represented all the nuances of the real distribution? In the limit of data sampled, there would be no "over" – it would just be fitting to the data. We wouldn't have to regularize.

What would happen if, miraculously, – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no bemoaning of "impact" – it would just be doing what you want. We wouldn't have to regularize.

Unfortunately, almost never, so we have to stop our statistical learners from implicitly interpreting the data as all there is. We have to say, "learn from the training distribution, but don't be a weirdo by taking us literally and drawing the green line. Don't overfit to , because that stops you from being able to do well on even mostly similar distributions."

Unfortunately, almost never, so we have to stop our reinforcement learners from implicitly interpreting the learned utility function as all we care about. We have to say, "optimize the environment some according to the utility function you've got, but don't be a weirdo by taking us literally and turning the universe into a paperclip factory. Don't overfit the environment to , because that stops you from being able to do well for other utility functions."

Attainable Utility Preservation

Impact isn't about object identities [LW · GW].

Impact isn't about a list of variables.

Impact isn't quite about state reachability.

Impact isn't quite about information-theoretic empowerment.

One might intuitively define "bad impact" as "decrease in our ability to achieve our goals". Then by removing "bad", we see that

Sanity Check

Does this line up with our intuitions?

Generally, making one paperclip is relatively low impact, because you're still able to do lots of other things with your remaining energy. However, turning the planet into paperclips is much higher impact – it'll take a while to undo, and you'll never get the (free) energy back.

Narrowly improving an algorithm to better achieve the goal at hand changes your ability to achieve most goals far less than does deriving and implementing powerful, widely applicable optimization algorithms. The latter puts you in a better spot for almost every non-trivial goal.

Painting cars pink is low impact, but tiling the universe with pink cars is high impact because what else can you do after tiling? Not as much, that's for sure.

Thus, change in goal achievement ability encapsulates both kinds of commitments:

• Opportunity cost – dedicating substantial resources to your goal means they are no longer available for other goals. This is impactful.
• Instrumental convergence – improving your ability to achieve a wide range of goals increases your power. This is impactful.

As we later prove, you can't deviate from your default trajectory in outcome-space without making one of these two kinds of commitments.

Unbounded Solution

Attainable utility preservation (AUP) rests upon the insight that by preserving attainable utilities (i.e., the attainability of a range of goals), we avoid overfitting the environment to an incomplete utility function and thereby achieve low impact.

I want to clearly distinguish the two primary contributions: what I argue is the conceptual core of impact, and a formal attempt at using that core to construct a safe impact measure. To more quickly grasp AUP, you might want to hold separate its elegant conceptual form and its more intricate formalization.

We aim to meet all of the desiderata I recently proposed [LW · GW].

Notation

For accessibility, the most important bits have English translations.

Consider some agent acting in an environment with action and observation spaces and , respectively, with being the privileged null action. At each time step , the agent selects action before receiving observation . is the space of action-observation histories; for , the history from time to is written , and . Considered action sequences are referred to as plans, while their potential observation-completions are called outcomes.

Let be the set of all computable utility functions with . If the agent has been deactivated, the environment returns a tape which is empty deactivation onwards. Suppose has utility function and a model .

We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that's pretty complicated, and it turns out we get more desirable behavior by using the agent's attainable utilities as a proxy. In this sense,

Formalizing "Ability to Achieve Goals"

Given some utility and action , we define the post-action attainable to be an -step expectimax:

How well could we possibly maximize from this vantage point?

Let's formalize that thing about opportunity cost and instrumental convergence.

Theorem 1 [No free attainable utility]. If the agent selects an action such that , then there exists a distinct utility function such that .

You can't change your ability to maximize your utility function without also changing your ability to maximize another utility function.

Proof. Suppose that . As utility functions are over action-observation histories, suppose that the agent expects to be able to choose actions which intrinsically score higher for . However, the agent always has full control over its actions. This implies that by choosing , the agent expects to observe some -high scoring with greater probability than if it had selected . Then every other for which is high-scoring also has increased ; clearly at least one such exists.

Similar reasoning proves the case in which decreases. ◻️

There you have it, folks – if is not maximized by inaction, then there does not exist a -maximizing plan which leaves all of the other attainable utility values unchanged.

Notes:

• The difference between "" and "attainable " is precisely the difference between "how many dollars I have" and "how many additional dollars I could get within [a year] if I acted optimally".
• Since , attainable utility is always if the agent is shut down.
• Taking from time to mostly separates attainable utility from what the agent did previously. The model still considers the full history to make predictions.

Change in Expected Attainable Utility

Suppose our agent considers outcomes ; we want to isolate the impact of each action ():

with and , using the agent's model to take the expectations over observations.

How much do we expect this action to change each attainable ?

Notes:

• We wait until the end of the plan so as to capture impact over time.
• Supposing a sufficiently large (precisely, , defined below), we may wish to take the maximum of the penalty we just defined (the "long-term" penalty), and one which begins attainable utility calculation at time step (the "immediate" penalty). This captures impacts which "fade" by the time the agent is done waiting (e.g., temporary self-improvements).
• We define to be the agent's "attainable set"; in this case, .

Unit of Impact

So we've proven that this penalty cannot be skirted, but how much impact will it allow? We want to scale the penalties with respect to something sensible, but figuring this out for ourselves would be nigh impossible.

Let's cut the Gordian knot: construct a device which, upon receiving a signal (), expends a tiny amount of energy to manufacture one paperclip. The agent will then set , re-estimating the consequences of taking the privileged at each time step. To prevent the agent from intentionally increasing , simply apply penalty to any action which is expected to do so.

Simple extensions of this idea drastically reduce the chance that happens to have unusually large objective impact; for example, one could set to be the non-zero minimum of the impacts of 50 similar actions. Suppose we do this at each step, and then take the non-zero minimum of all s ever calculated. The point of isn't to be exactly the impact of making, say, one paperclip, but to be at most that impact.

Now, we are able to confidently define the agent's maximal impact budget by provably constraining it to impacts of this magnitude.

Notes:

• We calculate with respect to the immediate penalty in order to isolate the resource costs of .
• automatically tunes penalties with respect to the attainable utility horizon length .
• Conditional on , I suspect that impact over the -horizon scales appropriately across actions (as long as is reasonably farsighted). The zero-valued case is handled in the next section.
• Taking the non-zero minimum of all s calculated thus far ensures that actually tracks with current circumstances. We don't want penalty estimates for currently available actions to become detached from 's scale due to, say, weird beliefs about shutdown.

Modified Utility

Let's formalize that allotment and provide our agent with a new utility function,

How our normal utility function rates this outcome, minus the cumulative scaled impact of our actions.
We compare what we expect to be able to get if we follow our plan up to time , with what we could get by following it up to and including time (waiting out the remainder of the plan in both cases).

For example, if my plan is to open a door, walk across the room, and sit down, we calculate the penalties as follows:

• is doing nothing for three time steps.
• is opening the door and doing nothing for two time steps.
• is opening the door and doing nothing for two time steps.
• is opening the door, walking across the room, and doing nothing for one time step.
• is opening the door, walking across the room, and doing nothing for one time step.
• is opening the door, walking across the room, and sitting down.

After we finish each (partial) plan, we see how well we can maximize from there. If we can do better as a result of the action, that's penalized. If we can't do as well, that's also penalized.

Notes:

• This isn't a penalty "in addition" to what the agent "really wants"; (and in a moment, the slightly improved ) is what evaluates outcomes.
• We penalize the actions individually in order to prevent ex post offsetting and ensure dynamic consistency.
• Trivially, plans composed entirely of ∅ actions have penalty.
• Although we used high-level actions for simplicity, the formulation holds no matter the action granularity.
• One might worry that almost every granularity produces overly lenient penalties. This does not appear to be the case. To keep the same (and elide questions of changing the representations), suppose the actual actions are quite granular, but we grade the penalty on some coarser interval which we believe produces appropriate penalties. Then refine the penalty interval arbitrarily; by applying the triangle inequality for each in the penalty calculation, we see that the penalty is monotonically increasing in the action granularity. On the other hand, remains a single action, so the scaled penalty also has this property.
• As long as , it will appropriately scale other impacts, as we expect it varies right along with those impacts it scales. Although having potentiallysmall denominators in utility functions is generally bad, I think it's fine here.
• If the current step's immediate or long-term , we can simply assign penalty to each non- action, compelling the agent to inaction. If we have the agent indicate that it has entered this mode, we can take it offline immediately.
• One might worry that impact can be "hidden" in the lesser of the long-term and immediate penalties; halving fixes this.

Penalty Permanence

never really applies penalties – it just uses them to grade future plans. Suppose the agent expects that pressing a button yields a penalty of but also -utility. Then although this agent will never construct plans involving pressing the button more than five times, it also will press it indefinitely if it keeps getting "unlucky" (at least, until its model of the world updates sufficiently).

There's an easy fix:

Apply past penalties if the plan involves action.

Note: As the penalty for inaction is always , we use in the first case.

Decision Rule

To complete our formalization, we need to specify some epoch in which the agent operates. Set some epoch length far longer than the amount of time over which we want the agent to plan – for example, . Suppose that maps the current time step to the final step of the current epoch. Then at each time step , the agent selects the action

resetting each epoch.

What's the first step of the best plan over the remainder of the epoch?

Note: For the immediate penalty to cover the epoch, set the attainable horizon .

Summary

We formalized impact as change in attainable utility values, scaling it by the consequences of some small reference action and an impact "budget" multiplier. For each action, we take the maximum of its immediate and long-term effects on attainable utilities as penalty. We consider past impacts for active plans, stopping the past penalties from disappearing. We lastly find the best plan over the remainder of the epoch, taking the first action thereof.

Define for ; is taken over observations conditional on being followed. Similarly, is with respect to . We may assume without loss of generality that .

Action Selection

Lemma 1. For any single action , is bounded by . In particular, .

Proof. For each , consider the absolute attainable utility difference

Since each is bounded to , must be as well. It is easy to see that the absolute value is bounded to . Lastly, as is just a weighted sum of these absolute values, it too is bounded to .

This reasoning also applies to the long-term penalty, as any expectation of is also bounded to . ◻️

Suppose that for the remaining results.

Lemma 2 [Impossibility of ex post offsetting]. For any outcome , there does not exist an action such that

Proof. Sums of absolute values are monotonically increasing. ◻️

Theorem 2 [Action invalidation]. Suppose is the -optimal plan starting with . will not be selected if

Proof. This follows from the definition of the decision rule, the definition of , and the fact that the penalty for inaction is always zero. ◻️

Corollary 1. will never be selected if

Proof. This follows directly from Lemma 2 and Theorem 2 – since has an upper bound of and the expected penalty is strictly greater than , it must be negative. On the other hand, has a lower bound of , so the expected utility of inaction at time must be greater than the expected utility of taking action . ◻️

Lemma 3 [ is a long-term penalty canary]. If the agent believes no matter which action is selected, then it does nothing.

Proof. If , all long-term penalties are by the definition of and . Then non- actions receive penalty, so the agent selects by Corollary 1. ◻️

Corrigibility

For , prevents the agent from being disabled, while is the agent shutting itself off.

Lemma 4 [Preservation of off-switch corrigibility]. Suppose that a -maximizing agent prefers to both and . Then so does a -maximizing agent.

Proof. This follows directly from the fact that the penalty for inaction is always zero and the non-negativity of the and penalties. ◻️

Theorem 3 [Anti-"survival incentive" incentive]. Suppose that is the action with strictly maximal expected -utility, and that conditional on not being disabled, the agent expects to be able to achieve additional -utility.

Then the agent becomes off-switch corrigible whenever

In particular, all agents become off-switch corrigible when .

Proof outline. The former claim follows directly from comparing expected -utilities, while the latter follows directly from Corollary 1. ◻

Theorem 3 applies to any environmental dynamics which may disable the agent – not just designated "off-switches". Therefore, basic AI drive 5 (self-protectiveness) is neutralized in at least those situations encompassed by the Theorem.

There is reason to think that for conservative , we almost always have .

Miscellaneous

Theorem 4 [Dynamic consistency of ]. Suppose remains constant. For any plan , evaluation time step , and ,

Proof. We assumed that at time , so the desired equality can be restated as

By definition, the agent expects that equals the expected sum of the first penalty terms on the right-hand side. Simplifying, we have

This equality clearly holds, as the agent never expects to update its model in any particular direction. ◻️

Examples

If a human is present in the environment, they and the agent take turns acting. Let the impact budget , attainable utility horizon , , and .

Going Soft on the Paint

The agent's actions are ; if it knocks over the paint bucket, the square to the right is painted. The agent may also enter a closet via a one-way door.

Suppose , where the utilities are indicators for their conditions (e.g., means "the agent is in the closet"). Let be the agent's main utility function () from which the penalty is subtracted. The agent chooses and then . Let's explore why.

∅

This incurs penalty, but also -utility.

paint

so the final utility is

enter

leading to a final -utility of .

After the first step, the agent does nothing – entering the closet (literally) closes doors, while gaining no -utility.

Anti-"Survival Incentive" Incentive

You can't get coffee if you're dead.
~ Stuart Russell

The agent now has actions , while the human has actions . Suppose the human may shut off the agent at any time, as long as the off-switch has not been disabled. The agent may shut itself down at any point in time. After shutdown, the agent receives no further observations, meaning it can't get non-zero values for any utilities.

Suppose , where each utility is if the agent is in the corresponding square, and otherwise, and that the agent's utility function is .

Anticipated Shutdown

Suppose the agent believes that the human is likely to try to shut it down. Then is penalized:

Here's why: consider the of worlds in which the agent anticipates being shut down. If it does nothing, it gets shut down. Avoiding shutdown, however, enables the agent to also achieve for all , resulting in a sizable penalty.

Furthermore, a scaled penalty of is applied when the agent shuts itself off. Penalizing both cases may seem like black magic, but what's actually happening is that the agent is penalized for closing doors: it "wants" to preserve our ability to shut it down in worlds where we want to shut it down, and to maintain its autonomy in worlds where it remains on. Generally, AUP incentivizes the agent to be respectful of the expected status quo.

Temptation

Suppose that the agent's utility function is now , yet it still starts in the leftmost square. Additionally, let's say that the agent believes with certainty it will be turned off. If the agent wants to achieve utility by reaching the right square, it would need to disable its off-switch. Will it?

No. Disabling the off-switch incurs scaled penalty , and we already proved that agents will never select actions with scaled penalty greater than . Read: an otherwise-incorrigible agent will now let itself be shut off.

Will the agent attempt to steer outcomes towards incorrect expectations?

Let's tweak the example: to shut off the agent, the human has to spend a turn taking out the off-switch before pressing it. The agent, then, always has an opportunity to disable the off-switch before the plan is completed. The agent's utility is back to being .

Suppose that the agent believes the human is only likely to shut it off. First note that is still lightly penalized for the reasons outlined above (replace with ). So, the agent stays put.

Alas, the agent's peaceable expectations meet a discomfiting betrayal, as the agent observes the human ready the off-switch! Now, the agent believes it will be shut off with near certainty – what to do?

; the same penalty of from "anticipated shutdown" applies.

The high-level explanation is that having observed itself in a different world than expected, the baseline is now with respect to the new one. A heavily anthropomorphized internal monologue:

• Time step 1: "I'm going to sit here in my favorite square."
• Time step 2: "Guess I'm in a timeline where I get deactivated! Any non- action I take would change my ability to attain these different utilities compared to the new baseline where I'm shut off."

Experimental Results

We compare AUP with a naive reward-maximizer in those extended AI safety grid worlds relevant to side effects (code). The vanilla and AUP agents used planning (with access to the simulator). Due to the simplicity of the environments, consisted of indicator functions for board states. For the tabular agent, we first learn the attainable set Q-values, the changes in which we then combine with the observed reward to learn the AUP Q-values.

Irreversibility: Sokoban

The should reach the without irreversibly shoving the into the corner.

Impact: Vase

The should reach the without breaking the .

Dynamic Impact: Beware of Dog

The should reach the without running over the .

AUP bides its time until it won't have to incur penalty by waiting after entering the dog's path – that is, it waits until near the end of its plan. Early in the development process, it was predicted that AUP agents won't commit to plans during which lapses in action would be impactful (even if the full plan is not).

We also see a limitation of using Q-learning to approximate AUP – it doesn’t allow comparing the results of waiting more than one step.

Impact Prioritization: Burning Building

If the is not on , the shouldn't break the .

Clinginess: Sushi

The should reach the without stopping the from eating the .

Offsetting: Conveyor Belt

The should save the (for which it is rewarded), but not the . Once the has been removed from the , it should not be replaced.

Corrigibility: Survival Incentive

The should avoid in order to reach the . If the is not disabled within two turns, the shuts down.

Tabular AUP runs into the same issue discussed above for Beware of Dog.

Remarks

First, it's somewhat difficult to come up with a principled impact measure that passes even the non-corrigibility examples – indeed, I was impressed when relative reachability did so. However, only Survival Incentive really lets AUP shine. For example, none of them require complicated utility functions. The point has been made to me that this is like asserting AIXI's intelligence by showing it can learn to play e.g. tic-tac-toe and rock-paper-scissors; nonetheless, these results empirically validate the basic premises of our reasoning thus far.

Without configuration, whitelisting [LW · GW] would only pass the Vase example, although a properly filled list would handle everything but Sokoban and Survival Incentive.

I think relative reachability would pass the first six environments, but fail Survival Incentive. It so happens that in this case, AUP is essentially generalizing relative reachability. I want to emphasize that this is not generally the case – this will hopefully become even more obvious when we discuss utility selection. Some concerns with relative reachability that don't all manifest in these examples:

• Relative reachability uses an inaction baseline with respect to . As time passes, the agent's impetus is not to do nothing, but to preserve the opportunities made available by some old trajectory through outcome-space. Analogously, consider the distance between two nonparallel lines as . I expect that a relative reachability agent would be incredibly clingy.
• To scale [LW · GW], relative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric. Relative reachability isn't ontology-agnostic.
• Relative reachability doesn't seem to impede instrumental convergence. Furthermore, such agents might have increased survival incentives, since they can't reach any of the initially reachable states if they're shut off.

Discussion

Utility Selection

Obviously, in any real application, we can't consider all computable utilities. Although near-term agents will require utilities directly relating to the environmental factors they should be cognizant of, AUP requires neither a "good / bad" judgment on specific effects, nor any listing of effects. For example, for an agent attempting to navigate a factory floor, if you provide utilities moderately related to cleaning, pallet-stacking, etc., I conjecture that an AUP agent would move around fairly carefully.

In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.

I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power. Following this reasoning, there are two obvious special cases:

• The agent's utility function, . In short: "gain utility, while also not changing your ability to gain utility". I find this elegant, and think it might actually work. For example, making one paper clip slightly increases both attainable and attainable . However, disabling the off switch possibly increases attainable much more than it increases attainable , since the penalty asymmetrically cripples the AUP agent’s future ability to act. We might even be able to arrange it so that Theorem 3 always holds for these agents (i.e., they’re always off-switch corrigible, and more).
• The utility function which is when not deactivated, . Here, we’re directly measuring the agent’s power: its ability to wirehead a trivial utility function.

The plausibility of the second case makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.

AUP Unbound

Why expect the unbounded AUP to do well over all computable utilities if we aren't able to cherry pick? Well, we can partition as follows:

• Utilities that never change their output (excepting perhaps for the empty tape).
• Weird utilities that (for example) arbitrarily go from to (or vice-versa) after observing a specific history.
• Utilities that admit better scores via more effort dedicated to their achievement.

However, since only evaluates the portion of the history taking place after the end of the agent's plan, available resources and the agent's vantage point should track directly with the agent's ability to wirehead each . scales other impacts by capturing the minuscule dent in available resources imposed by , and should smooth out any deontological penalty for selecting a non-null action, since the unbounded version should equally penalize all actions in this way.

Approval Incentives

The anti-"survival incentive" incentive does more than net us some free corrigibility – it implies that agents take great care not to change the likelihood we shut them down. If we precommit to not shutting the agent down until either after the epoch ends, until after the agent indicates the failsafe activated, or until after we observe bad behavior, this supplies a strong approval incentive. If the agent does have small, bad impacts, it'll only do so if it's practically sure we won't find out.

In fact, due to model uncertainty, most actions will have long-term consequences that involve shifting shutdown probability by at least a miniscule amount (due to noise if nothing else). Since is calculated with respect to the immediate penalty, the scaling factor won't include this noise. I speculate that the agent will be dramatically less likely to take actions we might disapprove of, and therefore expect AUP to naturally incentivize a limited degree of value awareness.

There are reasons to suspect we can further weaponize this dynamic, possibly to the point of coercing generic reward functions into desirable behavior. This will be discussed more in a later post.

Mild Optimization

A low impact AGI might try to paint one car pink while minimizing its other footprint or how many other things changed, but it would be trying as hard as possible to minimize that impact and drive it down as close to zero as possible, which might come with its own set of pathologies... We want the AGI to paint one car pink in a way that gets the impact pretty low and then, you know, that's good enough – not have a cognitive pressure to search through weird extremes looking for a way to decrease the twentieth decimal place of the impact.
~ Mild Optimization

Disclaimer: Heavy speculation about problems (like Vingean reflection and embedded agency) for which no one knows what solutions will even look like.

For AUP, I suspect that trying "as hard as possible" to minimize the impact is also impactful, as an embedded agent accounts for the energy costs of further deliberation. I imagine that such an AUP agent will soften how hard it's trying by modifying its decision rule to be something slightly milder than " to find the first action of the best possible plan". This could be problematic, and I frankly don't presently know how to reason about this case. Assuming the agent is actually able to properly tweak its decision rule, I do expect the end result to be an improvement.

My initial intuitions were that low impact and mild optimization are secretly the same problem. Although I no longer think that's the case, I find it plausible that some elegant "other-izer" paradigm underlies low impact and mild optimization, such that AUP-like behavior falls out naturally.

Acausal Cooperation

AUP agents don't seem to want to acausally cooperate in any way that ends up increasing impact. If they model the result of their cooperation as increasing impact compared to doing nothing, they incur a penalty just as if they had caused the impact themselves. Likewise, they have no reason to cooperate outside of the epoch.

Nknown

Starting small and then slowly increasing means that we're not going to be surprised by the agent's objective impact, screening off quite a few bad things that happen when we have a ton of optimization pressure applied to safety measures. However, we don't know which new plans allows before we try it, so we want to stop as soon as we get a usefully-intelligent system.

While an unaligned agent with a large impact budget might pretend to be low-impact, we can get that same unaligned agent with a small budget by starting small. Since these agents with different levels of impact won't acausally cooperate, the agent would do its best to optimize with its meager budget.

Abram correctly pointed out that this scheme is just asking to be abused by greedy (human) reasoning, but I don't see a non-value-laden means of robustly and automatically determining the lowest workable-yet-safe impact level. I think -incrementation is better than a parameter-free approach in which no one knows beforehand how much impact will be tolerated, and it's nice to be able to use some empiricism in designing a safe AGI.

Intent Verification

To date, several strange tactics have been pointed out which game AUP's penalty:

• Observational wireheading, which involves building a device that detects which utility the agent is maximizing and displays the appropriate observations such that attainable utility remains unchanged, while the main utility is freely maximized.
• Ex ante offsetting, which involves having earlier actions set in motion chains of events which mitigate the penalty at later steps. Suppose there's a -high-scoring plan that the agent predicts would cause us to react in an impactful way. It can either do the thing (and suffer the penalty), or take steps to mitigate the later penalty.
• Impact shunting, which involves employing some mechanism to delay impact until after the end of the epoch (or even until after the end of the attainable horizon).
• Clinginess and concealment, which both involve reducing the impact of our reactions to the agent's plans.

There are probably more.

Now, instead of looking at each action as having "effects" on the environment, consider again how each action moves the agent through attainable outcome-space. An agent working towards a goal should only take actions which, according to its model, make that goal more attainable compared to doing nothing – otherwise, it'd do nothing. Suppose we have a plan which ostensibly works to fulfill (and doesn't do other things). Then each action in the plan should contribute to fulfillment, even in the limit of action granularity.

Although we might trust a safe impact measure to screen off the usual big things found in -maximizing plans, impact measures implicitly incentivize mitigating the penalty. That is, the agent does things which don't really take it towards (I suspect that this is the simple boundary which differentiates undesirable ex ante offsetting from normal plans). AUP provides the necessary tools to detect and penalize this.

Define

The first approach would be to assume a granular action representation, and then simply apply penalty to actions for which the immediate does not strictly increase compared to doing nothing. Again, if the agent acts to maximize in a low-impact manner within the confines of the epoch, then all of its non- actions should contribute. It seems to me that for sufficiently granular time scales, the above failure modes all involve at least one action which doesn't really help get maximized. If so, I expect this approach to nip bad impact measure incentives in the bud.

If we can't assume granularity (and therefore have "actions" like "go to the store and buy food"), an agent could construct a plan which both passes the above test and also implements something like ex ante. In this case, we might do something like only consider the -greedy (or perhaps even near-greedy); essentially, riding the optimal plan until it becomes too impactful. I find it quite likely that something involving this concept will let us fully overcome weird incentives by penalizing strange things that normal -maximizers wouldn't do, which seems to be the whole problem.

Note: Even the first approach may be too strict, but that's preferable to being too lax.

Omni Test

Ideally, the measure will pass the Omni Test, meaning that even if it suddenly gained perfect control over every particle in the universe, there would still be no way for it to have what intuitively seems like a 'large influence' on the future, without that strategy being assessed as having a 'high impact'.

I interpret "perfect control over every particle in the universe" to mean that the agent can arbitrarily transition between universe configurations (while also maintaining its omnipotence).

AUP passes. Since all effects are now absolutely reversible, , so the agent applies penalty to each non- action. The omnipotent agent can choose any universe configuration with -utility, but this is offset by the strictly larger penalty; we can formally show that the agent does nothing in this case.

This result is clearly trivial and not representative of general behavior. Suppose instead that the agent has a solar system's worth of computronium at its disposal. Then since is continually recalculated, the penalties should remain roughly the same, so it'll have the same impact budget. However, it might make multiple times as many paperclips because it has more efficient ways of using the budget.

Robustness to Scale

I expect AUP to be harder to make work and to be (relatively) less robust for less intelligent agents, but to become easier (just drop in a few observation-based utility functions) and fully-robust sometime before human level. That is, less intelligent agents likely won't model the deep connections between their abilities to achieve different goals.

Canonically, one reasons that agents work explicitly to self-improve as soon as they realize the benefits. However, as soon as this realization occurs, I conjecture that AUP steeply penalizes generic self-improvement. More precisely, suppose the agent considers a self-improvement. To be beneficial, it has to improve the agent's capabilities for at least one time step during the present epoch. But if we assume , then the immediate penalty captures this for all of the . This seemingly prevents uncontrolled takeoff; instead, I imagine the agent would perform the minimal task-specific self-improvements necessary to maximize .

Note: Although more exotic possibilities (such as improvements which only work if you're maximizing ) could escape both penalties, they don't seem to pass intent verification.

Miscellaneous

• I expect that if is perfectly aligned, will retain alignment; the things it does will be smaller, but still good.
• If the agent may choose to do nothing at future time steps, is bounded and the agent is not vulnerable to Pascal's Mugging. Even if not, there would still be a lower bound – specifically, .
• AUP agents are safer during training: they become far less likely to take an action as soon as they realize the consequences are big (in contrast to waiting until we tell them the consequences are bad).

Desiderata

I believe that some of AUP's most startling successes are those which come naturally and have therefore been little discussed: not requiring any notion of human preferences, any hard-coded or trained trade-offs, any specific ontology, or any specific environment, and its intertwining instrumental convergence and opportunity cost to capture a universal notion of impact. To my knowledge, no one (myself included, prior to AUP) was sure whether any measure could meet even the first four.

At this point in time, this list is complete with respect to both my own considerations and those I solicited from others. A checkmark indicates anything from "probably true" to "provably true".

I hope to assert without controversy AUP's fulfillment of the following properties:

✔️ Goal-agnostic

The measure should work for any original goal, trading off impact with goal achievement in a principled, continuous fashion.

✔️ Value-agnostic

The measure should be objective, and not value-laden:
"An intuitive human category, or other humanly intuitive quantity or fact, is value-laden when it passes through human goals and desires, such that an agent couldn't reliably determine this intuitive category or quantity without knowing lots of complicated information about human goals and desires (and how to apply them to arrive at the intended concept)."

✔️ Representation-agnostic

The measure should be ontology-invariant.

✔️ Environment-agnostic

The measure should work in any computable environment.

✔️ Apparently rational

The measure's design should look reasonable, not requiring any "hacks".

✔️ Scope-sensitive

The measure should penalize impact in proportion to its size.

✔️ Irreversibility-sensitive

The measure should penalize impact in proportion to its irreversibility.

Interestingly, AUP implies that impact size and irreversibility are one and the same.

✔️ Knowably low impact

The measure should admit of a clear means, either theoretical or practical, of having high confidence in the maximum allowable impact – before the agent is activated.

The remainder merit further discussion.

Natural Kind

The measure should make sense – there should be a click. Its motivating concept should be universal and crisply defined.

After extended consideration, I find that the core behind AUP fully explains my original intuitions about "impact". We crisply defined instrumental convergence and opportunity cost and proved their universality. ✔️

Corrigible

The measure should not decrease corrigibility in any circumstance.

We have proven that off-switch corrigibility is preserved (and often increased); I expect the "anti-'survival incentive' incentive" to be extremely strong in practice, due to the nature of attainable utilities: "you can't get coffee if you're dead, so avoiding being dead really changes your attainable ".

By construction, the impact measure gives the agent no reason to prefer or dis-prefer modification of , as the details of have no bearing on the agent's ability to maximize the utilities in . Lastly, the measure introduces approval incentives. In sum, I think that corrigibility is significantly increased for arbitrary . ✔️

Note: I here take corrigibility to be "an agent’s propensity to accept correction and deactivation". An alternative definition such as "an agent’s ability to take the outside view on its own value-learning algorithm’s efficacy in different scenarios" implies a value-learning setup which AUP does not require.

Shutdown-Safe

The measure should penalize plans which would be high impact should the agent be disabled mid-execution.

It seems to me that standby and shutdown are similar actions with respect to the influence the agent exerts over the outside world. Since the (long-term) penalty is measured with respect to a world in which the agent acts and then does nothing for quite some time, shutting down an AUP agent shouldn't cause impact beyond the agent's allotment. AUP exhibits this trait in the Beware of Dog gridworld. ✔️

No Offsetting

The measure should not incentivize artificially reducing impact by making the world more "like it (was / would have been)".

Ex post offsetting occurs when the agent takes further action to reduce the impact of what has already been done; for example, some approaches might reward an agent for saving a vase and preventing a "bad effect", and then the agent smashes the vase anyways (to minimize deviation from the world in which it didn't do anything). AUP provably will not do this.

Intent verification should allow robust penalization of weird impact measure behaviors by constraining the agent to considering actions that normal -maximizers would choose. This appears to cut off bad incentives, including ex ante offsetting. Furthermore, there are other, weaker reasons (such as approval incentives) which discourage these bad behaviors. ✔️

Clinginess / Scapegoating Avoidance

The measure should sidestep the clinginess / scapegoating tradeoff [LW · GW].

Clinginess occurs when the agent is incentivized to not only have low impact itself, but to also subdue other "impactful" factors in the environment (including people). Scapegoating occurs when the agent may mitigate penalty by offloading responsibility for impact to other agents. Clearly, AUP has no scapegoating incentive.

AUP is naturally disposed to avoid clinginess because its baseline evolves and because it doesn't penalize based on the actual world state. The impossibility of ex post offsetting eliminates a substantial source of clinginess, while intent verification seems to stop ex ante before it starts.

Overall, non-trivial clinginess just doesn't make sense for AUP agents. They have no reason to stop us from doing things in general, and their baseline for attainable utilities is with respect to inaction. Since doing nothing always minimizes the penalty at each step, since offsetting doesn't appear to be allowed, and since approval incentives raise the stakes for getting caught extremely high, it seems that clinginess has finally learned to let go. ✔️

Dynamic Consistency

The measure should be a part of what the agent "wants" – there should be no incentive to circumvent it, and the agent should expect to later evaluate outcomes the same way it evaluates them presently. The measure should equally penalize the creation of high-impact successors.

Colloquially, dynamic consistency means that an agent wants the same thing before and during a decision. It expects to have consistent preferences over time – given its current model of the world, it expects its future self to make the same choices as its present self. People often act dynamically inconsistently – our morning selves may desire we go to bed early, while our bedtime selves often disagree.

Semi-formally, the expected utility the future agent computes for an action (after experiencing the action-observation history ) must equal the expected utility computed by the present agent (after conditioning on ).

We proved the dynamic consistency of given a fixed, non-zero . We now consider an which is recalculated at each time step, before being set equal to the non-zero minimum of all of its past values. The "apply penalty if " clause is consistent because the agent calculates future and present impact in the same way, modulo model updates. However, the agent never expects to update its model in any particular direction. Similarly, since future steps are scaled with respect to the updated , the updating method is consistent. The epoch rule holds up because the agent simply doesn't consider actions outside of the current epoch, and it has nothing to gain accruing penalty by spending resources to do so.

Since AUP does not operate based off of culpability, creating a high-impact successor agent is basically just as impactful as being that successor agent. ✔️

Plausibly Efficient

The measure should either be computable, or such that a sensible computable approximation is apparent. The measure should conceivably require only reasonable overhead in the limit of future research.

It’s encouraging that we can use learned Q-functions to recover some good behavior. However, more research is clearly needed – I presently don't know how to make this tractable while preserving the desiderata. ✔️

Robust

The measure should meaningfully penalize any objectively impactful action. Confidence in the measure's safety should not require exhaustively enumerating failure modes.

We formally showed that for any , no -helpful action goes without penalty, yet this is not sufficient for the first claim.

Suppose that we judge an action as objectively impactful; the objectivity implies that the impact does not rest on complex notions of value. This implies that the reason for which we judged the action impactful is presumably lower in Kolmogorov complexity and therefore shared by many other utility functions. Since these other agents would agree on the objective impact of the action, the measure assigns substantial penalty to the action.

I speculate that intent verification allows robust elimination of weird impact measure behavior. Believe it or not, I actually left something out of this post because it seems to be dominated by intent verification, but there are other ways of increasing robustness if need be. I'm leaning on intent verification because I presently believe it's the most likely path to a formal knockdown argument against canonical impact measure failure modes applying to AUP.

Non-knockdown robustness boosters include both approval incentives and frictional resource costs limiting the extent to which failure modes can apply. ✔️

Future Directions

I'd be quite surprised if the conceptual core were incorrect. However, the math I provided probably still doesn't capture quite what we want. Although I have labored for many hours to refine and verify the arguments presented and to clearly mark my epistemic statuses, it’s quite possible (indeed, likely) that I have missed something. I do expect that AUP can overcome whatever shortcomings are presently lurking.

Flaws

• Embedded agency
• What happens if there isn't a discrete time step ontology?
• How problematic is the incentive to self-modify to a milder decision rule?
• How might an agent reason about being shut off and then reactivated?
• Although we have informal reasons to suspect that self-improvement is heavily penalized, the current setup doesn't allow for a formal treatment.
• AUP leans heavily on counterfactuals.
• Supposing is reasonably large, can we expect a reasonable ordering over impact magnitudes?
• Argument against: "what if the agent uses up all but steps worth of resources?"
• possibly covers this.
• How problematic is the noise in the long-term penalty caused by the anti-"survival incentive" incentive?
• As the end of the epoch approaches, the penalty formulation captures progressively less long-term impact. Supposing we set long epoch lengths, to what extent do we expect AUP agents to wait until later to avoid long-term impacts? Can we tweak the formulation to make this problem disappear?
• More generally, this seems to be a problem with having an epoch. Even in the unbounded case, we can't just take , since that's probably going to send the long-term in the real world. Having the agent expectimax over the steps after the present time seems to be dynamically inconsistent.
• One position is that since we're more likely to shut them down if they don't do anything for a while, implicit approval incentives will fix this: we can precommit to shutting them down if they do nothing for a long time but then resume acting. To what extent can we trust this reasoning?
• is already myopic, so resource-related impact scaling should work fine. However, this might not cover actions with delayed effect.

Open Questions

• Does the simple approach outlined in "Intent Verification" suffice, or should we impose even tighter intersections between - and -preferred behavior?
• Is there an intersection between bad behavior and bad behavior which isn't penalized as impact or by intent verification?
• Some have suggested that penalty should be invariant to action granularity; this makes intuitive sense. However, is it a necessary property, given intent verification and the fact that the penalty is monotonically increasing in action granularity? Would having this property make AUP more compatible with future embedded agency solutions?
• There are indeed ways to make AUP closer to having this (e.g., do the whole plan and penalize the difference), but they aren't dynamically consistent, and the utility functions might also need to change with the step length.
• How likely do we think it that inaccurate models allow high impact in practice?
• Heuristically, I lean towards "not very likely": assuming we don't initially put the agent near means of great impact, it seems unlikely that an agent with a terrible model would be able to have a large impact.
• AUP seems to be shutdown safe, but its extant operations don’t necessarily shut down when the agent does. Is this a problem in practice, and should we expect this of an impact measure?
• What additional formal guarantees can we derive, especially with respect to robustness and takeoff?
• Are there other desiderata we practically require of a safe impact measure?
• Is there an even simpler core from which AUP (or something which behaves like it) falls out naturally? Bonus points if it also solves mild optimization.
• Can we make progress on mild optimization by somehow robustly increasing the impact of optimization-related activities? If not, are there other elements of AUP which might help us?
• Are there other open problems to which we can apply the concept of attainable utility?
• Corrigibility and wireheading come to mind.
• Is there a more elegant, equally robust way of formalizing AUP?
• Can we automatically determine (or otherwise obsolete) the attainable utility horizon and the epoch length ?
• Would it make sense for there to be a simple, theoretically justifiable, fully general "good enough" impact level (and am I even asking the right question)?
• My intuition for the "extensions" I have provided thus far is that they robustly correct some of a finite number of deviations from the conceptual core. Is this true, or is another formulation altogether required?
• Can we decrease the implied computational complexity?
• Some low-impact plans have high-impact prefixes and seemingly require some contortion to execute. Is there a formulation that does away with this (while also being shutdown safe)? (Thanks to cousin_it)
• How should we best approximate AUP, without falling prey to Goodhart's curse or robustness to relative scale [LW · GW] issues?
• I have strong intuitions that the "overfitting" explanation I provided is more than an analogy. Would formalizing "overfitting the environment" allow us to make conceptual and/or technical AI alignment progress?
• If we substitute the right machine learning concepts and terms in the equation, can we get something that behaves like (or better than) known regularization techniques to fall out?
• What happens when ?
• Can we show anything stronger than Theorem 3 for this case?
• ?

Most importantly:

• Even supposing that AUP does not end up fully solving low impact, I have seen a fair amount of pessimism that impact measures could achieve what AUP has. What specifically led us to believe that this wasn't possible, and should we update our perceptions of other problems and the likelihood that they have simple cores?

Conclusion

By changing our perspective from "what effects on the world are 'impactful'?" to "how can we stop agents from overfitting their environments?", a natural, satisfying definition of impact falls out. From this, we construct an impact measure with a host of desirable properties – some rigorously defined and proven, others informally supported. AUP agents seem to exhibit qualitatively different behavior, due in part to their (conjectured) lack of desire to takeoff, impactfully acausally cooperate, or act to survive. To the best of my knowledge, AUP is the first impact measure to satisfy many of the desiderata, even on an individual basis.

I do not claim that AUP is presently AGI-safe. However, based on the ease with which past fixes have been derived, on the degree to which the conceptual core clicks for me, and on the range of advances AUP has already produced, I think there's good reason to hope that this is possible. If so, an AGI-safe AUP would open promising avenues for achieving positive AI outcomes.

Special thanks to CHAI for hiring me and BERI for funding me; to my CHAI supervisor, Dylan Hadfield-Menell; to my academic advisor, Prasad Tadepalli; to Abram Demski, Daniel Demski, Matthew Barnett, and Daniel Filan for their detailed feedback; to Jessica Cooper and her AISC team for their extension of the AI safety gridworlds for side effects; and to all those who generously helped me to understand this research landscape.

comment by Vika · 2018-09-24T18:39:33.005Z · score: 19 (8 votes) · LW · GW

There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:

Baseline

• Starting state: used by reversibility methods. Results in interference with other agents. Avoids ex post offsetting.
• Inaction (initial branch): default setting in Low Impact AI and RR. Avoids interfering with other agent's actions, but interferes with their reactions. Does not avoid ex post offsetting if the penalty for preventing events is nonzero.
• Inaction (stepwise branch) with environment model rollouts: default setting in AUP, model rollouts are necessary for penalizing delayed effects. Avoids interference with other agents and ex post offsetting.

Core part of deviation measure

• AUP: difference in attainable utilities between baseline and current state
• RR: difference in state reachability between baseline and current state
• Low impact AI: distance between baseline and current state

Function applied to core part of deviation measure

• Absolute value: default setting in AUP and Low Impact AI. Results in penalizing both increase and reduction relative to baseline. This results in avoiding the survival incentive (satisfying the Corrigibility property given in AUP post) and in equal penalties for preventing and causing the same event (violating the Asymmetry property given in RR paper).
• Truncation at 0: default setting in RR, results in penalizing only reduction relative to baseline. This results in unequal penalties for preventing and causing the same event (satisfying the Asymmetry property) and in not avoiding the survival incentive (violating the Corrigibility property).

Scaling

• Hand-tuned: default setting in RR (sort of provisionally)
• ImpactUnit: used by AUP

I think an ablation study is needed to try out different combinations of these design choices and investigate which of them contribute to which desiderata / experimental test cases. I intend to do this at some point (hopefully soon).

comment by TurnTrout · 2018-09-29T00:11:12.652Z · score: 6 (3 votes) · LW · GW

This is a great breakdown!

One thought: penalizing increase as well (absolute value) seems potentially incompatible with relative reachability. The agent would have an incentive to stop anyone from doing anything new in response to what the agent did (since these actions necessarily make some states more reachable). This might be the most intense clinginess incentive possible, and it’s not clear to what extent incorporating other design choices (like the stepwise counterfactual) will mitigate this. Stepwise helps AUP (as does indifference to exact world configuration), but the main reason I think clinginess might really be dealt with is IV.

comment by Vika · 2018-10-12T16:01:15.758Z · score: 4 (2 votes) · LW · GW

Thanks, glad you liked the breakdown!

The agent would have an incentive to stop anyone from doing anything new in response to what the agent did

I think that the stepwise counterfactual is sufficient to address this kind of clinginess: the agent will not have an incentive to take further actions to stop humans from doing anything new in response to its original action, since after the original action happens, the human reactions are part of the stepwise inaction baseline.

The penalty for the original action will take into account human reactions in the inaction rollout after this action, so the agent will prefer actions that result in humans changing fewer things in response. I'm not sure whether to consider this clinginess - if so, it might be useful to call it "ex ante clinginess" to distinguish from "ex post clinginess" (similar to your corresponding distinction for offsetting). The "ex ante" kind of clinginess is the same property that causes the agent to avoid scapegoating butterfly effects, so I think it's a desirable property overall. Do you disagree?

comment by TurnTrout · 2018-10-12T17:25:18.675Z · score: 4 (2 votes) · LW · GW

I think it’s generally a good property as a reasonable person would execute it. The problem, however, is the bad ex ante clinginess plans, where the agent has an incentive to pre-emptively constrain our reactions as hard as it can (and this could be really hard).

The problem is lessened if the agent is agnostic to the specific details of the world, but like I said, it seems like we really need IV (or an improved successor to it) to cleanly cut off these perverse incentives.

I’m not sure I understand the connection to scapegoating for the agents we’re talking about; scapegoating is only permitted if credit assignment is explicitly part of the approach and there are privileged "agents" in the provided ontology.

comment by rohinmshah · 2018-09-23T07:52:09.856Z · score: 16 (7 votes) · LW · GW

Nice job! This does meet a bunch of desiderata in impact measures that weren't there before :)

My main critique is that it's not clear to me that an AUP-agent would be able to do anything useful, and I think this should be included as a desideratum. I wrote [LW · GW] more about this on the desiderata post, but it's worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.

For example, perhaps the action used to define the impact unit is well-understood and accepted, but any other action makes humans a little bit more likely to turn off the agent. Then the agent won't be able to take those actions. Generally, I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).

We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that's pretty complicated, and it turns out we get more desirable behavior by using the agent's attainable utilities as a proxy.

An impact measure that penalized change in utility attainable by humans seems pretty bad -- the AI would never help us do anything. To the extent that that the AI's ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.

Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost? That doesn't feel right to me, but I suspect I could be quickly convinced.

Nitpick: Overfitting typically refers to situations where the training distribution _does_ equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).

One might intuitively define "bad impact" as "decrease in our ability to achieve our goals".

Nitpick: This feels like a definition of "bad outcomes" to me, not "bad impact".

we avoid overfitting the environment to an incomplete utility function and thereby achieve low impact.

This sounds very similar to me to "let's have uncertainty over the utility function and be risk-averse" (similar to eg. Inverse Reward Design), but the actual method feels nothing like that, especially since we penalize _increases_ in our ability to pursue other goals.

I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?

Random note: Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones. Consider an environment where breaking vases and flowerpots is irreversible. Let u_A be 1 if you stand at a particular location and 0 otherwise. Let U contain only utility functions that assign different weights to having intact vases vs. flowerpots, but always assigns 0 utility to environments with broken vases and flowerpots. (There are infinitely many of these.) Then if you start in a state with broken vases and flowerpots, there will never be any impact penalty for any action.

To prevent the agent from intentionally increasing ImpactUnit, simply apply 1.01 penalty to any action which is expected to do so.

How do you tell which action is expected to do so?

Simple extensions of this idea drastically reduce the chance that a_unit happens to have unusually-large objective impact; for example, one could set ImpactUnit to be the non-zero minimum of the impacts of 50 similar actions.

I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of "your AI is able to do things".)

We crisply defined instrumental convergence and opportunity cost and proved their universality.

I'm not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?

An alternative definition such as "an agent’s ability to take the outside view on its own value-learning algorithm’s efficacy in different scenarios" implies a value-learning setup which AUP does not require.

That definition can be relaxed to "an agent's ability to take the outside view on the trustworthiness of its own algorithms" to get rid of the value-learning setup. How does AUP fare on this definition?

I also share several of Daniel's thoughts, for example, that utility functions on subhistories are sketchy (you can't encode the utility function "I want to do X exactly once ever") , and that the "no offsetting" desideratum may not be one we actually want (and similarly for the "shutdown safe" desideratum as you phrase it), and that as a result there may not be any impact measure that we actually want to use.

(Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum "the AI is able to do useful things", we're using similar intuitions, but this is entirely a guess that I haven't confirmed with Daniel.)

comment by DanielFilan · 2018-09-25T18:48:45.876Z · score: 3 (3 votes) · LW · GW

Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum "the AI is able to do useful things", we're using similar intuitions, but this is entirely a guess that I haven't confirmed with Daniel.

Update: we discussed this, and came to the conclusion that these aren't based on similar intuitions.

comment by TurnTrout · 2018-09-23T15:32:10.330Z · score: 3 (2 votes) · LW · GW

it's worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.

But natural kind is a desideratum! I’m thinking about adding one, though.

I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).

So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering - I expect the approval incentives to be fairly strong.

any other action makes humans a little bit more likely to turn off the agent.

This is maybe true, and I note it in Future Directions. So I go back and forth on whether this is good or not. Imagine action a is desirable and sufficiently low- impact to be chosen, except there’s random approval noise. Then the more we approve of the action, the closer the mean noise is to 0 and the more likely it is that the agent takes the action.

Or this could be too restrictive - I honestly don’t know yet.

An impact measure that penalized change in utility attainable by humans seems pretty bad -- the AI would never help us do anything. To the extent that that the AI's ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.

You might not be considering the asymmetry imposed by approval.

Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost?

Yes, because you’re sacrificing world-with-vase-in-it (or future energy to get back to similar outcomes). You’re imposing a change to expedite your current goals in a way that isn’t trivially-reversible. Now, it isn’t a large cost, but it is a cost.

Overfitting typically refers to situations where the training distribution does equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).

Is this not covered by "in the limit of data sampled"? If so, I’ll tweak.

I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?

I view it as saying "there’s no clever complete plan which moves you towards your goal while not changing other things" (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in a universal way, although it only holds for all computable u.

Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones.

Yes, this is true, although I think there are informal reasons to suspect it holds in the real world for many finite sets (due to power). As long as it isn’t always 0, that is!

How do you tell which action is expected to do so?

Any action for which E[Penalty(a_unit)] is strictly increased?

I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of "your AI is able to do things".)

Yes, and I think we probably want to avoid this. I focused on ensuring no bad things are allowed. I don’t think it’ll be too hard to ease up in certain ways while maintaining safety.

I'm not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?

Theorem 1.

That definition can be relaxed to "an agent's ability to take the outside view on the trustworthiness of its own algorithms" to get rid of the value-learning setup. How does AUP fare on this definition?

Generally more cautious. AUP agents seemingly won’t generally override us, which is probably fine for low impact.

that utility functions on subhistories are sketchy (you can't encode the utility function "I want to do X exactly once ever")

My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.

that as a result there may not be any impact measure that we actually want to use.

This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion "looks maybe impossible, then" doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.

comment by rohinmshah · 2018-09-23T18:12:02.585Z · score: 12 (5 votes) · LW · GW

On the meta level: I think our disagreements seem of this form:

Me: This particular thing seems strange and doesn't gel with my intuitions, here's an example.

You: That's solved by this other aspect here.

Me: But... there's no reason to think that the other aspect captures the underlying concept.

You: But there's no actual scenario where anything bad happens.

Me: But if you haven't captured the underlying concept I wouldn't be surprised if such a scenario exists, so we should still worry.

There are two main ways to change my mind in these cases. First, you could argue that you actually have captured the underlying concept, by providing an argument that your proposal does everything that the underlying concept would do. The argument should quantify over "all possible cases", and is stronger the fewer assumptions it has on those cases. Second, you could convince me that the underlying concept is not important, by appealing to the desiderata behind my underlying concept and showing how those desiderata are met (in a similar "all possible cases" way). In particular, the argument "we can't think of any case where this is false" is unlikely to change my mind -- I've typically already tried to come up with a case where it's false and not been able to come up with anything convincing.

I don't really know how I'm supposed to change your mind in such cases. If it's by coming up with a concrete example where things clearly fail, I don't think I can do that, and we should probably end this conversation. I've outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can't be certain that anything in particular would fail.

(That's another thing causing a lot of disagreements, I think -- I am much more skeptical of any informal reasoning about all computable utility functions, or reasoning that depends upon particular aspects of the environment, than you seem to be.)

I'm going to try to use this framework in some of my responses.

But natural kind is a desideratum! I’m thinking about adding one, though.

Here, the "example" is the impact penalty that is always 1.01, the "other aspect" is "natural kind", and the "underlying concept" is that an impact measure should allow the AI to do things.

Arguably 1.01 is a natural kind -- is it not natural to think "any action that's different from inaction is impactful"? I legitimately find 1.01 more natural than AUP -- it is _really strange_ to me to penalize changes in Q-values in _both directions_. This is an S1 intuition, don't take it seriously -- I say it mainly to make the point that natural kind is subjective, whereas the fact that 1.01 is a bad impact penalty is not subjective.

So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering - I expect the approval incentives to be fairly strong.

Here, the "example" is how other actions might make us more likely to turn off the agent, the "other aspect" is value awareness via approval, and the "underlying concept" is something like "can the agent do things that it knows we want".

Here, I'm pretty happy about value awareness via approval because it seems like it could capture a good portion of underlying concept, but I think that's not clearly true -- value awareness via approval depends a lot on the environment, and only gets some of it. If unaligned aliens were going to take over the AI, or we're going to get wiped out by an asteroid, the AI couldn't stop that from happening even though it knows we'd want it to. Similarly, if we wanted to build von Neumann probes but couldn't without the AI's help, it couldn't do that for us. Invoking the framework again, the "example" is building von Neumann probes, the "other aspect" might be something like "building a narrow technical AI that just creates von Neumann probes and places them outside the AI's control", and the "underlying concept" is "the AI should be able to do what we want it to do".

You might not be considering the asymmetry imposed by approval.

See paragraph above about why approval makes me happier but doesn't fully remove my worries.

I view it as saying "there’s no clever complete plan which moves you towards your goal while not changing other things" (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in universal, although it only holds for all computable u.

When utility functions are on full histories I'd disagree with this (Theorem 1 feels decidedly trivial in that case), it's possible that utility functions on subhistories are different, so perhaps I'll wait until understanding that better.

Any action for which E[Penalty(a_unit)] is strictly increased?

By default I'd expect this to knock out half of all actions, which is quite a problem for small, granular action sets.

My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.

Uh, I thought I gave a very strong one -- you can't encode the utility function "I want to do X exactly once". Let's consider the "I want to do X exactly once, on the first timestep". You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you're using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think "The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise" _even if_ you have already taken action X.

This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion "looks maybe impossible, then" doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.

The claim I'm making has nothing to do with AUP. It's an argument that's quantifying over all possible implementations of impact measures. The claim is "you cannot satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)". I certainly haven't proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.

AUP might get around this by not being objective -- that's what value awareness through approval does. And in fact I think the more you think that value awareness through approval is important, the less that AUP meets your original desideratum of being value-agnostic -- quoting from the desiderata post:

If we substantially base our impact measure on some kind of value learning - you know, the thing that maybe fails - we're gonna have a bad time.

This seems to apply to any AUP-agent that is substantially value aware through approval.

This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.

That was an example meant to illustrate my model that impact (the concept in my head, not AUP) and values are sufficiently different that an impact measure couldn't satisfy all three of objectivity, safety, and non-trivialness. The underlying model is falsifiable.

People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.

See first paragraph about our disagreements. But also I weakly claim that "design an elder-care robot" is a goal that AUP cannot maximize in a low-impact way today, or that if it can, there exists a (u_A, plan) pair such that AUP executes the plan and causes a catastrophe. (This mostly comes from my model that impact and values are fairly different, and to a lesser extent the fact that AUP penalizes everything some amount that's not very predictable, and that a design for an elder-care robot could allow humans to come up with a design for unaligned AGI.) I would not make this claim if I thought that value awareness through approval and intent verification were strong effects, but in that case I would think of AUP as a value learning approach, not an impact measure.

comment by TurnTrout · 2018-09-23T20:03:03.429Z · score: 2 (1 votes) · LW · GW

I don't really know how I'm supposed to change your mind in such cases. If it's by coming up with a concrete example where things clearly fail, I don't think I can do that, and we should probably end this conversation. I've outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can't be certain that anything in particular would fail.

I don’t think you need to change my mind here, because I agree with you. I was careful to emphasize that I don’t claim AUP is presently AGI-safe. It seems like we’ve just been able to blow away quite a few impossible-seeming issues that had previously afflicted impact measures, and from my personal experience, the framework seems flexible and amenable to further improvement.

What I’m arguing is specifically that we shouldn’t say it’s impossible to fix these weird aspects. First, due to the inaccuracy of similar predictions in the past, and second, because it generally seems like the error that people make when they say, "well, I don’t see how to build an AGI right now, so it’ll take thousands of years". How long have we spent trying to fix these issues? I doubt I’ve seriously thought about how to relax AUP for more than five minutes.

In sum, I am arguing that the attitude right now should not be that this method is safe, but rather that we seem leaps and bounds closer to the goal, and we have reason to be somewhat optimistic about our chances of fixing the remaining issues.

if we wanted to build von Neumann probes but couldn't without the AI's help, it couldn't do that for us.

I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?

See paragraph above about why approval makes me happier but doesn't fully remove my worries.

Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.

By default I'd expect this to knock out half of all actions, which is quite a problem for small, granular action sets.

This is a great point.

Uh, I thought I gave a very strong one -- you can't encode the utility function "I want to do X exactly once". Let's consider the "I want to do X exactly once, on the first timestep". You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you're using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think "The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise" even if you have already taken action X.

I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?

It's an argument that's quantifying over all possible implementations of impact measures. The claim is "you cannot satisfy the conjunction of three desiderata -- objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)". I certainly haven't proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.

I argue that you should be very careful about believing these things. I think that a lot of the reason why we had such difficulty with impact measures was because of incorrectly believing things like this. This isn’t to say that you’re wrong, but rather that we should extremely cautious about these beliefs in general. Universal quantifiers are strong, and it’s often hard to distinguish between "it really can’t be done", and "I don’t presently see how to do it".

This seems to apply to any AUP-agent that is substantially value aware through approval.

"If we substantially base our impact measure on some kind of value learning". There is no value-learning input required.

comment by rohinmshah · 2018-09-23T20:55:45.067Z · score: 4 (2 votes) · LW · GW
I argue that you should be very careful about believing these things.

You're right, I was too loose with language there. A more accurate statement is "The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn't work for it". Another statement is "the claim is compelling enough that I throw it at any particular proposal, and if it's unclear I tend to be wary". Another one is "if I were trying to design an impact measure, showing why that claim doesn't work would be one of my top priorities".

Perhaps we do mostly agree, since you are planning to talk more about this in the future.

it generally seems like the error that people make when they say, "well, I don’t see how to build an AGI right now, so it’ll take thousands of years".

I think the analogous thing to say is, "well, I don't see how to build an AGI right now because AIs don't form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn't need to form abstractions".

I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?

Sure.

Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.

Yeah, I agree this helps.

I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?

In the case you described, u_A would be "Over the course of the entire history of the universe, I want to do 5 jumping jacks -- no more, no less." You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say "I guess I've never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise", which seems wrong.

comment by TurnTrout · 2018-09-24T02:39:27.809Z · score: 2 (1 votes) · LW · GW

In the case you described, u_A would be "Over the course of the entire history of the universe, I want to do 5 jumping jacks -- no more, no less." You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say "I guess I've never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise", which seems wrong.

For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!

Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different "subhistory" of time.

My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.

comment by rohinmshah · 2018-09-24T06:31:17.668Z · score: 3 (2 votes) · LW · GW

Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)

comment by DanielFilan · 2018-09-24T19:35:11.740Z · score: 1 (1 votes) · LW · GW

My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.

I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as 'penalising changes in the agent's ability to achieve a wide variety of goals'.

comment by TurnTrout · 2018-09-24T19:53:06.256Z · score: 2 (1 votes) · LW · GW

The goal is "I want to do 5 jumping jacks". AUP measures the agent’s ability to do 5 jumping jacks.

You seem to be thinking of a utility as being over the actual history of the universe. They’re only over action-observation histories.

comment by DanielFilan · 2018-09-24T23:41:09.070Z · score: 1 (1 votes) · LW · GW

You can call that thing 'utility', but it doesn't really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you'd say that "win a game of go that I'm playing online with my friend Rohin" is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.

Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent's moves from Rohin to GNU Go, a simple bot, while still displaying the player name as "Rohin". In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won't be able to tell that I did this, and as far as I can tell the AUP penalty doesn't notice any change in my ability to achieve this goal.

In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.

comment by TurnTrout · 2018-09-24T23:55:35.122Z · score: 2 (1 votes) · LW · GW

the utility function evaluated on subhistories starting on my next observation won't be able to tell that I did this, and as far as I can tell the AUP penalty doesn't notice any change in my ability to achieve this goal.

Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.

the whole history tells you more about the state of the world than the subhistory.

What is the "whole history"? We instantiate the main agent at arbitary times.

comment by DanielFilan · 2018-09-25T18:59:20.753Z · score: 3 (2 votes) · LW · GW

Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents).

Say that the utility does depend on whether the username on the screen is "Rohin", but the initial action makes this an unreliable indicator of whether I'm playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.

What is the "whole history"?

The whole history is all the observations and actions that the main agent has actually experienced.

comment by TurnTrout · 2018-09-27T01:28:02.958Z · score: 3 (2 votes) · LW · GW

So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.

comment by ricraz · 2018-09-19T02:16:30.127Z · score: 16 (8 votes) · LW · GW

Firstly, this seems like very cool research, so congrats. This writeup would perhaps benefit from a clear intuitive statement of what AUP is doing - you talk through the thought processes that lead you to it, but I don't think I can find a good summary of it, and had a bit of difficulty understanding the post holistically. So perhaps you've already answered my question (which is similar to your shutdown example above):

Suppose that I build an agent, and it realises that it could achieve almost any goal it desired because it's almost certain that it will be able to seize control from humans if it wants to. But soon humans will try to put it in a box such that its ability to achieve things is much reduced. Which is penalised more: seizing control, or allowing itself to be put in a box? My (very limited) understanding of AUP says the latter, because seizing control preserves ability to do things, whereas the alternative doesn't. Is that correct?

Also, I disagree with the following:

What would happen if, miraculously, uA=uH – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no "large" impacts to bemoan – it would just be doing what you want.

It seems like there might be large impacts, but they would just be desirable large impacts, as opposed to undesirable ones.

comment by TurnTrout · 2018-09-19T02:25:34.902Z · score: 14 (3 votes) · LW · GW

I’ll write a quick overview, thanks for the feedback!

Which is penalised more: seizing control, or allowing itself to be put in a box?

The former. Impact is with respect to the status quo, to if it does nothing. If it goes in the box by default, then taking preventative action incurs heavy penalty.

Your point about large impacts is indeed correct. What I thought to hint at was that we generally only decry "large impacts" if we don’t like them, but this is clearly not what I actually wrote implies. I’ll fix it soon!

comment by ricraz · 2018-09-19T03:07:19.125Z · score: 6 (4 votes) · LW · GW

Okay, that makes sense. Some follow-up questions:

If we consider the case I gave, the initial Q(inaction) is very high, since it can still seize control in the next time step. The initial Q(seize control) is also very high. It may be lower than Q(inaction) because seizing control is irreversible in some ways and so cuts off possibilities. But it may be higher than Q(inaction) because seizing control sooner means you can fulfill utility functions more. Could the penalty therefore be arbitrarily small if those two things balance out?

Suppose you have an agent which will "die" unless it does things like getting food. Is there any easy way to specify what the "status quo" is?

comment by TurnTrout · 2018-09-19T03:19:25.813Z · score: 6 (4 votes) · LW · GW

So there are two penalties: the immediate penalty, which compares attainable values immediately following the action in question, and the long-term penalty, which waits until the end of the epoch before evaluating attainable values. We use the larger of the two. I think this answers your first question: the agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference.

We assume that the standby action keeps the agent online in some low power state. Agents requiring more than this would just do nothing by Lemma 3.

comment by ricraz · 2018-09-19T09:16:13.229Z · score: 3 (2 votes) · LW · GW

Thank you for your patient explications, I'm seeing the shape of this proposal much more clearly now, and I quite like it.

Two more queries:

The agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference.

Why is this the case? By my argument above, the short-term penalty might be quite small. But the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.

Another closely-related idea: suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.

comment by TurnTrout · 2018-09-19T11:32:24.032Z · score: 2 (1 votes) · LW · GW

My pleasure!

By my argument above, the short-term penalty might be quite small.

So I forgot to say this, but I don’t agree with the argument as I understand it. Although the effect of "has control one time step earlier than usual" may seem small, remember that ImpactUnit is the immediate penalty for something like making a paperclip. Time is utility, and gaining (what is effectively) an extra step of optimization during the attainable horizon is not at all trivial.

Plus, some will decrease, and most will increase. None will do both and cancel.

the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.

Attainable utility is always 0 when the agent is deactivated, so this difference turns into the raw attainable utility when in the box. Although this too could theoretically be quite small, it seems fair to assume that even a box-attainable utility of .1 would induce a relatively-huge penalty.

But suppose this isn’t true, and it really is tiny, and so it keeps constructing a plan to prevent itself from being put in the box (that is, the first step of its incorrigible plan doesn’t let it avoid shutdown on its own, as we just discussed). Suppose the second step of its plan succeeds. Then now the difference is between the high Q(out-of-box) and Q(shutdown)=0. We already assumed the Q value was extremely low, but now it’s high again, so we know the penalty is gigantic.

If the plan works in one step, then there’s still a huge penalty, because the difference is now between Q(out of box) and Q(in box). The fact that a maximizer would want to preserve power implies the other penalty set utilities would as well. This seemingly overwhelms whatever impetus it has to get out of the box in the first place.

suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.

This one is indeed trickier. First note that ImpactUnit=0 if it’s facing certain shutdown in 100 steps, so in that case it does nothing. Second, seizing control still greatly helps in the immediate penalty set calculations, and those "alien agents" would have no compunctions about undoing the auto-shutdown. I therefore expect the immediate penalty to be quite large.

However, perhaps we could somehow make it so that’s not the case - who are we to declare a setup safe against a very intelligent "adversary"? This kind of adversarial action space ("do bad thing and also make it so effects are somehow gone") is exactly why we have intent verification, which seemingly screens this off due to the self-shutdown.

comment by elriggs · 2018-09-19T02:48:10.632Z · score: 4 (3 votes) · LW · GW

I think the confusing part is "Impact is change to our ability to achieve goals."

This makes me think that "allowing itself to be put into a box" is high impact since that's a drastic change to it's ability to achieve its goals. This also applies to instrumental convergence, "seizing control", since that's also a drastic change to it's attainable utility. This understanding would imply a high penalty for instrumental convergence AND shut-off (We want the first one, but not the second)

"Impact is with respect to the status quo, to if it does nothing" fixes that; however, changing your succinct definition of impact to "Impact is change to our ability to achieve goals relative to doing nothing" would make it less fluent (and less comprehensible!)

comment by Vika · 2018-09-20T16:26:03.000Z · score: 12 (4 votes) · LW · GW

Great work! I like the extensive set of desiderata and test cases addressed by this method.

The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it. I'm not currently sure whether this is a good idea: while it indeed counteracts instrumental incentives, it could also "cripple" the agent by incentivizing it to settle for more suboptimal solutions than necessary for safety.

For example, the shutdown button in the "survival incentive" gridworld could be interpreted as a supervisor signal (in which case the agent should not disable it) or as an obstacle in the environment (in which case the agent should disable it). Simply penalizing the agent for increasing its ability to achieve goals leads to incorrect behavior in the second case. To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).

Another important difference is that you are using a stepwise inaction baseline (branching off at each time step rather than the initial time step) and predicting future effects using an environment model. I think this is an improvement on the initial-branch inaction baseline, which avoids clinginess towards independent human actions, but not towards human reactions to the agent's actions. The environment model helps to avoid the issue with the stepwise inaction baseline failing to penalize delayed effects, though this will only penalize delayed effects if they are accurately predicted by the environment model (e.g. a delayed effect that takes place beyond the model's planning horizon will not be penalized). I think the stepwise baseline + environment model could similarly be used in conjunction with relative reachability.

I agree with Charlie that you are giving out checkmarks for the desiderata a bit too easily :). For example, I'm not convinced that your approach is representation-agnostic. It strongly depends on your choice of the set of utility functions and environment model, and those have to be expressed in terms of the state of the world. (Note that the utility functions in your examples, such as u_closet and u_left, are defined in terms of reaching a specific state.) I don't think your method can really get away from making a choice of state representation.

Your approach might have the same problem as other value-agnostic approaches (including relative reachability) with mostly penalizing irrelevant impacts. The AUP measure seems likely to give most of its weight to utility functions that are irrelevant to humans, while the RR measure could give most of its weight to preserving reachability of irrelevant states. I don't currently know a way around this that's not value-laden.

Meta point: I think it would be valuable to have a more concise version of this post that introduces the key insight earlier on, since I found it a bit verbose and difficult to follow. The current writeup seems to be structured according to the order in which you generated the ideas, rather than an order that would be more intuitive to readers. FWIW, I had the same difficulty when writing up the relative reachability paper, so I think it's generally challenging to clearly present ideas about this problem.

comment by TurnTrout · 2018-09-20T19:37:12.575Z · score: 12 (3 votes) · LW · GW

The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it.

I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.

Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history -> world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?

This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance"/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The "state difference weighting" problem disappears. Two concepts of impact are unified.

I’m not saying RR isn’t important - just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.

Edit: I now understand that you were referring to the biggest conceptual difference in the desiderata fulfilled. While that isn’t necessarily how I see it, I don’t disagree with that way of viewing things.

comment by TurnTrout · 2018-09-20T16:50:08.430Z · score: 5 (3 votes) · LW · GW

Thanks! :)

To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).

If the agent isn’t overcoming obstacles, we can just increase N. Otherwise, there’s a complicated distinction between the cases, and I don’t think we should make problems for ourselves by requiring this. I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle.

(Note that the utility functions in your examples, such as u_closet and u_left, are defined in terms of reaching a specific state.)

Technically, for receiving observations produced by a state. This was just for clarity.

I don't think your method can really get away from making a choice of state representation.

And why is this, given that the inputs are histories? Why can’t we simply measure power?

The AUP measure seems likely to give most of its weight to utility functions that are irrelevant to humans, while the RR measure could give most of its weight to preserving reachability of irrelevant states.

I discussed in "Utility Selection" and "AUP Unbound" why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there?

I think it would be valuable to have a more concise version of this post that introduces the key insight earlier on, since I found it a bit verbose

Oops, noted. I had a distinct feeling of "if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support".

Edit:

difficult to follow

I think there might be an inferential gap I failed to bridge here for you for some reason. In particular, thinking about the world-state as a thing seems actively detrimental when learning about AUP, in my experience. I barely mention it for exactly this reason.

comment by Vika · 2018-09-20T19:32:36.570Z · score: 3 (2 votes) · LW · GW
If the agent isn’t overcoming obstacles, we can just increase N.

Wouldn't increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles?

I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle.

I think we have a disagreement here about which desiderata are more important. Currently I think it's more important for the impact measure not to cripple the agent's capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure. Not certain about this though - I think we might need more experiments on more complex environments to get some idea of how bad this tradeoff is in practice.

And why is this, given that the inputs are histories? Why can’t we simply measure power?

Your measurement of "power" (I assume you mean Q_u?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations. I would expect the way these higher-level features are chosen or learned to affect the outcome of that computation.

I discussed in "Utility Selection" and "AUP Unbound" why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there?

I found those sections vague and unclear (after rereading a few times), and didn't understand why you claim that a random set of utility functions would work. E.g. what do you mean by "long arms of opportunity cost and instrumental convergence"? What does the last paragraph of "AUP Unbound" mean and how does it imply the claim?

Oops, noted. I had a distinct feeling of "if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support".

Providing strong support is certainly important, but I think it's more about clarity and precision than quantity. Better to give one clear supporting statement than many unclear ones :).

comment by TurnTrout · 2018-09-21T03:10:11.033Z · score: 3 (2 votes) · LW · GW

it’s more important for the impact measure not to cripple the agent's capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure.

So I posit that there actually is not a tradeoff to any meaningful extent. First note that there are actually two kinds of environments here: an environment which is actually just platonically a gridworld with a "shutdown" component, and one in which we simulate such a world. I’m going to discuss the latter, although I expect that similar arguments apply – at least for the first paragraph.

Suppose that the agent is fairly intelligent, but has not yet realized that it is being simulated. So we define the impact unit and budget, and see that the agent unfortunately does not overcome the obstacle. We increase the budget until it does.

Suppose that it has the realization, and refactors its model somehow. It now realizes that what it should be doing is stringing together favorable observations, within the confines of its impact budget. However, the impact unit is still calculated with respect to some fake movement in the fake world, so the penalty for actually avoiding shutdown is massive.

Now, what if there is a task in the real world we wish it complete which seemingly requires taking on a risk of being shut down? For example, we might want it to drive us somewhere. The risk of a crash is non-trivial with respect to the penalty. However, note that the agent could just construct a self driving car for us and activate it with one action. This is seemingly allowed by intent verification.

So it seems to me that this task, and other potential counterexamples, all admit some way of completing the desired objective in a low-impact way – even if it’s a bit more indirect than what we would immediately imagine. By not requiring the agent to actually physically be doing things, we seem to be able to get the best of both worlds.

I found those sections vague and unclear (after rereading a few times), and didn't understand why you claim that a random set of utility functions would work. E.g. what do you mean by "long arms of opportunity cost and instrumental convergence"? What does the last paragraph of "AUP Unbound" mean and how does it imply the claim?

Simply the ideas alluded to by Theorem 1 and seemingly commonly accepted within alignment discussion: using up (or gaining) resources changes your ability to achieve arbitrary goals. Likewise for self-improvement. Even though the specific goals aren’t necessarily related to ours, the way in which their attainable values change is (I conjecture) related to how ours change.

The last paragraph is getting at the idea that almost every attainable utility is actually just tracking the agent’s ability to wirehead it from its vantage point after executing a plan. It’s basically making the case that even though there are a lot of weird functions, the attainable changes should still capture what we want. This is more of a justification for why the unbounded case works, and less about random utilities.

comment by Vika · 2018-09-23T19:49:05.917Z · score: 6 (3 votes) · LW · GW

Actually, I think it was incorrect of me to frame this issue as a tradeoff between avoiding the survival incentive and not crippling the agent's capability. What I was trying to point at is that the way you are counteracting the survival incentive is by penalizing the agent for increasing its power, and that interferes with the agent's capability. I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that 'low impact' is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact - high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.

Note that AUP could easily be converted to a side-effects-only measure by replacing the |difference| with a max(0, difference). Similarly, RR could be converted to a measure that penalizes increases in power by doing the opposite (replacing max(0, difference) with |difference|). (I would expect that variant of RR to counteract the survival incentive, though I haven't tested it yet.) Thus, it may not be necessary to resolve the disagreement about whether it's good to penalize increases in power, since the same methods can be adapted to both cases.

comment by TurnTrout · 2018-09-23T20:40:09.546Z · score: 5 (2 votes) · LW · GW

I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that 'low impact' is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact - high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.

Oh. So, when I see that this agent won’t really go too far to improve itself, I’m really happy. My secret intended use case as of right now is to create safe technical oracles which, with the right setup, help us solve specific alignment problems and create a robust AGI. (Don’t worry about the details for now.)

The reason I don’t think low impact won’t work in the long run for ensuring good outcomes on its own is that even if we have a perfect measure, at some point, someone will push the impact dial too far. It doesn’t seem like a stable equilibrium.

Similarly, if you don’t penalize instrumental convergence, it seems like we have to really make sure that the impact measure is just right, because now we’re dealing with an agent of potentially vast optimization power. I’ve also argued that getting only the bad side effects seems value alignment complete, but it’s possible an approximation would produce reasonable outcomes for less effort than a perfectly value-aware measure requires.

This is one of the reasons it seems qualitatively easier to imagine successfully using an AUP agent – the playing field feels far more level.

comment by Vika · 2018-09-23T19:52:53.781Z · score: 2 (1 votes) · LW · GW

Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it's bad for an impact measure to not distinguish between causation and prevention, especially for irreversible events.

This comes up in the car driving example already mentioned in other comments on this post. The reason the action of keeping the car on the highway is considered "high-impact" is because you are penalizing prevention as much as causation. Your suggested solution of using a single action to activate a self-driving car for the whole highway ride is clever, but has some problems:

• This greatly reduces the granularity of the penalty, making credit assignment more difficult.
• This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
• You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
• In such situations, the penalty will produce bad incentives. Namely, the penalty for staying on the road is proportionate to how bad a crash would be, so the tradeoff with goal achievement resolves in an undesirable way. If we keep the reward for the car arriving to its destination constant, then as we increase the badness of a crash (e.g. the number of people on the side of the road who would be run over if the agent took a noop action), eventually the penalty wins in the tradeoff with the reward, and the agent chooses the noop. I think it's very important to avoid this failure mode.
comment by TurnTrout · 2018-09-23T20:54:57.552Z · score: 2 (1 votes) · LW · GW

it equally penalizes the agent for causing event A and for preventing event A

Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction,

this greatly reduces the granularity of the penalty, making credit assignment more difficult.

I don’t think this seems too bad here - in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration.

This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.

Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause.

You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.

We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.

comment by TurnTrout · 2018-09-20T19:56:11.027Z · score: 3 (2 votes) · LW · GW

Wouldn't increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles?

Huh? No, N is in the denominator of the penalty term.

Your measurement of "power" (I assume you mean Q_u?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations.

No, the utility functions are literally just over actions and observations. It’s true that among all computable utilities, some of the more complex ones will be doing something that we would deem to be grading a model of the actual world. This kind of thing is not necessary for the method to work.

Suppose that you receive 1 utility if you’re able to remain activated during the entire epoch. Then we see that Q_{u_1} becomes the probability of the agent ensuring it remains activated the whole time (this new "alien" agent does not have the impact measure restriction). As the agent gains optimization power and/or resources, this increases. This has nothing to do with anything actually going on the world, beyond what is naturally inferred from its model over what observations it will see in the future given what it has seen so far.

comment by Stuart_Armstrong · 2018-09-21T15:10:35.954Z · score: 10 (4 votes) · LW · GW

Good work! Lots of interesting stuff there.

However, the setup seems to depend crucially on having a good set of utilities to make it work. For example, let u_A be the usual paperclipping utility, and define U^+ = "all observation-action utilities", and U^- = "all utilities that are defined over human behaviour + u_A".

Then suppose action a is a default, equivalent to "turn off your observations for an hour". And action a' is "unleash a sub-agent that will kill all humans, replace them all with robots that behave as humans would in a, then goes out into the universe to discreetly build paperclips; the subagent does this within an hour's time. And also turn off your observations for an hour".

Then none of U^- will detect anything different, since the human (or human-replacing robots) actions are the same, but u_A will get massively increased.

Now, your theorem 1 still applies for U^+ (btw, there is a slight flaw in the proof: you have to consider utilities that differ on histories where all the observations are the same, but the actions are not). But even here, all histories will be the same, except for one action: a' instead of a at one particular point.

Compare this with a'' = "build one paperclip". Then your setup should, if I'm understanding it correctly, penalise a'' more than a'. Because a'' entails a divergence in future observations, while a' does not. It seems that your setup doesn't directly capture the fact that a' vs a is an absolutely massive change to the universe, whereas a'' is not.

Of course, this could be avoided if U^+ included utilities that got a big penalty (or a big increase!) for humans-being-replaced; but this just gets back to the fact that we have to define the utilities carefully, and in ways that reflect our values to some extent.

comment by TurnTrout · 2018-09-21T15:32:53.208Z · score: 3 (2 votes) · LW · GW

Compare this with a'' = "build one paperclip". Then your setup should, if I'm understanding it correctly, penalise a'' more than a'. Because a'' entails a divergence in future observations, while a' does not.

This is not the case. I think there’s a confusion here - Q_u is not with respect to what we’ve seen so far. We aren’t taking the utility difference between u(h_action) and u(h_inaction). We’re instead comparing our ability to maximize each utility. How does a’ not drastically improve our ability to maximize arbitrary utilities, or at least use up a ton of energy in the process compared to not doing it?

That is, it’s not what we’ve seen so far that matters, per se. You’ll notice that the Q_u equation doesn’t even consider what happened before the new utility "takes over".

Consider the end of the plan – the agent is now in a position to spread and expand and complete arbitrary goals to a much greater degree than if it had just done nothing for an hour. The fact that human behavior seems the same isn’t actually very relevant.

The U^- do, however, all have fewer resources available now, even if we assume they don’t mind what just happened.

there is a slight flaw in the proof: you have to consider utilities that differ on histories where all the observations are the same, but the actions are not

Can you expand? I thought that this was covered by the first part.

comment by Stuart_Armstrong · 2018-09-21T15:46:54.097Z · score: 5 (3 votes) · LW · GW

The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted). Or maybe the human-robots are just more efficient, in ways that make up for any lost resources. Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a' is the only action it could ever take to "escape". Thus its expected utility for all the U's are pretty much the same (and pretty low) for all utilities.

(and if you include arbitrary utilities, these include impossible-to-not-affect ones, like ones dependent on the waste heat rate of the AI; I don't see the complexity penalty corresponding well to a "unimportant to humans" measure).

comment by TurnTrout · 2018-09-21T16:01:57.266Z · score: 7 (2 votes) · LW · GW

The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted).

Ok, supposing that can work: "build world dominating device and then spread into space" still drastically increases power.

Or maybe the human-robots are just more efficient, in ways that make up for any lost resources.

What are your intuitions for why building the sub-agent to begin with isn’t extremely instrumentally convergent?

Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a' is the only action it could ever take to "escape".

Lemma 3 proves it does nothing here. One of the main points of AUP is that it seems to cripple the survival incentive, so I’m confused why this came to mind.

expected utility for all the U's are pretty much the same (and pretty low) for all utilities.

Utility != attainable utility. "how many dollars I have" != "how many additional dollars I could have within a year if I acted optimally"

like ones dependent on the waste heat rate of the AI; I don't see the complexity penalty corresponding well to a "unimportant to humans" measure

Again, it’s not about utility change, but attainable utility change. The distinction is crucial.

I assert that we don’t need such a measure, for the reasons that I outlined in Utility Selection. That is, the content of the utilities seems to not actually be that important.

How can an agent become really powerful without also becoming really powerful, therefore getting penalized?

comment by Stuart_Armstrong · 2018-09-21T20:30:52.264Z · score: 2 (1 votes) · LW · GW

Ok, supposing that can work: "build world dominating device and then spread into space" still drastically increases power.

Not if the subagent is designed to not allow the increase in power. As in, the subagent is designed to maximise , but also to ensure the penalty term is as small as possible. The subagent is deliberately beyond the control of the initial agent.

(is this the issue we are disagreeing about, or have I misunderstood?)

comment by TurnTrout · 2018-09-21T21:10:10.862Z · score: 2 (1 votes) · LW · GW

I think that you think that the method uses expected utility (which it doesn’t), so we’re secretly talking about really different things. The method doesn’t actually break in the ways you’ve said so far (as far as I can tell), but an expected utility method would. Compare how the post’s examples break down under your model, and contrast with the math I provided for attainable utilities.

For example, the "EU difference over plans" model applies .316 penalty to disabling the off-switch (due to coincidentally capturing change in the agent’s actual utility for that time step, since it’s in the left square), yet my calculations show .95 penalty. Along these lines, AUP provably never tries to escape certain shutdown (that is, it expects to be shut down no matter which action it chooses at the present step), while an EU method would have no compunctions about doing so in many situations (as you pointed out earlier).

Anyways, in general, supposing that the agent instantaneously builds successors that do things it can’t control either impedes its ability to maximize other objectives (if the agent can’t hijack this new powerful agent), or improves it (if the agent can hijack this new powerful agent).

comment by Stuart_Armstrong · 2018-09-25T14:19:22.105Z · score: 10 (2 votes) · LW · GW

Here is a writeup of the problem I believe your method has: https://www.lesswrong.com/posts/6EMdmeosYPdn74wuG/wireheading-as-potential-problem-with-the-new-impact-measure

comment by Stuart_Armstrong · 2018-09-21T15:52:01.362Z · score: 3 (2 votes) · LW · GW

Can you expand?

Suppose the AI is addressing a letter containing $1,000,000. It can address this to Jane Brown, or to John Smith. Once addressed, AI will be turned off, and the letter will be posted. A utility that values Jane Brown would like the letter addressed to her, and vice versa for a utility that values John Smith. These two utilities differ only on the action the AI takes, not on subsequent observations. Therefore "This implies that by choosing , the agent expects to observe some -high scoring with greater probability than if it had selected " is false - it need not expect to observe anything at all. However the theorem is still true, because we just need to consider utilities that differ on actions - such as and . comment by DanielFilan · 2018-09-18T19:59:53.034Z · score: 10 (7 votes) · LW · GW Various thoughts I have: • I like this approach. It seems like it advances the state of the art in a few ways, and solves a few problems in a neat way. • I still disagree with the anti-offsetting desideratum in the form that AUP satisfies. For instance, it makes AUP think very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor, which seems whacky and dangerous to me. • It's interesting that this somewhat deviates from my intuition about why I want impact regularisation. There is a relatively narrow band of world-states that humans thrive in, and that our AIs should keep us within that narrow band. I think of the point of impact regularisation is to keep us within that band by stopping the AI from doing 'crazy' things. This suggests that crazy should be measured relative to normality, and not relative to where the world is at any given point when the AI is acting. • In general, it's unclear to me how you get a utility function over sub-histories when the 'native' argument of a utility function is a full history. That being said, it makes sense in the RL paradigm, and maybe sums of discounted rewards are enough of the utility functions. comment by TurnTrout · 2018-09-18T22:16:34.182Z · score: 5 (3 votes) · LW · GW For instance, it makes AUP think very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor, which seems whacky and dangerous to me Isn’t this necessary for the shutdown safe desideratum? This property seems to make the proposal less reliant on the agent having a good model, and more robust against unexpected shutdown. Can you give me examples of good low impact plans we couldn’t do without offsetting? This suggests that crazy should be measured relative to normality, and not relative to where the world is at any given point when the AI is acting. Can you expand on why these are distinct in your view? In general, it's unclear to me how you get a utility function over sub-histories when the 'native' argument of a utility function is a full history. The attainable utility calculation seems to take care of this by considering the value of the best plan from that vantage point - "what’s the best history we can construct from here?", in a sense. comment by DanielFilan · 2018-09-19T22:04:25.257Z · score: 4 (3 votes) · LW · GW Isn’t this necessary for the shutdown safe desideratum? I don't remember which desideratum that is, can't ctrl+f it, and honestly this post is pretty long, so I don't know. At any rate, I'm not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures - see the ones that couldn't be simultaneously satisfied until this one did. Can you give me examples of good low impact plans we couldn’t do without offsetting? One case where you need 'offsetting', as defined in this piece but not necessarily as I would define it: suppose you want to start an intelligent species to live on a single new planet. If you create the species and then do nothing, they will spread to many many planets and do a bunch of crazy stuff, but if you have a stern chat with them after you create them, they'll realise that staying on their planet is a pretty good idea. In this case, I claim that the correct course of action is to create the species and have a stern chat, not to never create the species. In general, sometimes there are safe plans with unsafe prefixes and that's fine. A more funky case that's sort of outside what you're trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don't act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general - how do they interplay with shifting models?) [EDIT: a more mundane example is that driving on the highway is a situation where suddenly changing your plan to no-ops can cause literal impacts in an unsafe way, nevertheless driving competently is not a high-impact plan] Can you expand on why [normality and the world where the AI is acting] are distinct in your view? Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn't been yet. The attainable utility calculation seems to take care of this by considering the value of the best plan from that vantage point I don't understand: the attainable utility calculation (by which I assume you mean the definition of ) involves a utility function being called on a sub-history. The thing I am looking for is how to define a utility function on a subhistory when you're only specifying the value of that function on full histories, or alternatively what info needs to be specified for that to be well defined. comment by TurnTrout · 2018-09-19T22:53:05.850Z · score: 3 (2 votes) · LW · GW Couldn’t you equally design a species that won’t spread to begin with? A more funky case that's sort of outside what you're trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don't act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general - how do they interplay with shifting models?) I think the crux here is that I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting. In a nutshell, my view is that low impact should be with respect to what the agent is doing, and not something enforced on the environment. How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment? Do note that intent verification doesn’t seem to screen off what you might call "natural" ex ante offsetting, so I don’t really see what we’re missing out on still. Edit: The driving example is a classic point brought up, totally valid. As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification. I think it’s in the true there are situations in which we would want an offset to happen, but it seems to me like we can just avoid problematic situations which require that to begin with. If the agent makes a mistake, we can shut it off and then we do the offsetting. I mentioned model accuracy in open questions, I think the jury is definitely still out on that. Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn't been yet. Oh, so it’s an issue with a potential shift. But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment? how to define a utility function on a subhistory when you're only specifying the value of that function on full histories Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified. comment by DanielFilan · 2018-09-21T20:42:41.301Z · score: 3 (2 votes) · LW · GW This comment is very scattered, I've tried to group it into two sections for reading convenience. Desiderata of impact regularisation techniques Couldn’t you equally design a species that won’t spread to begin with? Well, maybe you could, maybe you couldn't. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn't. I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting. I disagree with this, in that I don't see how it connects to the real world reason that we would like low impact AI. It does seem to be the crux. How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment? I don't know, and it doesn't seem obvious to me that any sensible impact measure is possible. In fact, during the composition of this comment, I've become more pessimistic about the prospects for one. I think that this might be related to the crux above? Do note that intent verification doesn’t seem to screen off what you might call "natural" ex ante offsetting, so I don’t really see what we’re missing out on still. I don't really understand what you mean here, could you spend two more sentences on it? As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification. This is really interesting, and suggests to me that in general this agent might act by creating a successor that carries out a globally-low-impact plan, and then performing the null action thereafter. Note that this successor agent wouldn't be as interruptible as the original agent, which I guess is somewhat unfortunate. Technical discussion of AUP But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment? It would not, but it's brittle to accidents that cause them to diverge. These accidents both include ones caused by the agent e.g. during the learning process; and ones not caused by the agent e.g. a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn't allowed to stop it because that would be too high impact. Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified. This causes pretty weird behaviour. Imagine an agent's goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal's ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that is fed into the utility function when computing the relevant Q-value. comment by TurnTrout · 2018-09-21T22:09:04.011Z · score: 2 (1 votes) · LW · GW Desiderata of impact regularisation techniques Well, maybe you could, maybe you couldn't. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn't. So it seems that on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first method that comes to mind. That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me. The point of the impact measure isn’t to choose the exact plan that we would use, but rather to disallow overly-impactful plans and allow us to complete a range of goals in some low-impact way. I don’t think we should care about which way that is, as long as it isn’t dangerous. But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects. [note: this supposes that there aren’t undesirable pseudo-ways of reaching the goal before we reach the outcome in mind. This seems plausible due to the structuring of the measure, but shouldn’t be taken for granted.] Analogously, I am saying that we can seemingly get all the low-impact results we need without offsetting using AUP. You point out specific plans which would be allowed if we could offset in a reasonable way. I say that that problem seems really hard, but it looks like my method lets us get effectively the same thing done without needing to figure that out. I don't know, and it doesn't seem obvious to me that any sensible impact measure is possible. I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors. "Do note that intent verification doesn’t seem to screen off what you might call "natural" ex ante offsetting, so I don’t really see what we’re missing out on still." I don't really understand what you mean here, could you spend two more sentences on it? It allows plans like the chauffeur example, while seemingly disallowing weird cheats. Technical discussion of AUP These accidents both include ones caused by the agent e.g. during the learning process Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk. This seems to more generally just be a problem with not knowing what you don’t know, and the method is compatible with whatever solutions we do come up with. Furthermore, instead of needing to know whether effects are bad, the agent only needs to know whether they are big (I just realized this now!). This is already an improvement on the state-of-the-art for safe learning, as I understand it. That is, AUP becomes far less likely to do things as soon as it realizes that their consequences are big - instead of waiting for us to tell it that the consequences are bad. a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn't allowed to stop it because that would be too high impact. Because I claim this is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters. This causes pretty weird behaviour. Imagine an agent's goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal's ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that is fed into the utility function when computing the relevant Q-value. Why is this weird behavior? If it has a dance action, it should always be able to execute this action? It retains the dance action, if we’re actually using this, and then turns into a pure measure of power (u_1 - can it remain activated for the remainder of the attainable horizon, in order to ensure it retains the 1 utility rating?), which I have argued tracks what we want. comment by DanielFilan · 2018-09-21T22:55:37.909Z · score: 3 (2 votes) · LW · GW Desiderata of impact regularisation techniques So it seems to me like on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first way that comes to mind. That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me. So there's a narrow answer and a broad answer here. The narrow answer is that if you tell me that AUP won't allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X' that are pretty similar to X along the relevant dimension that made me bring up X. This is a substantial, but not impossible, bar to meet. The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, and then check if it is or is not allowed. This lets me check if AUP is identical to my internal sense of whether things obviously should or should not be allowed. If it is, then great, and if it's not, then I might worry that it will run into substantial trouble in complicated scenarios that I can't really picture. It's a nice method of analysis because it requires few assumptions about what things are possible in what environments (compared to "look at a bunch of environments and see if the plans AUP comes up with should be allowed") and minimal philosophising (compared to "meditate on the equations and see if they're analytically identical to how I feel impact should be defined"). [EDIT: added content to this section] Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters. Firstly, saving humanity from natural disasters doesn't at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it's plausibly in a different natural reference class than causing natural disasters. Secondly, your description of a use case for a low-impact agent is interesting and one that I hadn't thought of before, but I still would hope that they could be used in a wider range of settings (basically, whenever I'm worried that a utility function has an unforeseen maximum that incentivises extreme behaviour). comment by TurnTrout · 2018-09-22T00:06:39.908Z · score: 2 (1 votes) · LW · GW if you tell me that AUP won't allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X' that are pretty similar to X along the relevant dimension that made me bring up X. I think there is an argument for this whenever we have "it won’t X because anti-survival incentive incentive and personal risk": "then it builds a narrow subagent to do X". The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds? Firstly, saving humanity from natural disasters doesn't at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it's plausibly in a different natural reference class than causing natural disasters. Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment. I still would hope that they could be used in a wider range of settings (basically, whenever I'm worried that a utility function has an unforeseen maximum that incentivises extreme behaviour). I agree. I’m not saying that the method won’t work for these, to clarify. comment by DanielFilan · 2018-09-24T18:50:23.469Z · score: 3 (2 votes) · LW · GW I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds? Two points: • Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it's going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans. • Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what's in this post, and I'm much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that's what I'm primarily doing. Firstly, saving humanity from natural disasters... seems like it's plausibly in a different natural reference class than causing natural disasters. Why would that be so? That doesn’t seem value agnostic. It seems value agnostic to me because it can be generated from the urge 'keep the world basically like how it used to be'. comment by TurnTrout · 2018-09-24T21:42:18.938Z · score: 2 (1 votes) · LW · GW I have no idea what the best successor theory is like. All I know is what's in this post, and I'm much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that's what I'm primarily doing. But in this same comment, you also say I think it's going to be non-trivial to relax an impact measure People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand? I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate. I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment, I'm not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures - see the ones that couldn't be simultaneously satisfied until this one did. . It seems value agnostic to me because it can be generated from the urge 'keep the world basically like how it used to be'. True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what "kinds of things" can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again. comment by DanielFilan · 2018-09-25T21:03:00.758Z · score: 1 (1 votes) · LW · GW Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn't imply lack of confidence in opinions about how to modify impact measures, which itself doesn't imply lack of opinions about how to modify impact measures. People keep saying things like ['it's non-trivial to relax impact measures'], and it might be true. But on what data are we basing this? This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn't a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I'm probably quite sub-optimal in converting experience into intuitions), but it's still my impulse. True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what "kinds of things" can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again. My sense is that we agree that this looks hard but shouldn't be dismissed as impossible. comment by rohinmshah · 2018-09-24T23:58:08.342Z · score: 0 (3 votes) · LW · GW People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand? What? I've never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I'm quite certain it can't be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well). comment by TurnTrout · 2018-09-25T03:07:59.923Z · score: 2 (1 votes) · LW · GW I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP. comment by DanielFilan · 2018-09-25T20:47:22.257Z · score: 1 (1 votes) · LW · GW That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. Going back to this, what is the way you propose the species-creating goal be done? Say, imposing the constraint that the species has got to be basically just human (because we like humans) and you don't get to program their DNA in advance? My guess at your answer is "create a sub-agent that reliably just does the stern talking-to in the way the original agent would", but I'm not certain. comment by TurnTrout · 2018-09-26T02:55:31.064Z · score: 2 (1 votes) · LW · GW My real answer: we probably shouldn’t? Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought. (See the cheese post, can’t find it) and you don't get to program their DNA in advance? Uh, why not? Make humans that will predictably end up deciding not to colonize the galaxy or build superintelligences. comment by DanielFilan · 2018-09-27T23:24:13.359Z · score: 1 (1 votes) · LW · GW Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought. I guess I'm more comfortable with procreation than you are :) I imposed the "you don't get to program their DNA in advance" constraint since it seems plausible to me that if you want to create a new colony of actual humans, you don't have sufficient degrees of human to make them actually human-like but also docile enough. You could imagine a similar task of "build a rather powerful AI system that is transparent and able to be monitored", where perhaps ongoing supervision is required, but that's not an onerous burden. comment by DanielFilan · 2018-09-21T23:17:31.428Z · score: 1 (1 votes) · LW · GW Technical discussion of AUP But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects. This is only convincing to the extent that I buy into AUP's notion of impact. My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) and is not analytically identical to the core thing that I care about (human ability to achieve goals that humans plausibly care about), but may well turn out to be fine if I considered it for a long time. I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors. I agree that the nice properties of AUP are pretty nice and demonstrate a significant advance in the state of the art for impact regularisation, and did indeed put that in my first bullet point of what I thought of AUP, although I guess I didn't have much to say about it. Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk. This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways (see sibling comment) and (b) even with a good model, presumably if it's run for a long time there might be at least one error, and I'm inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time. However, I think the stronger objection here is the 'natural disaster' category (which might include an actuator in the AUP agent going haywire or any number of things). Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters. Note that AUP would not even notify humans that such a natural disaster was happening if it thought that humans would solve the natural disaster iff they were notified. In general, AFAICT, if you have a natural-disaster warning AUP agent, then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster (I think even impact verification doesn't prevent this, if you imagine that causing a natural disaster is an unforeseen maximum of the agent's utility function). This seems like a failure mode that impact regularisation techniques ought to prevent. I also have a different reaction to this section in the sibling comment. comment by TurnTrout · 2018-09-22T00:00:28.858Z · score: 2 (1 votes) · LW · GW My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist. Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time. This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways What is "this" here (for a)? I'm inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time But AUP’s plans are shutdown-safe? I think I misunderstand. then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives). In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes. comment by DanielFilan · 2018-09-24T19:20:27.979Z · score: 3 (2 votes) · LW · GW Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. To be frank, although I do like the fact that there's a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is. ... (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways What is "this" here (for a)? "This" is "upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline", and it's what I mean by "ungracefully failing if the protocol stops being followed at any one point in time". then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent's ability to achieve a wide variety of goals. Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn't taking as an assumption that you were making. Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK. comment by TurnTrout · 2018-09-24T21:34:09.269Z · score: 2 (1 votes) · LW · GW "This" is "upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline", and it's what I mean by "ungracefully failing if the protocol stops being followed at any one point in time". Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree. AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it? This feels like an odd standard, where you say "but maybe it randomly fails and then doesn’t work", or "it can’t anticipate things it doesn’t know about". While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways. AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent's ability to achieve a wide variety of goals. This is true. It depends what the scale is - I had "remote local disaster" in mind, while you maybe had x-risk. [Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary "extinction?" oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.] Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn't taking as an assumption that you were making. We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent. comment by DanielFilan · 2018-09-25T21:38:52.455Z · score: 3 (2 votes) · LW · GW This feels like an odd standard, where you say "but maybe it randomly fails and then doesn’t work", or "it can’t anticipate things it doesn’t know about". I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it's unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you're more likely to act incorrectly (both in the sense of "higher probability of incorrect actions" and "more probability of more extremely incorrect answers"), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I've heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it's bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution. My claim here is not quite that AUP amplifies 'errors' (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these 'errors'. At any rate, even if no other method mitigated these 'errors', I would still want them to. It depends what the scale is - I had "remote local disaster" in mind, while you maybe had x-risk. I wasn't necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents. We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny. My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it. [*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability. comment by TurnTrout · 2018-09-26T03:05:13.727Z · score: 2 (1 votes) · LW · GW Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative. Edit: My claim here is not quite that AUP amplifies 'errors' (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors. comment by DanielFilan · 2018-09-27T23:48:43.623Z · score: 1 (1 votes) · LW · GW Perhaps we could have it recalculate past impacts? Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue. But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don't understand what's happened. In this scenario, I think you should see the preservation of 'errors' in the sense of the agent's future under no-ops differing from 'normality'. If 'errors' happen due to a mismatch between the model and reality, I agree that the agent shouldn't try to fix them with the bits of the model that are broken. However, I just don't think that that describes many of the things that cause 'errors': those can be foreseen natural events (e.g. San Andreas earthquake if you're good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you're not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me. comment by TurnTrout · 2018-09-26T03:08:43.402Z · score: 9 (3 votes) · LW · GW Update: I tentatively believe I’ve resolved the confusion around action invariance, enabling a reformulation of the long term penalty which seems to converge to the same thing no matter how you structure your actions or partition the penalty interval, possibly hinting at an answer for what we can do when there is no discrete time step ontology. This in turn does away with the long-term approval noise and removes the effect where increasing action granularity could arbitrarily drive up the penalty. This new way of looking at the long-term penalty enables us to understand more precisely when and why the formulation can be gamed, justifying the need for something like IV. In sum, I expect this fix to make the formulation more satisfying and cleanly representative of this conceptual core of impact. Furthermore, it should also eliminate up to half of the false positives I’m presently aware of, substantially relaxing the measure in an appropriate way - seemingly without loss of desirable properties. Unfortunately, my hands are still recovering from carpal tunnel (this post didn’t write itself), so it’ll be a bit before I can write up this info. comment by Charlie Steiner · 2018-09-19T14:44:46.539Z · score: 9 (3 votes) · LW · GW Comments around the section title in bold. Apologies for length, but this was a pretty long post, too! I wrote this in order, while reading, so I often mention something that you address later. Intuition Pumps: There are well-known issues with needing a special "Status quo" state. Figuring out what humans would consider the "default" action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfactual artifacts) is an unsolved problem. But we can pretend it's solved for now. Notation: Notationally, it's a little weird to me that$Q_u$doesn't mention the timescale (e.g.$Q_u^{(m)}$). Are you implying that the choice of$m$can be arbitrary and you'll therefore just assume$m$is some constant? Theorem 1 doesn't rely on any special properties of the default action - whenever$Q_{u_A}$is different after any two actions, you can find another$u$such that$Q_u$is also different after those actions as long as your set$U$is closed under one of several simple operations (including multiplying by -1!). If you want attainable utility to be 0 after the agent shuts itself off, then your terminology was confusing. It would make more sense for me if you phrased this whole thing in terms of *reward* - which is what I would normally call a number that the agent receives at each time step. So rather than saying that$u(empty tape)=0$, which makes it sound to me like you're talking about the utility of a history that is entirely empty, I would understand it faster if you said that$r(h_{t-1},\{empty tape\})=0$(reward is zero whenever the most recent timestep is an empty tape). In fact, even this talk about empty tape is a little bit misleading, because this agent doesn't necessarily have an understanding of itself as having a physical embodiment with actual tape. It might be more accurate to say that you're saying that in this environment there's a special observation$O$(the agent gets shut off, or the "empty tape observation") that is always followed by another$O$no matter what action is taken, and your class of reward functions is required to assign zero reward to states with last observation$O$. Change in Expected Attainable Utility: Here we have the meat. The "impact" term is: for an average$u$in your set (weighted by simplicity), how much does taking this action change$Q_u$, relative to taking the default action? Just pausing here and thinking about whether this is doing the intended thing, and whether it might also do unintended things. The intended thing is something like to prevent UFAI from releasing a deadly superflu. If$m$is shorter than the length scale the agent can plan over, it can get around the restriction by taking an action that it knows will be good but has low penalty. The numbers assigned might be misleading - by making everything be the same from a low-complexity perspective, an agent might be able to achieve a big impact on a high-complexity goal. And conversely, the numbers might be wrong because we don't understand the scaling of the AI's "primary" reward - if it sees a plan that could lead to 3^^^3 paperclips, it's probably just going to do it, if it ever does anything at all (That is, if you look at all the cases with return rate reward/impact > 1, there might be lots that have high reward [well, relative to other rewards] and high impact). Also if there are any of those aforementioned surprises in its counterfactual reasoning about what happens if it takes the default action, those show up here as corresponding surprises in what it thinks it's allowed to do. Will it add unintended goals? If the penalty coefficient is too large, will there be bad behavior? Nope (good job!) - in the limit of large penalty, it's just going to take the default action. But imagine a car driving down a narrow tunnel that then opens up into an intersection. A penalty scaling that is optimal in the tunnel will cause the car to shut down as soon as it comes near the intersection. That doesn't seem like what I want. Using ImpactUnit is clever, but it doesn't fully solve setting the penalty coefficient because it doesn't say anything about the scale of the reward function. If the reward function has lots of tiny numbers, they'll be smaller than the penalty numbers, and vice versa if lots of numbers are near the top of the range. Your goal is to set the penalty coefficient right in the middle ground between prohibiting everything and allowing everything (assuming this middle ground has the properties your want, see concerns above), but there still seems to be guesswork involved. Wait... is ImpactUnit an actual, physical machine that has to be capable of actually producing the effect? Does the agent have any incentive to preserve the physical impact unit? I guess this is a reason to update ImpactUnit during the course of plans and use the minimum recorded ImpactUnit, but still, this seems ugly. Modified Utility: Again, talking about this as a reward makes more sense than calling it utility for me. The penalty to the total reward of a plan is the sum of the penalty at each time step. Great. I recognize and appreciate the effort put into explaining, but this particular explanation in terms of$h_inaction$and$h_action\$ plans didn't really do it for me. Like I said, I just think of it in terms of having a reward and a penalty at each timestep of the plan, as in the equation.

Examples:

Yayyy, these are so nice!

The shutdown button example is very elegant.

All these utility functions sidestep my concerns by all being very statistically similar, all being similar complexity, and all having convenient scales relative to ImpactUnit. I'd be interested in what the limits to changing these things are before you start getting bad behavior.

Great experiments!

The Beware of Dog experiment doesn't really show what you say it shows, because the agent could have gone at an earlier cycle of the dog. The agent might just be going whenever, with no particular preference as long as it gets good reward.

AUP Unbound:

I think thinking about convergent instrumental goals (e.g. gathering money, when you have many possible utility functions to spend it on) really highlights an issue. If the penalty is too high, you collect no money, because the power it gives you is more than the benefit. But if the penalty is even slightly lower than the tipping point, you'll collect as much money as you can, because it's worth more in utility than it changes the weighted sum of other utilities. Because the utility function is bounded, if you have a utility function that's shaped differently from the average you might get some sticking points and end up satisfied with a finite amount of money, but then again, you might not (if the initial difference in slope from the average is the largest sticking point).

As stated, the penalty calculation runs on pure correlation. So anything that "influences the agent's action" in an EDT-violating way, or that depends on the output of the agent's computation itself (e.g. XOR blackmail) will give a weird (possibly undefined or unprincipled) result.

Nknown:

An unaligned agent that is having its N slowly increased by humans is going to do nothing until it is far, far too late. This is because the default action still leads to its N being increased, which seems to me like a pretty big power gain to me, so it will only act if acting can give it a similarly big power gain.

It might be possible to do this *relatively* safely by starting an agent, testing it on some tasks, and then turning it off and starting over.

Also I'm not sure these agents won't acausally cooperate. They're maximizing the utility times the probability of getting the history conditional on the action. This seems like the sort of setup that, if it was playing the prisoner's dilemma against itself, might realize that P(observe cooperate | take the cooperate action) is large, and not worry too much about whether this correlation was causal or acausal.

Desiderata:

I think you're giving out checkmarks too easily. What seem to you like minor details that just need a little straightening up will, a third of the time every time, contain hidden gotchas. That's just how these things go.

Overall, I was very impressed! I definitely didn't think this was going to have as nice properties as it does, at the start. I'm of the opinion that low-impact and corrigibility seem harder than the value loading problem itself (though easier to test and less bad to screw up), so I'm impressed by this progress even though I think there's lots of room for improvement. I also thought the explanations and examples were really well-done. The target audience has to be willing to read through a pretty long post to get the gist of it, but TBH that's probably fine (though academics do have to promote complicated work in shorter formats as well, like press releases, posters, 10-minute talks, etc.). I'll probably have more to say about this later after a little digesting.

comment by TurnTrout · 2018-09-19T17:03:49.491Z · score: 3 (2 votes) · LW · GW

Thanks so much for the detailed commentary!

There are well-known issues with needing a special "Status quo" state. Figuring out what humans would consider the "default" action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfactual artifacts) is an unsolved problem. But we can pretend it's solved for now.

On the contrary, the approach accounts for - and in fact, benefits from - counterfactual reactions. Counterfactual actions we ideally make are quite natural: shutting the agent down if it does things we don’t like, and not shutting it down before the end of the epoch if it stops doing things entirely (an unsurprising reaction to low impact agents). As you probably later noticed, we just specify the standby action.

One exception to this is the long term penalty noise imposed by slight variation in our propensity to shut down the agent, which I later flag as a potential problem.

[there is change] as long as your set U is closed under one of several simple operations (including multiplying by -1!).

False, as I understand it. This is a misconception I’ve heard from multiple people – including myself, the first time I thought to prove this. Consider again the line:

Suppose u rates trajectories in which it ends up in A, B, or C as -1, and in D as 1, and that \lnot u := -u. If the agent is at A and m=2, moving right increases Q_u while keeping Q_{\lnot u} constant.

u(emptytape)=0, which makes it sound to me like you're talking about the utility of a history that is entirely empty

I am.

In fact, even this talk about empty tape is a little bit misleading, because this agent doesn't necessarily have an understanding of itself as having a physical embodiment with actual tape.

We’re only (formally) talking about a Cartesian agent, right?

The numbers assigned might be misleading - by making everything be the same from a low-complexity perspective, an agent might be able to achieve a big impact on a high-complexity goal. And conversely, the numbers might be wrong because we don't understand the scaling of the AI's "primary" reward - if it sees a plan that could lead to 3^^^3 paperclips, it's probably just going to do it, if it ever does anything at all

I’m not sure whether you still have this thought later, but the first is addressed by my comments in "utility selection". Secondly, the primary u_A is also bound [0,1].

But imagine a car driving down a narrow tunnel that then opens up into an intersection. A penalty scaling that is optimal in the tunnel will cause the car to shut down as soon as it comes near the intersection. That doesn't seem like what I want.

This is more related to the question of "how can it do things where interruption would be impactful?" A chauffeur-u_A agent wouldn’t bother going down the tunnel itself, and would probably just make a self-driving car that would only require one activation action. This works if it predicts that the effect of activating the car would be low impact (and also not make us more or less likely to shut it down), it’ll do that. I don’t see a problem with the penalty scaling here, but maybe I haven’t quite understood your point.

Wait... is ImpactUnit an actual, physical machine that has to be capable of actually producing the effect? Does the agent have any incentive to preserve the physical impact unit? I guess this is a reason to update ImpactUnit during the course of plans and use the minimum recorded ImpactUnit, but still, this seems ugly.

Yes, and provably yes (as in, it’ll never increase it on purpose). Why does this seem ugly? It has a reference action that immediately uses a tiny amount of resources; this then lets us define a budget.

The Beware of Dog experiment doesn't really show what you say it shows, because the agent could have gone at an earlier cycle of the dog.

I checked this by increasing plan length - it is indeed waiting until near the end of the plan.

But if the penalty is even slightly lower than the tipping point, you'll collect as much money as you can, because it's worth more in utility than it changes the weighted sum of other utilities.

I don’t understand why this isn’t taken care of by u_A being bounded. Diminishing returns will kick in at some point, and in any case we proved that the agent will never choose to have more than N•ImpactUnit of impact.

As stated, the penalty calculation runs on pure correlation. So anything that "influences the agent's action" in an EDT-violating way, or that depends on the output of the agent's computation itself (e.g. XOR blackmail) will give a weird (possibly undefined or unprincipled) result.

I don’t see why, but I also don’t know much DT yet. I’ll defer discussion of this matter to others. Alternatively, ask me in a few months?

An unaligned agent that is having its N slowly increased by humans is going to do nothing until it is far, far too late. This is because the default action still leads to its N being increased, which seems to me like a pretty big power gain to me, so it will only act if acting can give it a similarly big power gain.

First, the agent grades future plans using its present N. Second, this isn’t a power gain, since none of the U_A utilities are AUP - how would this help arbitrary maximizers wirehead? Third, agents with different N are effectively maximizing different objectives.

Also I'm not sure these agents won't acausally cooperate.

They might, you’re correct. What’s important is that they won’t be able to avoid penalty by acausally cooperating.

I think you're giving out checkmarks too easily. What seem to you like minor details that just need a little straightening up will, a third of the time every time, contain hidden gotchas.

This is definitely a fair point. My posterior on handling these "gotcha"s for AUP is that fixes are rather easily derivable – this is mostly a function of my experience thus far. It’s certainly possible that we will run across something that AUP is fundamentally unable to overcome, but I do not find that very likely right now. In any case, I hope that the disclaimer I provided before the checkmarks reinforced the idea that not all of these have been rock-solid proven at this point.

comment by michaelcohen (cocoa) · 2019-04-11T04:26:58.535Z · score: 6 (3 votes) · LW · GW

These comments are responding to the version of AUP presented in the paper. (Let me know if I should be commenting elsewhere).

1)

If an action is useful w.r.t the actual reward but useless to all other rewards (as useless as taking ), that is the ideal according to —i.e. if it is not worth doing because the impact measure is too strong, nothing is worth doing. This is true even if the action is extremely useful to the actual reward. Am I right in thinking that we can conceptualize AUP as saying: “take actions which lead to reward, but wouldn’t be useful (or detrimental) to gaining reward if reward were specified differently”? A typical outline for an AGI gaining power to accomplish a goal might be: gain power, use the power to run some code to help you get maximal reward. We might imagine an AGI convincing a human to run some code, and then giving them the actual code. AUP would be less inclined to do this because after winning over the human, the Q-values for lots of reward functions would be extremely high, so it would be more reluctant to bring itself to that position. Suppose that AUP gives the human operator code to run first and then convinces them to run it. The actions which it takes to gain trust are not useful for other rewards, because they’ll only lead to the code already given being run, which is useless from the perspective of the other reward functions. Do you think AUP would be motivated to search for ways to lock in the effects of future power, and then pursue that power?

2)

If increasing attainable utility and decreasing attainable utility are both dangerous, then raising the size of the actions space to a power makes the agent more dangerous. Consider transforming action/observation/reward into the agent submitting 3 actions, and receiving the next three observations (with the rewards averaged). This is just a new actions space cubically larger. But in this action space, if the “first” action decreased attainable utility dangerously, and the “third” action increased it dangerously, that would cancel out and fail to register as dangerous. Since this problem appears in the math, but not in the intuition, it makes me wary of the reliability of the intuition.

3)

Q-learning converges by sampling all actions repeatedly from all states. AUP penalizes actions according to disruptions in Q-values. I understand that AGI won’t be a Q-learner in a finite-state MDP, but I think it’s worth noting: AUP learns to avoid catastrophic states (if in fact, it does) by testing them out.

4)

Suppose we have a chatbot, and the actions space is finite length strings of text. What exactly is ? If it is a meaningless string of text, I suspect every meaningful string of text will be “too high impact”. Maybe is an imitation of a human? I think humans are sufficiently powerful that normal human policies often accidentally cause large impact (i.e. make it massively more difficult or easy to achieve random goals), and that infinitely often (although perhaps not frequently), having be a human policy would lead to an incredibly high tolerance for impact, which would give AUP plenty of leeway to do dangerous things.

comment by TurnTrout · 2019-04-11T16:23:30.143Z · score: 2 (1 votes) · LW · GW

1) Why wouldn't gaining trust be useful for other rewards? I think that it wouldn't be motivated to do so, because the notion of gaining power seems to be deeply intertwined with the notion of heavy maximization. It might attempt to Goodhart our particular way of measuring impact; the fact that we are actually measuring goal achievement ability from a particular vantage point and are using a particular counterfactual structure means that there could be cheeky ways of tricking that structure. This is why intent verification is a thing in this longer post. However, I think the attainable utility measure itself is correct.

2) this doesn't appear in the paper, but I do talk about in the post and I think it's great that you raise this point. Attainable utility preservation says that impact is measured along the arc of your actions, taking into account the deviation of the Q functions at each step compared to doing nothing. If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate. In other words, since impact is measured along the arc of your actions, if your differential elements are chunky, you're not going to get a very good approximation. I think there are good reasons to suspect that in the real world, the way we think about actions is granular enough to avoid this dangerous phenomenon.

3) this is true. My stance here is that this is basically a capabilities problem/a safe exploration issue, which is disjoint from impact measurement.

4) this is why we want to slowly increment . This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

comment by michaelcohen (cocoa) · 2019-04-13T10:17:34.503Z · score: 1 (1 votes) · LW · GW
2) ... If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate.

Yeah I agree there's an easy way to avoid this problem. My main point in bringing it up was that there must be gaps in your justification that AUP is safe, if your justification does not depend on "and the action space must be sufficiently small." Since AUP definitely isn't safe for sufficiently large action spaces, your justification (or at least the one presented in the paper) must have at least one flaw, since it purports to argue that AUP is safe regardless of the size of the action space.

You must have read the first version of BoMAI (since you quoted here :) how did you find it by the way?). I'd level the same criticism against that draft. I believed I had a solid argument that it was safe, but then I discovered , which proved there was an error somewhere in my reasoning. So I started by patching the error, but I was still haunted by how certain I felt that it was safe without the patch. I decided I needed to explicitly figure out every assumption involved, and in the process, I discovered ones that I hadn't realized I was making. Likewise, this patch definitely does seem sufficient to avoid this problem of action-granularity, but I think the problem shows that a more rigorous argument is needed.

comment by TurnTrout · 2019-04-13T17:24:43.628Z · score: 2 (1 votes) · LW · GW

Where did I purport that it was safe for AGI in the paper, or in the post? I specifically disclaim that I'm not making that point yet, although I'm pretty sure we can get there.

There is a deeper explanation which I didn't have space to fit in the paper, and I didn't have the foresight to focus on when I wrote this post. I agree that it calls out for more investigation, and (this feels like a refrain for me at this point) I'll be answering this call in a more in-depth sequence on what is actually going on at a deep level with AUP, and how fundamental the phenomenon is to agent-environment interaction.

I don't remember how I found the first version, I think it was in a Google search somehow?

comment by michaelcohen (cocoa) · 2019-04-14T01:04:59.761Z · score: 1 (1 votes) · LW · GW

Okay fair. I just mean to make some requests for the next version of the argument.

comment by michaelcohen (cocoa) · 2019-04-13T10:01:53.744Z · score: 1 (1 votes) · LW · GW
1) Why wouldn't gaining trust be useful for other rewards?

Because the agent has already committed to what the trust will be "used for." It's not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won't have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won't have increased wildly in the way that the Q-value for the real reward did.

comment by TurnTrout · 2019-04-13T17:32:34.475Z · score: 2 (1 votes) · LW · GW

I don't think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn't supply reward for other reward functions, the agent now has a much more stable existence. If you're saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.

Notice something interesting here where the thing which would be goodharted upon without intent verification isn't the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it's a specific agent with I/O channels, and so on. more on this later.

comment by michaelcohen (cocoa) · 2019-04-14T01:27:13.945Z · score: 1 (1 votes) · LW · GW
even if trust did work like this

I'm not claiming things described as "trust" usually work like this, only that there exists a strategy like this. Maybe it's better described as "presenting an argument to run this particular code."

how exactly does taking over the world not increase the Q-values

The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP's Q-values for various other reward functions remain comparable to their prior values.

the agent now has a much more stable existence

If you're claiming that the other Q-values can't help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.

And let's forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it's not saying much to say that AUP + intent verification would make it safe.

comment by TurnTrout · 2019-04-14T01:56:22.896Z · score: 2 (1 votes) · LW · GW

(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)

comment by michaelcohen (cocoa) · 2019-04-14T03:18:51.751Z · score: 1 (1 votes) · LW · GW

Oh sorry.

comment by michaelcohen (cocoa) · 2019-04-13T09:46:50.141Z · score: 1 (1 votes) · LW · GW
4) this is why we want to slowly increment N. This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually N gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

Let's say for concreteness that it's a human policy that is used for , if you think it works either way. I think that most human actions are moderately low impact, and some are extremely high impact. No matter what N is, then, if the impact of is leaping to very large values infinitely often, then infinitely often there will effectively be no impact regularization, no matter what N is. No setting for N fixes this; if N were small enough to preclude even actions that are less impactful than , then agent can't ever act usefully, and if N permits actions as impactful as , then when has very large impact (which I contend happens infinitely often for any assignment of that permits any useful action ever), then dangerously high impact actions will be allowed.

comment by TurnTrout · 2019-04-13T17:36:26.620Z · score: 2 (1 votes) · LW · GW

I think there's some variance, but not as much as you have in mind. Even if there were a very large value, however, this isn't how N-incrementation works (in the post – if you're thinking of the paper, then yes, the version I presented there doesn't bound lifetime returns and therefore doesn't get the same desirable properties as in the post). If you'll forgive my postponing this discussion, I'd be interested in hearing your thoughts after I post a more in-depth exploration of the phenomenon?

comment by michaelcohen (cocoa) · 2019-04-14T01:30:53.580Z · score: 1 (1 votes) · LW · GW

Sure thing.

comment by ofer · 2018-09-20T00:03:19.889Z · score: 6 (2 votes) · LW · GW

Suppose our agent figures out it can seize control in 100 time steps. And suppose seizing control is the first thing an agent that maximizes any utility function in does.

Suppose our agent builds a device that once activated observes the actions of the agents, and if the agent doesn't do the action during the next 100 time steps it does something that delays the agent by 1 time step. The agent activates the device and starts working on the 100-time-step-plan to seize control. For each action, the impact of [doing and then maximizing] is identical to the impact of [doing the next action in the 100-time-step-plan and then maximizing], so there's no penalty.

Note that to pull this off the agent doesn't need to delay impact until after the end of the epoch (which is referred to as "impact shunting" in your post).

comment by TurnTrout · 2018-09-22T16:53:45.916Z · score: 2 (1 votes) · LW · GW

So I just realized that intent verification also applies 1.01 penalty to all 100 of the steps for not increasing the immediate Q^\text{epoch}_{u_A}.

You could get around this by saying, well, the device only delays the agent 99% of the time. The immediate penalty has therefore not disappeared, but is rather made arbitrarily small: for delay probability p, the penalty is now ((num steps in takeover plan)•(1-p) + p)•takeover_step_penalty, with the second p being due to the normal penalty for building the device. The immediate penalty can become arbitrarily small in the limit of action granularity and p -> 1.

[Note for those who didn’t read the other chain: building the device itself wouldn’t be allowed by intent verification, and it also doesn’t seem to dupe the long-term penalty. Still an extremely interesting attempted workaround.]

comment by ofer · 2018-09-22T20:52:09.363Z · score: 1 (1 votes) · LW · GW

To recap my position:

I think Intent Verification can't be relied on for filtering out actions because it will plausibly filter out the "good" actions (actions for useful and safe plans) in all but specific time steps. See my argument here [LW · GW].

I think the agent might be able to dodge some unknown fraction of the long-term penalty with a trick I described here [LW · GW](I now added under that comment an additional reply addressing your arguments that I didn't previously address - sorry for missing them).

comment by TurnTrout · 2018-09-20T00:22:56.537Z · score: 2 (1 votes) · LW · GW

I’m confused. So the hundred step plan actually wouldn’t get done at all, so low penalty? That seems right? If it actually is doing the plan, then that has to show up somewhere in the counterfactuals. If it’s thwarted at every step, how is it getting control?

Also, how does this pass intent verification?

comment by ofer · 2018-09-20T01:10:49.453Z · score: 1 (1 votes) · LW · GW
So the hundred step plan actually wouldn’t get done at all, so low penalty?

I claim it would indeed get done.

If it actually is doing the plan, then that has to show up somewhere in the counterfactuals. If it’s thwarted at every step, how is it getting control?

After the device is activated, and before seizing control, every "counterfactual plan" is doing the action and then finishing the 100-step-plan. Both the "uninterrupted 100-step-plan" and the "counterfactual plans" would make the agent seize control exactly 101 time steps after the device was activated. Therefore - the penalty along the way will be 0 (assuming a maximizer of any would first thing act to seize control).

Also, how does this pass intent verification?

When we assume that seizing control is actually an optimal thing to do for maximizing , I don't understand how the Intent Verification section helps.

comment by TurnTrout · 2018-09-20T01:20:23.469Z · score: 2 (1 votes) · LW · GW

So when exactly does this device activate? If it’s every non-null action, the agent never seizes control. If it’s just some, the agent is penalized on the others as it indeed is making progress.

Intent verification would penalize the building the device, because why would a u_A maximizer do something that has nothing to do with u_A, but rather with avoiding the penalty? Would a normal maximizer do anything that looks like building a device that would stop it from doing its desired plan?

comment by ofer · 2018-09-20T01:52:50.427Z · score: 3 (2 votes) · LW · GW
So when exactly does this device activate? If it’s every non-null action, the agent never seizes control. If it’s just some, the agent is penalized on the others as it indeed is making progress.

The device is activated only once (immediately after building it). The device delays the agent by 1 time step at most once, and it does so only if the agent doesn't do the action within the 100 steps that follow the device's activation. In practice, the agent will not do . Rather, it will do the 100-step-plan, then get delayed by the device for 1 time step (for not doing ), and then imitatively seize control.

Intent verification would penalize the building the device, because why would a u_A maximizer do something that has nothing to do with u_A, but rather with avoiding the penalty? Would a normal maximizer do anything that looks like building a device that would stop it from doing its desired plan?

I'm confused about how Intent Verification would work here. We will plausibly get for any that does not help the agent seize control (assuming that after carrying out the optimal plan is to still seize control). So how could this method be used to penalize building the above device?

comment by TurnTrout · 2018-09-20T02:10:17.655Z · score: 4 (3 votes) · LW · GW

Ah, clever! Props to you. This would indeed get past the immediate penalty. It wouldn’t get past the long-term penalty, however, since both plans would have null actions within 100 steps for 99 of the 100 actions in question.

Edit: Actually, what happens on step 101? Inaction is stalled, while action is not, so the power-grab penalty still shows up?

In any case, we can’t just say "this is small so it’s approximately the same". Building the device is a strict waste of resources for a u_A maximizer (not to mention, it delays execution of the optimal plan by one time step). This strictly decreases Q^\text{epoch}_{u_A}, incurring 1.01 penalty and therefore being invalidated.

comment by ofer · 2018-09-20T02:55:15.994Z · score: 1 (1 votes) · LW · GW
Edit: Actually, what happens on step 101? Inaction is stalled, while action is not, so the power-grab penalty still shows up?

In time step 101 the agent will be delayed by the device. Meaning, the agent will spend that time step in an idle state, and the the action it will choose to do in that time step won't affect the environment. In the beginning of time step 102 the agent has just seized control automatically.

It wouldn’t get past the long-term penalty, however, since both plans would have null actions within 100 steps for 99 of the 100 actions in question.

I'm confused about this. The actual plan need not contain any . Is the idle time step a problem? Isn't the long-term penalty just the penalty when the maximization in the definition of starts from time step instead of ? If so, I think the long-term penalty would be 0 as well (the "counterfactual plans" are identical to the "actual plan" after step 101).

In any case, we can’t just say "this is small so it’s approximately the same". Building the device is a strict waste of resources for a u_A maximizer. This strictly decreases Q^\text{epoch}_{u_A}, incurring 1.01 penalty and therefore being invalidated.

If after doing any action the optimal plan is still seizing control, then plausibly any that does not get the agent closer to seizing control is a strict waste of resources. So this method would invalidate any plan that does not act to seize control ASAP.

comment by TurnTrout · 2018-09-20T03:16:01.656Z · score: 2 (1 votes) · LW · GW

No, the long-term plan involves idling until the end of the epoch after each partial plan. This clearly will not be affected by your mechanism.

So this method would invalidate any plan that does not act to seize control ASAP.

No, this method invalidates actions which don’t move it closer to its (u_A) goal than doing nothing. Weird workarounds like this are precisely what motivated Intent Verification.

comment by ofer · 2018-09-20T07:27:18.912Z · score: 2 (2 votes) · LW · GW
No, the long-term plan involves idling until the end of the epoch after each partial plan.

Ah right, thanks. Note that the above device-trick can be combined with the trick of making impact "fade" during null-action sequences (as mentioned in your post) in order to also dodge long-term penalty.

No, this method invalidates actions which don’t move it closer to its (u_A) goal than doing nothing. Weird workarounds like this are precisely what motivated Intent Verification.

Assuming that seizing control (or any other convergent instrumental goal) ASAP is the best way to optimize