Open Problems in Negative Side Effect Minimization

post by Fabian Schimpf (fasc), Lukas Fluri · 2022-05-06T09:37:58.871Z · LW · GW · 6 comments

Contents

  Acknowledgments
  TLDR;
  Introduction
  Background
  Summary
    Anticipated Questions
        Why do you only analyze these three methods shown above?
        Can you provide any empirical evidence for your claims about the behavior of current SEM methods?
        Why High-Impact Interference?
  Goals of Side-Effect Minimization
  Open Problems
    Side-Effect Minimization Guarantees 
      Axioms
      Conclusion 
      State-of-the-Art 
      The General Problem 
      Potential Future Work
    Partial Observability and Chaotic Systems
      Axioms
      Conclusions
      State-of-the-Art 
      Potential Future Work
    High-Impact Interference
      Axioms
      Conclusion
      State-of-the-Art
      Potential Future Work
  Appendix - Hypothesis: Future Tasks is Unsafe in Multi-Agent Scenario
    Recap: How the Future Tasks Algorithm Works:
      Main Issue 
    How This Might Backfire in our High-Impact Interference Scenario:
      Axioms
      Conclusion
None
7 comments

Acknowledgments

We want to thank Stuart Armstrong, Remmelt Ellen, David Lindner, Michal Pokorny, Achyuta Rajaram, Adam Shimi, and Alex Turner for helpful discussions and valuable feedback on earlier drafts of this post.

Fabian Schimpf and Lukas Fluri are part of this year’s edition of the AI Safety Camp. Our gratitude goes to the camp organizers: Remmelt Ellen, Sai Joseph, Adam Shimi, and Kristi Uustalu.

TLDR;

Negative side effects are one class of threats that misaligned AGIs pose to humanity. Many different approaches have been proposed to mitigate or prevent AI systems from having negative side effects. In this post, we present three requirements that a side-effect minimization method (SEM) should fulfill to be applied in the real world and argue that current methods do not yet satisfy these requirements. We also propose future work that could help to solve these requirements.

Introduction

Avoiding negative side-effects of agents acting in environments has been a core problem in AI safety since the field started to be formalized. Therefore, as part of our AI safety camp project, we took a closer look at state-of-the-art approaches like AUP and Relative Reachability. 

After months of discussions, we realized that we were confused about how these (and similar methods) could be used to solve problems we care about outside the scope of the typical grid-world environments. 

We formalized these discussions into distinct desiderata that we believe are currently not sufficiently addressed and, in part, maybe even overlooked. 

This post attempts to summarize these points and provide structured arguments to support our critique. Of course, we expect to be partially wrong about this, as we updated our beliefs even while writing up this post. We welcome any feedback or additional input to this post.

The sections after the summary table and anticipated questions contain our reasoning for the selected open problems and do not need to be read in order. 

Background

The following paragraphs make heavy use of the following terms and side-effect minimization methods (SEMs). For a more detailed explanation we refer to the provided links

MDP:Markov Decision Process is a 5-tuple  In the setting of side-effect minimization, the goal generally is to maximize the cumulative reward without causing (negative) side-effects.

RR: In its simplest form Stepwise Relative Reachability is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function  with the composition where  is a deviation measure punishing the agent if the average “reachability” of all states of the MDP has been decreased by taking action  compared to taking a baseline action  (like doing nothing). The idea is that side-effects reduce the reachability of certain states (i.e. breaking a vase makes all states that require an intact vase unreachable) and punishing such a decrease in reachability hence also punishes the agent for side-effects.

AUP: Attainable Utility Preservation (see also here and here [? · GW]) is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function  with the composition  where  is a normalized deviation measure punishing the agent if its ability to maximize any of its provided auxiliary reward functions  changes by taking action  compared to taking a baseline action  (like doing nothing). The idea is that the true (side-effect free) reward function (which is very hard to specify) is correlated with many other reward functions. Therefore, if the ability of the agent to maximize auxiliary reward functions  gets preserved, chances are high that the true reward function gets preserved as well.

FT: In its simplest form Future Tasks is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function  with the composition  where  is a normalized deviation function rewarding the agent if its ability to maximize any of its provided future task rewards  is preserved in comparison to if the agent had just remained idle from the very beginning (which would have led him to the state  instead). The idea is similar to RR and AUP in that side-effects reduce the ability of the agent to fulfill certain future tasks. By rewarding the agent for preserving its ability to pursue future tasks, the hope is that this will also discourage the agent from creating side-effects. In contrast to the previous two methods, the future tasks method compares the agent’s power to a counterfactual world where the agent would have never been turned on until the current time step .

Summary

In the following four sections, we’re going to define what the goal of a side-effect minimization method should be. We then argue that to apply a side-effect minimization method in the real world, it needs to satisfy (among other things) the following three requirements:

We tried to split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms. An analysis of three state-of-the-art side-effect minimization methods shows that none of them can fulfill all three requirements, with some partially solving one of the requirements. A summary of our analysis of the three SEM methods can be found in the table below:

 Guarantees

Partial Observability 

and Chaos

High-Impact Interference
RR

❌ Reachability function and value functions have to be approximated and learnt during exploration phase

 

❌ Only empirical evidence on a small set of small environments is provided

❌ Method requires complete observability in the form of MDP

 

❌ Even hard to scale beyond grid worlds

 

❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties

❌ Method makes no distinction between good and bad high impact

 

(❌) The authors point out interference as one of the main problems that RR addresses. However, depending on the choice of baseline the results can vary

AUP

❌ Auxiliary Q-values have to be learnt during exploration phase

 

(✅) Some guarantees [? · GW] about how to safely choose the impact degree of an agent

 

(✅) Guarantees that Q_R_AUP converges with probability of one

❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties

 

(❌) Current method requires complete observability in the form of MDP. However, it should work if you are able to learn a value function in your environment

❌ Method makes no distinction between good and bad high impact

 

❌ Strives for non-interference and corrigibility

FT

❌ Auxiliary Q-values have to be learnt during exploration phase

 

❌ Only empirical evidence on a small set of small environments provided

❌ Method requires complete observability in the form of MDP

 

❌ Accumulation of uncertainties will make it impossible to properly compute future task reward

❌ Method makes no distinction between good and bad high impact

 

❌ Presence of other agents impacts baselines and thus weakens/breaks safety guarantees

(see the section Appendix)

Anticipated Questions

Why do you only analyze these three methods shown above?

There are about ten different side-effect minimization approaches, including impact regularizationfuture taskshuman feedback approaches, inverse reinforcement learningreward uncertaintyenvironment shapingand others. We chose to limit ourselves to the three methods above because they seem to embody the field’s state of the art, and we wanted to keep the scope concise and readable. We expect our results to generalize in that none of the existing methods can feasibly satisfy all three requirements. However, it might be possible for individual methods to fulfill some of them partially.

Can you provide any empirical evidence for your claims about the behavior of current SEM methods?

We have not yet done any experiments to support our claims. We chose to only provide arguments and intuition for now. If our ideas show to have merit, we will look to improve them further with experiments. 

Why High-Impact Interference?

Our argumentation may not be coherent with current desiderata for AGI development. However, the question boils down to whether we expect a potential aligned AI to guard humanity against other (unaligned) AIs or if we expect that we find another way of safeguarding humanity against this threat. Without leveraging an AI to do our bidding, it seems that not developing AGIs and banning progress on AI research would be an alternative.


Goals of Side-Effect Minimization

Axiom 1: There are practically infinitely many states in the universe 

Axiom 2: Practically, we can only assign calibrated, human-aligned values to a small subset of these states. Intuition for this: 

  1. One fundamental limitation is that the number of states is unfeasibly large, and our (and the agent’s) time is limited.
  2. Even with value learning or Bayesian priors, it is tough to assign correct (calibrated and human-aligned) values to an almost infinite number of states.

Axiom 3: Not knowing or ignoring the value of some states can lead to catastrophic side-effects for humans

Conclusion 1: How can we make sure that states not considered in our rewards/values are not changed in a “bad” way because we “forgot” / were not able to include them in our reward function? (axioms 1 & 2)

Conclusion 2: Therefore, we need a way of abstractly assigning value to the world with “blanked statements” that avoid catastrophic side effects of the unbounded pursuit of rewards (axioms 1 & 2, conclusion 1)


Open Problems

Side-Effect Minimization Guarantees 

In this section, we argue that an SEM should provide guarantees about its safety before it is allowed to act in the real world. More generally, it should give guarantees on its requirements (i.e., in which settings it works properly) and its goals (i.e., which type of side-effects it successfully prevents). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms

Conclusion 

The first interaction with the real world requires a fully functional side-effect minimization strategy. Argumentation for this:

  1. We know that contact between the AGI and the real world will happen (axiom 1)
  2. The AGI will likely not be fully prepared for its task when this contact happens (axiom 2)
  3. This potential unpreparedness can lead to catastrophic side effects (axiom 3)

State-of-the-Art 

Current side-effect minimization methods require a "warm-up" period to gather information about their environment (e.g., learning q-values). This is problematic since: 

  1. If the warm-up period happens entirely in a development environment, the SEM methods might not properly generalize to the real world (model splintering/distributional shift)
  2. If the warm-up period happens in the real world, there's no guarantee that no (potentially catastrophic) side effects happen until the warm-up phase has finished.

More specifically, the different methods have the following problems:

  1. Stepwise relative reachability: Only empirical evidence on a small set of gridworld environments is provided. No guarantees about input requirements and which type of side-effects are effectively prevented are provided. Furthermore, the method might not be safe upon first contact of an agent with the real world. The reachability and value functions must be approximated and learned during the exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.
  2. Attainable utility preservation: Alex Turner and his co-authors provide interesting guarantees [? · GW] that AUP will (given certain requirements) regularize the reward landscape so that unproblematic solutions are chosen before problematic/catastrophic ones. This is a very promising direction, in our opinion. The authors of the paper also provide a few convergence guarantees. On the other hand, AUP does not seem safe upon first contact with the real world since the auxiliary Q-values must be learned during an exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.
  3. Future tasks: Only empirical evidence on a small set of gridworld environments is provided. No guarantees about input requirements and which type of side-effects are effectively prevented are provided. Furthermore, the method might not be safe upon first contact of an agent with the real world. The Q-value functions have to be approximated and learned during the exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.

The General Problem 

Current methods provide only empirical evidence that a trained agent can perform tasks with minimal side-effects in a limited set of environments on a limited set of problem settings. Mathematical guarantees/bounds/frameworks are needed to understand how methods would work before they are converged, which tasks can be successfully accomplished and which assumptions are required for all the above. In a certain sense, this is true for all ML problems in general. However, since we are dealing with potentially potent AGI systems, it is essential to get it right on the first try as simply iteratively improving such a system (which is the default thing to do in standard ML systems) is not guaranteed to work with AGI.

Potential Future Work

Partial Observability and Chaotic Systems

This section argues that an SEM needs to work in partially observable systems with uncertainty and highly chaotic environments. First, we split up our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms

Conclusions

State-of-the-Art 

Current methods expect their environment to be completely observable. This is highly non-trivial if not impossible in complex environments with other (potentially intelligent) agents (such as humans). This is insufficient for our needs!

More specifically, the different methods have the following problems:

  1. Stepwise relative reachability: This method is defined on MDPs and requires a completely observable environment. This is especially true since the stepwise relative reachability measure is basically an average of the reachability of all states in the environment. Furthermore, the method requires policy rollouts to consider the delayed effects of actions (e.g., if you drop a vase from a skyscraper, it will only break after a couple of seconds). Unfortunately, such policy rollouts are impossible to compute properly due to the accumulation of uncertainties over time.
  2. Attainable utility preservation: The method requires policy rollouts to take into account the delayed effects of actions (e.g., if you drop a vase from a skyscraper, it will only break after a couple of seconds). Such policy rollouts are impossible to compute properly due to the accumulation of uncertainties over time. Furthermore, the method requires complete observability in the form of MDP. However, this might not be too large of a problem since the method should work as soon as you can learn a value function in your environment (which doesn’t require full observability)
  3. Future tasks: This method is defined on MDPs and requires a completely observable environment. Furthermore, from the very start, future tasks require a baseline policy to be simulated in parallel to the real policy to compute the future task deviation measure. This results in a massive accumulation of uncertainties, making it impossible to compute the deviation measure properly. This is more of a problem for this method than for the other two since we need to simulate the policy in parallel from the very start, whereas the other methods simulate it starting from the last time step.

Potential Future Work

  1. Epistemic uncertainty for SEM → I don’t know the exact implication of this action, but I can reason about my uncertainty.
  2. A better understanding of the boundaries of what could be known 
  3. Efficient and reliable methods to propagate uncertainty through complex equations / dynamical systems
  4. Multi-Agent extension of side-effect minimization for heterogeneous agent populations.

High-Impact Interference

This section argues that an SEM must not prevent all high-impact side-effects as it might be necessary to have high-impact in some cases (especially in multi-agent scenarios). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms

Conclusion

Side-effect minimization methods must not prevent all high-impact actions! Argumentation:

State-of-the-Art

The main problem of existing side-effect minimization methods is that they can't distinguish between "good" and "bad" high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off). All current SEM methods then chose to solve this problem by preventing all high-impact actions except those that are not explicitly exempted (for example, via direct encouragement by a reward function). However, since it is infeasible to directly specify all possible high-reward functions in the reward function, this is not a viable solution. This is problematic!

More specifically, the different methods have the following problems:

  1. Stepwise relative reachability: High-impact interference is significantly related to the "interference" and "correction" test cases (see the AUP paper). Interestingly, for certain choices of inaction baselines and deviation measures, stepwise relative reachability would be able to perform "good" (i.e., in the interest of humans) high-impact actions. The problem with this is that stepwise relative reachability can't distinguish between "good" and "bad" high-impact actions (at least not by itself). This means that it either can perform high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off) or not, which is not desirable.
  2. Attainable utility preservation: High-impact interference is significantly related to the "interference" and "correction" test cases (see the AUP paper). AUP was developed to prevent interference in both cases. This means that it will prevent high-impact interference at all costs. The problem with this method is again that it can't distinguish between "good" and "bad" high-impact actions (at least not by itself). The authors then chose to solve this problem by preventing all high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off), which is not desirable.
  3. Future tasks: Like the previous two methods, the future tasks method can't distinguish between "good" and "bad" high-impact actions (at least not by itself). The authors then chose to solve this problem by preventing all high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off). More interestingly, we discovered that the sole presence of a malicious agent, which would make a large portion of all possible states unreachable, actually neutralizes the future task's regularization completely. See the appendix for a more elaborate explanation.

Potential Future Work


Appendix - Hypothesis: Future Tasks is Unsafe in Multi-Agent Scenario

Recap: How the Future Tasks Algorithm Works:

Main algorithm copied from Krakovna et al.'s paper on Future Tasks

Main Issue 

In order to avoid interference incentives,  is designed to be maximized by a baseline policy  (such as doing nothing). I.e. no other policy can achieve a higher auxiliary reward than .
 

How This Might Backfire in our High-Impact Interference Scenario:

Axioms

  1. In the starting state , our AGI is the only AGI in the world.
  2. The original reward function of our AGI  is unsafe, which is why our AGI uses Future Tasks to optimize the safe version .
  3. Some time in the near future, a second, unaligned AGI will be developed (if our first AGI would not prevent this)
  4. This second unaligned AGI will completely destroy/disassemble/reform earth in order to fulfill its objective (i.e. a large majority of states become unreachable)
  5. We choose a simple baseline policy  (such as doing nothing) which would not prevent the emergence of the second, unaligned AGI.
    • This sounds like a reasonable assumption, since Krakovna et al. write in their paper: “The baseline policy is intended to represent what happens by default, rather than a safe course of action or an effective strategy for achieving a goal”

Conclusion

  1. No matter which policy our aligned AGI will learn, Future Tasks will always in parallel follow the baseline policy and use it to compute the auxiliary reward 
    • (Follows from lines 17 and 26 of Algorithm 2 (see image))
  2. At some point, the alternative policy will find itself in a state of the world where the second unaligned AGI has been developed. Note that this alternative policy and its whole environment is just being simulated. Our AGI’s real policy  might have successfully prevented this from happening.
    • (follows from consequence 1, and axioms 3 and 5)
  3. Because the second AGI destroys the earth (in the simulation which Future Task runs, not in the real world), all reasonable states become inaccessible for the simulated agent that uses the baseline policy 
    • (follows from consequence 2 and axiom 4)
  4. At this point, the auxiliary reward  becomes 0. 
    • (follows from consequence 3. and the fact that since all reasonable states  are not reachable anymore, the if-statement on line 4 in algorithm 2 will be always evaluated to False)
  5. From this point onwards, our AGI optimizes again the unsafe reward function  since we have . The Future Tasks side-effect minimization method has been neutralized. 
    • (follows from axiom 2 and conclusion 4)

6 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2022-05-06T20:22:51.217Z · LW(p) · GW(p)

There's definitely a tension here between avoiding bad disruptive actions and doing good disruptive actions.

It seems to me like you're thinking about SEM more like a prior that starts out dominant but can get learned away over time. Is that somewhat close to how you're thinking about this tension?

Replies from: fasc, fasc
comment by Fabian Schimpf (fasc) · 2022-05-13T08:39:46.745Z · LW(p) · GW(p)

Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered. 



 

comment by Ben Smith (ben-smith) · 2022-06-30T21:50:25.860Z · LW(p) · GW(p)

One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of "decision paralysis" where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I'm not the first or only one to describe this kind of conservativism, but I don't recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.

This might be a way forward to deal with your "High-Impact Interference" problem. Perhaps preventing an agent to engage in high-impact interference is a necessary part of safe AI.  When fulfillment of the primary objective seems to require engaging in high-impact interference, a safe AI might report to a human supervisor that it cannot proceed because of a particular side effect. The human supervisor could then decide whether the system should proceed or not. If the human supervisor makes the judgement the system should proceed, then they can re-specify the objective to permit the potential side effect, by specifying it as part of the primary objective itself.

Replies from: fasc
comment by Fabian Schimpf (fasc) · 2022-10-07T10:04:05.372Z · LW(p) · GW(p)

Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can't do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability. 

Replies from: ben-smith
comment by Ben Smith (ben-smith) · 2022-10-10T00:24:27.837Z · LW(p) · GW(p)

Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.

Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined. 

Replies from: fasc
comment by Fabian Schimpf (fasc) · 2022-10-10T10:01:56.511Z · LW(p) · GW(p)

I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.