# Generalizing the Power-Seeking Theorems

post by TurnTrout · 2020-07-27T00:28:25.677Z · score: 39 (11 votes) · LW · GW · 2 comments

## Contents

  Normal amounts of sightedness
Relaxation summary
Conclusion
Appendix: Proofs
Discount rates
Reward distribution generalization
None


Thanks to Rohin Shah, Michael Dennis, Josh Turner, and Evan Hubinger for comments.

It sure seems like gaining power over the environment is instrumentally convergent (optimal for a wide range of agent goals). You can turn this into math and prove things about it. Given some distribution over agent goals, we want to be able to formally describe how optimal action tends to flow through the future.

Does gaining money tend to be optimal? Avoiding shutdown? When? How do we know?

Optimal Farsighted Agents Tend to Seek Power proved that, when you distribute reward fairly and evenly across states (IID), it's instrumentally convergent to gain access to lots of final states (which are absorbing, in that the agent keeps on experiencing the final state). The theorems apply when you don't discount the future (you're "infinitely farsighted").

While it's good to understand the limiting case, what if the agent, you know, isn't infinitely farsighted? That's a pretty unrealistic assumption. Eventually, we want this theory to help us predict what happens after we deploy RL agents with high-performing policies in the real world.

# Normal amounts of sightedness

But what if we care about the journey? What if

We can view Frank as traversing a Markov decision process, navigating between states with his actions:

It sure seems like Frank is more likely to start with the blue or green gems. Those give him way more choices along the way, after all. But the previous theorems only said "at , he's equally likely to pick each gem. At , he's equally likely to end up in each terminal state".

Let me tell you, finding the probability that one tangled web of choices is optimal over another web, is generally a huge mess. You're finding the measure of reward functions which satisfy some messy system of inequalities, like

And that's in the simple tiny environments!

How do we reason about instrumental convergence – how do we find those sets of trajectories which are more likely to be optimal for a lot of reward functions?

We exploit symmetries.

The blue gem makes available all of the same options as the red gems, and then some. Since the blue gem gives you strictly more options, it's strictly more likely to be optimal! When you toss back in the green gem, avoiding the red gems becomes yet more likely.

So, we can prove that for all , most agents don't choose the red gems. Agents are more likely to pick blue than red. Easy.

Plus, this reasoning mirrors why we think instrumental convergence exists to begin with:

Sure, the goal could incentivize immediately initiating shutdown procedures. But if you stay active, you could still shut down later, plus there are all these other states the agent might be incentivized to reach.

This extends further. If the symmetry occurs twice over, then you can conclude the agent is at least twice as likely to do the instrumentally convergent thing.

# Relaxation summary

My initial work made a lot of simplifying assumptions [LW · GW]:

• The agents are infinitely farsighted: they care about average reward over time, and don't prioritize the present over the future.
• Relaxed. See above.
• The environment is deterministic.
• Relaxed. The paper is already updated to handle stochastic environments. The new techniques in this post also generalize straightforwardly.
• Reward is distributed IID over states, where each state's reward distribution is bounded and continuous.
• Relaxed. We can immediately toss out boundedness, as none of our reasoning about instrumental convergence relies on it. It just ensured certain unrelated equations were well-defined.
• With a bit of work, I could probably toss out continuity in general (and instead require only non-degeneracy), but I haven't done that yet.
• If you can prove instrumental convergence under IID reward, and then you have another reward function distribution  which improves reward for instrumentally convergent trajectories while worsening reward for already-unlikely trajectories, then there's also instrumental convergence under .
• For example, if you double reward in instrumentally convergent states and halve it in unlikely states, then you still have instrumental convergence.
• The environment is Markov.
• Relaxed. -step Markovian environments are handled by conversion into isomorphic Markov environments.
• The agent is optimal.
• The environment is finite and fully observable.

The power-seeking theorems apply to: infinitely farsighted optimal policies in finite deterministic MDPs with respect to reward distributed independently, identically, continuously, and boundedly over states.

# Conclusion

We now have a few formally correct strategies for showing instrumental convergence, or lack thereof.

• In deterministic environments, there's no instrumental convergence at  for IID reward.
• When , you're strictly more likely to navigate to parts of the future which give you strictly more options (in a graph-theoretic sense). Plus, these parts of the future give you strictly more power.
• When , it's instrumentally convergent to access a wide range of terminal states.
• This can be seen as a special case of having "strictly more options", but you no longer require an isomorphism on the paths leading to the terminal states.

# Appendix: Proofs

This work builds off of my initial paper on power-seeking; I'll refer to that as [1].

Definition. Let  be a visitation distribution of state , and let  contain  denotes the measure of reward functions  under distribution  satisfying  at discount rate .

Non-dominated visitation distributions have positive measure and "take" positive measure from every other non-dominated visitation distribution.

Lemma 1. If , then  for all . Furthermore,  such that   for all  containing

Proof. The first claim was proven in [1]. The second claim follows by observing that visitation distributions  which are non-dominated with respect to all of  are also non-dominated with respect to subsets  (as taking subsets winnows the set of constraints). Then, use the fact that non-dominated visitation distributions always have positive measure (in particular, with respect to ). QED.

## Discount rates

Definition. The graph induced by a set  of visitation distributions consists of the states visited and actions taken by at least one of the policies generating the visitation distributions. This is also referred to as the -graph.

Theorem 2 [Strictly more meaningful options means strict instrumental convergence and strict power increase]. Let  be subsets of non-dominated visitation distributions. If the -graph is isomorphic to a subgraph of the -graph, such that the isomorphism fixes , then  for all . If the subgraph of the -graph is strict, then so is the inequality.

Furthermore,  is more powerful in the -graph than in the -graph. If the subgraph of the -graph is strict, then  is strictly more powerful in the -graph.

Proof. The  claim follows from symmetry; measure must be invariant to state relabelling, because reward is IID. The strict inequality follows from lemma 1: adding another non-dominated visitation distribution must strictly increase  and decrease .

Similarly, the first power claim follows from symmetry. Adding another non-dominated visitation distribution must strictly increase the power [1]. QED.

## Reward distribution generalization

We derive a sufficient condition for instrumental convergence for (certain) non-IID reward function distributions.

Definition. Distribution  (with CDF ) dominates distribution  (with CDF ) when  (when  minorizes ). Similarly, distribution  (with CDF ) is dominated by distribution  (with CDF ) when  (when  majorizes ).

The following insight is simple: if you can prove instrumental convergence under IID rewards, and then you have another reward function distribution  which improves reward for instrumentally convergent trajectories while worsening reward for already-unlikely trajectories, then there's also instrumental convergence under

For example, if you double reward in instrumentally convergent states and halve it in unlikely states, then you still have instrumental convergence.

The logic goes:

If e.g. avoiding shutdown was instrumentally convergent for this more generous IID distribution, but realistic distributions are far less likely to reward shutdown, and a few other trajectories are even more likely to be rewarded. So, it's still instrumentally convergent to avoid shutdown for this more realistic task-based distribution we have in mind.

Theorem 3. Let . Suppose that under reward function distribution

If all  have dominant distributions at discount rate  under distribution compared to under , and all  have dominated return distributions at discount rate  under distribution  compared to under , then  under

Proof. Consider the process of starting with the initial  return distributions, and iteratively swapping them to their more generous counterparts. If any such swap strictly increases , it strictly increases  and strictly decreases  by lemma 1. Clearly such a swap cannot strictly decrease .

Similar logic applies to the less generous return distributions for  under . QED.