P₂B: Plan to P₂B Betterpost by Ramana Kumar (ramana-kumar), Daniel Kokotajlo (daniel-kokotajlo) · 2021-10-24T15:21:09.904Z · LW · GW · 14 comments
One Convergent Instrumental Goal to Rule them All Why P2B Works Objections & Nuances Isn’t this trivial? Most agents aren’t planners though? What about the procrastination paradox? Footnotes None 14 comments
tl;dr: Most good plans involve taking steps to make better plans. Making better plans is the convergent instrumental goal, of which all familiar convergent instrumental goals are an instance. This is key to understanding what agency is and why it is powerful.
Planning means using a world model to predict the consequences of various courses of actions one could take, and taking actions that have good predicted consequences. (We think of this with the handle “doing things for reasons,” though we acknowledge this may be an idiosyncratic use of “reasons.”)
We take “planning” to include things that are relevantly similar to this procedure, such as following a bag of heuristics that approximates it. We’re also including actually following the plans, in what might more clunkily be called “planning-acting.”
Planning, in this broad sense, seems essential to the kind of goal-directed, consequential, agent-like intelligence that we expect to be highly impactful. This sequence explains why.
One Convergent Instrumental Goal to Rule them All
Consider the maxim
“make there be more and/or better planning towards your goal.”
This section argues that all the classic convergent instrumental goals are special cases of this maxim.
To flesh this out a little, here are some categories of ways to follow the maxim. Remember that a planner is typically close (in terms of what it might affect via action) to at least one planner – itself – so these directions can typically be applied in the first case to the planner itself.
- Make the planners with your goal better at planning. For example, get them new relevant data to work with¹, get them to run faster or more effective algorithms, build protections against value drift, etc.
- Make the planners with your goal have better options. For example, move them to better locations, get them more resources, get them more power or a greater number of options to select from, have them take steps in an object-level plan towards the goal.
- Make there be more planners with your goal. For example, keep yourself running and aligned with your goal, acquire delegates and subordinates, convince followers and converts, build successors.
Reviewing Omohundro's “The Basic AI Drives” and Bostrom’s “The Superintelligent Will,” we extract a list of convergent instrumental goals, and find that they are all instances of the maxim “make there be more/better planners for the current goal:”
- Self-preservation / self-protection:
- Make there be more planners that have your goals, focusing on reusing the existing planner, that is, preventing its destruction.
- Make there be better planners with your goals, focusing on making the existing one better.
- Resource Acquisition:
- Make there be better planners with your goals, focusing on making the existing one able to take more effective actions.
- Goal-content Integrity:
- Make there be more planners that have your goals, focusing on ensuring the existing planners that have your goals keep those goals and avoid them being changed.
- Resource-use Efficiency:
- Same as self-improvement
- Cognitive Enhancement:
- Same as self-improvement
- Same as self-improvement
- Technological Perfection:
- Same as resource acquisition and/or self-improvement
- Omohundro takes this to be something like “make the utility function explicit” along with “maximize expected utility.”
- Thus it is similar to goal-content integrity and self-improvement.
- Utility-function preservation:
- Similar to goal-content integrity.
- Prevent counterfeit utility:
- Essentially this is avoiding wireheading. Omohundro: “An important class of vulnerabilities arises when the subsystems for measuring utility become corrupted.”
- Thus it is similar to goal-content integrity.
Seeking a concise, memorable-yet-accurate name for this maxim, this convergent-instrumental-goal-to-rule-them-all, we settled on P2B:
P2B ≔ Plan to P2B Better
This name emphasizes the recursive, feedback-loopy aspect of the phenomenon, which is only implicit in the idea of “better plans.”
Why P2B Works
There are several ways for planning to be ineffective, such as an inaccurate or unwieldy world model, a limited selection of actions to choose from, or inefficient use of time or other resources in predicting or assessing consequences. But often planners can and will address these issues: planning is self-correcting, thanks to P2B. A planner that didn’t recognize the importance of P2B, or was unable to do it for some reason, would not be self-correcting.
Instrumental goals are about passing the buck: if you are a planner, and you can’t achieve your final goal with a single obvious action (or sequence of actions), you can instead pass the buck to something else, typically your future self.² There will often be obvious available actions that put the receiver of the buck “closer” to achieving the final goal than you.³
P2B is what it means to generically pass the buck, closing some distance along the way. The “better” in P2B means being closer to the goal and/or generally able to close distance faster. Convergent instrumental goals are ways to close distance that work for almost any final goal, hence they are instances of P2B.
Objections & Nuances
Isn’t this trivial?
One might complain that we’ve defined P2B broadly enough that our claim about it being the convergent instrumental goal is trivial — true by definition since we defined “planner” and “better” so broadly. Fair enough; the reason we are doing this is because we think it's a useful framing/foundation for answering questions about agency, not because we think it is important or interesting on its own. We agree that the more detailed taxonomies of instrumentally convergent goals are useful. We just think it is also useful to have this unified frame. We intend to write subsequent posts making use of this frame.
Most agents aren’t planners though?
Yes they are — remember, we said above that we are defining “planner” broadly to include relevantly similar algorithms/procedures. Let’s flesh that idea out a bit more…
We think it’s OK to talk loosely about families of algorithms. When we say “planning,” for example, we are gesturing at a vague cluster of algorithms that has “for each action, imagine the expected consequences of that action, then evaluate how good those consequences are, then pick the action that had the best expected consequences” as a central example.
We are not saying that every planning algorithm must be exactly of that form. Examining exactly where the boundaries of these concepts lie is an interesting and potentially valuable rabbit hole that we don’t feel the need to go down yet.
One thing we do wish to say is that we intend to include algorithms which behave similarly to the paradigmatic planning algorithm mentioned above. One easy way to generate algorithms like this is by automating bits of the process with heuristics. For example, maybe instead of calculating the expected consequences of every action all the time, the algorithm has a bag of heuristics that tell it when to calculate and when to not bother (and what to do instead) and the bag of heuristics tends to yield similar results for less computational expense, at least in some relevant class of environments.⁴
(We haven’t defined “agents” yet, but you can probably guess from what we’ve said that our definition is going to resemble Dennett’s Intentional Stance.)
What about the procrastination paradox?
A planner that “P2Bs forever”, without ever taking “object-level” actions in plans that aren’t about making better future plans/planners, won’t be very effective at achieving its goal. But P2B is not the only strategy a planner should pursue — we have only said that P2B is the convergent instrumental goal. Whenever there are obvious actions that directly lead towards the goal, a planner should take them instead.
The danger of taking instrumental actions forever can show up in some toy decision problems. However, in realistic cases and for realistic planners, this is not so much of an issue — one can pursue convergent instrumental goals without ceasing to keep an eye out for opportunities to achieve terminal goals. Nevertheless, due to the automation-of-bits-of-the-core-algorithm phenomenon described above, it’s not uncommon for agents to end up pursuing P2B as a terminal goal, or even pursuing sub-sub-subgoals of P2B such as “acquire money” as final goals. As Richard Ngo pointed out [LW · GW], we should expect mesa-optimizers to develop terminal goals for power, survival, learning, etc. because such things are useful in a wide range of environments and therefore probably useful in the particular environment they are being trained in.
1. This was an “aha” moment for me: Even such everyday actions as “briefly glance up from your phone so you can see where you are going when walking through a building” are instances of following this maxim! You are looking up from your phone so that you can acquire more relevant data (the location of the door, the location of the door handle, etc.) for your immediate-future-self to make use of. Your immediate-future-self will have a slightly better world-model as a result, and thus be better than you at making plans. In particular, your immediate future self will be able, e.g., to choose the correct moment & location to grab the door handle, by contrast with your present self who is looking at Twitter and does not know where to grab. ↩
2. This phrasing makes it sound like your goal is binary and permanent, either achieved or not. For more typical goals, which look more like utility functions, we think the same point would apply but would be more unwieldy to state. ↩
3. To make this analogy to covering distance from a target more precise, consider something like the edit distance between the world as it is and any variation satisfying one’s final goal. The world is always changing, but the fraction of changes that are reducing this edit distance increases when there are more capable and effective planners with that goal. ↩
4. Or perhaps even it’s heuristics all the way down, but it’s a sophisticated bag of heuristics that behaves as if it were following the calculate-expected-consequences-then-pick-the-best procedure, at least in some relevant class of environments. Note that we have the intuition that, generally speaking, substituting heuristics for bits of the core algorithm risks increasing “brittleness”/”narrowness,” i.e., problematically reducing the range of environments in which the system behaves like a planner. ↩
Comments sorted by top scores.