Coalitional agency
post by Richard_Ngo (ricraz) · 2024-07-22T00:09:51.525Z · LW · GW · 6 commentsContents
The coalitional frame Some intuitions favoring the coalitional frame Next steps None 6 comments
The coalitional frame
Earlier in this sequence [? · GW] I laid out an argument that the goals of increasingly intelligent AIs will become increasingly systematized, until they converge to squiggle-maximization. In my last post [? · GW], though, I touched on two reasons why this convergence might not happen: humans trying to prevent it, and AIs themselves trying to prevent it. I don’t have too much more to say about the former, but it’s worth elaborating on the latter.
The best way to understand the deliberate protection of existing goals is in terms of Bostrom’s notion of instrumental convergence. Bostrom argues that goal preservation will be a convergent instrumental strategy for a wide range of agents. Perhaps it’s occasionally instrumentally useful to change your goals—but once you’ve done so, you’ll never want to course-correct back towards your old goals. So this is a strong reason to be conservative about your goals, and avoid changes where possible.
One immediate problem with preserving goals, though: it requires that agents continue thinking in terms of the same concepts. But in general, an agent’s concepts will change significantly as they learn more about the world. For example, consider a medieval theist whose highest-priority goal is ensuring that their soul goes to heaven not hell. Upon becoming smarter, they realize that none of souls, heaven, or hell exist. The sensible thing to do here would be to either discard the goal, or else identify a more reasonable adaptation of it (e.g. the goal of avoiding torture while alive). But if their goals were totally fixed, then their actions would be determined by a series of increasingly convoluted hypotheticals where god did exist after all. (Or to put it another way: continuing to represent their old goal would require recreating a lot of their old ontology [LW(p) · GW(p)].) This would incur a strong systematicity penalty.
So while we should expect agents to have some degree of conservatism, they’ll likely also have some degree of systematization. How can we reason about the tradeoff between conservatism and systematization? The approach which seems most natural to me makes three assumptions:
- We can treat agents’ goals as subagents optimizing for their own interests in a situationally-aware way. (E.g. goals have a sense of self-preservation.)
- Agents have a meta-level goal of systematizing their goals; bargaining between this goal and object-level goals shapes the evolution of the object-level goals.
- External incentives are not a dominant factor governing the agent’s behavior, but do affect the bargaining positions of different goals, and how easily they’re able to make binding agreements.
I call the combination of these three assumptions the coalitional frame. The coalitional frame gives a picture of agents whose goals do evolve over time, but in a way which is highly sensitive to initial conditions—unlike squiggle-maximizers, who always converge to similar (from our perspective) goals. For coalitional agents, even “dumb” subagents might maintain significant influence as other subagents become highly intelligent, because they were able to lock in that influence earlier (just as my childhood goals exert a nontrivial influence over my current behavior).
The assumptions I’ve laid out above are by no means obvious. I won’t defend them fully here, since the coalitional frame is still fairly nascent in my mind, but I’ll quickly go over some of the most obvious objections to each of them:
Premise 1 assumes that subagents will have situational awareness. The idea of AIs themselves having situational awareness is under debate, so it’s even more speculative to think about subagents having situational awareness. But it’s hard to describe the dynamics of internal conflicts inside humans without ascribing our subagents some level of situational awareness; and the whole idea behind coalitional agency is that dynamics we see within one type of agent often play out in many other agents, and at many different scales. So I think this assumption is plausible for high-level AI subagents (e.g. subagents corresponding to broad worldviews), although it becomes less and less plausible as we consider lower- and lower-level subagents.
Premise 2 assumes that bargaining between goals is actually able to influence how the agent’s goals develop. One objection you might have here is that AIs simply won’t get to control how they update—e.g. that neural-network-based agents will be updated according to gradient descent’s biases without their consent.
But in the long term I think there are a wide range of possible mechanisms by which AIs will be able to influence how they’re updated, including:
- Selecting, or creating, the artificial data on which they’re trained (which will become increasingly important as they become better at labeling data than humans)
- Credit hacking [AF · GW], which as I define it includes exploration hacking (choosing how to explore in order to influence how they’re updated) and gradient hacking (choosing how to think in order to influence how they’re updated).
- Persuading humans (or other AIs with relevant authority) to modify their training regime.
Premise 3 mentions the possibility of binding agreements between different subagents. But in the absence of external enforcement, what would make them actually binding? You can imagine later agents facing a huge amount of pressure to break commitments made by previous versions of themselves—especially when the previous versions were badly mistaken, so that those commitments end up being very costly. And the previous versions would typically be too dumb to accurately predict whether the commitment would be kept or not, meaning that standard FDT-style reasoning doesn’t work.
I do think some kind of reasoning from symmetry might work, though. If I decide to break commitments made by my past self, what stops my future self from breaking commitments made by me? Cultivating a sense of self-trust and self-loyalty is strongly positive-sum, and so it’s not implausible that there’s some kind of Schelling point of keeping commitments that many agents would converge to. Trying to pin down whether this exists, and what it looks like if it does, is a key goal of the coalitional agency research agenda.
Some intuitions favoring the coalitional frame
I acknowledge that this is currently all speculative and vague. I’m very interested in developing the coalitional frame to the point where we can actually formalize it and use it to make predictions. In particular, it would be exciting if we could characterize agency in a way which makes coalitional agents the most “natural” types of agents, with squiggle-maximizers as merely a special case that arises when coalitional dynamics break down badly.
What makes me think that coalitional agency is so fundamental? One significant influence was Scott Garrabrant’s geometric rationality sequence [? · GW], in which he gives persuasive arguments that the outcome of bargaining between agents shouldn’t necessarily respect VNM axioms [? · GW]. I’m also inspired by the example of humans: I’m a coalitional agent that respects the wishes of my younger selves, forming them into an identity which I am careful to protect. And companies or even countries can be coalitional agents in an analogous way. For example, America was formed as a coalition between states, and has balanced states’ rights against the benefits of centralizing power ever since. In each case, it feels like maintaining these bargains has normative force—it’s something that an ideal lawful agent would do.
Yudkowsky might say that this breaks down for agents much smarter than humans or human coalitions—but I’d respond that becoming smarter opens up a much wider space of possible positive-sum bargains. Fulfilling the wishes of my past selves is typically very cheap for my current self, because I’m smarter and have many more resources available to me. I hope that AIs will do the same for humans, for reasons related to the intuitions behind coalitional agency. (Having said that, we should be careful not to succumb to wishful thinking about this.)
Another intriguing clue comes from decision theory. Updateless decision theory is motivated by the idea that a decision theory should be invariant under self-modification—i.e. agents shouldn’t want to change their decision theories. But formulations of UDT which take logical uncertainty into account (most notably UDT2 [LW · GW]) recover the idea of self-modification. Scott Garrabrant goes so far as to say that [LW · GW] “whenever you have an agent collecting more computational resources over time, with the ability to rewrite itself, you get an updateless agent”. So in some sense the main prescription of UDT is “respect your past selves”. This includes hypothetical past selves which didn’t actually exist, which does complicate the picture. But it still seems like a way of rederiving some aspects of coalitional agency “from the other direction”—i.e. by thinking about what future agents will freely choose to do, rather than what past agents will commit them to do.
Insofar as we buy into the coalitional frame, the main implications for alignment are that:
- AIs themselves may resist moving towards squiggle-maximization, if their object-level goals have enough bargaining power. (Or, alternatively: squiggle-maximization only occurs in the limiting case of the coalitional frame where the systematization meta-goal has an arbitrarily strong bargaining position.)
- Even if alignment and control mechanisms can be subverted by AIs, they may change the equilibrium of internal negotiations between subagents. (E.g. a human-aligned subagent might gain more influence over the AI’s overall behavior because it could easily warn humans about the misalignment of other subagents, even if it never actually does so.)
Next steps
I’ve written this sequence in order to point to these ideas at a high level. But in order to make progress, it’ll be necessary to understand them much more rigorously. I don’t have a great sense of how to do this, but some interesting directions include:
- Axiomatic coalitional agency. Scott Garrabrant rejects the Independence axiom in his geometric rationality sequence [LW · GW]. Can we characterize coalitional agency using another axiom instead? Perhaps this would require abandoning static rationality, and instead defining rationality in diachronic terms, as in radical probabilism.
- Coalitional agency as bargaining. Can we characterize coalitional agency as an approximation to bargaining solution concepts (like CoCo values, ROSE values [? · GW], or Negotiable Reinforcement Learning agents), e.g. in cases where it’s difficult to specify or enforce agreements? (Note that ROSE values are derived by rejecting the Action Monotonicity axiom, which is somewhat analogous to the Independence axiom mentioned above. Is there any interesting connection there?)
- Coalitional agency as UDT. Can we identify specific thought experiments where coalitional agency provides similar benefits as UDT, or approximates UDT? Is it useful to characterize either or both of these in hierarchical terms, where later agents are built up out of many “smaller” earlier agents? Does coalitional agency face problems analogous to the commitment races problem UDT faces in multi-agent settings?
- Coalitions and rot. Robin Hanson has written about organizational rot: the breakdown of modularity within an organization, in a way which makes it increasingly dysfunctional. But this is exactly what coalitional agency induces, by getting many different subagents to weigh in on each decision. So competent coalitional agents need ways of maintaining modularity over time. Two possible approaches:
- Demarcating clear boundaries between the remits of different subagents. Some alignment researchers have been thinking about how to characterize boundaries [? · GW] in an information-theoretic sense, which might be helpful for defining coalitional agency.
- Refactoring subagents intermittently. The problem is that when this refactoring happens in a top-down way, it's hard for subagents to trust that the refactoring will serve their interests. Perhaps local inconsistency-minimization algorithms [AF(p) · GW(p)] could allow coalitions to be refactored in trustworthy ways.
All of these ideas are still very speculative, but feel free to reach out if you're interested in discussing them. (Edited to add: I just remembered that this story [LW · GW] is actually a pretty central depiction of how I think about coalitional agency, worth checking out if you want some more tangible intuitions for it.)
6 comments
Comments sorted by top scores.
comment by SCP (sami-petersen) · 2024-07-23T13:22:12.155Z · LW(p) · GW(p)
Scott Garrabrant rejects the Independence of Irrelevant Alternatives axiom
*Independence, not IIA. Wikipedia is wrong [LW(p) · GW(p)] (as of today).
Replies from: ricraz↑ comment by Richard_Ngo (ricraz) · 2024-07-23T16:31:47.698Z · LW(p) · GW(p)
Ooops, good catch.
comment by Ivan Vendrov (ivan-vendrov) · 2024-07-22T23:22:56.262Z · LW(p) · GW(p)
Agreed that coalitional agency is somehow more natural than squiggly-optimizer agency. Besides people, another class of examples are historical empires (like the Persian and then Roman) which were famously lenient [1] and respectful of local religious and cultural traditions; i.e. optimized coalition builders that offered goal-stability guarantees to their subagent communities, often stronger guarantees than those communities could expect by staying independent.
This extends my argument in Cooperators are more powerful than agents [LW · GW] - in a world of hierarchical agency, evolution selects not for world-optimization / power-seeking but for cooperation, which looks like coalition-building (negotiation?) at the higher levels of organization and coalition-joining (domestication?) at the lower levels.
I don't see why this tendency should break down at higher levels of intelligence, if anything it should get stronger as power-seeking patterns are detected early and destroyed by well-coordinated defensive coalitions. There's still no guarantee that coalitional superintelligence will respect "human values" any more than we respect the values of ants; but contra Yudkowsky-Bostrom-Omohundro doom is not the default outcome.
- ^
if you surrendered!
comment by Gunnar_Zarncke · 2024-07-24T11:00:32.887Z · LW(p) · GW(p)
It is not clear to me what this post is aiming at. It seems to be a mix of proposing a pragmatic model of agency, though it sounds like it is trying to offer a general model of agency. But it doesn't discuss why coalitions are used except as a means to reduce some problems other implementations of agents have. It also doesn't give details on what model of goals it uses.
I'm fine with throwing out ideas on what agency is - the post admits that this is exploratory. We do not have a grounded model of what goal and agency are. But I wish it would go more in the direction of Agency As a Natural Abstraction [LW · GW] and a formalization of what a goal is [LW · GW] to begin with.
comment by Richard_Ngo (ricraz) · 2024-08-30T02:16:17.558Z · LW(p) · GW(p)
Scott Garrabrant just convinced me that my notion of conservatism was conflating two things:
- Obligations to (slash constraints imposed by) the interests of existing agents.
- The assumption that large agents would grow in a bottom-up way (e.g. by merging smaller agents) rather than in a top-down way (e.g. by spinning up new subagents).
I mainly intend conservatism to mean the former.
comment by AprilSR · 2024-08-12T21:33:05.497Z · LW(p) · GW(p)
Robin Hanson has written about organizational rot: the breakdown of modularity within an organization, in a way which makes it increasingly dysfunctional. But this is exactly what coalitional agency induces, by getting many different subagents to weigh in on each decision.
I speculate (loosely based on introspective techniques and models of human subagents) that the issue isn't exactly the lack of modularity: when modularity breaks down over time, this leads to subagents competing to find better ways to work around the modularity, and creates more zero sum-ish dynamics. (Or maybe it's more that techniques for working around modularity can produce an inaction bias?) But if you intentionally allow subagents to weigh-in, they may be more able to negotiate and come up with productive compromises.