Gradual Disempowerment, Shell Games and Flinches
post by Jan_Kulveit · 2025-02-02T14:47:53.404Z · LW · GW · 5 commentsContents
Shell Games The Flinch Delegating to Future AI Local Incentives Conclusion None 6 comments
Over the past year and half, I've had numerous conversations about the risks we describe in Gradual Disempowerment [LW · GW]. (The shortest useful summary of the core argument is: To the extent human civilization is human-aligned, most of the reason for the alignment is that humans are extremely useful to various social systems like the economy, and states, or as substrate of cultural evolution. When human cognition ceases to be useful, we should expect these systems to become less aligned, leading to human disempowerment.) This post is not about repeating that argument - it might be quite helpful to read the paper first, it has more nuance and more than just the central claim - but mostly me ranting sharing some parts of the experience of working on this and discussing this.
What fascinates me isn't just the substance of these conversations, but relatively consistent patterns in how people avoid engaging with the core argument. I don't mean the cases where stochastic parrots people confused about AI progress repeat claims about what AIs can't do that were experimentally refuted half a year ago, but the cases where smart, thoughtful people who can engage with other arguments about existential risk from AI display surprisingly consistent barriers when confronting this particular scenario.
I found this frustrating, but over time, I began to see these reactions as interesting data points in themselves. In this post, I'll try to make explicit several patterns I've observed. This isn't meant as criticism. Rather, I hope that by making these patterns visible, we can better understand the epistemics of the space.
Before diving in, I should note that this is a subjective account, based on my personal observations and interpretations. It's not something agreed on or shared with the paper coauthors, although when we compared notes on this, we sometimes found surprisingly similar patterns. Think of this as one observer's attempt to make legible some consistently recurring dynamics. Let's start with what I call "shell games [LW · GW]", after an excellent post by TsviBT.
Shell Games
The core principle of the shell games in alignment [LW · GW] is that when people propose strategies for alignment, the hard part of aligning superintelligence is always happening in some other component of the system than what's analyzed. In gradual disempowerment scenarios, the shell game manifests as shifting the burden of maintaining human influence between different societal systems.
When you point out how automation might severely reduce human economic power, people often respond "but the state will handle redistribution." When you explain how states might become less responsive to human needs as they rely less on human labor and taxes, they suggest "but cultural values and democratic institutions will prevent that." When you point out how cultural evolution might drift memplexes away from human interests when human minds stop being the key substrate, maybe this has an economic solution or governance solution.
What makes this particularly seductive is that each individual response is reasonable. Yes, states can regulate economies. Yes, culture can influence states. Yes, economic power can shape culture. The shell game exploits the tendency to think about these systems in isolation, missing how the same underlying dynamic - decreased reliance on humans - affects all of them simultaneously, and how shifting the burden puts more strain on the system which ultimately has to keep humans in power.
I've found this pattern particularly common among people who work on one of the individual domains. Their framework gives them sophisticated tools for thinking about how one social system works, but usually the gradual disempowerment dynamic undermines some of the assumptions they start from, if multiple systems might fail in correlated ways.
The Flinch
Another interesting pattern in how people sometimes encounter the gradual disempowerment argument is a kind of cognitive flinch away from really engaging with it. It's not disagreement exactly; it's more like their attention suddenly slides elsewhere, often to more “comfortable”, familiar forms of AI risk.
This happens even with (maybe especially with) very smart people who are perfectly capable of understanding the argument. A researcher might nod along as we discuss how AI could reduce human economic relevance, but bounce off the implications for state or cultural evolution. Instead, they may want to focus on technical details of the econ model, how likely it is that machines will outcompete humans in virtually all tasks including massages or something like that.
Another flinch is something like just rounding it off to some other well known story - like “oh, you are discussing multipolar scenario” or "so you are retelling Paul's story about influence-seeking patterns [LW · GW]." (Because the top comment on G.D. LessWrong post is a bit like that, it is probably worth noting that while it fits the pattern, it is not the single or strongest piece of evidence.)
Delegating to Future AI
Another response, particularly from alignment researchers, is "This isn't really a top problem we need to worry about now - either future aligned AIs will solve it or we are doomed anyway."
This invites a rather unhelpful reaction of the type "Well, so the suggestion is we keep humans in control by humans doing exactly what the AIs tell them to do, and this way human power and autonomy is preserved?". But this is a strawman and there's something deeper here - maybe it really is just another problem, solvable by better cognition.
I think this is where the 'gradual' assumption is important. How did you get to the state of having superhuman intelligence aligned to you? If the current trajectory continues, it's not the case that the AI you have is a faithful representative of you, personally, run in your garage. Rather it seems there is a complex socio-economic process leading to the creation of the AIs, and the smarter they are, the more likely it is they were created by a powerful company or a government.
This process itself shapes what the AIs are "aligned" to. Even if we solve some parts of the technical alignment problem we still face the question of what is the sociotechnical process acting as “principal”. By the time we have superintelligent AI, the institutions creating them will have already been permeated by weaker AIs decreasing human relevance and changing the incentive landscape.
The idea that the principal is you, personally, implies that a somewhat radical restructuring of society somehow happened before you got such AI and that individuals gained a lot of power currently held by super-human entities like bureaucracies, states or corporations.
Also yes: it is true that capability jumps can lead to much sharper left turns. I think that risk is real and unacceptably high. I can easily agree that gradual disempowerment is most relevant in words where rapid loss of control does not happen first, but note that the gradual problem makes the risk of coups go up. There is actually substantial debate here I'm excited about.
Local Incentives
Let me get a bit more concrete and personal here. If you are a researcher at a frontier AI lab, I think it's not in your institution's self-interest for you to engage too deeply with gradual disempowerment arguments. The institutions were founded based on worries about power and technical risks of AGI, not worries about AI and macroeconomy. They have some influence over technical development, and their 'how we win' plans were mostly crafted in a period of time where it seemed this was sufficient. It is very unclear if they are helpful or have much leverage in the gradual disempowerment trajectories.
To give a concrete example, in my read of Dario Amodei's "Machines of Loving Grace" one of the more important things to notice is not what is there, like fairly detailed analysis of progress in biology, but what is not there, or is extremely vague. I appreciate it is at least gestured at:
At that point (...a little past the point where we reach "a country of geniuses in a datacenter"...) our current economic setup will no longer make sense, and there will be a need for a broader societal conversation about how the economy should be organized.
So, we will have nice, specific things like Prevention of Alzheimer's, or some safer, more reliable descendant of CRISPR may cure most genetic disease in existing people. Also, we will need to have some conversation because the human economy will be obsolete and incentives for states to care about people will be obsolete.
I love that it is a positive vision. Also, IDK, it seems like a kind of forced optimism about certain parts of the future. Yes, we can acknowledge specific technical challenges. Yes, we can worry about deceptive alignment or capability jumps. But questioning where the whole enterprise ends, even if everything works as intended? Seems harder to incorporate into institutional narratives and strategies.
Even for those not directly employed by AI labs, there are similar dynamics in the broader AI safety community. Careers, research funding, and professional networks are increasingly built around certain ways of thinking about AI risk. Gradual disempowerment doesn't fit neatly into these frameworks. It suggests we need different kinds of expertise and different approaches than what many have invested years developing. Academic incentives also currently do not point here - there are likely less than ten economists taking this seriously, trans-disciplinary nature of the problem makes it hard sell as a grant proposal.
To be clear this isn't about individual researchers making bad choices. It's about how institutional contexts shape what kinds of problems feel important or tractable, how funding landscape shapes what people work on, how memeplexes or ‘schools of thought’ shape attention. In a way, this itself illustrates some of the points about gradual disempowerment - how systems can shape human behavior and cognition in ways that reinforce their own trajectory.
Conclusion
Actually, I don't know what's really going on here. Mostly, in my life, I've seen a bunch of case studies of epistemic distortion fields - cases where incentives like money or power shape what people have trouble thinking about, or where memeplexes protect themselves from threatening ideas. The flinching moves I've described look somewhat familiar to those patterns.
5 comments
Comments sorted by top scores.
comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-02-02T15:14:08.855Z · LW(p) · GW(p)
I have been discussing thoughts along these lines. My essay A Path to Humany Autonomy [LW · GW] argues that we need to slow AI progress and speed up uuman intelligence progress. My plan for how to accomplish slowing AI progress is to use novel decentralized governance mechanisms aided by narrow AI tools. I am working on fleshing out these governance ideas in a doc. Happy to share.
comment by the gears to ascension (lahwran) · 2025-02-02T15:43:13.069Z · LW(p) · GW(p)
I would guess that the range of things people propose for the shell game is tractable to get a good survey of. It'd be interesting to try to plot out the system as a causal graph with recurrence so one can point to, "hey look, this kind of component is present in a lot of places", and see if one can get that causal graph visualization to show enough that it starts to feel clear to people why this is a problem. I doubt I'll get to this, but if I play with this, I might try to visualize it with arrays of arrows vaguely like,
a -> b -> c_1 -> c_1
... -> ...
c_n -> c_n
|
v
d_1 ... d_n
^
| | /
v v
f <- e
where c might be, idk, people's bank accounts or something, d might be people's job decisions, e might be an action by some single person, etc. there's a lot of complexity in the world, but it's finite, and not obviously beyond us to display the major interactions. being able to point to the graph and say "I think there are arrows missing here" seems like it might be helpful. it should feel like, when one looks at the part of the causal graph that contains ones' own behavior, "oh yeah, that's pretty much got all the things I interact with in at least an abstract form that seems to capture most of what goes on for me", and that should be generally true for basically anyone with meaningful influence on the world.
ideally then this could be a simulation that can be visualized as a steppable system. I've seen people make sim visualizations for public consumption - https://ncase.me/, https://www.youtube.com/@PrimerBlobs - it doesn't exactly look trivial to do, but it seems like it'd allow people to grok the edges of normality better to see normality generated by a thing that has grounding, and then see that thing in another, intuitively-possible parameter setup. It'd help a lot with people who are used to thinking about only one part of a system.
But of course trying to simulate abstracted versions of a large fraction of what goes on on earth sounds like it's only maybe at the edge of tractability for a team of humans with AI assistance, at best.
comment by Dagon · 2025-02-02T16:14:39.120Z · LW(p) · GW(p)
Do we have a good story about why this hasn't already happened to humans? Systems don't actually care about the individuals they comprise, and certainly don't care about the individuals that are neither taxpayers, selectorate, contributors, or customers.
Why do modern economies support so many non-participants? Let alone the marginal and slightly sub-marginal workers, which don't cost much and may have option value or be useful to keep money moving in some way, there are a lot who are clearly a drain on resources.
↑ comment by Archimedes · 2025-02-02T18:34:58.205Z · LW(p) · GW(p)
Participants are at least somewhat aligned with non-participants. People care about their loved ones even if they are a drain on resources. That said, in human history, we do see lots of cases where “sub-marginal participants” are dealt with via genocide or eugenics (both defined broadly), often even when it isn’t a matter of resource constraints.
When humans fall well below marginal utility compared to AIs, will their priorities matter to a system that has made them essentially obsolete? What happens when humans become the equivalent of advanced Alzheimer’s patients who’ve escaped from their memory care units trying to participate in general society?
comment by Noosphere89 (sharmake-farah) · 2025-02-02T15:26:35.969Z · LW(p) · GW(p)
I think a big part of the issue is not just the assumptions people use, but also because your scenario doesn't really lead to existential catastrophe in most worlds, if only because a few very augmented humans determine a lot of what the future does hold, at least under single-single alignment scenarios, and a lot of AI thought has been directed towards worlds where AI does do existential risk, and a lot of this is because of the values of the first thinkers on the topic.
More below:
https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from#GChLyapXkhuHaBewq [LW(p) · GW(p)]