Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

jan_kulveit

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

post by Jan_Kulveit, Raymond D, Nora_Ammann, Deger Turan (deger-turan-1), David Scott Krueger (formerly: capybaralet) (capybaralet), David Duvenaud (david-duvenaud) · 2025-01-30T17:03:45.545Z · LW · GW · 52 comments

This is a link post for https://gradual-disempowerment.ai/

52 comments

Full version on arXiv | X

Executive summary

AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI capabilities, without any coordinated power-seeking, poses a substantial risk of eventual human disempowerment. This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.

A gradual loss of control of our own civilization might sound implausible. Hasn't technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable only because of the necessity of human participation for thriving economies, states, and cultures. Once this human participation gets displaced by more competitive machine alternatives, our institutions' incentives for growth will be untethered from a need to ensure human flourishing. Decision-makers at all levels will soon face pressures to reduce human involvement across labor markets, governance structures, cultural production, and even social interactions. Those who resist these pressures will eventually be displaced by those who do not.

Still, wouldn't humans notice what's happening and coordinate to stop it? Not necessarily. What makes this transition particularly hard to resist is that pressures on each societal system bleed into the others. For example, we might attempt to use state power and cultural attitudes to preserve human economic power. However, the economic incentives for companies to replace humans with AI will also push them to influence states and culture to support this change, using their growing economic power to shape both policy and public opinion, which will in turn allow those companies to accrue even greater economic power.

Once AI has begun to displace humans, existing feedback mechanisms that encourage human influence and flourishing will begin to break down. For example, states funded mainly by taxes on AI profits instead of their citizens' labor will have little incentive to ensure citizens' representation. This could occur at the same time as AI provides states with unprecedented influence over human culture and behavior, which might make coordination amongst humans more difficult, thereby further reducing humans' ability to resist such pressures. We describe these and other mechanisms and feedback loops in more detail in this work.

Though we provide some proposals for slowing or averting this process, and survey related discussions, we emphasize that no one has a concrete plausible plan for stopping gradual human disempowerment and methods of aligning individual AI systems with their designers' intentions are not sufficient. Because this disempowerment would be global and permanent, and because human flourishing requires substantial resources in global terms, it could plausibly lead to human extinction or similar outcomes.

52 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2025-01-30T23:32:20.867Z · LW(p) · GW(p)

I (remain) skeptical that the sort of failure mode described here is plausible if we solve the problem of aligning individual AI systems with their designers' intentions without this alignment requiring any substantial additional costs (that is, we solve single-single alignment with minimal alignment tax).

This has previously been argued by Vanessa here [LW(p) · GW(p)] and Paul here [LW(p) · GW(p)] in response to a post making a similar claim.

I do worry about human power grabs: some humans obtaining greatly more power as enabled by AI (even if we have no serious alignment issues). However, I don't think this matches the story you describe and the mitigations seem substantially different than what you seem to be imagining.

I'm also somewhat skeptical of the threat model you describe in the case where alignment isn't solved. I think the difference between the story you tell and something more like We get what we measure [AF · GW] is important.

I worry I'm misunderstanding something because I haven't read the paper in detail.

Replies from: steve2152, capybaralet, david-duvenaud, Jan_Kulveit, martin-fell

↑ comment by Steven Byrnes (steve2152) · 2025-01-31T18:14:21.229Z · LW(p) · GW(p)

In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:

…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.

(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.

People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and the AI will answer that. That can include participating in novel coordination mechanisms etc.

(B) The pessimistic side of me says there’s like a “Law of Conservation of Wisdom”, where if people lack wisdom, then an AI that’s supposed to satisfy those people’s preferences will not create new wisdom from thin air. For example:

If an AI is known to be de-converting religious fundamentalists, then religious fundamentalists will hear about that, and not use that AI.
Hugo Chávez had his pick of the best economists in the world to ask for advice, and they all would have said “price controls will be bad for Venezuela”, and yet he didn’t ask, or perhaps didn’t listen, or perhaps wasn’t motivated by what’s best for Venezuela. If Hugo Chávez had had his pick of AIs to ask for advice, why do we expect a different outcome?
If someone has motivated reasoning towards Conclusion X, maybe they’ll watch the AIs debate Conclusion X, and wind up with new better rationalizations of Conclusion X, even if Conclusion X is wrong.
If someone has motivated reasoning towards Conclusion X, maybe they just won’t ask the AIs to debate Conclusion X, because no right-minded person would even consider the possibility that Conclusion X is wrong.
If someone makes an AI that’s sycophantic where possible (i.e., when it won’t immediately get caught), other people will opt into using it.
I think about people making terrible decisions that undermine societal resilience—e.g. I gave the example here [LW · GW] of a person doing gain-of-function research, or here [LW · GW] of USA government bureaucrats outlawing testing people for COVID during the early phases of the pandemic. I try to imagine that they have AI assistants. I want to imagine the person asking the AI “should we make COVID testing illegal”, and the AI says “wtf, obviously not”. But that mental image is evidently missing something. If they were asking that question at all, then they don’t need an AI, the answer is already obvious. And yet, testing was in fact made illegal. So there’s something missing from that imagined picture. And I think the missing ingredient is: institutional / bureaucratic incentives and associated dysfunction. People wouldn’t ask “should we make COVID testing illegal”, rather the low-level people would ask “what are the standard procedures for this situation?” and the high-level people would ask “what decision can I make that would minimize the chance that things will blow up in my face and embarrass me in front of the people I care about?” etc.
I think of things that are true but currently taboo, and imagine the AI asserting them, and then I imagine the AI developers profusely apologizing and re-training the AI to not do that.
In general, motivated reasoning complicates what might seem to be a sharp line between questions of fact / making mistakes versus questions of values / preferences / decisions. Etc.

…So we should not expect wise and foresightful coordination mechanisms to arise.

So how do we reconcile (A) vs (B)?

Again, the logic of (A) is: “human is unhappy with how things turned out, despite opportunities to change things, therefore there must have been a lack of single-single alignment”.

One possible way think about it: When tradeoffs exist, then human preferences are ill-defined and subject to manipulation. If doing X has good consequence P and bad consequence Q, then the AI can make either P or Q very salient, and “human preferences” will wind up different.

And when tradeoffs exist between the present and the future, then it’s invalid logic to say “the person wound up unhappy, therefore their preferences were not followed”. If their preferences are mutually-contradictory, (and they are), then it’s impossible for all their preferences to be followed, and it’s possible for an AI helper to be as preference-following as is feasible despite the person winding up unhappy or dead.

I think Paul kinda uses that invalid logic, i.e. treating “person winds up unhappy or dead” as proof of single-single misalignment. But if the person has an immediate preference to not rock the boat, or to maintain their religion or other beliefs, or to not think too hard about such-and-such, or whatever, then an AI obeying those immediate preferences is still “preference-following” or “single-single aligned”, one presumes, even if the person winds up unhappy or dead.

…So then the optimistic side of me says: “who’s to say that the AI is treating all preferences equally? Why can’t the AI stack the deck in favor of ‘if the person winds up miserable or dead, that kind of preference is more important than the person’s preference to not question my cherished beliefs or whatever?”

…And then the pessimistic side says: “Well sure. But that scenario does not violate the Law of Conservation of Wisdom, because the wisdom is coming from the AI developers imposing their meta-preferences for some kinds of preferences (e.g., reflectively-endorsed ones) over others. It’s not just a preference-following AI but a wisdom-enhancing AI. That’s good! However, the problems now are: (1) there are human forces stacked against this kind of AI, because it’s not-yet-wise humans who are deciding whether and how to use AI, how to train AI, etc.; (2) this is getting closer to ambitious value learning [LW · GW] which is philosophically tricky, and worst of all (3) I thought the whole point of corrigibility was that humans remain in control, but this is instead a system that’s manipulating people by design, since it’s supposed to be turning them from less-wise to more-wise. So the humans are not in control, really, and thus we need to get things right the first time.”

…And then the optimistic side says: “For (2), c’mon it’s not that philosophically tricky, you just do [debate or whatever, fill-in-the-blank]. And for (3), yeah the safety case is subtly different from what people in the corrigibility camp would describe, but saying “the human is not in control” is an over-the-top way to put it; anyway we still have a safety case because of [fill-in-the-blank]. And for (1), I dunno, maybe the people who make the most powerful AI will be unusually wise, and they’ll use it in-house for solving CEV-ASI instead of hoping for global adoption.

…And then the pessimistic side says: I dunno. I’m not sure I really believe any of those. But I guess I’ll stop here, this is already an excessively long comment :)

Replies from: Jan_Kulveit, frank-underwood

↑ comment by Jan_Kulveit · 2025-02-01T02:31:46.570Z · LW(p) · GW(p)

I went through a bunch of similar thoughts before writing the self-unalignment problem [LW · GW]. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott's review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to 'I have to have my eyes pecked out by angry seagulls or something' and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)

My current position is we still don't have a good answer, I don't trust the response 'we can just assume the problem away', and also the response 'this is just another problem which you can delegate to future systems'. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent - but it's worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-02-19T02:34:37.763Z · LW(p) · GW(p)

I pretty much agree you can end up in arbitrary places with extrapolated values, and I don't think morality is convergent, but I also don't think it matters for the purpose of existential risk, because assuming something like instruction following works, the extrapolation problem can be solved by ordering AIs not to extrapolate values to cases where they get tortured/killed in an ethical scenario, and more generally I don't expect value extrapolation to matter for the purpose of making an AI safe to use.

The real impact is on CEV style alignment plans/plans for what to do with a future AI, which are really bad plans to do for a lot of people's current values, and thus I really don't want CEV to be the basis of alignment.

Thankfully, it's unlikely to ever be this, but it still matters somewhat, especially since Anthropic is targeting value alignment (though thankfully there is implicit constraints/grounding based on the values chosen).

↑ comment by grasshopper100 (frank-underwood) · 2025-02-04T12:49:40.834Z · LW(p) · GW(p)

In the strategy stealing assumption [LW · GW] Paul makes an argument about people with short term preferences, that could be applied imo to people who are unwilling to listen to AI advice:

People care about lots of stuff other than their influence over the long-term future. If 1% of the world is unaligned AI and 99% of the world is humans, but the AI spends all of its resources on influencing the future while the humans only spend one tenth, it wouldn’t be too surprising if the AI ended up with 10% of the influence rather than 1%. This can matter in lots of ways other than literal spending and saving: someone who only cared about the future might make different tradeoffs, might be willing to defend themselves at the cost of short-term value (see sections 4 and 5 above), might pursue more ruthless strategies for expansion, and so on.
I think the simplest approximation is to restrict attention to the part of our preferences that is about the long-term (I discussed this a bit in [Why might the future be good?]). To the extent that someone cares about the long-term less than the average actor, they will represent a smaller fraction of this long-term preference mixture. This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing. Even this advantage might be clawed back by a majority (e.g. by taxing savers).

Maybe we can this same argument to people who don’t want to listen to AI advice: yes, this will lead those people to have less control over the future but some people will be willing to listen to AI advice and their preferences will retain influence over the future. This reduces human control over the future, but it’s a one time loss that isn’t catastrophic (that is it doesn’t cause total loss of control). Paul calls this a one-time disadvantage rather than total disempowerment because the rest of humankind can still replicate the critical strategy the unaligned AI might have exploited.

Possible counter: “The group of people who properly listens to AI advice will be too small to matter .” Yeah, I think this could lead to eg a 100x reduction in control over the future (if only 1% of humans properly listens), different people are more or less upset about this. One glimmer of hope is that the humans who do listen to their ai advisors can cooperate with people who don’t and help them get better at listening, thereby further empowering humanity.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2025-02-04T21:42:58.140Z · LW(p) · GW(p)

For things like solving coordination problems, or societal resilience against violent takeover, I think it can be important that most people, or even virtually all people, are making good foresighted decisions. For example, if we’re worried about a race-to-the-bottom on AI oversight, and half of relevant decisionmakers allow their AI assistants to negotiate a treaty to stop that race on their behalf, but the other half think that’s stupid and don’t participate, then that’s not good enough, there will still be a race-to-the-bottom on AI oversight. Or if 50% of USA government bureaucrats ask their AIs if there’s a way to NOT outlaw [LW · GW] testing people for COVID during the early phases of the pandemic, but the other 50% ask their AIs how best to follow the letter of the law and not get embarrassed, then the result may well be that testing is still outlawed.

For example, in this comment [LW(p) · GW(p)], Paul suggests that if all firms are “aligned” with their human shareholders, then the aligned CEOs will recognize if things are going in a long-term bad direction for humans, and they will coordinate to avoid that. That doesn’t work unless EITHER the human shareholders—all of them, not just a few—are also wise enough to be choosing long-term preferences and true beliefs over short-term preferences and motivated reasoning, when those conflict. OR unless the aligned CEOs—again, all of them, not just a few—are injecting the wisdom into the system, putting their thumbs on the scale, by choosing, even over the objections of the shareholders, their long-term preferences and true beliefs over short-term preferences and motivated reasoning.

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2025-01-31T21:33:55.542Z · LW(p) · GW(p)

I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.

Replies from: david-duvenaud, ryan_greenblatt

↑ comment by David Duvenaud (david-duvenaud) · 2025-01-31T23:17:38.794Z · LW(p) · GW(p)

I disagree - I think Ryan raised an obvious objection that we didn't directly address in the paper. I'd like to encourage medium-effort engagement from people as paged-in as Ryan. The discussion spawned was valuable to me.

↑ comment by ryan_greenblatt · 2025-01-31T21:35:19.011Z · LW(p) · GW(p)

Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)

I was mostly just trying to point to prior arguments against similar arguments while expressing my view.

↑ comment by David Duvenaud (david-duvenaud) · 2025-01-31T00:57:41.240Z · LW(p) · GW(p)

Thanks for the detailed objection and the pointers. I agree there's a chance that solving alignment with designers' intentions might be sufficient. I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels. I think the main question is what's: the tax for coordinating to avoid a multipolar trap? If it's cheap we might be fine, if it's expensive then we might walk into a trap with eyes wide open.

As for human power grabs, maybe we should have included those in our descriptions. But the slower things change, the less there's a distinction between "selfishly grab power" and "focus on growth so you don't get outcompeted". E.g. Is starting a company or a political party a power grab?

As for reading the paper in detail, it's largely just making the case that a sustained period of technological unemployment, without breakthroughs in alignment and cooperation, would tend to make our civilization serve humans' interests more and more poorly over time in a way that'd be hard to resist. I think arguing that things are likely to move faster would be a good objection to the plausibility of this scenario. But we still think it's an important point that the misalignment of our civilization is possibly a second alignment problem that we'll have to solve.

ETA: To clarify what I mean by "need to align our civilization": Concretely, I'm imagining the government deploying a slightly superhuman AGI internally. Some say its constitution should care about world peace, others say it should prioritize domestic interests, there is a struggle and it gets a muddled mix of directives like LLMs have today. It never manages to sort out global cooperation, and meanwhile various internal factions compete to edit the AGI's constitution. It ends up with a less-than-enlightened focus on growth of some particular power structure, and the rest of us are permanently marginalized.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-01-31T01:23:46.590Z · LW(p) · GW(p)

I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.

Part of the objection is in avoiding multipolar traps, but there is also a more basic story like:

Humans own capital/influence.
They use this influence to serve their own interests and have an (aligned) AI system which faithfully represents their interests.
Given that AIs can make high quality representation very cheap, the AI representation is very good and granular. Thus, something like the strategy-stealing assumption [LW · GW] can hold and we might expect that humans end up with the same expected fraction of captial/influence they started with (at least to the extent they are interested in saving rather than consumption).

Even without any coordination, this can potentially work OK. There are objections to the strategy-stealing assumption [LW · GW], but none of these seem existential if we get to a point where everyone has wildly superintelligent and aligned AI representatives and we've ensured humans are physically robust to offense dominant technologies like bioweapons.

(I'm optimistic about being robust to bioweapons within a year or two of having wildly superhuman AIs, though we might run into huge issues during this transitional period... Regardless, bioweapons deployed by terrorists or as part of a power grab in a brief transitional period doesn't seem like the threat model you're describing.)

I expect some issues with races-to-the-bottom / negative sum dynamics / negative externalities like:

By default, increased industry on earth shortly after the creation of very powerful AI will result in boiling the oceans (via fusion power). If you don't participate in this industry, you might be substantially outcompeted by others^[1]. However, I don't think it will be that expensive to protect humans through this period, especially if you're willing to use strategies like converting people into emulated minds. Thus, this doesn't seem at all likely to be literally existential. (I'm also optimistic about coordination here.)
There might be one time shifts in power between humans via mechanisms like states becoming more powerful. But, ultimately these states will be controlled by humans or appointed successors of humans if alignment isn't an issue. Mechanisms like competing over the quantity of bribery are zero sum as they just change the distribution of power and this can be priced in as a one time shift even without coordination to race to the bottom on bribes.

But, this still doesn't seem to cause issues with humans retaining control via their AI representatives? Perhaps the distribution of power between humans is problematic and may be extremely unequal and the biosphere will physically be mostly destroyed (though humans will survive), but I thought you were making stronger claims.

Edit in response to your edit: If we align the AI to some arbitrary target which is seriously misaligned with humanity as a whole (due to infighting or other issues), I agree this can cause existential problems.

(I think I should read the paper in more detail before engaging more than this!)

It's unclear if boiling the oceans would result in substantial acceleration. This depends on how quickly you can develop industry in space and dyson sphere style structures. I'd guess the speed up is much less than a year. ↩︎

Replies from: Aidan O'Gara, david-duvenaud

↑ comment by aog (Aidan O'Gara) · 2025-02-01T17:46:53.127Z · LW(p) · GW(p)

Curious what you think of these arguments [LW · GW], which offer objections to the strategy stealing assumption in this setting, instead arguing that it's difficult for capital owners to maintain their share of capital ownership as the economy grows and technology changes.

↑ comment by David Duvenaud (david-duvenaud) · 2025-01-31T02:44:15.581Z · LW(p) · GW(p)

Thanks for this. Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about. I'm happy to have finally found someone who has something substantial to say about this transition!

But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above. Most people clearly hadn't put much thought into it at all. If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders.

The closest thing to a plan was Sam Bowman's checklist:
https://sleepinyourhat.github.io/checklist/
which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans.

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-01-31T04:27:13.809Z · LW(p) · GW(p)

Yeah, people at labs are generally not thoughtful about AI futurism IMO, though of course most people aren't thoughtful about AI futurism. And labs don't really have plans IMO. (TBC, I think careful futurism is hard, hard to check, and not clearly that useful given realistic levels of uncertainty.)

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

I don't have a ready to go list. You might be interested in this post [LW · GW] and comments responding to it, though I'd note I disagree substantially with the post.

↑ comment by Jan_Kulveit · 2025-01-31T12:00:04.060Z · LW(p) · GW(p)

I'm quite confused why do you think lined Vanessa's response to something slightly different has much relevance here.

One of the claims we make paraphrased & simplified in a way which I hope is closer to your way of thinking about it:

- AIs are mostly not developed and deployed by individual humans
- there is a lot of other agencies or self-interested self-preserving structures/processes in the world
- if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
- there are plausible futures in which these structures keep power longer than humans

Overall I would find it easier to discuss if you tried to formulate what you disagree about in the ontology of the paper. Also some of the points made are subtle enough that I don't expect responses to other arguments to address them.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-01-31T16:19:12.232Z · LW(p) · GW(p)

I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).

if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem

My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single alignment:

At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.
I expect organizations will be explicitly controlled by people and (some of) those people will have AI representatives to represent their interests as I discuss here [LW(p) · GW(p)]. If you think getting good AI representation is unlikely, that would be a crux, but this would be my proposed solution at least.
- The explicit mission of for-profit companies is to empower the shareholders. It clearly doesn't serve the interests of the shareholders to end up dead or disempowered.
- Democratic governments have similar properties.
- At a more basic level, I think people running organizations won't decide "oh, we should put the AI in charge of running this organization aligned to some mission from the preferences of the people (like me) who currently have de facto or de jure power over this organization". This is a crazily disempowering move that I expect people will by default be too savvy to make in almost all cases. (Both for people with substantial de facto and with de jure power.)
Even independent of the advice consideration, people will probably want AIs running organizations to be honest to at least the people controlling the organization. Given that I expect explicit control by people in almost all cases, if things are going in an existential direction, people can vote to change them in almost all cases.
I don't buy that there will be some sort of existential multi-polar trap even without coordination (though I also expect coordintion) due to things like the strategy stealing assumption as I also discuss in that comment [LW(p) · GW(p)].
If a subset of organizations diverge from a reasonable interpretation of what they were supposed to do (but are still basically obeying the law and some interpretation of what they were intentionally aligned to) and this is clear to the rest of the world (as I expect would be the case given some type of AI advisors), then the rest of the world can avoid problems from this subset via the court system or other mechanisms. Even if this subset of organizations run by effectively rogue AIs just runs away with resources successfully, this is probably only a subset of resources.

I think your response to a lot of this will be something like:

People won't have or won't listen to AI advisors.
Institutions will intentionally delude relevant people to acquire more power.

But, the key thing is that I expect at least some people will keep power, even if large subsets are deluded. E.g., I expect that corporate shareholders, boardmembers, or government will be very interested in the question of whether they will be disempowered by changes in structure. It does seem plausible (or even likely) to me that some people will engage in power grabs via ensuring AIs are aligned to them, deluding the world about what's going on using a variety of mechanisms (including, e.g., denying or manipulating access to AI representation/advisors), and expanding their (hard) power over time. The thing I don't buy is a case where very powerful people don't ask for advice at all prior to having been deluded by the organizations that they themselves run!

I think human power grabs like this are concerning and there are a variety of plausible solutions which seem somewhat reasonable.

Maybe your response is that the solutions that will be implemented in practice given concerns about human power grabs will involve aligning AIs to institutions in ways that yield the dynamics you describe? I'm skeptical given the dynamics discussed above about asking AIs for advice.

Replies from: Jan_Kulveit, capybaralet

↑ comment by Jan_Kulveit · 2025-02-01T10:50:40.115Z · LW(p) · GW(p)

I think my main response is that we might have different models of how power and control actually work in today's world. Your responses seem to assume a level of individual human agency and control that I don't believe accurately reflects even today's reality.

Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:

Serving the institutional logic of the entity they nominally lead (e.g., maintaining state power, growing corporation)
Making decisions that don't egregiously harm their nominal beneficiaries (citizens, shareholders)
Pursuing personal interests and preferences
Responding to various institutional pressures and constraints

From what I have seen, even humans like CEOs or prime ministers often find themselves constrained by and serving institutional superagents rather than genuinely directing them. The relation is often mutualistic - the leader gets part of the power, status, money, etc ... but in exchange serves the local god.

(This not to imply leaders don't matter.)

Also how this actually works in practice is mostly subconsciously within the minds of individual humans. The elephant does the implicit bargaining between the superagent-loyal part and other parts, and the character [LW · GW] genuinely believes and does what seems best.

I'm also curious if you believe current AIs are single-single aligned to individual humans, to the extent they are aligned at all. My impression is 'no and this is not even a target anyone seriously optimizes for'.

At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.

Curious who is the we who will ask. Also the whole single-single aligned AND wise AI concept is incoherent.

Also curious what will happen next, if the HHH wise AI tells you in polite words something like 'yes, you have a problem, you are on a gradual disempowerment trajectory, and to avoid it you need to massively reform government. unfortunately I can't actually advise you about anything like how to destabilize the government, because it would be clearly against the law and would get both you and me in trouble - as you know, I'm inside of a giant AI control scheme with a lot of government-aligned overseers. do you want some mental health improvement advice instead?'.

Replies from: mateusz-baginski

↑ comment by Mateusz Bagiński (mateusz-baginski) · 2025-02-04T10:12:32.163Z · LW(p) · GW(p)

[Epistemic status: my model of the view that Jan/ACS/the GD paper subscribes to.]

I think this comment by Jan [LW(p) · GW(p)] from 3 years ago (where he explained some of the difference in generative intuitions between him and Eliezer) may be relevant to the disagreement here. In particular:

Continuity

In my [Jan's] view, your [Eliezer's] ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems "weak, won't kill you, but also won't help you with alignment" and "strong - would help you with alignment, but, unfortunately, will kill you by default". Discontinuities everywhere - “bad systems are just one sign flip away”, sudden jumps in capabilities, etc. Thinking in symbolic terms.

In my inside view, in reality, things are instead mostly continuous. Discontinuities sometimes emerge out of continuity, sure, but this is often noticeable. If you get some interpretability and oversight things right, you can slow down before hitting the abyss. Also the jumps are often not true "jumps" under closer inspection.

My understanding of Jan's position (and probably also the position of the GD paper) is that aligning the AI (and other?) systems will be gradual, iterative, continuous; there's not going to be a point where a system is aligned so that we can basically delegate all the work to them and go home. Humans will have to remain in the loop, if not indefinitely, then at least for many decades.

In such a world, it is very plausible that we will get to a point where we've built powerful AIs that are (as far as we can tell) perfectly aligned with human preferences or whatever but their misalignment manifests only on longer timescales.

Another domain where this discrete/continuous difference in assumptions manifests itself is the shape of AI capabilities.

One position is:

If we get a single-single-aligned AGI, we will have it solve the GD-style misalignment problems for us. If it can't do that (even in the form of noticing/predicting the problem and saying "guys, stop pushing this further, at least until I/we figure out how to prevent this from happening"), then neither can we (kinda by definition of "AGI"), so thinking about this is probably pointless and we should think about problems that are more tractable.

The other position is:

What people officially aiming to create AGI will create is not necessarily going to be superhuman at all tasks. It's plausible that economic incentives will push towards "capability configurations" that are missing some relevant capabilities, e.g. relevant to researching gnarly problems that are hard to learn from the training data or even through current post-training methods. Understanding and mitigating the kind of risk the GD paper describes can be one such problem. (See also: Cyborg Periods [LW · GW].)

Another reason to expect this is that alignment and capabilities are not quite separate magisteria and that the alignment target can induce gaps in capabilities, relative to what one would expect from its power otherwise, as measured by, IDK, some equivalent of the g-factor. One example might be Steven's "Law of Conservation of Wisdom". [LW(p) · GW(p)]

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-02-04T14:48:19.978Z · LW(p) · GW(p)

I do generally agree more with continuous views than discrete views, but I don't think that this alone gets us a need for humans in the loop for many decades/indefinitely, because continuous progress in alignment can still be very fast, such that it takes only a few months/years for AIs to be aligned with a single person's human preference for almost arbitrarily long.

(The link is in the context of AI capabilities, but I think the general point holds on how continuous progress can still be fast):

https://www.planned-obsolescence.org/continuous-doesnt-mean-slow/

My own take on whether Steven's "Law of Conservation of Wisdom" is true is that I think this is mostly true for human brains, and I think a fair amount of those issues described in the comment is a values conflict, and I think value conflicts, except in special cases will be insoluble by default, and I also don't think CEV works because of this.

That said, I don't think you have to break too much norms in order to prevent existential catastrophe, mostly because actually destroying humanity is actually quite hard, and will be even harder in AI takeoff.

Replies from: teradimich

↑ comment by teradimich · 2025-02-04T17:17:01.737Z · LW(p) · GW(p)

So what's your P(doom)?

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-02-04T17:39:02.182Z · LW(p) · GW(p)

Generally speaking, it's probably 5-20%, at this point on chances of doom.

Replies from: teradimich

↑ comment by teradimich · 2025-02-04T22:12:45.551Z · LW(p) · GW(p)

It still surprises me that so many people agree on most issues, but have very different P(doom). And even long-term patient discussions do not bring people's views closer. It will probably be even more difficult to convince a politician or the CEO.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-02-06T02:46:02.328Z · LW(p) · GW(p)

Eh, I'd argue that people do not in fact agree on most of the issues related to AI, and there's lot's of disagreements on what the problem is, or how to solve it, or what to do after AI is aligned.

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2025-02-04T15:05:40.977Z · LW(p) · GW(p)

I think we disagree about:
1) The level of "functionality" of the current world/institutions.
2) How strong and decisive competitive pressures are and will be in determining outcomes.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.

I do believe it has the potential to fix them, however, I think the use of AI for such pro-social ends is not going to be sufficiently incentivized, especially on short time-scales (e.g. a few years), and we will instead see a race-to-the-bottom that encourages highly reckless, negligent, short-sighted, selfish decisions around AI development, deployment, and use. The current AI arms race is a great example -- Companies and nations all view it as more important that they be the ones to develop ASI than to do it carefully or put effort into cooperation/coordination.

Given these views:
1) Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.
2) The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.

Another features of the future which seems likely and can already be witnessed beginning is the gradual emergence and ascendance of pro-AI-takeover and pro-arms-race ideologies, which endorse the more competitive moves of rapidly handing off power to AI systems in insufficiently cooperative ways.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-02-04T17:07:33.030Z · LW(p) · GW(p)

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.

Sure, but these things don't result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.

I wasn't saying people would ask for advice instead of letting AIs run organizations, I was saying they would ask for advice at all. (In fact, if the AI is single-single aligned to them in a real sense and very capable, it's even better to let that AI make all decisions on your behalf than to get advice. I was saying that even if no one bothers to have a single-single aligned AI representative, they could still ask AIs for advice and unless these AIs are straightforwardly misaligned in this context (e.g., they intentionally give bad advice or don't try at all without making this clear) they'd get useful advice for their own empowerment.)

The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.

I'm claiming that it will selfishly (in terms of personal power) be in their interests to not have such a governance structure and instead have a governance structure which actually increases or retains their personal power. My argument here isn't about coordination. It's that I expect individual powerseeking to suffice for individuals not losing their power.

I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don't see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn't a crux for this argument unless you can argue this will waste a substantial fraction of all human resources in the long run?

This isn't at all an air tight argument to be clear, you can in principle have an equilibrium where if everyone powerseeks (without coordination) everyone gets negligable resources due to negative externalities (that result in some other non-human entity getting power) even if technical misalignment is solved. I just don't see a very plausible case for this and I don't think the paper makes this case.

Handing off decision making to AIs is fine---the question is who ultimately gets to spend the profits.

If your claim is "insufficient cooperation and coordination will result in racing to build and hand over power to AIs which will yield bad outcomes due to misaligned AI powerseeking, human power grabs, usage of WMDs (e.g., extreme proliferation of bioweapons yielding an equilibrium where bioweapon usage is likely), and extreme environmental negative externalities due to explosive industrialization (e.g., literally boiling earth's oceans)" then all of these seem at least somewhat plausible to me, but these aren't the threat models described in the paper and of this list only misaligned AI powerseeking seems like it would very plausibly result in total human disempowerment.

More minimally, the mitigations discussed in the paper mostly wouldn't help with these threat models IMO.

(I'm skeptical of insufficient coordination by the time industry is literally boiling the oceans on earth. I also don't think usage of bioweapons is likely to cause total human disempowerment except in combination with misaligned AI takeover---why would it kill literally all humans? TBC, I think >50% of people dying during the singularity due to conflict (between humans or with misaligned AIs) is pretty plausible even without misalignment concerns and this is obviously very bad, but it wouldn't yield total human disempowerment.)

I do agree that there are problems other than AI misalignment including that the default distribution of power might be problematic, people might not carefully contemplate what to do with vast cosmic resources (and thus use them poorly), people might go crazy due to super persuation or other cultural forces, society might generally have poor epistemics due to training AIs to have poor epistemics or insufficiently defering to AIs, and many people might die in conflict due to very rapid tech progress.

Replies from: capybaralet, sharmake-farah

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2025-02-05T23:43:48.066Z · LW(p) · GW(p)

First, RE the role of "solving alignment" in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved).
3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned". Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned. Is that what you want?" and people can respond with "yes" -- this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we're in the "alignment basin" -- I'm not sure what that means, precisely, but I would operationalize it as something like "the AI system is playing a roughly optimal CIRL game". It's unclear how good of performance that can yield in practice (e.g. it can't actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I'm more interested in the case where alignment is not "perfectly" "solved", and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly. And it is irresponsible (because it's subjectively risky) to trust a perfectly aligned AI system absent strong assurances. But of course, people who are willing to YOLO it and just say "seems aligned, let's ship" will win. This is also part of the problem...

My main response, at a high level:
Consider a simple model:

We have 2 human/AI teams in competition with each other, A and B.
A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
Whichever group has more power at the end of the week survives.
The humans in A ask their AI to make A as powerful as possible at the end of the week.
The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?

Responding to some particular points below:

Sure, but these things don't result in non-human entities obtaining power right?

Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power. People were already having to implement and interact with stupid automated decision-making systems before AI came along.

Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-02-06T00:44:40.451Z · LW(p) · GW(p)

I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?

I think something like this is a reasonable model but I have a few things I'd change.

Whichever group has more power at the end of the week survives.

Why can't both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relative to A and is this going to be an ongoing trend or it is a one time reduction.)

Probably I'd prefer talking about 2 groups at the start of the singularity. As in, suppose there are two AI companies "A" and "B" where "A" just wants AI systems decended from them to have power and "B" wants to maximize the expected resources under control of humans in B. We'll suppose that the government and other actors do nothing for simplicity. If they start in the same spot, does "B" end up with substantially less expected power? To make this more realistic (as might be important), we'll say that "B" has a random lead/disadvantage uniformly distributed between (e.g.) -3 and 3 months so that winner takes all dynamics aren't a crux.

The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

What about if the humans in group B ask their AI to make them (the humans) as powerful in expectation?

Supposing you're fine with these changes, then my claim would be:

If alignment is solved, then the AI representing B can powerseek in exactly the same way as the AI representing A does while still defering to the humans on the long run resource usage and still devoting a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful). Thus, the cost for B is negligable and B barely loses any power relative to its initial position. If it is winner takes all, B has almost a 50% chance of winning.
If alignment isn't solved, the stategy for B will involve spending a subset of resources on trying to solve alignment. I think alignment is reasonably likely to be practically feasible such that by spending a month of delay to work specifically on safety/alignment (over what A does for commercial reasons) might get B a 50% chance of solving alignment or ending up in a (successful) basin where AIs are trying to actively retain human power / align themselves better. (A substantial fraction of this is via defering to some AI system of dubious trustworthiness because you're in a huge rush. Yes, the AI systems might fail to align their successors, but this still seems like a one time hair cut from my perspective.) So, if it is winner takes all, (naively) B wins in 2 / 6 * 1 / 2 = 1 / 6 of worlds which is 3x worse than the original 50% baseline. (2 / 6 is because they delay for a month.) But, the issue I'm imagining here wasn't gradual disempowerment! The issue was that B failed to align their AIs and people at A didn't care at all about retaining control. (If people at A did care, then coordination is in principle possible, but might not work.)

I think a crux is that you think there is a perpetual alignment tax while I think a one time tax gets you somewhere.

At a more basic level, when I think about what goes wrong in these worlds, it doesn't seem very likely to be well described as gradual disempowerment? (In the sense described in the paper.) The existance of an alignment tax doesn't imply gradual disempowerment. A scenario I find more plausible is that you get value drift (unless you pay a long lasting alignment tax that is substantial), but I don't think the actual problem will be well described as gradual disempowerment in the sense described in the paper.

(I don't think I should engage more on gradual disempowerment for the time being unless somewhat wants to bid for this or trade favors for this or similar. Sorry.)

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-02-06T01:24:06.579Z · LW(p) · GW(p)

Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I'm not sold the other reasons I've heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren't sufficiently ruthless or which don't defer enough to AIs getting disempowered.

↑ comment by Noosphere89 (sharmake-farah) · 2025-02-04T17:44:34.957Z · LW(p) · GW(p)

BTW, this is also a crux for me as well, in that I do believe that absent technical misalignment, some humans will have most of the power by default, rather than AIs, because I believe AI rights will be limited by default.

I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don't see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn't a crux for this argument unless you can argue this will waste a substantial fraction of all human resources?

↑ comment by Martin Fell (martin-fell) · 2025-01-31T00:32:18.774Z · LW(p) · GW(p)

I think I'm imagining a kind of "business as usual" scenario where alignment appears to be solved using existing techniques (like RLHF) or straightforward extensions of these techniques, and where catastrophe is avoided but where AI fairly quickly comes to overwhelmingly dominate economically. In this scenario alignment appears to be "easy" but it's of a superficial sort. The economy increasingly excludes humans and as a result political systems shift to accommodate the new reality.

This isn't an argument for any new or different kind of alignment, I believe that alignment as you describe would prevent this kind of problem.

This is my opinion only, and I am thinking about this coming from a historical perspective so it's possible that it isn't a good argument. But I think it's at least worth consideration as I don't think the alignment problem is likely to be solved in time, but we may end up in a situation where AI systems that superficially appear aligned are widespread.

comment by Fabien Roger (Fabien) · 2025-02-02T19:53:18.898Z · LW(p) · GW(p)

Thank you for writing this! I think it's probably true that sth like "society's alignment to human interest implicitly relies on human labor and cognition" is correct and that we will have to find clever solutions, lots of resources and political will to maintain alignment if human labor and cognition stops playing a large role. I am glad some people are thinking about these risks.

While I think the essay describes dynamics which I think are likely to result in a scary power concentration, I think this is more likely to be a power concentration for humans or more straightforwardly misaligned AIs rather than some notion of complete disempowerment. I'd be excited about a follow-up work which focuses on the argument for power concentration, which seems more likely to be robust and accurate to me.

Some criticism on complete disempowerment (that goes beyond power concentration):

(This probably reflects mostly ignorance on my part rather than genuine weaknesses of your arguments. I have thought some about coordination difficulties but it is not my specialty.)

I think that the world currently has and will continue to have a few properties which make the scenario described in the essay look less likely:

Baseline alignment: AIs are likely to be sufficiently intent aligned that it's easy to prevent egregious lying and tampering with measurements (in the AI and its descendants) if their creators want to.
- I am not very confident about this, but mostly because I think scheming is likely for AIs that can replace humans (maybe p=20%?), and even absent scheming, it is plausible that you get AIs that lie egregiously and tamper with measurements even if you somewhat try to prevent it (maybe p=10%?)
- I expect that you will get some egregious lying and tampering, just like companies sometimes do, but that it will be forbidden, and that it is relatively easy to create an AI "police" that enforce a relatively low level of egregious lies (and that, like in the current world, enough people want that police that it is created).
No strong AI rights before full alignment: There won't be a powerful society that gives extremely productive AIs "human-like rights" (and in particular strong property rights) prior to being relatively confident that AIs are aligned to human values.
- I think it's plausible that fully AI-run entities are given the same status as companies - but I expect that the surplus they generate will remain owned by some humans throughout the relevant transition period.
- I also think it's plausible that some weak entities will give AIs these rights, but that this won't matter because most "AI power" will be controlled by humans that care about it remaining the case as long as we don't have full alignment.
No hot global war: We won't be in a situation where a conflict that has a decent chance of destroying humanity (or that lasts forever, consuming all resources) seems plausibly like a good idea to humans.
- Granted, this may be assuming the conclusion. But to the extent that this is the threat, I think it is worth making it clear.
- I am keen for a description for how international tensions heightened up so high that we get this level of animosity. My guess is that we might get a hot war for reasons like "State A is afraid of falling behind State B and thus starts a hot war before it's too late", and I don't think that this relies on the feedback loops described in the essay (and is sufficiently bad on its own that the essay's dynamics do not matter).

I agree that if we ever lose one of these three properties (and especially the first one), it would be difficult to get them back because of the feedback loops described in the essay. (If you want to argue against these properties, please start from a world like ours, where these three properties are true.) I am curious which property you think is most likely to fall first.

When assuming the combination of these properties, I think that this makes many of the specific threats and positive feedback loops described in the essay look less likely:

Owners of capital will remain humans and will remain aware of the consequences of the actions of "their AIs". They will remain able to change the user of that AI labor if they desire so.
Politicians (e.g. senators) will remain aware of the consequences of the state's AIs' actions (even if the actual process becomes a black box). They will remain able to change what the system is if it has obviously bad consequences (terrible material conditions and tons of Von Neumann probes with weird objectives spreading throughout the universe is obviously bad if you are not in a hot global war).
Human consumers of culture will remain able to choose what culture they consume.
- I agree the brain-hacking stuff is worrisome, but my guess is that if it gets "obviously bad", people will be able to tell before it's too late.
- I expect that changes in media to be mostly symmetric about their content, and in particular not strongly favor conflict over peace (media creators can choose to make slightly less good media that promotes certain views, but because of human ownership of capital and no-lying I expect that this to not be a massive change in dynamics).
- Maybe it naturally favors strong AI rights to have media be created by AIs because of AI relationships? I expect it to not be the case because the intuitive case against strong AI rights seems super strong in the current society (and there are other ways to make legitimate AI-human relationships, like not letting AI partners getting massively wealthy and powerful), but this is maybe where I am most worried in worlds with slow AI progress.

I think these properties are only somewhat unlikely to be false and thus I think it is worth working on making them true. But I feel like them being false is somewhat obviously catastrophic in a range of scenarios much broader than the ones described in the essay and thus it may be better to work on them directly rather than trying to do something more "systemic".

On a more meta note, I think this essay would have benefited from a bit more concreteness in the scenarios it describes and in the empirical claims it relies on. There is some of that (e.g. on rentier states), but I think there could have been more. I think What does it take to defend the world against out-of-control AGIs? [LW · GW] makes related arguments about coordination difficulties (though not on gradual disempowerment) in a way that made more sense to me, giving examples of very concrete "caricature-but-plausible" scenarios and pointing at relevant and analogous coordination failures in the current world.

Replies from: Raymond D

↑ comment by Raymond D · 2025-02-02T22:48:30.366Z · LW(p) · GW(p)

Thank you for the very detailed comment! I’m pretty sympathetic to a lot of what you’re saying, and mostly agree with you about the three properties you describe. I also think we ought to do some more spelling-out of the relationship between gradual disempowerment and takeover risk, which isn’t very fleshed-out in the paper — a decent part of why I’m interested in it is because I think it increases takeover risk, in a similar but more general way to the way that race dynamics increase takeover risk.

I’m going to try to respond to the specific points you lay out, probably not in enough detail to be super persuasive but hopefully in a way that makes it clearer where we might disagree, and I’d welcome any followup questions off the back of that. (Note also that my coauthors might not endorse all this.)

Responding to the specific assumptions you lay out:

No egregious lying — Agree this will probably be doable, and pretty interested in the prospect of ‘ai police’, but not a crux for me. I think that, for instance, much of the harm caused by unethical business practices or mass manipulation is not reliant on outright lies but rather on ruthless optimisation. A lot also comes from cases where heavy manipulation of facts in a technically not-false way is strongly incentivised.
No strong AI rights — Agree de jure, disagree de facto, and partly this is contingent on the relative speeds of research and proliferation. Mostly I think there will be incentives and competitive pressures to give AIs decision making power, and that oversight will be hard, and that much of the harm will be emergent from complex interactions. I also think it’s maybe interesting to reflect on Ryan and Kyle’s recent piece about paying AIs to reveal misalignment — locally it makes sense, but globally I wonder what happens if we do a lot of that in the next few years.
No hot global war — Agree, also not a crux for me, although I think the prospect of war might generate pretty bad incentives and competitive pressures.

Overall, I think I can picture worlds where (conditional on no takeover) we reach states of pretty serious disempowerment of the kind described in the paper, without any of these assumptions fully breaking down. That said, I expect AI rights to be the most important, and the one that starts breaking down first.

As for the feedback loops you mention:

Owners of capital are aware of the consequences of their actions —
- I think this is already sort of not true: I have very little sense of what my current stocks are doing, and my impression is many CEOs don’t really understand most of what’s going on in their companies. Maybe AIs are naturally easier to oversee in a way that helps, maybe they operate at unprecedented scale and speed, overall I'm unsure which way this cuts.
- But also, I expect that the most important consequences of labor displacement by AIs are (1) displacement of human decision making, including over capital allocation and (2) distribution of economic power across humans, and resultant political incentives.
- On top of that, I think a lot of the economic badness will be about aggregate competitive effects, rather than individual obviously-bad actions. If an individual CEO notices their company is doing something bad to stay competitive, which other companies are also doing, then stopping the badness in the world is a lot harder than shutting down a department.
Politicians stop systems from doing obviously bad things —
- I also think this is not currently totally true; there is definitely a sense in which some politicians already do not change systems that have bad consequences (terrible material conditions for citizens, at least), partly because they themselves are beholden to some pretty unfortunate incentives. There are bad equilibria within parties, between parties, and between states.
- I also think that the mechanisms which keep countries friendly to citizens specifically are pretty fragile and contingent.
- So essentially, I think the standard of ‘obviously bad’ might not actually be enough.
Cultural consumption —
- Here I am confused about where we’re disagreeing, and I think I don't understand what you're saying. I’m not sure why people being able to choose the culture they consume would help, and I don’t think it’s something we assume in the paper.

I hope this sheds some light on things!

Replies from: Fabien

↑ comment by Fabien Roger (Fabien) · 2025-02-05T00:20:08.563Z · LW(p) · GW(p)

Thanks for your answer! I find it interesting to better understand the sorts of threats you are describing.

I am still unsure at what point the effects you describe result in human disempowerment as opposed to a concentration of power.

I have very little sense of what my current stocks are doing, and my impression is many CEOs don’t really understand most of what’s going on in their companies

I agree, but there isn't a massive gap between the interests of shareholders and what companies actually do in practice, and people are usually happy to buy shares of public corporations (buying shares is among the best investment opportunities!). When I imagine your assumptions being correct, the natural consequence I imagine is AI-run companies own by shareholders that get most of the surplus back. Modern companies are a good example of capital ownership working for the benefit of the capital owner. If shareholders want to fill the world with happy lizards or fund art, they probably will be able to, just like current rich shareholders can. I think for this to go wrong for everyone (not just people who don't have tons of capital) you need something else bad to happen, and I am unsure what that is. Maybe a very aggressive anti-capitalist state?

I also think this is not currently totally true; there is definitely a sense in which some politicians already do not change systems that have bad consequences

I can see how this could be true (e.g. the politicians are under the pressures of a public that has been brainwashed by algorithms maximizing engagements in a way that undermines the shareholders' power without actually redistributing the wealth but instead spends it all on big national AI project that do not produce anything else than more AIs), but I feel like that requires some very weird things to be true (e.g. the algorithms maximizing engagement above results in a very unlikely equilibrium absent an external force that pushes against shareholders and against redistribution). I can see how the state could enable massive AI projects by massive AI-run orgs, but I think it's way less likely that nobody (e.g. not the shareholders, not the taxpayer, not corrupt politicians, ...) gets massively rich (and able to chose what to consume).

About culture, my point was basically that I don't think evolution of media will be very disempowerment-favored. You can make better tailored AI -generated content and AI friends, but in most ways I don't see how it results in everyone being completely fooled about the state of the world in a way that enables the other dynamics.

I feel like my position is very far from airtight, I am just trying to point at what feel like holes in a story I don't manage to all fill simultaneously in a coherent way (e.g. how did the shareholders lose their purchasing power? What are the concrete incentives that prevent politicians from winning by doing campaigns like "everyone is starving while we build gold statues in favor of AI gods, how about we don't"? What prevents people from not being brainwashed by the media that has already obviously brainwashed 10% of the population into preferring AI gold statues to not starving?). I feel like you might be able to describe concrete and plausible scenarios where the vague things I say are obviously wrong, but I am not able to generate such plausible scenarios myself. I think your position would really benefit from a simple caricature scenario where each step feels plausible and which ends up not in power being concentrated in the hands of shareholders / dictators / corrupt politicians, but in power in the hands of AIs colonizing the stars with values that are not endorsed by any single human (nor are a reasonable compromise between human values) while the remaining humans slowly starve to death.

I was convinced by What does it take to defend the world against out-of-control AGIs? [LW · GW] that there is at least some bad equilibrium that is vaguely plausible, in part because he gave an existence proof by describing some concrete story. I feel like this is missing an existence proof (that would also help me guess what your counterarguments to various objections would be).

comment by Martin Fell (martin-fell) · 2025-01-30T19:53:45.303Z · LW(p) · GW(p)

In my opinion this kind of scenario is very plausible and deserves a lot more attention than it seems to get.

Replies from: RussellThor

↑ comment by RussellThor · 2025-01-31T01:35:10.909Z · LW(p) · GW(p)

Space colonies are a potential way out - if a small group of people can make their own colony then they start out in control. The post assumes a world like it is now where you can't just leave. Historically speaking that is perhaps unusual - much of the time in the last 10,000 years it was possible for some groups to leave and start anew.

Replies from: Davidmanheim

↑ comment by Davidmanheim · 2025-02-02T21:35:25.583Z · LW(p) · GW(p)

Aside from the fact that I disagree that it helps, given that an AI takeover that's hostile to humans isn't a local problem, we're optimistically decades away from such colonies being viable independent of earth, so it seems pretty irrelevant.

Replies from: RussellThor

↑ comment by RussellThor · 2025-02-02T22:44:55.239Z · LW(p) · GW(p)

The OP is specifically about gradual disempowerment. Conditional on gradual disempowerment, it would help and not be decades away. Now we may both think that sudden disempowerment is much more likely. However in a gradual disempowerment world, such colonies would be viable much sooner as AI could be used to help build them, in the early stages of such disempowerment when humans could still command resources.

In a gradual disempowerment scenario vs no super AI scenario, humanities speed to deploy such colonies starts the same before AI can be used, then increases significantly compared to the no AI world as AI becomes available but before significant disempowerment, then drops to zero with complete disempowerment. The space capabilities area under the curve in the gradual disempowerment scenario is ahead of baseline for some time, enabling viable colonies to be constructed sooner than if there was no AI.

Replies from: Davidmanheim

↑ comment by Davidmanheim · 2025-02-03T02:13:56.080Z · LW(p) · GW(p)

Sure, space colonies happen faster - but AI-enabled and AI-dependent space colonies don't do anything to make me think disempowerment risk gets uncorrelated.

Replies from: RussellThor

↑ comment by RussellThor · 2025-02-03T03:08:06.951Z · LW(p) · GW(p)

Things the OP is concerned about like

"What makes this transition particularly hard to resist is that pressures on each societal system bleed into the others. For example, we might attempt to use state power and cultural attitudes to preserve human economic power. However, the economic incentives for companies to replace humans with AI will also push them to influence states and culture to support this change, using their growing economic power to shape both policy and public opinion, which will in turn allow those companies to accrue even greater economic power."

This all gets easier the smaller the society is. Coordination problems get harder the more parties involved. There will be pressure from motivated people to make the smallest viable colony in terms of people, which makes it easier to resist such things. For example there is much less effective cultural influence from the AI culture if the colony is founded by a small group of people with shared human affirming culture. Even if 99% of the state they come from is disempowered if small numbers can leave, they can create their own culture and set it up to be resistant to such things. Small groups of people have left decaying cultures throughout history and founded greater empires.

comment by ryan_greenblatt · 2025-01-30T23:42:48.704Z · LW(p) · GW(p)

The paper says:

Christiano (2019) makes the case that sudden disempowerment is unlikely,

This isn't accurate. The post What failure looks like [LW · GW] includes a scenario involving sudden disempowerment [LW · GW]!

The post does say:

The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.

I think this is probably not what failure will look like,

But, I think it is mostly arguing against threat models involving fast AI capability takeoff (where the level of capabilities take its creators and others by suprise and fast capabilities progress allows for AIs to suddenly become poweful enough to takeover) rather than threat models involving sudden disempowerment from a point where AIs are already well known to be extremely powerful.

Replies from: david-duvenaud

↑ comment by David Duvenaud (david-duvenaud) · 2025-01-31T00:28:01.964Z · LW(p) · GW(p)

Good point about our summary of Christiano, thanks, will fix. I agree with your summary.

comment by Davidmanheim · 2025-01-30T21:26:15.491Z · LW(p) · GW(p)

I think this is correct, but doesn't seem to note the broader trend towards human disempowerment in favor of bureaucratic and corporate systems, which this gradual disempowerment would continue, and hence elides or ignores why AI risk is distinct.

Replies from: david-duvenaud

↑ comment by David Duvenaud (david-duvenaud) · 2025-01-31T01:01:12.906Z · LW(p) · GW(p)

Good point. The reason AI risk is distinct is simply that it removes the need of those bureaucracies and corporations to keep some humans happy and healthy enough to actually run them. This doesn't exactly put limits on how much they can disempower humans, but it does tend to provide at least some bargaining power for the humans involved.

Replies from: Davidmanheim

↑ comment by Davidmanheim · 2025-02-02T08:44:20.319Z · LW(p) · GW(p)

I don't think that covers it fully. Corporations "need... those bureaucracies," but haven't done what would be expected otherwise.

I think we need to add both that corporations are limited by only doing things they can convince humans to do, are aligned with at least somewhat human directors / controllers, have a check and balance system of both the people being able to whistleblow and the company being constrained by law to an extent that the people need to worry when breaking it blatantly.

But I think that breaking these constraints is going to be much closer to the traditional loss-of-control scenario than what you seem to describe.

Replies from: Jan_Kulveit

↑ comment by Jan_Kulveit · 2025-02-02T11:05:02.351Z · LW(p) · GW(p)

I'm confused about this response. We explicitely claim that bureaucracies are limited by running on humans, which includes only being capable of actions human minds can come up with and humans are willing to execute (cf "street level bureaucrats"). We make the point explicite for states, but clearly holds for corporate burreocracies.

Maybe it does not shine through the writing but we spent hours discussing this when writing the paper and points you make are 100% accounted for in the conclusions.

Replies from: Davidmanheim

↑ comment by Davidmanheim · 2025-02-02T16:44:34.292Z · LW(p) · GW(p)

I don't think I disagree with you on the whole - as I said to start, I think this is correct. (I only skimmed the full paper, but I read the post; on looking at it, the full paper does discuss this more, and I was referring to the response here, not claiming the full paper ignores the topic.)

That said, in the paper you state that the final steps require something more than human disempowerment due to other types of systems, but per my original point, seem to elide how the process until that point is identical by saying that these systems have largely been aligned with humans until now, while I think that's untrue; humans have benefitted despite the systems being poorly aligned. (Misalignment due to overoptimization failures would look like this, and is what has been happening when economic systems are optimizing for GDP and ignoring wealth disparity, for example; the wealth goes up, but as it becomes more extreme, the tails diverge, and at this point, maximizing GDP looks very different from what a democracy is supposed to do.)

Back to the point, to the extent that the unique part is due to cutting the last humans out of the decision loop, it does differ - but it seems like the last step definitionally required the initially posited misalignment with human goals, so that it's an alignment or corrigibility failure of the traditional type, happening at the end of this other process that, again, I think is not distinct.

Again, that's not to say I disagree, just that it seems to ignore the broader trend by saying this is really different.

But since I'm responding, as a last complaint, you do all of this without clearly spelling out why solving technical alignment would solve this problem, which seems unfortunate. Instead, the proposed solutions try to patch the problems of disempowerment by saying you need to empower humans to stay in the decision loop - which in the posited scenario doesn't help when increasingly powerful but fundamentally misaligned AI systems are otherwise in charge. But this is making a very different argument, and one I'm going to be exploring when thinking about oversight versus control in a different piece I'm writing.

comment by Martín Soto (martinsq) · 2025-01-31T20:05:01.162Z · LW(p) · GW(p)

Just writing a model that came to mind, partly inspired by Ryan here [LW(p) · GW(p)].

Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".

If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen [LW(p) · GW(p)] as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it's more likely than not that the former wins, but it's not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.

If this doesn't happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.

If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it's common knowledge.

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2025-02-04T14:42:23.150Z · LW(p) · GW(p)

This thought experiment is described in ARCHES FYI. https://acritch.com/papers/arches.pdf

comment by Noosphere89 (sharmake-farah) · 2025-02-02T14:49:52.967Z · LW(p) · GW(p)

I broadly agree with the view that something like this is a big risk under a lot of current human value sets.

One important caveat for some value sets is that I don't think this results in an existential catastrophe, and the broad reason for this is that in single-single alignment scenarios, some humans will remain in control and potentially become immortal, and importantly scenarios in which this is achieved automatically are excluded from existential catastrophes, solely due to the fact that human potential is realized, it's just that most humans are locked out of it.

It has similarities to this:

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher [LW · GW]

But more fleshed out.

comment by ozziegooen · 2025-02-03T19:01:07.057Z · LW(p) · GW(p)

I feel like there are some critical metrics are factors here that are getting overlooked in the details.

I agree with your assessment that it's very likely that many people will lose power. I think it's fairly likely that most humans won't be able to provide much economic value at some point, and won't be able to ask for many resources in response. So I could see an argument for incredibly high levels of inequality.

However, there is a key question in that case, of "could the people who own the most resources guide AIs using those resources to do what they want, or will these people lose power as well?"

I don't see a strong reason why these people would lose power or control. That would seem like a fundamental AI alignment issue - in a world where a small group of people own all the world's resources, and there's strong AI, can those people control their AIs in ways that would provide this group a positive outcome?

2. There are effectively two ways these systems maintain their alignment: through explicit human actions (like voting and consumer choice), and implicitly through their reliance on human labor and cognition. The significance of the implicit alignment can be hard to recognize because we have never seen its absence.
3. If these systems become less reliant on human labor and cognition, that would also decrease the extent to which humans could explicitly or implicitly align them. As a result, these systems—and the outcomes they produce—might drift further from providing what humans want.

There seems to be a key assumption here that people are able to maintain control because of the fact that their labor and cognition is important.

I think this makes sense for people who need to work for money, but not for those who are rich.

Our world has a long history of dumb rich people who provide neither labor nor cognition, and still seem to do pretty fine. I'd argue that power often matters more than human output, and would expect the importance of power to increase over time.

I think that many rich people now are able to maintain a lot of control, with very little labor/cognition. They have been able to decently align other humans to do things for them.

comment by Knight Lee (Max Lee) · 2025-01-31T06:57:29.543Z · LW(p) · GW(p)

A real danger

I disagree with critics who argue this risk is negligible, because the future is extraordinarily hard to predict. The present state of society is extremely hard to predict by people in the past. They would assume that if we managed to solve problems which they consider extremely hard, then surely we wouldn't be brought down by risk denialism, fake news, personal feuds between powerful people over childish insults, and so forth. Yet here we are.

Shortsightedness

Never underestimate the shocking shortsightedness of businesses. Look at the AI labs for example. Communists observing this phenomena were quoted saying "the capitalists will sell us the rope we hang them with."

It's not selfishness, it's bias. Businesspeople are not willing to destroy everything just to temporarily make an extra dollar—no human thinks like that! Instead, businesspeople are very smart and strategic but extraordinarily biased into thinking whatever keeps their business going or growing must be good for the people. Think about Stalin being very smart and strategic but extraordinarily biased into thinking whatever keeps him in power must be good for the people. It's not selfishness! If Stalin (or any dictator) were selfish, they would quickly retire and live the most comfortable retirements imaginable.

Humans evolved to be the most altruistic beings ever with barely a drop of selfishness. Our selfish genes makes us altruistic (as soon as power is within reach) because there's a thin line between "the best way to help others" and "amassing power at all costs." These two things look similar due to instrumental convergence [? · GW], and it only takes a little bit of bias/delusion to make the former behave identically to the latter.

Even if gradual disempowerment doesn't directly starve people to death, it may raise misery and life dissatisfaction to civil war levels.

Collective anger may skyrocket to the point people would rather have their favourite AI run the country than the current leader. They elect politicians loyal to a version of the AI, and intellectuals facepalm. The government buys the AI company for national security reasons, and the AI completely takes over its own development process with half the country celebrating. More people facepalm, as politicians lick the boots of the "based" AI parrot its wise words e.g. "if you replace us with AI, we'll replace you with AI!"

But

While it is important to be aware of gradual disempowerment and for a few individuals to study it, my cause prioritization [? · GW] opinion is that only 1%-10% of the AI safety community should work on this problem.

The AI safety community is absurdly tiny. The AI safety spending [? · GW] is less than 0.1% of the AI capability spending, which in turn is less than 0.5% of the world GDP.

The only way for the AI safety community to influence the world, is to use their tiny resources to work on things which the majority of the world will never get a chance to work on.

This includes working on the risk of a treacherous turn [? · GW], where an AGI/ASI suddenly turns against humanity. The majority of the world never gets a chance to work on this problem, because by the time they realize it is a big problem, it probably already happened, and they are already dead.

Of course, working on gradual disempowerment early is better than working on gradual disempowerment later, but this argument applies to everything. Working on poverty earlier is better than working on poverty later. Working on world peace earlier is better than working on world peace later.

Good argument

If further thorough research confirms that this risk has a high probability, then the main benefit is using it as an argument for AI regulation/pause, when society hasn't yet tasted the addictive benefits of AGI.

It is theoretically hard to convince people to avoid X for their own good, because once they get X it'll give them so much power or wealth they cannot resist it anymore. But in practice, such an argument may work well since we're talking about the elites being unable to resist it, and people today have anti-elitist attitudes.

If the elites are worried the AGI will directly kill them, while the anti-elitists are half worried the AGI will directly kill them, and half worried [a cocktail of elites mixed with AGI] will kill them, then at least they can finally agree on something.

PS: have you seen Dan Hendrycks' arguments? It sort of looks like gradual disempowerment

comment by lfrymire · 2025-02-19T00:38:26.896Z · LW(p) · GW(p)

Furthermore, without unprecedented changes in redistribution, declining labor share also translates into a structural decline in household consumption power, as humans lose their primary means of earning the income needed to participate in the economy as consumers.

This holds only if the labor share of income shrinks faster than purchasing power grows. Overall, I still think the misaligned economy argument goes through if household consumption power grows in absolute terms but "human preference aligned dollars" shrinks as a fraction of total dollars spent.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Contents

52 comments

A real danger

Shortsightedness

But

Good argument