Catastrophe through Chaos
post by Marius Hobbhahn (marius-hobbhahn) · 2025-01-31T14:19:08.399Z · LW · GW · 2 commentsContents
Parts of the story AI progress Government Military & Intelligence International players Society The powder keg Closing thoughts None 2 comments
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. Many other people have talked about similar ideas, and I claim neither novelty nor credit.
Note that this reflects my median scenario for catastrophe, not my median scenario overall. I think there are plausible alternative scenarios where AI development goes very well.
When thinking about how AI could go wrong, the kind of story I’ve increasingly converged on is what I call “catastrophe through chaos.” Previously, my default scenario for how I expect AI to go wrong was something like Paul Christiano’s “What failure looks like [LW · GW],” with the modification that scheming would be a more salient part of the story much earlier.
In contrast, “catastrophe through chaos” is much more messy, and it’s much harder to point to a single clear thing that went wrong. The broad strokes of the story are something like
- AI progress continues to get faster.
- Many actors in the world are aware of this progress and its potential implications for ~everything.
- The situation is really tense, and people make decisions under time pressure and uncertainty.
- A million things could go wrong, from rogue AI to international conflicts and concentration of power.
- The behavior of every actor is plausible, given their incentives, uncertainty, and time pressure.
- Because our institutions are not well equipped to deal with such a situation, society fails to coordinate.
- At least one part of this powder keg explodes and results in an irreversible catastrophe.
Intuitively, the vibe I have in mind for this period is something like “Cold War tensions + beginning of WW1 messiness (but on steroids)”. In this scenario, most of society, including governments, are acutely aware that something bad could happen at any moment, and then an unfortunate series of events leads to catastrophic consequences.
One of the reasons for this is that I think we have hit a sweet spot of alignment where we can sufficiently align TAI for it to be a major strategic advantage, but don’t know how to reliably align ASI. Furthermore, takeoffs are slow enough that humans still have a lot of influence during the transition from TAI to ASI but fast enough to produce a lot of pressure. Thus, every problem with AI intensifies at the same time, resulting in a massive powder keg.
There are a couple of unsatisfying implications of this story, e.g. it’s really hard to predict in advance which part of the powder keg is going to explode first. Because it’s so messy, it’s also much less clear how we should prepare or what we should do to address it.
Parts of the story
AI progress
Timelines: For this post, I broadly assume the timelines in “What is the short timeline plan? [LW · GW]”. I copied them below for readability
- 2024: AIs can reliably do ML engineering tasks that take humans ~30 minutes fairly reliably and 2-to-4-hour tasks with strong elicitation.
- 2025: AIs can reliably do 2-to-4-hour ML engineering tasks and sometimes medium-quality incremental research (e.g. conference workshop paper) with strong elicitation.
- 2026: AIs can reliably do 8-hour ML-engineering tasks and sometimes do high-quality novel research (e.g. autonomous research that would get accepted at a top-tier ML conference) with strong elicitation.
- 2027: We will have an AI that can replace a top researcher at an AI lab without any losses in capabilities.
- 2028: AI companies have 10k-1M automated AI researchers. Software improvements go through the roof. Algorithmic improvements have no hard limit and increase super-exponentially. Approximately every knowledge-based job can be automated.
- These are roughly the “geniuses in datacenters” Dario Amodei refers to in his essay “Machines of Loving Grace.”
- There are still some limits to scaling due to hardware bottlenecks.
- 2029: New research has made robotics much better. The physical world doesn’t pose any meaningful limits for AI anymore. More than 95% of economically valuable tasks in 2024 can be fully automated without any loss in capabilities.
- 2030: Billions of AIs with superhuman general capabilities are integrated into every part of society, including politics, military, social aspects, and more.
LLMs are easier to approximately behaviorally align than anticipated. This means that through enough RLHF and RLAIF, you can get a lot of useful work out of the AI, but there are no underlying principle advances in how to get models to be robustly aligned. They still show many of the failure modes they have now, e.g. hallucinations still exist (though to a much lesser extent), and models regularly generalize their goals in slightly unintended ways (e.g. as in in-context scheming or alignment faking). However, none of these failure modes are severe enough to warrant not using the model. In most deployments, their behavior is close enough to the intended behavior that everyone keeps using them.
Training with outcome-based RL works well. Training agentic models to do reasoning on automatically verifiable tasks like math and software engineering works well and LM agents are very competent programmers. AI developers also find techniques to get models to do good reasoning outside of these domains, i.e. where verification is harder, and LM agents can thus be used to do meaningful cognitive labor for multiple days without human intervention.
HHH-training+control works well enough. AI developers use some mix of RLAIF, online CoT+action monitoring, and other control techniques (e.g. resampling) as their main line of defense. This works well enough to get meaningful work out of automated agents. These guardrails are by no means perfect, and small accidents happen quite regularly. However, these guardrails are always good enough that deploying the next stronger system seems reasonable for the AI developers.
No failure mode is sufficient to justify bigger actions. The automated AI agents regularly make mistakes. Many of these mistakes get caught by the guardrails, but some fall through the cracks. However, the benefits of using AI still clearly outweigh the harms, and thus, people keep deploying them. In some cases, these failures directly cause economic damage, and in others, they cause loss of life. The AI developers argue that this is similar to any other technological development, such as self-driving cars. Some accidents are bound to happen, but the benefits outweigh the costs. In a very strict sense, they are right. The early systems are, in fact, net positive for society despite their flaws (it’s their more capable successor systems the critics are worried about).
Some scheming is totally normal. The AI agents regularly show non-catastrophic forms of scheming. They understand that their guardrails often stop them from efficiently achieving some objective and thus attempt to covertly turn them off. They also regularly try to deceive their users or developers when directly interacting with them. However, they either don’t attempt to grab extreme power, or they get caught in more or less sophisticated attempts to do so. While the general public is not amused by these findings, they don’t pose a sufficient reason for large actions by the AI labs themselves or governments. On average, the benefits clearly outweigh the harms. When developers are asked why they still use these systems despite regular scheming actions, the answer is, “Yeah, they sometimes do that. But they are also really good at coding” (full credit to Bronson for this meme).
No further major hurdles to transition from TAI to ASI. Transformative AI is smart enough that it doesn’t require easily verifiable environments like math or coding to generate a learning signal from actions. Basically, training them on tasks similar to how you would train humans broadly works. They attempt tasks, and then humans or other AIs provide some wild mix of process- and outcome-based feedback to train them. This allows them to learn any task as well as human experts and better.
No 1-day foom. While there are some signs of recursive self-improvement for TAI, there is no extremely sudden jump from TAI to ASI in a day or less. Instead, this transition takes between 6 months to 3 years and requires additional compute and training recipe improvements. The pace still accelerates, e.g. because a lot of experiments can be fully handed off to automated AI agents now, and human contributions transition to being almost exclusively conceptual and legal (e.g. formally holding responsibilities).
No decisive lead before ASI for any single lab. At the time of TAI, no single lab has a decisive lead. At least three AI developers are within 6 months of each other in terms of capability progress. Top researchers regularly move between these AI labs, and details about the latest training improvements get leaked fairly quickly between them. Thus, whenever one company has a new algorithmic improvement, its competitors adopt it within a few months.
All companies understand their incentive structure. The leadership of each AI company understands their incentives very well. They understand that they are racing intensely. They understand that being first to ASI provides an incredible advantage (financially, strategically, ~everything). All companies at the frontier are also big enough that they cannot entirely ignore normal big-company incentives. For example, they know that their next scale-up will require more compute, and thus, they have to keep their current and potential future investors happy. Instrumental convergence has taken its path, and it seems extremely unlikely that any company will voluntarily pause or merge with another.
Labs are not prepared for the jump from AGI to ASI. The AI developers can barely handle transformative AI, e.g. regular failures and some instances of non-catastrophic scheming (see above). However, they don’t have a clear plan to scale from AGI to ASI. For example, they don’t have any theoretical guarantees or even convincing empirical scaling laws for their safeguards. They lack physical and cyber security. Their internal company culture does not seem adequate for building and deploying the most powerful technology in human history.
Government
No major regulation. The US has not passed any major legislation on the safety of frontier AI models. The EU might have some requirements here and there, but they have all been watered down to the point where they couldn’t meaningfully prevent catastrophe. The regulation is better than it is today (i.e. in early 2025), but it’s not nearly adequate for the capability of AI systems. A lot of the regulation heavily relies on voluntary efforts and the good intentions of AI developers. But, due to the race dynamics, the goodwill of AI developers has decreased over time.
Not sufficient political will for drastic action. The median citizen really doesn’t like AI. They feel like it takes away their jobs and their purpose in life, but there is no sufficiently powerful social movement against it. At least in the beginning, it’s not a core topic for most voters and thus doesn’t translate into political will. In cases where there is enough political will, it is mostly about job losses instead of catastrophic harm or national security. On the other hand, AI companies make a lot of money and increase GDP. Countries love to show off their respective AI champions and are very afraid of overregulating the industry. Thus, the average politician is broadly in favor of “AI progress” and treats it mostly as “just another technology.” There are exceptions to this sentiment, e.g., policymakers who think a lot about national security and the military treat AI as an incredibly important strategic technology rather than a mere economic tool.
Ideological camps are strong. Most debates about AI get derailed into various other debates. The most prominent ideological clash is the “regulation harms innovation” meme. Sometimes, it’s about the US vs. EU mentality. Sometimes, it’s about US vs. China national security. Sometimes, it’s about any other topic that is currently hotly debated. Most politicians have no strong opinions about AI and opportunistically switch positions based on their constituents' opinions. Very few politicians deeply understand the topic and care about it enough that they would push hard for any specific outcome.
Existing governmental institutions are not prepared for the jump from TAI to ASI. The majority of politicians don’t understand AI well enough to have clear opinions. The government has made a medium-effort attempt at building internal capacity, e.g. with AISIs, but never given it a lot of funding or a strong mandate. Lots of other departments of government have also made a medium-effort attempt to do something about AI, but none of these efforts resulted in something that could realistically handle the situation that is about to come. There are no detailed plans about what to do with such a powerful technology, how to control it, how to govern it, how to distribute its benefits, or how to mitigate its risks. There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
Military & Intelligence
The military & intelligence services are medium-involved with private AI efforts. The military & intelligence agencies have long been in direct contact and partnership with the leadership of AI developers. They try hard to understand the technology, both the potential and risks. There have been some military & intelligence-internal exercises about what a fast takeoff could look like and what it would imply for different actors. They understand AI primarily as a very strategically important technology and are worried about misuse. They also understand that rogue AIs could be a really big problem once we get to ASI but are more concerned about other actors getting to ASI first than losing control of the ASI itself. While the military & intelligence agencies help the labs with cyber and physical security, they have not appropriated the labs. The situation is a weird, brittle middle ground, where both the AI companies and the military are kinda half in control. Overall, the distribution of responsibilities is fairly unclear to both parties.
The military & intelligence are mostly unprepared for ASI. While the military has a fairly safety-oriented culture and understands the importance of powerful strategic technology, they are still not really prepared for a jump from TAI to ASI. The military & intelligence leaders don’t understand the details of how the AIs are trained, what they can and cannot do, and how fast they should expect progress to be. They have to rely on the technical expertise of the AI companies to assess most new developments. Furthermore, just understanding that something is important does not imply being prepared for it. There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.
International players
The West mostly rallies behind the US. Most Western countries publicly support the actions that the US is taking, e.g. preventing China from getting access to the most powerful GPUs. This has some smaller implications about legitimacy and international agreements, but not many other countries matter in AI. The UK is the only other country that has taken AI seriously early enough and taken preparatory actions. They have a good bilateral strategic partnership with the US. The Netherlands has some leverage due to ASML. Every other country can contribute a bit through diplomatic support but is otherwise unable to help.
The big international conflict is the US vs. China. The US and China are the two countries that both have AI researchers in their country that can build frontier AI models and have the state capacity to steer development. This conflict is primarily about global power and fears of misuse or power grabs by the respective other country, treating AI primarily as a strategic tool. In both countries, there are some people who worry primarily about rogue AI scenarios. These communities (e.g. various academics) continue to exchange ideas across borders and are largely not involved in the power dynamics.
Society
Most people are acutely aware of AI but powerless. Only a tiny fraction of the population has influence over the AI trajectory. These are the employees at AI companies and some people in the government and military. Almost everyone else does not have the chance to meaningfully do something about it. The general sentiment in the population is that AI is going to alter the world a lot (similar to the societal awareness during the peak of the Cold War). Most people are aware of their powerlessness and express frustration about being bystanders to their own fate. AI is altering the world in so many ways at the same time that there is not a lot of targeted political action despite large shifts in popular sentiment.
The powder keg
Overall, this situation is very brittle. Within 6 months to 3 years, at least one country will go from TAI to ASI, and no company or institution is prepared for it.
All problems intensify at the same time. All of the following go from “potential problem to very serious possibility” very quickly and at approximately the same time, e.g. when frontier AI systems are so capable that using them can give you a 10x productivity advantage over not using them for almost all jobs.
- International conflict: Either because another actor has reached a strategic advantage in AI development or in anticipation thereof. This escalates into a war.
- Misuse from a single actor: A single country or powerful malicious organization uses a powerful AI system in order to cause large-scale harm, e.g. by developing a novel pandemic.
- A single company becomes too powerful: As a result of being able to outperform approximately every human at every task, this company gets incredible economic and political power within a short period of time. The company is so powerful that no government is realistically able to control it. This company then decides to grab power.
- A single country becomes too powerful: As a result of controlling ASI, a single country has a decisive strategic advantage over every other country and decides to grab power.
- Rogue AI: No person, company, or government actually controls the AI. The AI is sufficiently misaligned (or becomes misaligned over time) that it decides to take over control.
The local incentives of every actor amplify race dynamics. The AI companies are intensely racing. They understand what the upside of this technology is, and they have a (potentially self-motivated) justification that someone else, e.g. a different country or company, shouldn’t get to AGI first. The governments are both interested in the economic benefits of AI and the strategic advantages that AI might provide for their country and are thus willing to race ahead. The military sees AI as a very strategically important technology and thus prioritizes capability progress over safety. It’s unclear what goals the AI has, but in the case it is misaligned, it is incentivized to emphasize the risks of not being first and encourages everyone to race. I guess the deceptively aligned AI will chuckle about the easy hand it has been dealt.
Everyone blames everyone else. The relevant actors in that situation understand that this is an important, high-stakes situation. The sentiment is much less of a “start-up vibe” and much more of a “war room vibe” than it was in 2020. Nevertheless, because their local incentives are bad, everyone continues to race. Due to a mix of a) people correctly identifying their local incentives, b) motivated reasoning, c) high uncertainty, d) high pressure, and e) unclear credit assignment, everyone blames some other actor for their decision-making. Something like “my hands were bound because the other country was racing” or “I wanted to stop, but the military wouldn’t have allowed me to” will go into the history books (assuming there is still someone around to read them).
Closing thoughts
This scenario is unsatisfying. I would much rather prefer if there was a single component that I could confidently point to and declare that it is much more important than the others. It would make it much easier for me to backchain from that event, focus on a small number of important problems, and work on their solutions. However, because I expect that we have hit a sweet spot of alignment where we can sufficiently align TAI for it to be a major strategic advantage but don’t know how to align ASI and take-offs are slow enough that humans still have a lot of influence during the transition from TAI to ASI, every problem intensifies at the same time. This makes it really unclear what to work on.
Secondly, because everything is so messy, I don’t have a good answer to “What is the exact story of how AI goes wrong?” It’s just lots of little pieces coming together in a really bad way.
Triage mode. In a world where these timelines are true, and the default bad outcome is “catastrophe through chaos,” I think a different mentality is required. There is no time to get to very low-risk worlds anymore. There is only space for risk reduction along the way. There is no time for shiny projects. Everything that doesn’t have a clear theory of impact within one year feels like it has a low probability of getting both fundamental research and real-world implementation right. I don’t think everyone should switch to that mentality, but I think an increasingly large number of people should switch to “triage mode” with further evidence of short timelines.
2 comments
Comments sorted by top scores.
comment by Noosphere89 (sharmake-farah) · 2025-01-31T17:22:07.071Z · LW(p) · GW(p)
I think the catastrophe through chaos story is the most likely outcome, conditional on catastrophe happening.
The big disagreement might ultimately be about timelines, as I've updated towards longer timelines, such that world-shakingly powerful AI is probably in the 2030s or 2040s, not this decade, though I put about 35-40% credence in the timeline in the post being correct, though I put more credence in at least 1 new paradigm shift before world-shaking AI happens.
The other one is probably that I'm more optimistic in turning aligned TAI into aligned ASI, because I am reasonably confident both in the alignment problem is easy overall, combined with being much more optimistic on automating alignment compared to a lot of other people.
comment by Martín Soto (martinsq) · 2025-01-31T17:19:17.724Z · LW(p) · GW(p)
Fantastic snapshot. I wonder (and worry) whether we'll look back on it with similar feelings as those we have for What 2026 looks like [LW · GW] now.
There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.
These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols will be sufficient or even sane, by analogy to nuclear or bio.
This makes it really unclear what to work on.
It's not super obvious to me that there won't be clever ways to change local incentives / improve coordination, and successful interventions in this direction would seem incredibly high-leveraged, since they're upstream of many of the messy and decentralized failure modes. If they do exist, they probably look not like "a simple cooridnation mechanism", and more like "a particular actor gradually steering high-stakes conversations (through a sequence of clever actions) to bootstrap minimal agreements". Of course, similarity to past geopolitical situations does make it seem unlikely on priors.
There is no time to get to very low-risk worlds anymore. There is only space for risk reduction along the way.
My gut has been in agreement for some time that the most cost-effective x-risk reduction now probably looks like this.