ASI existential risk: Reconsidering Alignment as a Goal

post by habryka (habryka4) · 2025-04-15T19:57:42.547Z · LW · GW · 4 comments

This is a link post for https://michaelnotebook.com/xriskbrief/index.html

Contents

  Biorisk scenario
  The Vulnerable World Hypothesis
  Loss of control to ASI
  Conclusion
  Acknowledgements
None
4 comments

A post by Michael Nielsen that I found quite interesting. I decided to reproduce the full essay content here, since I think Michael is fine with that, but he should feel free to let me know to only excerpt it. 


This is the text for a talk exploring why experts disagree so strongly about whether artificial superintelligence (ASI) poses an existential risk to humanity. I review some key arguments on both sides, emphasizing that the fundamental danger isn't about whether "rogue ASI" gets out of control: it's the raw power ASI will confer, and the lower barriers to creating dangerous technologies. This point is not new, but has two underappreciated consequences. First, many people find rogue ASI implausible, and this has led them to mistakenly dismiss existential risk. Second: much work on AI alignment, while well-intentioned, speeds progress toward catastrophic capabilities, without addressing our world's potential vulnerability to dangerous technologies.

Why do thoughtful, well-informed people disagree so strongly about existential risk (xrisk) from ASI? As I'm sure you're aware, there is an enormous gap between distinguished scientists and technologists on this issue. On one side are those who believe ASI poses an xrisk:

And on the other side are many who are dismissive of such concerns, e.g.:

Why is there such strong disagreement between well-informed people, on an issue so crucial for humanity? An explanation I often hear is that it's due to differences in short-term self-interest – this is sometimes coarsely described as the xrisk-concerned wanting publicity or making money through safety not-for-profits, while the xrisk-skeptics want to get richer and more powerful through control of AGI. But while those explanations are convenient, they're too glib. Many prominent people concerned about xrisk are making large financial sacrifices, sometimes forgoing fortunes they could make working toward ASI, in favour of merely well-paid work on safety7. And while it's true xrisk-skeptics have an incentive to make money and acquire power, they have far more interest in remaining alive. I believe the disagreement is due primarily to sincerely held differences in underlying assumptions. In this text we'll try to better understand those differences.

For this discussion, we'll assume ASI has been achieved. That is, we have systems with superhuman performance across most domains of human endeavour. Thus, this isn't about the systems in widespread use today, such as ChatGPT, Claude, and Gemini. It's about systems which can outperform humans at virtually any cognitive task, while also being able to act in the physical world8. We're already seeing early hints of what this might entail: AlphaGo Master so thoroughly mastered Go that Ke Jie, one of the greatest ever human players, described it as a "God of Go," and said it revealed "not a single human has touched the edge of the truth of Go." Similarly, DeepMind's AlphaChip system has helped design several generations of Google's TPU chips, and "generates superhuman or comparable chip layouts in hours, rather than taking weeks or months of human effort". But ASI would have abilities far beyond these narrow examples – imagine AlphaGo's Move 37, not as a one-off insight, but a trillion-fold, pervasive across multiple domains in the world9. Of course, such systems will still be subject to constraints from physics, chemistry, biology, economics, computational complexity, and so on. But they will demonstrate capabilities that seem near-magical today, much as many human abilities appear incomprehensible to other primates. Some skeptics question whether 'LLMs can ever become ASI', but for us it doesn't matter whether ASI has been achieved through LLMs or some different approach. Others counter that ASI is many decades away –- which may or may not be true, but doesn't change the fundamental concerns if we eventually achieve it.

Finally, two notes on the approach of the text. First, this talk explores topics that have been investigated by many researchers. The relevant literature is vast, and my citations are more representative than complete. My apologies to those whose work is unfairly omitted. Second, my overall framing emphasizes existential risk ("xrisk"), but much of the discussion involves catastrophic risks – scenarios that won't directly cause extinction, but might "only" kill hundreds of millions or billions of people. Obviously, there's a spectrum from catastrophic risks through to truly existential risks, and catastrophic events might trigger events (war, infrastructure breakdown, ecological collapse) ultimately leading to extinction. Because the analysis of these risks overlaps, with insights from one informing our understanding of the other, I'll use "xrisk" as a shorthand throughout, distinguishing between them when necessary for the specific argument at hand.

Biorisk scenario

Let's discuss a concrete xrisk scenario involving bioweapons. While you may have heard such examples before, it's valuable as a setting to develop both skeptical arguments and reasons for concern. As is common in hypothetical scenarios, it begins as a simplified sketch, but we'll then iteratively improve it through critical examination. The scenario is not intended to convince anyone to be concerned about existential risk. Rather, it's a vehicle to help understand and illustrate patterns of argument and counter-argument common in discussions of ASI xrisk.

The scenario begins with a doomsday cultist, like those in the Aum Shinrikyo cult, asking an ASI: "Please design an airborne virus which spreads as rapidly as measles, is as lethal as ebola, undergoes asymptomatic spread like Covid, and which is good at evading known vaccine techniques. And please help me fabricate and release it."

An immediate skeptical response is: "But we will align our AI systems to prevent ASI from helping with dangerous requests." Certainly, this has been done with some success by ChatGPT, Claude, and Gemini. However, there is tension between such safety guardrails and the desire to build systems which seek out truths without barrier. How do we decide the boundary between "safe" truths and dangerous truths the system should not reveal? And who decides where that boundary lies? This point has been made succinctly by Elon Musk, who said "What we need is TruthGPT" that "tries to understand the nature of the universe". National militaries seem unlikely to accept sanitized truth-seeking capabilities. Neither will many open source groups, or idealists who deeply value understanding truth and nature. All these groups may set the the boundary of acceptable truths very differently than in current consumer products. Techniques used to build "safe" systems are easily repurposed to build less safe models, reducing the expense of alignment work, while increasing truth-seeking power. We already see this with DeepSeek, which appears to have few or no biorisk guardrails10.

The underlying problem is that it is intrinsically desirable to build powerful truthseeking ASIs, because of the immense benefits helpful truths bring to humanity. The price is that such systems will inevitably uncover closely-adjacent dangerous truths. Deep understanding of reality is intrinsically dual use. We've seen this repeatedly through history – a striking case was the development of quantum mechanics in the early twentieth century, which helped lead to many wonderful things (including much of modern molecular biology, materials science, biomedicine, and semiconductors), but also helped lead to nuclear weapons. Indeed, you couldn't get those benefits without the negative consequences. Should we have avoided quantum mechanics – and much of modern medicine and materials and computation – in order to avoid nuclear weapons? Some may argue the answer is "yes", but it's a far from universal position. And the key point is: sufficiently powerful truths grant tremendous power for both good and ill.

Even if such truth-revealing models are never built, it's likely easy to remove alignment barriers. The biologist Kevin Esvelt found it cost only a few hundred dollars to finetune an existing open source model so it would be far more helpful in generating pandemic agents, concluding11:

Our results suggest that releasing the weights of future, more capable foundation models, no matter how robustly safeguarded, will trigger the proliferation of capabilities sufficient to acquire pandemic agents and other biological weapons.

The fundamental asymmetry is that "understand reality as deeply as possible" is a simple, well-defined goal, grounded in objective reality, while creating "aligned" systems requires building complex, subjective, hard-to-agree guardrails on top of that reality. So truth-seeking has a clear target, while alignment requires constantly shifting definitions based on social consensus. This asymmetry makes it intrinsically far easier to build unrestricted systems than aligned ones. And so alignment is intrinsically an unstable situation, ripe for proliferation.

Another skeptical response to the initial scenario is: "Isn't such a virus biologically implausible? How do we know it can exist?" It's true that we don't know for sure that it can exist. But there's a considerable chance. As the query implies, viruses with any single characteristic are known. And we're gradually getting better at engineering combinations. For example, in the 2000s, humans accidentally engineered a mousepox variant that was 100% lethal to mice vaccinated against mousepox12. Even without ASI, we are gradually understanding how to engineer increased spread, lethality, immune evasion, and other qualities. Indeed, it may be possible to do even worse than the query implies — viruses are known which have case fatality rates near 100%, such as the MSW strain of Californian myxoma virus in European rabbits13. And, in any case, "only" a 50% or 90% or 99% death rate seems like small comfort, and would leave humanity appallingly vulnerable to further death due to failed infrastructure and war.

Another skeptical response is: "But can't you just Google this, and put it together out of the scientific literature? How does AI buy you anything?" With early AI systems, this was a reasonable response. Anyone who has seriously used today's models (much less far more capable future ASI) knows the response is becoming more reliant on the provision of the unstable guardrails mentioned above. Today's models are increasingly good at synthesizing a wide variety of sources, drawing together knowledge in easy-to-act-on ways. They are still subject to hallucination and have trouble sorting reliable results from unreliable. But they are getting better, and ASI will be far better, especially as it attains more capacity to do physical experimentation and to act in the world. Even if it's "merely putting together knowledge out of the scientific literature", that is in any case much of what an outstanding scientist does. Indeed, AlphaFold and similar systems surpassed the best human scientists at the design of proteins, merely by training on widely-available public data. The underlying tension is: the better the AI systems, the more capable they intrinsically ought to be at solving problems like that in the query. If you are pessimistic about AI's ceiling, and think such systems won't ever be very capable, then you won't be concerned. But if you believe ASI will be extraordinarily capable, like having a team of brilliant scientific advisors, then you must rely on guardrails built into the system, not any fundamental lack of capability.

Another skeptical response: "But are such cults for real? Surely the Aum Shinrikyo cult was a strange one off, or exaggerated or misunderstood? No-one really wants the end of the world!" In fact, there is a long history of individuals or groups who have sought to bring about apocalypse. Often this involves significant mental illness, and the individuals or groups involved aren't especially competent (fortunately). But it doesn't change the fact that world-ending intent is more common than one might a priori assume. Sometimes the efforts involve intended collective suicide by the people hoping to bring apocalypse about; other times those people expect to form part of a small elect who will be saved, often to help make a better world. I won't attempt a history of such efforts here. But I will mention a little about the remarkable history of Aum Shinrikyo, as illustration.

Aum Shinrikyo began in 1987, and grew rapidly, acquiring more than 10,000 recruits by the mid-1990s. Many were educated Japanese in their 20s and 30s, some with significant expertise in science and technology. The cult initially predicted apocalypse in 1997, starting with a nuclear conflict between the US and Japan, which would escalate and cause the destruction of most of humanity. Aum members were to be an elect who were saved. After a failed attempt to gain political power in 1990, top cult leaders shifted from merely predicting apocalypse to actively bringing it about14. They had programs to develop chemical and biological weapons, missiles, and purchased a sheep station in Australia to prospect for uranium. Their chemical and biological weapons efforts were especially successful, and according to historian Charles Townshend15, "police raids found enough sarin in Aum's possession to kill over four million people." They made multiple attempts to deploy sarin gas and other chemical and biological weapons. Fortunately, all had limited success, although dozens of people were killed, and thousands affected. While most cult members were unaware of these efforts, some cult members genuinely were working to instigate apocalypse. Aum were more competent (and thus are better known) than many other groups with apocalyptic intent; part of the concern with ASI is that it may greatly increase the supply of apocalyptic competence.

We've merely explored the surface of this scenario, to give a sample of thinking about ASI xrisk. Of course, this scenario can be developed through many more iterations, going deeper and deeper. That would be valuable, but isn't the point of this text. Instead, I'll finish the section by mentioning the best skeptical response of all: "Even if this scenario is biologically and psychologically plausible, and alignment of single AI systems doesn't buy us much, humanity will prevent it in other ways. For instance, we can do it by controlling biological synthesis equipment, ensuring viral sequences are screened, involving law enforcement and intelligence agencies, and generally deploying all the different institutions we humans use to ensure safety. And we'll have ASI to help us too!"

This is the strategy humanity has historically used to defend against threats: we (mostly) allow innovation, and respond to problems as they arise. This just-in-time coevolutionary response strategy works well, since it means we get the benefits of innovation, while mitigating the risks. It's true there are major technological problems which have been challenging for humanity to deal with in this way – consider asbestos, leaded petrol, climate change, and similar examples. But for each such example there are thousands of innovations where existing institutions largely overcame early problems. In the short term such a just-in-time safety strategy will work well. Indeed, AI and AGI and ASI will lead to many extraordinary beneficial innovations. But over the medium term, this strategy is much harder to defend, for reasons I'll discuss below.

The Vulnerable World Hypothesis

Underlying the biorisk scenario is a more fundamental question: "Is there a 'recipe for ruin', some cheap-and-easy-to-make-and-deploy technology capable of ending humanity? Or causing say > 90% human deaths?" The existence of such a recipe is what Nick Bostrom has called "the Vulnerable World hypothesis"16. While we have survived technological advances so far, the universe offers no guarantee all such advances will be manageable. Like explorers in uncharted territory, we could suddenly find ourselves facing an insurmountable danger. Fortunately, we don't yet know of any recipes for ruin. But an unpleasant parlour game is identifying candidates plausible with foreseeable technology. We've briefly discussed one, the engineered virus discussed in the last section. Let's look even more briefly at a few others. As in the last section, the purpose here isn't to develop detailed, convincing scenarios. It is rather to begin to explore the range of possibilities.

One plausible example is mirror bacteria: hypothetical human-engineered organisms where some existing bacterium has been duplicated, but all molecules have been reversed into their chiral mirror images. Some scientists believe conventional immune defenses would fail to recognize such mirror bacteria, and they would spread unchecked, overwhelming the biosphere, and leading to a mass extinction of many species. Until recently, some basic elements of mirror life were being actively pursued by curiosity-driven scientists. However, many of those scientists recently joined together to publish an important piece in Science17, explaining the dangers, and publicly abandoning their work on mirror life.

Another example begins with easy-to-make nuclear weapons. This sounds implausible. But Ted Taylor, the leading American designer of nuclear weapons, thought they were very plausible, telling the writer John McPhee there is "a way to make a bomb… so simple that I just don't want to describe it". Taylor's remarks, published in a book by McPhee, stimulated at least two people to develop plausible designs for DIY nuclear weapons. These have not, so far as we know, been built, perhaps because the availability of fissile material remains a bottleneck. If that bottleneck can be removed, then it may be very difficult to avoid the widespread proliferation of weapons. Of course, on their own, easy-to-make nukes aren't directly an existential risk. However, they are bad news, and may well trigger other instabilities. A world with fewer than ten nuclear powers can perhaps avoid the use of nuclear weapons; that won't be true in a world with thousands of rogue nuclear actors. And it is easy to develop plausible scenarios in which "small-scale" use by rogue actors escalates to trigger a major nuclear exchange. Whether such an exchange could truly destroy humanity is debated; however, the combination of immediate deaths, nuclear winter, and collapse of global infrastructure would be catastrophic, and at least plausibly threaten human extinction. Taylor believed "Every civilization must go through [its nuclear crisis]… Those that don't make it destroy themselves. Those that do make it wind up cavorting all over the universe"18.

A skeptic of ASI xrisk may point out that: "It's not really ASI you're concerned about here. It's increased scientific and technological ingenuity, harnessed for destruction." This is correct. The Vulnerable World Hypothesis is a broad question about the nature of the universe and technology. It is not about ASI specifically. The fundamental concern is whether dangerous capabilities are latent in our universe, and likely to be discovered. However, ASIs which speed up science and technology will act as a supercharger, perhaps able to rapidly uncover recipes for ruin, that might have otherwise taken centuries to discover, or never have been discovered at all. Today, it needs considerable expertise and resources to work on superviruses, mirror life, or nuclear weapons. But ASI will greatly reduce those requirements. And, as discussed earlier, while single aligned ASIs may refuse to help, proliferation makes this unstable. It's the existence of ASI which is the problem here, regardless of the alignment of single systems. You cannot put an impermeable barrier around understanding and controlling reality, when you have built systems with intellectual capacity beyond von Neumann or Einstein, and which are also exceptionally capable of building and operating in the physical world. And if you can make one such artificial mind, you can scale up to a million, and make them each a thousand times faster. We should expect a major discontinuity in individual capability, unless individuals are denied access to the fruits of these systems.

A skeptical response to the Vulnerable World Hypothesis is: "Well, most people don't want to blow up the world. So provided you can keep tabs on the relatively few actors who perhaps do, you should be okay." That's fair enough. Unfortunately, a lot of people do strongly desire power and ability to dominate others. It seems to be a strong inbuilt instinct, which we see from the everyday minutiae of human behaviour, to the very large: e.g., colonial powers destroying or oppressing indigenous populations, often not even out of malice, but indifference: they are inconvenient, and brushed aside because they can be. We humans collectively have considerable innate desire for power we can use over people defenseless to stop it. But in a Vulnerable World, the competitive drive to ratchet up that exercise of power to dangerous levels will be enormous.

A common response to concerns about ASI risk is to propose "uplift" approaches — using brain-computer interfaces, genetic engineering, or other enhancement technologies to augment human capabilities. While this seems superficially promising, it misunderstands the fundamental issue posed by the Vulnerable World Hypothesis. The problem isn't whether intelligence is carbon or silicon-based, but about increased intellectual capability leading to increased power and access to catastrophic technologies. Uplift approaches seem likely to increase rather than decrease the danger of developing recipes for ruin. And it creates other problems –– e.g., BCI companies (or regulators) able to re-shape human thought in dictatorial ways. The issue isn't which substrate intelligence runs on, but the dual-use nature of deep understanding of reality19.

Let's return to what I described in the last section as the best skeptical response, but in the more general context of the Vulnerable World hypothesis: "Sure, maybe we live in a Vulnerable World, and unrestricted ASI may help discover recipes for ruin. However, those technologies won't arise in a vacuum. Humanity will see many smaller problems along the way, and respond to manage the issue, same as we ever have. Capabilities and safety naturally co-evolve. Let's address problems as they arise, just as we did with the ozone hole and the Montreal Protocol, or nuclear weapons and the non-proliferation treaty. Not worry prematurely about abstract future concerns. Institutional safety and real-world barriers will keep us safe."

Such actions are important. But it assumes the capabilities of ASI will improve slowly enough for our institutions to adapt. History suggests technological discontinuities can shock even top experts and leave institutions out of their depth. Recall Admiral William Leahy telling President Truman that the atomic bomb project was 'the biggest fool thing we have ever done. The bomb will never go off, and I speak as an expert in explosives.' Leahy had taught physics and chemistry at Annapolis and was Chairman of the Joint Chiefs. He was completely wrong. Sometimes discontinuities happen. ASI will be a genie whose consequences, some positive and some very negative, can't easily be put back in the bottle. And, as the nuclear example suggests, relatively simple offensive technologies can sometimes be very difficult to defend against; there is no intrinsic requirement for symmetry.

Loss of control to ASI

The most discussed instance of the Vulnerable World Hypothesis is the rogue ASI takeover scenario. This is sometimes summarized briefly as: "Just as chimpanzees ought to be wary of human beings, so human beings ought to be wary of creating more capable potential successor entities". Insofar as ASIs are agents with their own agenda, and have some freedom to act, they may end up taking more and more control, and eventually using Earth for their own ends, even if those ends damage humanity. Power, once ceded, is difficult to reclaim.

These loss-of-control arguments have been widely discussed, in influential writings like those of Nick Bostrom, Stuart Russell, and Eliezer Yudkowsky, and more recent writing like the scenario recently released by Daniel Kokotajlo et al, as well as in many other places20. And so I'll merely briefly examine a few of the standard skeptical responses, mostly to deepen the context of discussion. With that context, I'll then discuss ways in which over-emphasizing loss-of-control may lead people to miss the broader challenge posed by the Vulnerable World Hypothesis, and has thus led to mistakes in the strategy taken toward existential risk.

A standard skeptical response is21: "But we won't make agents with their own interests and power-seeking drive". This seems implausible for several reasons. Creating such agents will be intellectually satisfying for developers, and economically profitable for investors, at least in the short run. Romance-bots, personal assistants, agents of persuasion, military robots, and financial trading systems will all function "better" when endowed with considerable agency. This process is already well underway, driven by powerful market incentives that reward increasing agency without equivalent rewards for safety.

Another skeptical response is: "We won't be willing to cede much power to inhuman entities". History suggests otherwise. We already delegate life-and-death capability to guided missiles, which sometimes make fatal errors. We entrust financial decisions to trading algorithms, which contribute to events like the 2010 Flash Crash, where markets lost more than a trillion dollars in 36 minutes. Perhaps most strikingly, the Soviet Union's Dead Hand system was designed to launch nuclear counterstrikes automatically if the Soviet leadership was killed. We have already delegated world-ending authority to automated systems.

Another skeptical response is: "We can simply turn them off". This is unrealistic. Could we "turn off" platforms like Facebook and X if they were net negative for our society? It's not so easy: they are deeply integrated into our society, and powerful interests become aligned with them over time, making them harder to control. A better skeptical response is that we will control these ASI systems using many of the same mechanisms we use to control other powerful actors: institutions, norms, laws, and education. However, that is only viable if those governance mechanisms keep pace with the capabilities of the powerful actors. That's going to be an enormous challenge!

The rogue ASI scenario is the most commonly discussed xrisk from ASI. This is a mistake, for two reasons. One reason is that people fixate on whether ASIs getting out of control is plausible. For instance, often they're used to technology being something human beings ultimately control, and find it hard to believe that might one day no longer be true. Unfortunately, if they don't buy loss-of-control scenarios they may then dismiss xrisk22. But the fundamental underlying issue isn't machines going rogue (or not), it's the power conferred by the machines, whether that power is then wielded by humans or by out-of-control machines. That is, the issue is with the Vulnerable World Hypothesis, considered broadly, not rogue ASI.

A second reason to dislike overemphasis on rogue ASI scenarios is that it leads to badly mistaken actions. Many well-intentioned people, worried about rogue ASI, have gone into alignment and interpretability work. Organizations fund companies developing "aligned" AGI and ASI. These actions make sense if retaining control of ASI is the key issue. However, if the fundamental challenge is the dual-use nature of reality, then those actions will ultimately be counterproductive. It is not control that fundamentally matters: it's the power conferred. All those people working on alignment or control are indeed reducing certain kinds of risk. But they're also making the commercial development of ASI far more tractable, speeding our progress toward catastrophic capabilities. In the short term, these well-intentioned researchers will help create many extraordinary new things in the world (and be well rewarded for it). But they will also help build a largely invisible latent overhang in dangerous capabilities, which our safety-supplying institutions will struggle to keep pace with. In this view, alignment work is a form of market-supplied safety. But it leaves critically undersupplied crucial non-market parts of safety – preventing proliferation of easy development of dangerous technologies, up to and including recipes for ruin.

When I point this out people sometimes misunderstand me as saying "Oh, you're saying rogue ASI won't occur, and that's wrong because […]". Or they say "Oh, you're worried about misuse, not agentic ASI". I am emphatically not saying either of these things. I'm saying the issue of whether ASI gets out of control is not fundamental to the discussion of whether ASI poses an xrisk or how to avert it. Treating it as fundamental has led to bad mistakes. But overemphasis on this particular scenario (a) has led many people to mistakenly reject xrisk; and (b) many of those who have accepted xrisk have focused on alignment work, reasoning that it prevents rogue ASI, while paying insufficient attention to the fact that it speeds up ASI without addressing the fundamental existential risks.

Conclusion

That's a small taste of scenarios for existential risk from ASI. Such scenarios can be developed much further, but uncertainties remain. Some argue there's insufficient evidence to justify serious concern, wanting concrete, direct proof of catastrophe before they're persuaded. Others, myself included, see such scenarios, despite some uncertainties, as sufficiently compelling to warrant deep concern. Thus, strong disagreement about ASI xrisk arises from differing thresholds for conviction and comfort with reasoning that is in part based on toy models and heuristic arguments.

A similar disagreement affected anthropogenic climate change in the early twentieth century. Leading scientists like Arrhenius and Angstrom reached opposite conclusions because they believed different toy models. The result was sharp disagreement among scientists about the plausibility of climate change. It wasn't until the 1950s and 1960s that satisfactory models began to be developed. Much observation and improvement of those models led in the 1980s and 1990s to the modern consensus that human emissions of greenhouse gases are warming the climate. By contrast, for ASI we're still at the toy model stage. Furthermore, while climate can plausibly be predicted using detailed physical models, ASI is subject to a wildcard factor, of ASI acting in some decisive way that we intrinsically can't predict in advance, since ASI is by definition far superior to humans in intellect. There is, as Vernor Vinge so memorably put it, an "opaque wall across the future".

There are many other challenges. Much writing about ASI xrisk has been bombastic, overly speculative, or alarmist without being constructive23. It's sometimes full of overconfident assertions or outright holes, and may confuse narratively plausible fiction with fact. None of this inspires confidence, and many people dismiss ASI xrisk as just another misleading doom narrative, often thinking of earlier panics over the population bomb or the Y2K bug.

On the other side, I've met many ASI xrisk skeptics who are dismissive, but who have also never seriously attempted to understand the best forms of the arguments they confidently dismiss. They often either refuse to engage at all, or only engage defensively, looking for minor holes that can justify dismissal, rather than trying to understand the best form of the argument.

Reasons for dismissal vary from person to person, but a few motives recur. One barrier is the strangeness of the idea of xrisk. Usually it's a good idea to ignore wild-sounding scenarios! But sometimes the world really does change. One thinks of aristocrats in 1780s France, utterly confident in their position, and unable to conceive of the guillotining of the King and the French Republic, just a few years away.

Another barrier is the sheer unpleasantness of engaging with scenarios involve large-scale death. People naturally don't want to believe it's plausible, or to think about it24. Add to that the difficulty of assessment: if you're not a biologist or chemist or physicist or computer security researcher, it's difficult to know how seriously to take many of the scenarios. You may lack important expert intuitions for how much power is latent in the world25.

These barriers to engagement are reinforced by powerful incentives in the opposite direction. In the very near future, AI will (mostly) lead to improvements in the world, some enormous – new medical treatments! Wondrous new materials! More productivity in many sectors! These benefits will be real and profound, not hypothetical – they're natural extensions of capabilities already emerging. We'll solve many problems along the way, proving the just-in-time approach to safety works for many challenges. It's going to be amazing, and very exciting. In the short term, the people who worry about xrisk are mostly going to keep seeming wrong, and like they're holding back progress. It's natural to align with the seemingly optimistic vision, and then dismiss or ignore your apparent outgroup. You're aligning with a tribe that appears optimistic, and this also appears good for you personally – betting on AI will increase your wealth, power, and status, at least in the short term. Furthermore, economic growth is powerful! Far more capital will flow toward companies making capability advances rather than to organizations focusing on non-market safety. Human beings naturally align to anticipated power and "success," often confusing power and glamour with right action26, especially when the downsides currently seem intangible. Capital reshapes people, aligning their beliefs with short-term corporate interests rather than with humanity's interests as a whole. What we see in the world is what gets amplified. It's no wonder so many people are so dismissive.

But reality doesn't care about human psychology. When alignment to anticipated power will lead to unhealthy outcomes, a thriving civilization requires people willing to act in defiance of the zeitgeist, not merely follow the incentive gradient of immediate rewards. I believe the arguments for xrisk are good enough that there is a moral obligation for anyone working on AGI to investigate this risk with deep seriousness, and to act even if it means giving up their own short-term interests.

So, what to do? Many ongoing efforts aim to reshape institutions and broader incentives to increase the supply of safety. These go under different rubrics with different foci, including differential technological development27, d/acc28, and coceleration29. All take seriously the potential for recipes for ruin and prioritize safety to head off such a possibility. The challenge is that in the short term, this work will often be less well-paid and less prestigious than working on AI models. We have institutions which reward work improving technology (including aligning it); when deeper understanding of reality is dual-use, this inadvertently rewards actions that increase existential risk. Making these efforts work will be extremely difficult, and success is not guaranteed. But the most meaningful contributions to humanity have often come from those willing to do what is right, even when it wasn't incentivized in the short term. Thanks for your attention.

Acknowledgements

Thanks to Ted Chiang, Toby Ord, and Hannu Rajaniemi for conversations which improved this piece.

4 comments

Comments sorted by top scores.

comment by Leon Lang (leon-lang) · 2025-04-15T23:14:19.892Z · LW(p) · GW(p)

I enjoyed reading the article! I have two counterarguments to some main points:


1. The article argues that alignment research might make things worse, and I think for many kinds of alignment research, that's a good point. (Especially the kind of alignment research that made its way into current frontier AI!). On the other hand, if we are really careful with our AI research, we might manage to align AI to powerful principles such as "Don't develop technologies which our human institutions cannot keep up with" etc. 

My impression is that for very dangerous technologies that humans have historically developed, it *was* possible for humans to predict possible negative effects quite early (e.g., the nuclear chain reaction was conceived much earlier than nuclear fission; we don't have mirror life afaik but people already predict that it could be really harmful). Thus, I guess ASI would also be capable of predicting this, and so if we align it to a principle such as "don't develop marginally dangerous technology", then I think this should in principle work.

2. I think the article neglects the possibility that AI could kill us all by going rogue *without* this requiring any special technology. E.g., the AI might simply be broadly deployed in physical robots that are much like the robots we have today, and control more and more of our infrastructure. Then, if all AIs collude, they could one day decide to deprive us of basic needs, making humans go extinct.

Possibly Michael Nielsen would counter that this *is* in line with the vulnerable world hypothesis since the technology that kills us all is then the AI itself. But that would stretch it a bit since the vulnerable world hypothesis usually assumes that something world-destroying can be deployed with minor resources, which isn't the case here.

comment by Mitchell_Porter · 2025-04-16T07:19:11.376Z · LW(p) · GW(p)

I knew the author (Michael Nielsen) once but didn't stay in touch... I had a little trouble figuring out what he actually advocates here, e.g. at the end he talks about increasing "the supply of safety", and lists "differential technological development" (Bostrom), "d/acc" (Buterik), and "coceleration" (Nielsen) as "ongoing efforts" that share this aim, without defining any of them. But following his links, I would define those in turn as "slowing down dangerous things, and speeding up beneficial things"; "focusing on decentralization and individual defense"; and "advancing safety as well as advancing capabilities". 

In this particular essay, his position seems similar to contemporary MIRI. MIRI gave up on alignment in favor of just stopping the stampede towards AI, and here Michael is also saying that people who care about AI safety should work on topics other than alignment (e.g. "institutions, norms, laws, and education"), because (my paraphrase) alignment work is just adding fuel to the fire of advances in AI. 

Well, let's remind ourselves of the current situation. There are two AI powers in the world, America and China (and plenty of other nations who would gladly join them in that status). Both of them are hosting a capabilities race in which multiple billion-dollar companies compete to advance AI, and "making the AI too smart" is not something that either side cares about. We are in a no-brakes race towards superintelligence, and alignment research is the only organized effort aimed at making the outcome human-friendly. 

I think plain speaking is important at this late stage, so let me also try to be as clear as possible about how I see our prospects. 

First, the creation of superintelligence will mean that humanity is no longer in control, unless human beings are somehow embedded in it. Superintelligence may or may not coexist with us, I don't know the odds of it emerging in a human-friendly form; but it will have the upper hand, we will be at its mercy. If we don't intend to just gamble on there being a positive outcome, we need alignment research. For that matter, if we really didn't want to gamble, we wouldn't create superintelligence until we had alignment theory perfectly worked out. But we don't live in that timeline. 

Second, although we are not giving ourselves time to solve alignment safely, that still has a chance of happening, if rising capabilities are harnessed to do alignment research. If we had no AI, maybe alignment theory would take 20 or 50 years to solve, but with AI, years of progress can happen in months or weeks. I don't know the odds of alignment getting fully solved in that way, but the ingredients are there for it to happen. 

I feel I should say something on the prospect of a global pause or a halt occurring. I would call it unlikely but not impossible. It looks unlikely because we are in a decentralized no-holds-barred race towards superintelligence already, and the most advanced AIs are looking pretty capable (despite some gaps e.g. 1 [LW · GW] 2 [LW · GW]), and there's no serious counterforce on the political scene. It's not impossible because change, even massive change, does happen in politics and geopolitics, and there's only a finite number of contenders in the race (though that number grows every year). 

comment by Davey Morse (davey-morse) · 2025-04-16T07:37:02.034Z · LW(p) · GW(p)

I'm saying the issue of whether ASI gets out of control is not fundamental to the discussion of whether ASI poses an xrisk or how to avert it.

I only half agree.

The control question is not fundamental to discussion of whether ASI poses x-risk—agreed. But I believe the control question is fundamental to discussion of how to avert x-risk.

Humanity's optimal strategy for averting x-risk depends on whether we can ultimately control ASI. If control is possible, then the best strategy for averting x-risk is coordination of ASI developmentacross companies and nations. If control is not possible, then the best strategy is very different and even less well-defined (e.g., pausing ASI development, attempting to seed ASI so that it becomes benevolent, making preparations so humans can live alongside self-directed ASI, etc).

So while it's possible that emphasis on the control question turns many people away from the xrisk conversation, I think the control question remains key for conversation about xrisk solutions.

comment by Jonas Hallgren · 2025-04-16T07:20:00.469Z · LW(p) · GW(p)

I would have wanted more pointing out of institutional capacity as part of the solution to this but I think it is a very good way of describing a more generalised re-focus to not goodharting on sub-parts of the problem. 

Now, that I've said something serious I can finally comment on what I wanted to comment on:

Thanks to Ted Chiang, Toby Ord, and Hannu Rajaniemi for conversations which improved this piece.

Ted Chiang!!! I'm excited for some banger sci-fi based on things more relating to x-risk scenarios, that is so cool!