Planning for Extreme AI Risks

post by joshc (joshua-clymer) · 2025-01-29T18:33:14.844Z · LW · GW · 2 comments

Contents

  The tl;dr
  1. Assumptions
  2. Outcomes
    2.1. Outcome #1: Human researcher obsolescence
    2.2. Outcome #2: A long coordinated pause
    2.3. Outcome #3: Self-destruction
  3. Goals
  4. Prioritization heuristics
  5. Heuristic #1: Scale aggressively until meaningful AI software R&D acceleration
  6. Heuristic #2: Before achieving meaningful AI software R&D acceleration, spend most safety resources on preparation
  7. Heuristic #3: During preparation, devote most safety resources to (1) raising awareness of risks, (2) getting ready to elicit safety research from AI, and (3) preparing extreme security.
    Category #1: Nonproliferation
    Category #2: Safety distribution
    Category #3: Governance and communication.
    Category #4: AI defense
  8. Conclusion
  Appendix
    Appendix A: What should Magma do after meaningful AI software R&D speedups
None
2 comments

This post should not be taken as a polished recommendation to AI companies and instead should be treated as an informal summary of a worldview. The content is inspired by conversations with a large number of people, so I cannot take credit for any of these ideas.

For a summary of this post, see the thread on X.

One of the timelines I explore in this post.


Many people write opinions about how to handle [LW · GW] advanced AI, which can be considered “plans.”

 

 

 

 

There’s the “stop AI now plan.”


 

On the other side of the aisle, there’s the “build AI faster plan.”

 


 

 

 

 

Some plans try to strike a balance with an idyllic governance regime.

 

And others have a “race sometimes, pause sometimes, it will be a dumpster-fire” vibe.

 

 

So why am I proposing another plan?

Existing plans provide a nice patchwork of the options available, but I’m not satisfied with any.

Some plans describe what the world would do if everyone were on the ball. Others are ruthlessly practical, to the point where they dismiss potentially essential interventions like international governance. And still others include a hodgepodge of interventions that seem promising at the moment but might not hold up.

This plan is comparatively wide-ranging, concise, and selective in its claims.

The plan is for a responsible AI developer, which I call “Magma [AF · GW].” It does not directly apply to governments, non-profits, or external researchers. Magma has the sole objective of minimizing risks on the scale of AI or human ‘takeover’even if advancing this goal requires dangerous actions.

This is a dangerous plan for a dangerous world.

The plan includes two parts:

  1. Goals and strategies for achieving them.
  2. Prioritization heuristics that specify how the developer might allocate their limited resources between goals over time.

The plan is quite different from an “RSP” or a “safety framework.” It is not meant to be enforceable or allow an AI company to transparently justify its actions. It is more like a rough-and-ready playbook.

Like any playbook, it cannot be followed to the T. As the game evolves, the players must respond to new information. While I try to cover strategies that are most applicable across scenarios, I am still obligated to quote General Eisenhower’s famous disclaimer:

“Plans are useless, but planning is indispensable.”

I plan in order to prepare. I wrote this post so that you and I can better decide how to use our meager time to get through these perhaps most consequential years in human history. The stakes have never been higher, nor the need for plans greater.

The tl;dr

There are several outcomes that define a natural boundary for Magma’s planning horizon:

This plan considers what Magma might do before any of these outcomes are reached.

At any given point in time, Magma might advance the following goals:

These goals trade-off, so how should developers prioritize between them?

I argue that before meaningful (e.g. >3x)[1] AI software R&D acceleration, Magma should:

After meaningful AI software R&D acceleration, Magma might orient toward one of the previously mentioned outcomes (safely automated scaling, a long coordinated pause, or self-destruction). The question of which outcome(s) Magma should target depends on complex factors (Appendix A), so I mostly describe what Magma might do in the earlier time leading up to meaningful AI R&D speedups.

1. Assumptions

The plan assumes the following:

To add color to these risks, imagine that future AI systems are like “a country of geniuses in a data center.” Now imagine these geniuses run at 100x the speed of normal humans. From the perspective of these AI systems, your fingers would move like glaciers as they fall toward your keyboard. In a few years, these AI systems might innovate technologies that would otherwise have been due to arrive in 2050, granting them (or the humans that control them) overwhelming military force.

 

Figure 1. Synthetic bees pressing against a window pane. This is a scene from the BlackMirror episode Hated in the Nation that I often picture when I think about AI takeover. In the scene, self-replicating bee-like machines attack a human population – breaking windows and slipping through cracks in doors to deliver a fatal payload. It is a visceral depiction of a novel technology utterly dominating existing defenses.

I believe these dangers are plausible enough to dominate most common sense ways to perform cost-benefit analysis, which is why they are Magma’s sole focus in this post.

Additional assumptions include the following:

2. Outcomes

Before describing a ‘plan,’ it’s worth considering where Magma might eventually end up, since these outcomes affect what Magma should be planning for.

A few outcomes are salient [2]:

These outcomes define a natural boundary for Magma’s near-term planning horizon because, once they are achieved, Magma staff might have much less influence, or their situation will be substantially different.

By the time these outcomes are reached, Magma might not be distinctly Magma anymore. Magma might be part of a joint project; however, in this case, all the same considerations apply to their involvement in the collaboration.

2.1. Outcome #1: Human researcher obsolescence

If superhuman AI systems can be constructed safely, these systems would be extraordinarily useful. Safe superhuman AI could cook up safety recipes and continuously distribute them to other developers. They could build up defenses against the onslaught of dangerous systems following in their wake. The list goes on.

However, Magma does not need to create safe superhuman AI. Magma only needs to build an autonomous AI researcher that finishes the rest of the job as well as we could have. This autonomous AI researcher would be able to scale capabilities and safety much faster with humans out of the loop.

Leaving AI development up to AI agents is safe relative to humans if:

(1) These agents are at least as trustworthy as human developers and the institutions that hold them accountable.

(2) The agents are at least as capable as human developers along safety-relevant dimensions, including ‘wisdom,’ anticipating the societal impacts of their work, etc.

On the path to obsoleting human researchers, Magma might pass through several waypoints that are shown in figure 2.

Figure 2. A timeline in which Magma obseletes human researchers. The specific dates are illustrative and unimportant.

Each milestone is explained below:

2.2. Outcome #2: A long coordinated pause

Creating wildly superhuman minds – and furthermore, placing this perilous task into the hands of AI agents that have only recently come into existence – sounds like a crazy thing to do.

Hopefully, society can agree this is crazy and coordinate a pause.

Someone once called this an “AI summer”: a chance to take a breather and enjoy the benefits of human-competitive AI agents before rocketing into the much more dangerous regime of superintelligence.

A long, coordinated pause is the most preferred outcome if it can be achieved.

 

Figure 3. A timeline in which Magma coordinates with competitors to slow down. The specific dates are illustrative and unimportant, and the pause could happen at any time (not necessarily after TEDAI).

2.3. Outcome #3: Self-destruction

If Magma does not automate scaling and other developers continue barreling forward, then Magma’s final option is to unilaterally slow down, and, eventually, be decisively overtaken.

 

Figure 4. A timeline in which Magma self-destructs. The specific dates are illustrative and unimportant.

Ryan Greenblatt once called this “self-immolation” [AF · GW]: to set oneself on fire, typically as an act of protest. I find this a fitting term. By giving up their top spot, not only might Magma free up resources for safety, reduce competitive pressure on the frontrunners, etc; but Magma also provides a costly signal to the world that makes their advocacy more credible.

3. Goals

The previous section explained the endpoints where Magma might be headed. But what actions might Magma take along the way to these endpoints, and which should they steer toward?

At any given point in time, Magma might advance the following goals:

All of these goals can be advanced to differing degrees depending on the level of investment Magma applies to them.

4. Prioritization heuristics

The goals in the previous section trade-off against each other, both because some are directly in tension and because Magma must distribute limited resources between them. So how might Magma prioritize between goals?

Magma might turn three “knobs”:

Magma adjusts these knobs over time like they are playing a perilous arcade game.

 

Figure 5. Prioritization knobs.

You might recall the three heuristics I introduced for controlling these knobs.

Before achieving meaningful (e.g. >3x) AI software R&D acceleration:

After meaningful (e.g. >3x) AI software R&D acceleration, I find heuristics more difficult to specify. Magma might begin to orient toward one of the aforementioned outcomes (human researcher obsolescence, a long coordinated pause, or self-destruction). Deciding between these options requires considering many factors (how willing are other actors to coordinate? Would slowing down actually make advocacy more credible? etc).

As a result, the remaining sections focus on what Magma might do in the period before meaningful (e.g. >3x) software AI R&D acceleration. Later periods are discussed in Appendix A.

5. Heuristic #1: Scale aggressively until meaningful AI software R&D acceleration

This heuristic recommends that Magma should initially scale at roughly the maximum speed Magma can do so. Magma might spend some resources on safety preparation, but not enough to take meaningful (e.g. >5%) hits to velocity.

This strategy sure sounds reckless, so why is this a reasonable heuristic?

To determine how quickly to scale, Magma might compare the costs and benefits of scaling.

Before Magma achieves meaningful AI software R&D acceleration, the benefits of scaling probably outweigh the costs, mostly because the costs of scaling are low:

As capabilities improve, the costs of advancing capabilities increase, and the benefits decrease, such that slowing down becomes increasingly helpful for safety.

Combining these considerations implies that significantly slowing down before meaningful (e.g. >3x) AI software R&D acceleration is probably not beneficial; however, significantly slowing down at some point after meaningful AI software R&D acceleration is likely a good strategy. The question of when and whether to slow down after this point is hard to answer. I describe some considerations that affect this in Appendix A.

6. Heuristic #2: Before achieving meaningful AI software R&D acceleration, spend most safety resources on preparation

 

Figure 6. A very rough picture of how Magma might divide resources between safety preparation and safety mitigations over time.

7. Heuristic #3: During preparation, devote most safety resources to (1) raising awareness of risks, (2) getting ready to elicit safety research from AI, and (3) preparing extreme security.

Magma might prioritize preparing for interventions that rank highest on the following axes:

  1. The amount of safety that the intervention buys given its costs.
  2. The extent to which the intervention is applicable to other AI developers.
  3. The extent to which the intervention benefits from taking actions in advance.

The table below includes a long list of interventions with (weakly held) views on how they score on these three dimensions.

 

 

 

 

Table 1. Weakly held views on the importance of early action to prepare various interventions.

The top-scoring interventions in this table are:

The next sections explain why I think these interventions are most deserving of early action.

Category #1: Nonproliferation

Disclaimer: I like to read about security as a hobby, but I am not a security expert. Many of my views here come from talking to people with more expertise than I have, but please still take my opinions in this section with a grain of salt.

There are many ways Magma might slow down AI proliferation, but the class of interventions with the longest lead times is arguably security. Security measures can help (1) prevent theft of critical AI software (proliferation to human actors) and (2) prevent self-exfiltration (proliferation to rogue AI).

I’ll focus on what companies can do to secure model weights in particular. Model weights are both more tractable [4] and will arguably also be more important [5] to secure in comparison with algorithmic insights.

Promising approaches to preparing model weight security include both (1) incrementally improving security and (2) performing forward-looking R&D.

Experts I’ve spoken with say that preparing extreme security will require close collaboration with government agencies. These agencies need to be involved in red teaming because they’ve been trying to break into companies for many years and have advanced internal exploits. Close government collaboration is not something that can wait until the last minute. Government agencies need to be involved while extreme security measures are being prepared and implemented.

Category #2: Safety distribution

There are two main strategies for making AI systems more safe:

How might Magma prepare these categories of safety measures?

An assumption of this plan is that Magma will achieve TEDAI in less than 3 years. Under short timelines, most safety progress will likely happen in the final years when human researchers have access to AI assistance. For example, if AI assistance allows 10x progress for 8 months, this period would effectively contain 6 years of research advances. Therefore, under short timelines, there is huge upside in preparing to leverage accelerated AI safety work.

To prepare to leverage AI assistance, Magma might:

 

Figure 7. A pipeline for extracting work from AI agents.

All of these are active areas of research and potentially include a broad range of work. For example, a paper proposing an interpretability technique might improve risk evaluation methodologies by providing a concrete example of a mitigation to reason about; However, in general, I think most AI safety research would be significantly more useful if researchers explicitly prioritized preparing to leverage AI assistance.

The plans and methodologies for creating a trustworthy AI agent researcher are nascent. If AI developers suddenly had access to powerful AI assistance, I think a lot of it would be wasted [? · GW] because very few people have thought deeply about safety evaluation methodologies.

Category #3: Governance and communication.

Magma might try to set the global stage for responsible AI governance by:

Perhaps the intervention in this category that appears to me like the lowest hanging fruit is clearly communicating the nature of extreme risks, how prepared we are, and what needs to be done to address them. I believe the most important fact about our situation is that developers will plausibly create wildly superhuman AI in the next four years that will be able to overpower humanity as swiftly and easily as a modern military would overpower a musket-bearing 18th-century Militia – and no AI company seems to be saying this and clearly explaining the implications.

Category #4: AI defense

AI defense benefits less from preparation than the other categories do.

Perhaps a more promising way to prepare for AI defense is to acquire market share and relationships with customers. For example, Magma might win contracts for AI military applications or build AI advising software for members of government – both of which are especially high-stakes deployments where responsible AI development is critical.

8. Conclusion

Even if developers follow a plan like this one, the situation is frankly horrifying. Developers might have only a couple of years before they need to thread the needle of AI alignment and thrust humanity’s destiny into the hands of newly born minds.

This is not the kind of situation one should stumble into – yet, as far as I am aware, AI companies generally do not have concrete plans for addressing extreme AI risks.[6]

A popular argument against planning is that predicting the future is hard. Perhaps AI companies are better off improving safety myopically rather than fooling themselves into thinking they can reason about more distant outcomes.

I agree that planning is challenging!

But it might be necessary.

Extreme risks might be hard to see coming, and much of the work to mitigate them has to begin years in advance.

It might be time to consider – research agenda by research agenda, security measure by security measure, policy by policy: what needs to be done, how is it going to happen, and are we on track to get there? The pages here obviously do not achieve this goal, but I’m hoping they are a step in the right direction.

I encourage readers to either:

I think preparing and prioritizing more effectively is quite tractable, and could substantially improve humanity’s chances of success.

Appendix

Appendix A: What should Magma do after meaningful AI software R&D speedups

Recall the three outcomes that Magma might eventually encounter:

Which of these outcomes might Magma steer toward?

Heuristic #2 and #3 are particularly weakly-held. There are many other factors that might affect which outcome to target, such as the extent to which Magma dropping out of the race will reduce pressure on the leader, how much sending a costly signal might amplify advocacy efforts, etc. See section 5 for factors that might affect this cost-benefit analysis.

However, I think the extent to which Magma is “in the pack” or lagging behind is perhaps the most decisive factor in whether Magma should self-destruct.

If Magma is “in the pack” of leaders, then there are potentially big returns to remaining in the race.

However, if Magma is lagging, then there is a good reason to think their window of influence is concentrated around the time of TEDAI.

In this scenario, Magma might ramp up their spending on safety while they have influence.

  1. ^

     I selected this number for concreteness. It is fairly arbitrary.

  2. ^

     This list of outcomes is not comprehensive. A fourth possibility is that “AI progress hits a wall,” so a substantial ‘pause’ happens naturally; however, Magma focuses on scenarios where software-driven progress scales to superhuman AI within a few years. These scenarios are both plausible (see assumptions) and arguably where most near-term addressable risk is concentrated.

    There might also be a short coordinated pause; however, this scenario quickly leads to “human researcher obsolescence,” which is a better boundary for Magma’s planning horizon in this case.

  3. ^

     I expect the main risk AI systems begin to pose after this point is proliferating AI capabilities by speeding up other researchers. At 3x AI software R&D speedups, this risk is very non-trivial, but also probably not very high compared to risks that are looming ahead of the developer.

  4. ^

     Model weights are a giant stack of numbers. Attacks cannot steal them by asking a human “what are the model weights?” However, an attacker can steal algorithmic insights out of the minds of people.

  5. ^

     After strong AI R&D capabilities, model weights might be more important to secure than algorithmic insights because, if a developer can steal model weights, they also potentially steal a factory for generating algorithmic insights on demand; however, if a cyber threat actor can only steal algorithmic insights, they are restricted by whether Magma actually uses model weights to generate these insights. Said another way, once model weights are stolen, there isn’t much Magma might be able to do to prevent reckless scaling to higher capabilities, whereas they retain greater control if model weights are secure

  6. ^

     There are a few documents that are somewhat plan-like such as The Checklist and some documents published by OpenAI in 2022, and safety frameworks are plans if you squint; however, the material published online does not scream “people at AI companies have thought very carefully about how to prepare for superdangerous AI.” My impression is that very few people explicitly think about this at AI companies.

2 comments

Comments sorted by top scores.

comment by Chris_Leong · 2025-01-30T05:13:47.621Z · LW(p) · GW(p)

Nice article, I especially love the diagrams!

In Human Researcher Obsolescence you note that we can't completely hand over research unless we manage to produce agents that are at least as "wise" as the human developers.

I agree with this, though I would love to see a future version of this plan include an expanded analysis of the role that wise AI plays would play in the strategy of Magma, as I believe that this could be a key aspect of making this plan work.

In particular:

• We likely want to be developing wise AI advisors to advise us during the pre-hand-off period. In fact, I consider this likely to be vital to successfully navigating this period given the challenges involved.

• It's possible that we might manage to completely automate the more objective components of research without managing to completely automating the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can't defer to them.

• When developing AI capabilities, there's an additional lever in terms of how much Magma focuses on direct capabilities vs. focusing on wisdom.

comment by RussellThor · 2025-01-30T03:55:29.486Z · LW(p) · GW(p)

Thanks for this article, upvoted.

Firstly Magma sounds most like Anthropic, especially the combination of Heuristic #1 Scale AI capabilities and also publishing safety work.

In general I like the approach, especially the balance between realism and not embracing fatalism. This is opposed to say MIRI, Pause AI and at the other end, e/acc. (I belong to EA, however they don’t seem to have a coherent plan I can get behind) I like the realization that in a dangerous situation doing dangerous things can be justified. Its easy to be “moral” and just say “stop” however its another matter entirely if that helps now.

I consider the pause around TEDAI to be important, though I would like to see it just before TEDAI (>3* alignment speed) not after. I am unsure how to achieve such a thing, do we have to lay the groundwork now? When I suggest such a thing elsewhere on this site, however it gets downvoted:

https://www.lesswrong.com/posts/ynsjJWTAMhTogLHm6/?commentId=krYhuadYNnr3deamT [LW · GW]

Goal #2: Magma might also reduce risks posed by other AI developers

In terms of what people not directly doing AI research can do, I think a lot can be done reducing risks by other AI models. To me it would be highly desirable if AI (N-1) is deployed as quickly as possible into society and understood while AI(N) is still being tested. This clearly isn’t the case with critical security. Similarly,

AI defense: Harden the world against unsafe AI 

In terms of preparation, it would be good if critical companies were required to quickly deploy AGI security tools as they become available. That is have the organization setup so that when new capabilities emerge and the new model finds potential vulnerabilities, experts in the company quickly assess them, and deploy timely fixes.

Your idea of acquiring market share in high risk domains? Haven't seen that mentioned before. It seems hard to pull off - hard to gain share in electricity grid software or similar.

Someone will no doubt bring up the more black hat approach to harden the world:

Soon after a new safety tool is released, a controlled hacking agent takes down a company in a neutral country with a very public hack, with the message if you don’t asap use these security tools, then all other similar companies suffer and they have been warned.