Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

post by Alex Lintz (alex-lintz) · 2022-08-03T23:15:22.652Z · LW · GW · 0 comments

Contents

  Summary
  The three pillars
    Pillar 1: Technical alignment
    Pillar 2: Safety-conscious deployment decisions
    Pillar 3: Coordination between potential AI deployers
  Hypothetical scenarios (for illustration) 
    Failure on one pillar, and partial success in others, leads to overall failure
      Partial technical and deployment decision success, but coordination failure 
      Partial technical and coordination success, but deployment decisions failure
      Partial coordination and deployment decision success, but technical failure
    Modest success in all pillars has unclear results
      Sufficient partial success on all pillars
      Insufficient partial success on all pillars
    Strong success on one pillar is unlikely, but could lead to overall success
      Very strong technical success sets a very low bar for deployment decision and coordination success
      Strong deployment decision success sets a low bar for technical and coordination success
  How likely are we to succeed? 
    Some variables that affect the difficulty of the pillars
  How to use the model
    Testing paths to victory
      Ideally, our community would have multiple redundant efforts to partially or fully solve each pillar.
    Describing strategic views
  Imperfections of the model and future research
None
No comments
Generated by Midjourney

Cross-posted to the EA Forum: Three pillars for avoiding AGI catastrophe [EA · GW]

Epistemic status: This model is loosely inspired by a conversation with Nate Soares but has since warped into something perhaps significantly different from his original intent. I think this model is a useful thinking tool when it comes to examining potential interventions to mitigate AI risk and getting a grasp of the problem we face. That said, I’m sure that the model has its flaws - I just haven’t spent enough time to find them. 

I’ve tried to write this post so it’s very skimmable (except the hypothetical scenarios section) - reading the first sentence or two of each heading and skipping the bullet points should get you the gist of the model. 

Summary

The three pillars model attempts to describe the conditions needed to successfully avoid the deployment of unaligned AGI. It proposes that, to succeed, we need to achieve some sufficient combination of success on all three of the following: 

  1. Technical alignment research
  2. Safety-conscious deployment decisions
  3. Coordination between potential AI deployers

While how difficult success is depends on the difficulty of solving any given pillar, this model points toward why we may well fail to avoid AGI catastrophe: we need to simultaneously succeed at three difficult problems. 

More generally, the model aims to help longtermists flesh out our mental pictures of what success on AGI risk looks like. In particular, it suggests that a strategy aimed solely at a single pillar is unlikely to be sufficient, and our community might need to take ambitious actions in several directions at once. 

This model is intended as an imperfect but hopefully useful thinking tool. Further work in this area could add nuance to the model, formalize it, and try to use it to describe different viewpoints (e.g. Christiano’s stories of AI failure) or strategies (e.g. raising awareness of AI risk, which might affect several pillars).

 

The three pillars

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

We need to attain sufficient success on some combination of three pillars in order to attain a future where AGI does not kill us: technical safety, safety-conscious deployment decisions, and coordination between deployers. Success on any given pillar can, to some extent, serve as a substitute for success on another. For example, we could be extremely successful on one pillar, quite successful on two, or fairly successful on all three in order to avoid doom. We do generally need at least a minimal amount of progress on each pillar to succeed — for example, even a low-cost and easy-to-implement technical solution to alignment still needs to be adopted by leading AI developers. 

One conceptualization which might be useful for driving intuition is to set the bar for victory at 100 pillar points. We can have partial success on all pillars: e.g. 33, 33, and 34 from pillars 1,2, and 3 respectively; or we can get almost everything from one pillar: e.g. a 90-5-5 split with pillar 1 supporting most of the weight. 

In practice, we should expect progress on the pillars to be correlated. Success on each pillar is more likely if we’re in a world with more competence, more well-calibrated AGI risk awareness, a more effective and influential longtermist community, etc. In addition, success on one pillar can increase the return-on-investment in other pillars. E.g. If technical success were to give us the foolproof ability to mind-read AGIs, it would be trivial to succeed on deployment decisions because it would be obvious whether an AGI is or isn’t aligned. So, on top of the 60 pillar points we get for achieving robust mind-reading, we could then get 60 points on deployment decisions without much extra effort by spreading knowledge about how to mind-read.

Success could happen in a number of ways, but I have provided some specific examples for each pillar. I’ve chosen to give examples of three levels of success/failure for each pillar to give intuition about the pillar (though note there is actually a continuous success/failure spectrum). These levels are as follows: 

Pillar 1: Technical alignment

The extent to which actors are able to make technical progress on the alignment problem (i.e. creating an AGI that won’t kill us all when deployed), and the ability to identify whether we’ve solved it (i.e. recognizing if a system is misaligned before deployment). 

 

Pillar 2: Safety-conscious deployment decisions

The extent to which organizations capable of deploying AGI are (a) cognizant of AI risk and (b) able to decide whether to deploy a system (including deciding not to deploy) based on whether or not they think it’s safe to deploy. 

 

Pillar 3: Coordination between potential AI deployers

The extent to which conditions and relations between potential deployers of AGI favor or disfavor deploying unaligned AGI.  

 

Hypothetical scenarios (for illustration) 

These very rough hypothetical future scenarios are intended both to illustrate what the above pillars mean by sketching examples of how each could succeed or fail, and to illustrate some plausible failure and success scenarios.
 

Failure on one pillar, and partial success in others, leads to overall failure

Partial technical and deployment decision success, but coordination failure 

Scenario 1: Deepmind has a giant team working on alignment. They are properly incentivized and are almost ready to deploy an aligned AGI. Deepmind gets inaccurate intel that Microsoft has an AGI that they are very nearly ready to deploy. Demis Hassabis, CEO of Deepmind, reasons that Deepmind’s AGI is more likely to go right than the one created by Microsoft, and so he chooses to deploy even though he’s not quite sure it’s ready. The unaligned AGI becomes impossible to control and kills us all. 

Scenario 2: Google, OpenAI, Anthropic, and a few other major AI companies get together and form a coalition. With their combined resources they are dramatically ahead of the competition. They have competent leadership that can distinguish between true and false solutions to the alignment problem. The US government abolishes the coalition because it violates antitrust law. The various parties split apart. One party chooses to deploy an AGI early, fearing that others will do the same. Humanity loses.

Partial technical and coordination success, but deployment decisions failure

Scenario 3: The US government dominates the AI ecosystem. They’ve managed to create a surveillance system which can tell them who is building AGI, and they attack anyone who does with immense cyber offensive capabilities. They’re quite sure their projects are the only ones capable of AGI. A new president is elected who has optimistic views on AGI safety. He believes that we need only be nice to the AGI while training it and then it will be safe to deploy. [5]He commands the most submissive lab to train up such an AGI and deploy it. Humanity loses.  

Partial coordination and deployment decision success, but technical failure

Scenario 4: The US government dominates the AI ecosystem. They’ve got an eye on all the competing projects and are shutting them down as they crop up. They’ve got people working round the clock on technical safety solutions. The AGI is fairly responsible but after several years of being able to deploy AGI and choosing not to, pressure increases. The team doesn’t have access to tools which can tell them whether the AGI will be misaligned nor does it have a foolproof alignment technique. Under public pressure after a decade of holdups, the leader of the AI lab eventually declares that he’s pretty sure he has trained an AGI which will not kill us all. He deploys and it kills us all. 

Modest success in all pillars has unclear results

Sufficient partial success on all pillars

Just a few labs are clearly in the lead. They are fairly sure of their lead, are inclined to spend time aligning AGI before deploying it, and have some degree of cooperation among them. As pressure mounts to deploy AGI one year after the labs are capable of doing so, OpenAI develops a solution to technical alignment that is easily understood and verified by third parties. OpenAI deploys an aligned AGI.  

Insufficient partial success on all pillars

Just a few Western labs are clearly in the lead. They are fairly sure of their lead, are inclined to spend time aligning AGI before deploying it, and have some degree of cooperation among them. Quality work is ongoing to solve the technical alignment problem but two years after leading labs are capable of deploying AGI, there is no foolproof solution in sight. Feeling threatened by the potential for a US-dominated post-AGI world, China starts its own secret project to build AGI. They are able to copy a leading lab’s model and, cognizant of the risks, decide it’s worth taking the chance to deploy the model even though alignment is not certain. The AI is unaligned and humanity loses. 

Strong success on one pillar is unlikely, but could lead to overall success

Very strong technical success sets a very low bar for deployment decision and coordination success

Scenario 5: Researchers at MIRI stumble across a total solution to the alignment problem. It can be easily adapted to new systems and they send it out to every lab in the world. Despite intense competition and poor leadership, leaders are able to recognize that this is in fact a solution to the problem and that applying it to their systems is a good idea. Labs apply it and Baidu deploys an aligned AGI. 

Strong deployment decision success sets a low bar for technical and coordination success

Scenario 6: The US government and Chinese government are totally bought into AGI risk and treat it as a truly existential concern. They get their competence hats on and are able to pick leaders for their national projects who can distinguish between true and false solutions to the alignment problem. But the two countries fail to come to any real between-country agreements, and multilateral efforts fail. Also, barely anyone works on the right problem. A mutually assured destruction (MAD) dynamic develops where the US and China both know they can destroy the world by releasing their AGI but they do not. This leads to a stalemate for decades where AGI deployment is seen as the most dangerous possible thing and hope for alignment is low. After 50 years like this we figure out how to make emulated minds and they are able to figure out the solution to the alignment problem. The US deploys an aligned AGI. 

 

How likely are we to succeed? 

A key factor for determining whether things are likely to go well, and on which pillars we should focus our attention, is how easy or difficult each pillar is to solve. It’s possible that one or more pillars are trivial and it’s also possible that one or more pillars are impossible. [6]

Some variables that affect the difficulty of the pillars

There are many variables which affect the difficulty of solving the pillars, some of which we can intervene on in order to make the problem ahead easier. Below are some examples of variables that could affect the difficulty of pillars, and my rough best-guess at their overall effect:

 

How to use the model

Testing paths to victory

One way to use this model is to get a sense of what combinations of strategies might be necessary for success. I.e. The model could facilitate building more complex theories of victory. 

For example, perhaps the longtermist community needs to simultaneously:

Ideally, our community would have multiple redundant efforts to partially or fully solve each pillar.

Describing strategic views

We can describe a range of strategic views on AI risk with a few parameters:

In practice, teasing out the implications of pillar difficulty is not obvious. If you believe a particular pillar is necessary for success but not very tractable, it might still be worth working on. By contrast, if you think that each pillar is likely to meet the minimum viable threshold, it might make more sense to work on particularly tractable areas. Since we don’t know exactly what minimum viability looks like, prioritization is hard!  

Some interventions look more promising when we consider their contributions across multiple pillars — for example, building consensus about the importance of AI safety among technical ML researchers is likely to get some important decision-makers on board, and could make it easier to developing an international “epistemic community” around AI risk reduction, improving international coordination.

 

Imperfections of the model and future research

I have some intuition that upon further scrutiny this model might break down. However, I do think it’s a useful thinking tool! As the saying goes, “all models are wrong, some models are useful”. The main issue is that all the pillars are deeply interconnected. For instance, progress on technical issues will be crucial for helping decision-makers determine whether or not a given alignment solution is real or not. Also, labs that have AGI-concerned leaders are more likely to coordinate.  

I think there are a lot of potentially interesting things to do with the model in follow-up work. For example: 


 

Acknowledgements: Ashwin Acharya and Michael Aird for significant feedback, Nate Soares for providing the initial idea, as well as Abi Olvera, Fynn Heide, Max Räuker, and Ben Cottier.

  1. ^

     Though we still can’t drop the ball entirely on the other pillars, we are just at a point where business as usual is probably fine.

  2. ^

     For Nate Soares’ take on what the right problem is, see his post here: On how various plans miss the hard bits of the alignment challenge [EA · GW]

  3. ^

     While the ability to determine whether an alignment solution will actually work is a technical problem, the leadership of the organization deciding whether to deploy AGI will face the problem of figuring out who to listen to as to whether the system is aligned and/or doing the technical thinking themselves. 

  4. ^

    This set of actors could be kept small by (a) one actor deploying safe AGI in such a way that prevents other actors from deploying any AGI or from deploying unsafe AGI specifically and (b) there being a large (perceived) lead in AI development between one or a small set of actors and all other actors. 

    The expected behavior of actors who might end up with the option of deploying unsafe AGI matters because that may affect what other actors do, such as how much those other actors cut corners to get to AGI first. 

  5. ^

     There are apparently real, powerful, people who have this opinion so it’s not as ridiculous as it sounds. 

  6. ^

     Note that to calculate a probability of success it’s not as simple as multiplying 10%*10%*10% to get the correct odds. This is because success and failure on different pillars is correlated. 

  7. ^

     ​​This framing really helped me understand why MIRI folk tend to be extremely pessimistic about our odds of survival. 

0 comments

Comments sorted by top scores.