AI Risk & Policy Forecasts from Metaculus & FLI's AI Pathways Workshop

post by _will_ (Will Aldred) · 2023-05-16T18:06:54.931Z · LW · GW · 4 comments

Contents

  Summary
  Introduction
  Methods
  Findings
  Scenarios
  Updates
  Feedback
  Acknowledgements
None
4 comments

Crossposted from the Metaculus Journal.

Summary

Over a two-day workshop, Metaculus Pro Forecasters and subject-matter experts (SMEs) from several organizations evaluated policy directions for reducing AI risk using scenario planning and forecasting methodologies. The key conclusions:

Introduction

The increasing capabilities of artificial intelligence have led to a growing sense of urgency to consider the potential risks and possibilities associated with these powerful systems, which are now garnering attention from public and political spheres. Notably, the Future of Life Institute’s open letter calling for a six month pause on leading AI development collected over 25,000 signatures within five weeks of release, and senior White House officials have met with AI leaders to discuss risks from AI.

This report presents the findings of a structured process, organized by Metaculus and the Future of Life Institute, which brought together Pro Forecasters and SMEs to begin identifying, quantitatively, the most impactful policy directions for reducing existential risk from misaligned AI [? · GW] (hereafter: “AI risk”).

Methods

The AI Pathways workshop combined a range of plausible scenarios for an AI future with probabilistic forecasts, leveraging the judgments of both SMEs and Metaculus Pro Forecasters. The goal of the exercise was to identify the most impactful actions for steering toward a positive future and away from a negative future.

We—the Metaculus AI Forecasting team, in collaboration with the FLI policy team—began by generating four possible states of the world in 2030, focusing on the impact of AI, and asking experts to identify developments which would likely play a large role in driving the world toward each of these four scenarios. The scenarios were developed with two key areas of uncertainty in mind: 1) takeoff speed and the related question of unipolarity versus multipolarity; 2) cooperation versus competition—between labs and between countries. The “Scenarios [LW(p) · GW(p)]” section below outlines these scenarios in more depth. We worked with the SMEs to identify “indicators” or “drivers” of these scenarios, ranked these, identified potential U.S. government policies and leading lab coordination actions that could pertain to the most important twenty or so indicators, and developed corresponding forecasting questions.

Workshop teams, made up of integrated groups of technical AI experts, AI policy experts, and at least one Pro Forecaster, then forecasted on these questions. In practice, most of the forecasting questions asked about either:

  1. The odds of a policy or coordination response (e.g., compute restrictions), conditional on whether or not an intermediate development (e.g., a warning shot [? · GW]) occurs.
  2. The odds of a more terminal development (e.g., recursive self-improvement [? · GW]), conditional on whether or not a policy or coordination action had been implemented.

All probabilities quoted in this report are the median forecast from the workshop and include forecasts from Pro Forecasters and SMEs. There were around 20 workshop participants, and the forecasting questions in this report were forecasted on by 5 or 6 participants on average. Quoted forecasts should be taken as having low to moderate resilience [? · GW], and not as being fully authoritative.

Findings

  1. Policy interventions
    1. Hardware
      1. Participants were pessimistic that any of the potential hardware-related policies they looked into will be implemented in the next few years. They forecasted a 15% chance of compute capacity restrictions being implemented by 2027, 15% chance of required reporting of large training runs by 2026, and 7% chance of usage-tracking firmware updates for chips being enforced by 2028.
      2. Participants’ forecasts indicate an expectation that compute restrictions would marginally increase the number of frontier labs,[1] with some of their stated reasoning being that compute restrictions mean that the top labs can’t race ahead on compute. More frontier labs could be interpreted as meaning intensified race dynamics, though the workshop group that examined this topic concurred that the positive effect of compute restrictions on slowing capabilities advances will likely outweigh any negative race dynamics effects. Participants expected the other hardware policies to have little effect on the number of frontier labs.
    2. Software
      1. One category of risk that has been part of the AI safety discourse over the past month or so, following the release “agentized [LW · GW]” large language models (LLMs) like Auto-GPT and BabyAGI, and which was was the focus of participants’ forecasting on software-related policy interventions, concerns the idea that these agentized LLMs might recursively self-improve. (Recursive self-improvement has been part of the AI safety discussion since the early days, however, agentized LLMs appear to be the first systems that might be capable of enacting some form of recursive self-improvement.)
        1. The default probability of an API restriction policy by 2027, according to the participants’ forecasts, is 7%. The probability of recursive self-improvement, via agentized LLMs improving their agentic “wrappers”, given no API restrictions is 90%, whereas the probability of recursive self-improvement in this way given API restrictions is 67%. This means that if one thinks that self-improving agentized LLMs are a significant worry, then raising the chance of a policy to prevent LLMs from accessing their own APIs is potentially a high impact intervention.
        2. The default probability of both API restrictions and LLM export controls by 2027 is 5%, according to the workshop participants. (Note: LLM export controls would make it illegal for anyone in the U.S. to open-source an LLM more powerful than some stated level.) The probability of recursive self-improvement, via agentized LLMs improving their agentic “wrappers”, given just one or none of API restrictions and LLM export controls 90%, whereas the probability of recursive self-improvement in this way given both API restrictions and LLM export controls is 60%. Therefore, if one thinks that self-improving agentized LLMs are a significant worry, then increasing the chance of an LLM export control policy is also potentially an impactful intervention.
    3. International
      1. Participants were pessimistic about an international AI treaty involving China. Specifically, they were at 16% that by 2030 the US and China will sign a treaty requiring mutual transparency over large training runs.
      2. Participants were also pessimistic—8.5% and 5%, respectively—that the U.S. or Chinese President will, by 2025, publicly adopt a stance against AI racing.
  2. State of AI
    1. The first transformative AI (TAI) will be developed by a for-profit company: 70% chance.
      1. The following definition of TAI was used: AI that precipitates a transition comparable to, or more significant than, the agricultural or industrial revolution.
      2. The remaining 30% was spread fairly equally over these possibilities: non-profit company, government project, nationalized company. The first TAI coming from academia or an open-source collective was judged as unlikely.
        1. A related workshop forecast: 45% chance that the open source ecosystem will have produced a GPT-4 equivalent by 2027.
    2. The number of actors at the TAI forefront will likely be small, though participants don’t expect only one forefront actor. The median estimate for the number of forefront actors was 4, and the lower and upper quartiles were 2 and 7, respectively.[2]
    3. 40% chance that AI will be meaningfully discussed by both major candidates in the 2024 U.S. Presidential debate.
  3. State of AI lab coordination
    1. Participants were somewhat pessimistic about leading AI labs coordinating with each other to slow the capabilities race.
      1. “Will three of the largest AI labs issue a joint statement committing to constrain capabilities development or deployment before 2027?” received a median forecast of 50%.
      2. Key figures seem unlikely to publicly adopt a stance against AI racing, by 2025. The median forecast was around 25% for Sam Altman and for Demis Hassibis, and around 10% for each of Eric Schmidt, Jeff Dean and Yan LeCun.
  4. Risks from AI
    1. The median forecast on the chance of AI-caused global catastrophe, meaning >10% of the global population being killed, by 2030 was 2%.
    2. Meanwhile, the median forecast regarding a moderate AI catastrophe by 2030 was 80%.[3]
    3. Interestingly, the chance of U.S. compute restrictions—restricting the amount of compute that is allowed to be used in training runs is one of the most commonly talked about policy interventions in AI safety—by 2025 was forecasted as substantially higher, namely, 30% versus 10%, if a moderate AI catastrophe occurs before then. The reasoning that came up repeatedly here is the idea that a warning shot, such as a moderate AI catastrophe, could act as a galvanizing event that causes proposed policies, which may currently be outside the Overton window, to gain support and be implemented. In fact, the notion that a warning shot might significantly raise the chance of policy action was a recurring theme in the workshop as a whole.

Scenarios

We—Metaculus and FLI—began this project by constructing and discussing a set of four scenarios set in 2030:

  1. In “The Pause Perdures”:[4]
    1. Despite general stability, and ability to do so, there are no AI systems significantly more capable than GPT-4 (i.e. maybe GPT-5 level. No PASTA, no superintelligence).
    2. AI is broadly integrated into society and used for many things. It has been a big adjustment.
    3. Both expert consensus and some minor AI-related disasters have created a compelling case that AGI/superintelligence is a bad idea in the foreseeable future.
    4. The limit on AI has been chosen by society and is enforced by human norms, laws, and institutions.
  2. In “Gatekeeper”:
    1. There is a single arguably superintelligent AI system, which is substantially more competent than all others.
    2. It is owned/controlled by a multinational NGO with shared global governance.
    3. It generates both scientific research and software products for the global public good (including help in governing less powerful AI systems).
    4. It supports a multilateral global agreement (an international IAEA-like agency) that prevents rival superintelligences from being developed.
    5. GPT-5 level systems are used broadly for many things, but are strictly capability-capped and carefully managed.
  3. In “Accelerando, Diminuendo”:
    1. There are many powerful and economically useful AI systems by 2025, and a few that could be called general intelligences by 2027. All of these are developed by major competing AI companies, and remain associated with them (though at some point the tail starts to wag the dog).
    2. Between 2025 and 2026 people broadly benefit greatly from AI, and AI systems seem to be coming up with methods of collaborating amongst themselves and with human institutions.
    3. From 2026 to 2027, decisions are increasingly delegated to AI systems, and there is a general feeling that nobody knows exactly what they are doing and why. Still, productivity is high, there is an endless stream of entertainment, and new technologies are appearing at a rapid rate.
    4. Around 2028, things start to get markedly worse worldwide for humans, but few if any humans really understand what is going wrong—or even that things might be going wrong at all.
    5. In 2030, the human population is trending sharply downward, though this is not widely known.
  4. In “Crash on Takeoff”:
    1. Timelines have been quite short, with the Metaculus (harder) AGI question resolving positive in 2026 after an AGI was developed in a massive year-long training run. This AGI is kept quite controlled, however, by its highly cautious developers.
    2. In 2028, a company in secret uses a somewhat less powerful but less constrained system to do human-assisted recursive self-improvement.
    3. In 2029, the humans fall out of the loop. There is brief conflict with the existing AGI, then intelligence explosion and classic instrumental drives/power-seeking behavior.
    4. Humanity is extinct by 2030.

Updates

We currently plan on running subsequent AI Pathways workshops, the next being in June, as part of the overall AI Pathways project.

Feedback

The AI Pathways project represents a new, as far as we’re aware, angle of attack for anticipating the future of AI risk and judging the impact of high-stakes policy decisions. We think that some meaningful progress was made in this first workshop, and we hope that future workshops will be even more directly useful for policy decisions. We would welcome your thoughts on the utility of this project. Please direct any feedback to will@metaculus.com.

Acknowledgements

AI Pathways is an initiative developed by Metaculus in partnership with the Future of Life Institute. It is designed to support U.S. policymakers as they navigate risks from advanced AI. We thank the SMEs and Pro Forecasters who contributed to this work.

This report is a project of Metaculus. It was authored by Will Aldred, with feedback from Lawrence Phillips, Nate Morrison, Dan Schwarz, Christian Williams, and Gaia Dempsey.

Metaculus, a public benefit corporation, is an online forecasting platform and aggregation engine working to improve human reasoning and coordination on topics of global importance.

  1. ^

     51% vs 68% that there’ll be more than 10 actors at the forefront in 2030. Number of forefront actors is certainly not the only noteworthy effect of hardware policies, but it’s what we chose to focus on in this workshop.

  2. ^

     We count an actor as being at the TAI forefront if they carried out a training run within one order of magnitude of the training compute used in the run that produced the TAI.

  3. ^

     We define moderate catastrophe as a disaster (or series of disasters occurring within a 30-day span) that trigger a public statement of concern from at least one head of state, cabinet member, foreign minister, national security advisor, or secretary of state from one of the P5.

  4. ^

     The main text outlines the scenario in broad strokes. Below, one concrete story for the scenario is given. (We thank the experts and forecasters in the Pause Perdures group for constructing this story.)

    GPT-5 is developed by OpenAI in mid-2025.

    In 2026, a terrorist cell deliberately misuses a GPT-5 level LLM to create an auto-blackmail system. It uses spear phishing campaigns to hack email, CCTV cameras, etc. to discover a wide variety of legitimate blackmail against politicians across the US, UK, China, South Korea, and Japan. Some of the blackmail material is also fabricated via deepfake audio and video, and it’s nearly impossible to tell the difference between the real and fake material. A lot of the blackmail materials are revealed amid a lot of resigning politicians, including several notable suicides and one assassination arising from the revealed blackmail information.

    This AI campaign is eventually discovered by an Interpol taskforce and gets wall-to-wall coverage in every major news outlet. This coverage galvanizes international coordination towards a pause.

    Throughout this story, China is at least five years behind LLM SOTA due to increasingly strong and increasingly well coordinated export controls. China’s access to cloud compute has also been heavily restricted. At the time of the pause, China has not even trained an LLM as good as GPT-4 despite significant investment.

    Generally speaking, AI capabilities have remained very capital intensive, so the set of actors has remained small. At the time of the pause, only (1) OpenAI + Microsoft, (2) Google + Deepmind, (3) Anthropic, and (4) the US government are capable of making GPT >=5. Leading AI labs continue to be private and domiciled in democracies.

    The US starts leading the world towards an international treaty banning AI training runs that use over ~1E28 FLOP. China initially resists these measures, but turns around and complies on account of the carrot offered: its access to AI compute supply chains are tied to participating in the treaty. (Note: an alternative story with a similar outcome could involve some successful "boxing" of China, instead of a treaty.) Secure chip technologies are developed such that centralized tracking of chips, knowledge of how they are being used, and remote shutdown are all possible. Installation of these features is mandated through supply chain controls from the U.S. government, with support from Japan, South Korea, and the Netherlands. The US government also invests heavily in cybersecurity for these actors to prevent exfiltration, and engages in a lot of compute governance.

4 comments

Comments sorted by top scores.

comment by SteveZ (steve-zekany) · 2023-05-16T20:23:50.995Z · LW(p) · GW(p)

Thanks for posting this. I am a bit surprised that the forecasts for hardware-related restrictions are so low. Are there any notes or details available on what led the group to those numbers?

In particular the spread between firmware-based monitoring (7%) and compute capacity restrictions (15%) seems too small to me. I would have expected either a higher chance of restrictions or lower chance of on-chip monitoring because both are predicated on similar decision-making steps but implementing and operating an end-to-end firmware monitoring system has many technical hurdles.

Replies from: Will Aldred
comment by _will_ (Will Aldred) · 2023-05-17T12:42:25.541Z · LW(p) · GW(p)

Thanks for this question.

Firstly, I agree with you that firmware-based monitoring and compute capacity restrictions would require similar amounts of political will to happen. Then, in terms of technical challenges, I remember one of the forecasters saying they believe that "usage-tracking firmware updates being rolled out to 95% of all chips covered by the 2022 US export controls before 2028" is 90% likely to be physically possible, and 70% likely to be logistically possible. (I was surprised at how high these stated percentages were, but I didn't have time then to probe them on why exactly they were at these percentages—I may do so at the next workshop.)

Assuming the technical challenges of compute capacity restrictions aren't significant, fixing compute capacity restrictions at 15% likely, and applying the following crude calculation:

P(firmware) = P(compute) x P(firmware technical challenges are met)

= 0.15 x (0.9 x 0.7) = 0.15 x 0.63 = 0.0945 ~ 9%

9% is a little above the reported 7%, which I take as meaning that the other forecasters on this question believe the firmware technical challenges are a little, but not massively, harder than the 90%–70% breakdown given above.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2023-05-17T03:29:49.931Z · LW(p) · GW(p)

For what it's worth, my median scenario looks like:

Leading AI labs continue doing AI-assisted but primarily human-driven improvement to AI over the next 1-3 years. At some point during this time, a sufficiently competent general reasoning & coding model is created that shifts the balance of AI to human inputs.  So the next generation of AI starts shifting in favor of increasing share of contributions from the previous AI. With this new help, the lab releases a new model in somewhat less than a year. This new model contributes even more. In just a few months, a yet newer model is designed and trained. A few months after that, another model. Then the lab says, 'hold up, this thing is now getting scary powerful. How can we be sure we trust it?' (maybe a few more or fewer model generations will be required).

Then we have a weird stalemate situation where the lab has this new powerful model which is clear superior to what it had a year ago, but is unsure about how trustworthy it is. It is safely contained and extensive tests are run, but the tests are far from definitive.

Meanwhile, the incautious open-source community continues to advance..... The countdown is ticking until the open source community catches up. 

So there we are in 2026, being like, "What do we do now? We have definitely created a model powerful enough to be dangerous, but still don't have a sure way to align it. We know what advances and compute it took for us to get this far, and we know that the open-source efforts will catch up in around 3-4 years. Can we solve the alignment problem before that happens?"

I don't have a clear answer to what happens then. My guess is that a true complete solution to the alignment problem won't be found in time, and that we'll have to 'make do' with some sort of incomplete solution which is hopefully adequate to prevent disaster.

I also think there's an outside possibility of a research group not associated with one of the main labs making and publishing an algorithmic breakthrough which brings recursive self-improvement into reach of the open-source community suddenly and without warning. If that happens, what does humanity do about that? If we have kicked off this process in an open-source anyone-can-do-it scenario, and we suspect we have only a few months before further advances occur to push the open-source models to dangerous levels of competence. I don't know. If anything, I'm hoping that the big labs with their reasonable safety precautions (hopefully which get improved over the next year or two) do actually manage to come in first. Just because then it'll be at least temporarily contained, and the world's governments and large corporate actors will have the opportunity to officially verify for themselves that 'yes, this is a real thing that exists and is dangerous now'. That seems like a scenario more likely to go well than the sudden open-source breakthrough.

Replies from: Will Aldred
comment by _will_ (Will Aldred) · 2023-05-17T12:53:11.913Z · LW(p) · GW(p)

Thanks for this comment. I don't have much to add, other than: have you considered fleshing out and writing up this scenario in a style similar to "What 2026 looks like [LW · GW]"?