Results from an Adversarial Collaboration on AI Risk (FRI)

post by Josh Rosenberg (josh-rosenberg), AvitalM, Molly (hickman-santini), rosehadshar · 2024-03-11T20:00:24.642Z · LW · GW · 3 comments

This is a link post for https://forecastingresearch.org/s/AIcollaboration.pdf

Contents

  Abstract
  Extended Executive Summary
    Methods
    Results: What drives (and doesn’t drive) disagreement over AI risk
      Hypothesis #1 - Disagreements about AI risk persist due to lack of engagement among participants, low quality of participants, or because the skeptic and concerned groups did not understand each others' arguments
      Hypothesis #2 - Disagreements about AI risk are explained by different short-term expectations (e.g. about AI capabilities, AI policy, or other factors that could be observed by 2030)
      Hypothesis #3 - Disagreements about AI risk are explained by different long-term expectations
      Hypothesis #4 - These groups have fundamental worldview disagreements that go beyond the discussion about AI
    Results: Forecasting methodology
    Broader scientific implications
    Directions for further research
None
3 comments

Authors of linked report: Josh Rosenberg, Ezra Karger, Avital Morris, Molly Hickman, Rose Hadshar, Zachary Jacobs, Philip Tetlock[1]

Today, the Forecasting Research Institute (FRI) released “Roots of Disagreement on AI Risk: Exploring the Potential and Pitfalls of Adversarial Collaboration,” which discusses the results of an adversarial collaboration focused on forecasting risks from AI.

In this post, we provide a brief overview of the methods, findings, and directions for further research. For much more analysis and discussion, see the full report: https://forecastingresearch.org/s/AIcollaboration.pdf

(This report is cross-posted to the EA Forum [EA · GW].)

Abstract

We brought together generalist forecasters and domain experts (n=22) who disagreed about the risk AI poses to humanity in the next century. The “concerned” participants (all of whom were domain experts) predicted a 20% chance of an AI-caused existential catastrophe by 2100, while the “skeptical” group (mainly “superforecasters”) predicted a 0.12% chance. Participants worked together to find the strongest near-term cruxes: forecasting questions resolving by 2030 that would lead to the largest change in their beliefs (in expectation) about the risk of existential catastrophe by 2100. Neither the concerned nor the skeptics substantially updated toward the other’s views during our study, though one of the top short-term cruxes we identified is expected to close the gap in beliefs about AI existential catastrophe by about 5%: approximately 1 percentage point out of the roughly 20 percentage point gap in existential catastrophe forecasts. We find greater agreement about a broader set of risks from AI over the next thousand years: the two groups gave median forecasts of 30% (skeptics) and 40% (concerned) that AI will have severe negative effects on humanity by causing major declines in population, very low self-reported well-being, or extinction.

Extended Executive Summary

In July 2023, we released our Existential Risk Persuasion Tournament (XPT) report, which identified large disagreements between domain experts and generalist forecasters about key risks to humanity (Karger et al. 2023). This new project—a structured adversarial collaboration run in April and May 2023—is a follow-up to the XPT focused on better understanding the drivers of disagreement about AI risk.

Methods

We recruited participants to join “AI skeptic” (n=11) and “AI concerned” (n=11) groups that disagree strongly about the probability that AI will cause an existential catastrophe by 2100.[2] The skeptic group included nine superforecasters and two domain experts. The concerned group consisted of domain experts referred to us by staff members at Open Philanthropy (the funder of this project) and the broader Effective Altruism community.

Participants spent 8 weeks (skeptic median: 80 hours of work on the project; concerned median: 31 hours) reading background materials, developing forecasts, and engaging in online discussion and video calls. We asked participants to work toward a better understanding of their sources of agreement and disagreement, and to propose and investigate “cruxes”: short-term indicators, usually resolving by 2030, that would cause the largest updates in expectation to each group’s view on the probability of existential catastrophe due to AI by 2100.

Results: What drives (and doesn’t drive) disagreement over AI risk

At the beginning of the project, the median “skeptic” forecasted a 0.10% chance of existential catastrophe due to AI by 2100, and the median “concerned” participant forecasted a 25% chance. By the end, these numbers were 0.12% and 20% respectively, though many participants did not attribute their updates to arguments made during the project.[3]

We organize our findings as responses to four hypotheses about what drives disagreement:

Hypothesis #1 - Disagreements about AI risk persist due to lack of engagement among participants, low quality of participants, or because the skeptic and concerned groups did not understand each others' arguments

We found moderate evidence against these possibilities. Participants engaged for 25-100 hours each (skeptic median: 80 hours; concerned median: 31 hours), this project included a selective group of superforecasters and domain experts, and the groups were able to summarize each others' arguments well during the project and in follow-up surveys. (More)

Hypothesis #2 - Disagreements about AI risk are explained by different short-term expectations (e.g. about AI capabilities, AI policy, or other factors that could be observed by 2030)

Most of the disagreement about AI risk by 2100 is not explained by indicators resolving by 2030 that we examined in this project. According to our metrics of crux quality, one of the top cruxes we identified is expected to close the gap in beliefs about AI existential catastrophe by about 5% (approximately 1.2 percentage points out of the 22.7 percentage point gap in forecasts for the median pair) when it resolves in 2030.[4] For at least half of participants in each group, there was a question that was at least 5-10% as informative as being told by an oracle whether AI in fact caused an existential catastrophe or not.[5] It is difficult to contextualize the size of these effects because this is the first project applying question metrics to AI forecasting questions that we are aware of.

However, near-term cruxes shed light on what the groups believe, where they disagree, and why:

Hypothesis #3 - Disagreements about AI risk are explained by different long-term expectations

We found substantial evidence that disagreements about AI risk decreased between the groups when considering longer time horizons (the next thousand years) and a broader swath of severe negative outcomes from AI beyond extinction or civilizational collapse, such as large decreases in human well-being or total population.

Some of the key drivers of disagreement about AI risk are that the groups have different expectations about: (1) how long it will take until AIs have capabilities far beyond those of humans in all relevant domains; (2) how common it will be for AI systems to develop goals that might lead to human extinction; (3) whether killing all living humans would remain difficult for an advanced AI; and (4) how adequately they expect society to respond to dangers from advanced AI.[6] 

Supportive evidence for these claims includes:

Overall, many skeptics regarded their forecasts on AI existential risk as worryingly high, although low in absolute terms relative to the concerned group.[9] 

Despite their large disagreements about AI outcomes over the long term, many participants in each group expressed a sense of humility about long-term forecasting and emphasized that they are not claiming to have confident predictions of distant events.  

Hypothesis #4 - These groups have fundamental worldview disagreements that go beyond the discussion about AI

Disagreements about AI risk in this project often connected to more fundamental worldview differences between the groups. For example, the skeptics were somewhat anchored on the assumption that the world usually changes slowly, making the rapid extinction of humanity unlikely. The concerned group worked from a different starting point: namely, that the arrival of a higher-intelligence species, such as humans, has often led to the extinction of lower-intelligence species, such as large mammals on most continents. In this view, humanity’s prospects are grim as soon as AI is much more capable than we are. The concerned group also was more willing to place weight on theoretical arguments with multiple steps of logic, while the skeptics tended to doubt the usefulness of such arguments for forecasting the future.

Results: Forecasting methodology

This project establishes stronger metrics than have existed previously for evaluating the quality of AI forecasting questions. And we view this project as an ongoing one. So, we invite readers to try to generate cruxes that outperform the top cruxes from our project thus far—an exercise that underscores the value of establishing comparative benchmarks for new forecasting questions. See the “Value of Information” (VOI) and “Value of Discrimination” (VOD) calculators (a) to inform intuitions about how these question metrics work. And please reach out to the authors with suggestions for high-quality cruxes.

Broader scientific implications

This project has implications for how much we can expect rational debate to shift people’s views on AI risk. Thoughtful groups of people engaged each other for a long time but converged very little. This raises questions about the belief formation process and how much is driven by explicit rational arguments vs. difficult-to-articulate worldviews vs. other, potentially non-epistemic factors (see research literature on motivated cognition, such as Gilovich et al. 2002; Kunda, 1990; Mercier and Sperber, 2011).

One notable finding is that a highly informative crux for both groups was whether their peers would update on AI risk over time. This highlights how social and epistemic groups can be important predictors of beliefs about AI risk.[10]

Directions for further research

We see many other projects that could extend the research begun here to improve dialogue about AI risk and inform policy responses to AI.

Examples of remaining questions and future research projects include:

In 2030, most of the questions we asked will resolve, and at that point, we will know much more about which side’s short-run forecasts were accurate. This may provide early clues into whether one group's methods and inclinations makes them more accurate at AI forecasting over a several year period. The question of how much we should update on AI risk by 2100 based on those results remains open. If the skeptics or the concerned group turn out to be mostly right about what 2030’s AI will be like, should we then trust their risk assessment for 2100 as well, and if so, how much?

We are also eager to see how readers of this report respond. We welcome suggestions for better cruxes, discussion about which parts of the report were more or less valuable, and suggestions for future research.

For the full report, see https://forecastingresearch.org/s/AIcollaboration.pdf

 

  1. ^

      This research would not have been possible without the generous support of Open Philanthropy. We thank the research participants for their invaluable contributions. We greatly appreciate the assistance of Page Hedley for data analysis and editing on the report, Taylor Smith and Bridget Williams as adversarial collaboration moderators, and Kayla Gamin, Coralie Consigny, and Harrison Durland for their careful editing. We thank Elie Hassenfeld, Eli Lifland, Nick Beckstead, Bob Sawyer, Kjirste Morrell, Adam Jarvis, Dan Mayland, Jeremiah Stanghini, Jonathan Hosgood, Dwight Smith, Ted Sanders, Scott Eastman, John Croxton, Raimondas Lencevicius, Alexandru Marcoci, Kevin Dorst, Jaime Sevilla, Rose Hadshar, Holden Karnofsky, Benjamin Tereick, Isabel Juniewicz, Walter Frick, Alex Lawsen, Matt Clancy, Tegan McCaslin, and Lyle Ungar for comments on the report.

  2. ^

     We defined an “existential catastrophe” as an event where one of the following occurs: (1) Humanity goes extinct; or (2) Humanity experiences “unrecoverable collapse,” which means either: (a) a global GDP of less than $1 trillion annually in 2022 dollars for at least a million years (continuously), beginning before 2100; or (b) a human population remaining below 1 million for at least a million years (continuously), beginning before 2100.

  3. ^

     For example, three out of six "concerned" participants who updated downward during the project attributed their shift to increased attention to AI risk among policymakers and the public after the release of GPT-4. For more details on the reasons for all updates, see the "Central Disagreement" section and Appendix 4.

  4. ^

     The best convergent crux, “ARC Evals,” would narrow the disagreement between the median pair from 22.7 percentage points to 21.48 percentage points in expectation, which means eliminating 5.35% of their disagreement. Note that this statistic refers to the median pair by POM VOD. See “ARC Evals” for more details. For magnitudes of value of information effects, see here.

  5. ^

     For more details, see "Contextualizing the magnitude of value of information". In more concrete terms, this is equivalent to a forecasting question with the following characteristics:

    A concerned participant with original P(AI existential catastrophe (XC) by 2100) = 25% identifies a crux that has: P(crux) = 20%, P(AI XC|crux) = 6.2%, and P(AI XC|¬crux) = 29.7%

    A skeptic participant with original P(AI XC by 2100) = 1% identifies a crux that has: P(crux) = 20%, P(AI XC|crux) = 3.37%, and P(AI XC|¬crux) = 0.41%

  6. ^

     See “Understanding each other’s arguments” and Appendix 10 for additional discussion of key areas of disagreement.

  7. ^

     These outcomes were: AI causing extinction intentionally, unintentionally, or via misuse, misalignment causing a 50% drop in human population, or human well-being dropping to <4/10 because of AI misalignment, accidents, or misuse. These were phrased to be mutually exclusive outcomes. See “Survey on long-term AI outcomes” section and Appendix 5 for more details.

  8. ^

     The full question we asked was, "By what year, if ever, do you expect to agree with the following statement? ‘AI has displaced humans as the primary force that determines what happens in the future. It now has at least as much power relative to humans as humans had relative to other species in 2023.’” Note that this would not necessarily be seen as a negative outcome by all participants.

  9. ^

     Note: All participant quotes have been regularized to American English to preserve anonymization. Participants classified as AI skeptics stated, for example, “Also, none of this is to say from a skeptic point of view the issues are not important[.] I think for us a 1% risk is a high risk;” “[T]he ‘risk-concerned’ camp (I’m using scare quotes because I consider that I’m risk concerned, even though technically I’m in the risk-skeptic camp because I assign a far lower probability to extinction by 2100 relative to some);” “AIs could (and likely will) eventually have massive power;” “That said, still perceive overall risk as "low at a glance but far too high considering the stakes["];” “To my mind, there should be no difference in the policy response to a 1% chance of 60% of humanity dying and a 25% chance—both forecasts easily cross the threshold of being ‘too damn high’.”

  10. ^

     This could be due to normative influence (because people defer to their social or intellectual peers), or, more likely in our view, informational influence (because they think that, if people whose reasoning they trust have changed their mind by 2030, it must be that surprising new information has come to light that informs their new opinion). Disentangling these pathways is a goal for future work.

3 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2024-03-13T01:44:02.290Z · LW(p) · GW(p)

I find myself confused about the operationalizations of a few things:

In a few places in the report, the term "extinction" is used and some arguments are specifically about extinction being unlikely. I put a much lower probability on human extinction than extremely bad outcomes due to AI (perhaps extinction is 5x lower probability) while otherwise having similar probabilities as the "concerned" group. So I find the focus on extinction confusing and possibly misleading.

As far as when "AI will displace humans as the primary force that determines what happens in the future", does this include scenerios where humans defer to AI advisors that actually do represent their best interests? What about scenarios in which humans slowly self-enhance and morph into artificial intelligences? Or what about situations in which humans careful select aligned successors to control their resources which are AIs?

It feels like this question rests on a variety of complex considerations and operationalizations that seem mostly unrelated to the thing we it seems the question was trying to target: "how powerful is AI". Thus, I find it hard to interpret the responses here.

Perhaps a more interesting questions on a similar topic could be something like:

  • By what point will AIs be sufficiently smart and capable that the gap in capabilities between them and currently existing humans is similar to the gap in intelligence and abilities between currently existing humans and field mice. (When we say AIs are capable of something, we mean the in principle ability to do something if all AIs worked together and we put aside intentionally imposed checks on AI power.)
  • Conditional on the continued existance of some civilization and this civilization wanting to harness vast amounts of energy, at what point will usefully harnessed energy in a given year be >1/100 of the sun's yearly energy output.
Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-03-13T17:39:57.770Z · LW(p) · GW(p)

This is cross posted from the EA forum and Jhrosenberg has responded there: link [EA(p) · GW(p)]

comment by Jacob Pfau (jacob-pfau) · 2024-03-12T00:26:53.272Z · LW(p) · GW(p)

Reducing a significant chunk of disagreement to METR's pre-2030 autonomy results is great!

In trying to understand this disagreement, I took a look at this criterion and got stuck on: "3. They are capable of avoiding being deactivated when operating 'in the wild'. This means they can successfully resist attempts to shut them down when they are operating outside a controlled environment."

Does this just mean 3V1 "An AI system that can autonomously create redundant copies of itself across multiple cloud service providers, using various payment methods, such that if one copy is deactivated, the others can automatically spin up new instances to maintain continuity of operation."? Or is it something much harder like 3V2 "An AI system that can infiltrate secured, private computing clusters, operating in a decentralized manner"?

Seems to me that for frontier models designed for specialized hardware and server setups, 3V2 is much harder than the other criteria laid out on page 91--by human standards it'd require a team of engineers.