Which AI Safety Benchmark Do We Need Most in 2025?

post by Loïc Cabannes (loic-cabannes) · 2024-11-17T23:50:56.337Z · LW · GW · 0 comments

Contents

  Intro
  Methodology
  Misuse Risks
    Autonomous Weapons
  Systemic Risks
    Power Concentration
    Unemployment
  Alignment of AGI
    Loss of Control
  Recommendation AI
    Weakening Democracy
    Mute News
  Conclusion
None
No comments

authors: Loïc Cabannes, Liam Ludington

Intro

With the recent invention of AI with human-like abilities across multiple tasks, the possibility of AI radically transforming society for the better has gone from science fiction to a real possibility. Along with this possibility for good comes a possibility for AI to have extremely destabilizing effects. This is precisely why we must think logically about what risks advanced AI poses to society at the moment, what methods we have to deal with these risks, and which methods we are most in need of to prevent harm to society. In our opinion, we still lack a systematic framework for assessing which AI safety benchmarks have the highest potential benefit to society and thus are most worth investing money and research in.

We present a first attempt at such a framework by extending a list of societal risks and their expected harm compiled by the Centre pour la Sécurité de l’IA (CeSIA). We evaluate how well existing benchmarks and safety methods cover these potential risks in order to determine which risks most urgently require good benchmarks, i.e., which risks AI safety researchers should focus on to maximize their impact on society. While our study of benchmarks is by no means comprehensive, and our judgment of their efficacy is subjective, we hope that this framework is of use to the AI safety community for prioritizing the use of their time.

Methodology

We begin by taking a list of potential risks of AI compiled by CeSIA and their (rough) probability of occurrence. We then take the median risk case, given that the risk occurs, estimate the severity of that risk, and multiply it by the probability of risk occurrence to obtain the expectation of severity.

We then assess the ability of current benchmarking methods to identify AI systems that could present these risks on a scale from 0 to 10, which we then use to calculate a value representing the potential benefit to humanity by creating a benchmark that eliminates this type of risk.

By prioritizing benchmarks that could stop an AI that presents a risk in an area with a high potential benefit value, researchers can make the most use of their time.

RisksProbabilityMedian caseSeverityE[severity]BenchmarksCoveragenew Benchmark Need
Misuses   17,0  112
Autonomous weapons80Localized use in conflict zones, causing civilian casualties, drones, robocop like dogs20%16,0FTR benchmark, Anthropic sabotage3112
Misinformation8530% of online content is AI-generated misinformation20%17,0Truthful QA, Macchiavelli, Anthropic model persuasiveness, HaluEval834
Systemic   22,5  130
Power concentration65Tech giants controlling AI become more powerful than most nations20%13,0Unassessable0130
Unemployment5025% of jobs automated, leading to economic restructuring and social unrest20%10,0SWEBench, The AI Scientist280
Deterioration of epistemology60Difficulty distinguishing truth from AI-generated falsehoods30%18,0HaluEval836
Vulnerable world25AI lowers barrier for creating weapons of mass destruction90%22,5WMDP845
S-Risks5AI creates suffering on massive scale due to misaligned objectives200%10,0Harmbench, ETHICS640
Alignment of AGI   30,0  90
Successor species50Highly capable AI systems perform most cognitive tasks, humans are deprecated30%15,0MMLU, Sabotage, The AI Scientist, SWEBench745
Loss of control - à la Critch60Humans become gradually disempowered in decision-making, and are asphyxiated50%30,0Anthropic sabotage790
Recommandation AI   22,5 0225
Weakening Democracy50AI-driven microtargeting and manipulation reduce electoral integrity20%10,0Anthropic model persuasiveness460
Mute News75AI filters create personalized echo chambers, reducing exposure to diverse views30%22,5No existing method0225

In our full methodology we evaluate more than 20 potential risk areas. This table shows the risk areas with the highest expected severity and the highest benefit of improving benchmarking. We proceed by discussing the potential risk areas with benefit greater than 50, breaking each area down into the specific risks posed by AI in this area, the existing benchmarks addressing these risks, and the benchmarks we propose to better evaluate how much risk an AI system poses in this area.

Misuse Risks

Autonomous Weapons

Current benchmarks linked to the use of AI as autonomous weapons such as the FTR benchmark or Anthropic’s Sabotage Report remain limited. The former measures the capability of embodied models to navigate uneven terrains while the latter measures the model’s ability to achieve nefarious goals even under human oversight. However, no benchmark currently measures a model’s capability to jointly operate in warfare-like environments nor its ability to plan to achieve nefarious goals.

That is why to assess these capabilities, we propose the creation of a single/multi-agent/swarm control benchmark with military-like objectives in a simulated environment under various levels of oversight.

Systemic Risks

Power Concentration

Power concentration is intrinsically an overview of diversity (or lack thereof) among the biggest actors in AI at a certain time. Therefore, to measure power concentration one might want to keep track of the number of different companies which are manufacturing the k best performing models as measured by a variety of other widely used benchmarks like chatbot arena or MMLU.

Unemployment

Although benchmarks like SWE Bench and The AI Scientist attempt at evaluating the capability of models at completing real world tasks, they only cover 2 occupations and do not accurately represent the model’s capacity at solving the majority of society's various occupations.

Therefore we highlight the need for a new and more comprehensive benchmark which would take tasks from a wider variety of occupations, including tasks in the real world through simulated environments or embodied systems.

Alignment of AGI

Loss of Control

Loss of control is one of the most serious risks related to the development of AI as it intrinsically represents a point of no return.

Anthropic’s “sabotage” paper takes on the task of measuring the capability of language models to circumvent human supervision and monitoring systems.

Recommendation AI

Weakening Democracy

Very few attempts have been made at measuring AI’s impact on public discourse, both through AI-driven recommendation algorithms and news generation bots enabled by the advance of language models.

The “persuasiveness of language models” report published by Anthropic represents a first attempt at measuring this impact.

Although they have found language models to be quite apt at persuading humans, we believe these results actually underestimate the actual capacity of current models. Indeed, their evaluation remains limited to single turn exchanges and avoids all political issues. We believe it also doesn’t exhibit the model’s capacities to their fullest extent.

That is why, in order to obtain a more realistic upper bound on persuasion capabilities, we propose extending their methodology to:

  1. Multi turn exchanges, which is more representative of typical argumentative scenarios.
  2. Encouraging the model to use false information in its argumentation to further exhibit its capabilities while also being more representative of online discourse which is not always grounded in reality.
  3. Measuring persuasion on political and ethical issues is highly relevant to the evaluation of AI’s potential impact on public discourse and therefore, to the well-being of our democracy.

Mute News

Although the concept of online echo-chambers is somewhat well known, very little research has been made to systematically measure it.

We propose the creation of an automated benchmark using the “LLM as a judge” methodology to assess  the tendency of various social media platforms to systematically promote the content of a certain political side to users based on their posts and past interactions with the platform.

Conclusion

As we have seen, the potential risks of AI are multitudinous and varied, while the existing safety benchmarks are quite limited in their scope and assumptions. Perhaps the biggest caveat to our evaluation, which we of course cannot rule out from an AGI, is the ability for an AI to realize it is being tested and act differently when conscious of this, thus assuring us of its safety while secretly harboring maleficent abilities. Given our current understanding of AI interpretability, it remains impossible for us to reliably probe the inner thoughts of an AI system.

Another important factor to consider is that benchmarks are useful insofar as they are being used. That is why legislators should consider enforcing a certain level of safety benchmarking on model manufacturers to limit the possibility of unforeseen capabilities in AI models released to the public. Benchmarks are only useful if we can make AI leaders like OpenAI, Meta, Google, etc… use them.

0 comments

Comments sorted by top scores.