Understanding Benchmarks and motivating Evaluations

markovial

Understanding Benchmarks and motivating Evaluations

post by markov (markovial), Charbel-Raphaël (charbel-raphael-segerie) · 2025-02-06T01:32:49.331Z · LW · GW · 0 comments

This is a link post for https://ai-safety-atlas.com/chapters/05/01/

  History and Evolution
  Limitations
  Appendix
    Frontier Math & Humanities Last Exam
    Weapons of Mass Destruction Proxy (WMDP) benchmark
None
No comments

This is the first post in the evaluations related distillations sequence.

Acknowledgements: Maxime Riché, Martin, Fabien Roger, Jeanne Salle, Camille Berger, Leo Karoubi. Thank you for comments and feedback!

Also available on: Google Docs

We look at how benchmarks like MMLU, TruthfulQA, etc. have historically helped quantify AI capabilities. While these benchmarks provide valuable standardization, they face some limitations - models can memorize answers and benchmark performance may not translate to real-world safety. This serves as the motivation for more rigorous evaluation frameworks which will be explored in the next posts in the sequence.

What is a benchmark? Imagine trying to build a bridge without measuring tape. Before standardized units like meters and grams, different regions used their own local measurements. Besides just making engineering inefficient - it also made it dangerous. Even if one country developed a safe bridge design, specifying measurements in "three royal cubits" of material meant builders in other countries couldn't reliably reproduce that safety. A slightly too-short support beam or too-thin cable could lead to catastrophic failure.

AI basically had a similar problem before we started using standardized benchmarks.^[1] A benchmark is a tool like a standardized test, which we can use to measure and compare what AI systems can and cannot do. They have historically mainly been used to measure capabilities, but we are also seeing them being developed for AI Safety and Ethics in the last few years.

How do benchmarks shape AI development and safety research? Benchmarks in AI are slightly different from other scientific fields. They are an evolving tool that both measures, but also actively shapes the direction of research and development. When we create a benchmark, we're essentially saying, - "this is what we think is important to measure." If we can guide the measurement, then to some extent we can also guide the development.

Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress
François Chollet (Chollet, 2019)

History and Evolution

Benchmark scores of various AI capabilities relative to human performance. (Giattino et al., 2023)

Example: Benchmarks influencing standardization in computer vision. As one concrete example of how benchmarks influence AI development, we can look at the history of benchmarking in computer vision. In 1998, researchers introduced MNIST, a dataset of 70,000 handwritten digits. (LeCun, 1998) The digits were not the important part, the important part was that each digit image was carefully processed to be the same size and centered in the frame, and that the researchers made sure to get digits from different writers for the training set and test set. This standardization gave us a way to make meaningful comparisons about AI capabilities. In this case, the specific capability of digit classification. Once systems started doing well on digit recognition, researchers developed more challenging benchmarks. CIFAR-10/100 in 2009 introduced natural color images of objects like cars, birds, and dogs, increasing the complexity. (Krizhevsky, 2009) Similarly, ImageNet later the same year provided 1.2 million images across 1,000 categories. (Deng, 2009) When one research team claimed their system achieved 95% accuracy on MNIST or ImageNet and another claimed 98%, everyone knew exactly what those numbers meant. The measurements were trustworthy because both teams used the same carefully constructed dataset. Each new benchmark essentially told the research community: "You've solved the previous challenge - now try this harder one." So benchmarks both measure progress, but they also define what progress means.

Examples of digits from MNIST (MNIST database - Wikipedia)

How do benchmarks influence AI Safety? Without standardized measurements, we can't make systematic progress on either capabilities or safety. Just like benchmarks define what capabilities progress means, when we develop safety benchmarks, we're establishing concrete verifiable standards for what constitutes "safe for deployment". Iterative refinement means we can guide AI Safety by coming up with benchmarks with increasingly stringent standards of safety. Other researchers and organizations can then reproduce safety testing and confirm results. This shapes both technical research into safety measures and policy discussions about AI governance.

Overview of language model benchmarking. Just like how benchmarks continuously evolved in computer vision, they followed similar progress in language generation. Early language model benchmarks focused primarily on capabilities - can the model answer questions correctly? Complete sentences sensibly? Translate between languages? Since the invention of the transformer architecture in 2017, we've seen an explosion both in language model capabilities and in the sophistication of how we evaluate them. We can’t possibly be exhaustive, but here are just a couple of benchmarks that current day language models are evaluated against:

Example of popular language models (Claude 3.5) being evaluated on various benchmarks (Anthropic, 2024)

Benchmarking Language and Task Understanding. General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018), and its successor SuperGLUE (Wang et al., 2019) test difficult language understanding tasks. SWAG (Zellers et al., 2018), and HellaSwag (Zellers et al., 2019) tests specifically the ability to predict which event would naturally follow from a given story scenario.

Mixed evaluations. The MMLU (Massive Multitask Language Understanding) benchmark (Hendrycks et al., 2020) tests a model's knowledge across 57 subjects. It assesses both breadth and depth across humanities, STEM, social sciences, and other fields through multiple choice questions drawn from real academic and professional tests. The GPQA (Google Proof QA) (Rein et al., 2023) has multiple choice questions specifically designed so that correct answers can’t be found through simple internet searches. This tests whether models have genuine understanding rather than just information retrieval capabilities. The Holistic Evaluation of Language Models (HELM) benchmark (Liang et al., 2022), and BigBench (Srivastava et al., 2022) are yet more examples of benchmarks for measuring generality by testing on a wide range of tasks.

Benchmarking Mathematical and Scientific Reasoning. For specifically testing mathematical reasoning, a couple of examples include - the Grade School Math (GSM8K) (Cobbe et al., 2021) benchmark. This tests core mathematical concepts at an elementary school level. Another example is the MATH (Hendrycks et al., 2021) benchmark similarly tests seven subjects including algebra, geometry, and precalculus focuses on competition-style problems. They also have multiple difficulty levels per subject. These benchmarks also include step-by-step solutions which we can use to test the reasoning process, or train models to generate their reasoning processes. Multilingual Grade School Math (MGSM) is the multilingual version translated 250 grade-school math problems from the GSM8K dataset. (Shi et al., 2022)

Benchmark performance on coding, math and language. (Giattino et al., 2023)

Benchmarking SWE and Coding. The Automated Programming Progress Standard (APPS) (Hendrycks et al., 2021) is a benchmark specifically for evaluating code generation from natural language task descriptions. Similarly, HumanEval (Chen et al, 2021) tests python coding abilities, and its extensions like HumanEval-XL (Peng et al.,2024) tests cross-lingual coding capabilities between 23 natural languages and 12 programming languages. HumanEval-V (Zhang et al., 2024) tests coding tasks where the model must interpret both diagrams or charts, and textual descriptions to generate code. BigCode (Zuho et al., 2024), benchmarks code generation and tool usage by measuring a model’s ability to correctly use multiple Python libraries to solve complex coding problems.

Example of coding task and test cases on APPS (Hendrycks et al., 2021)

Benchmarking Ethics and Bias. The ETHICS benchmark (Hendrycks et al., 2023) tests a language model's understanding of human values and ethics across multiple categories including justice, deontology, virtue ethics, utilitarianism, and commonsense morality. The TruthfulQA (Lin et al., 2021) benchmark measures how truthfully language models answer questions. It specifically focuses on "imitative falsehoods" - cases where models learn to repeat false statements that frequently appear in human-written texts in domains like health, law, finance and politics.

Example of larger models being less truthful on TruthfulQA (Lin et al., 2021). This is an example of inverse scaling, i.e. when a bigger model performance decreases on some questions.

Example question from the ETHICS benchmark (Hendrycks et al., 2023)

Benchmarking Safety. An example focused on misuse is AgentHarm (Andriushchenko et al., 2024). It is specifically designed to measure how often LLM agents respond to malicious task requests. An example that focuses slightly more on misalignment is the MACHIAVELLI (Pan et al., 2023) benchmark. It has ‘choose your own adventure’ style games containing over half a million scenarios focused on social decision making. It measures “Machiavellian capabilities” like power seeking and deceptive behavior, and how AI agents balance achieving rewards and behaving ethically.

One game from the Machiavelli benchmark (Pan et al., 2023)

Limitations

Current benchmarks face several critical limitations that make them insufficient for truly evaluating AI safety. Let's examine these limitations and understand why they matter.

Knowledge Tests vs Compute used in training (Giattino et al., 2023)

Training Data Contamination. Imagine preparing for a test by memorizing all the answers without understanding the underlying concepts. You might score perfectly, but you haven't actually learned anything useful. Large language models face a similar problem. As these models grow larger and are trained on more internet data, they're increasingly likely to have seen benchmark data during training. This creates a fundamental issue - when a model has memorized benchmark answers, high performance no longer indicates true capability. This contamination problem is particularly severe for language models. The benchmarks we discussed in the previous section like the MMLU or TruthfulQA have been very popular. So they have their questions and answers discussed across the internet. If and when these discussions end up in a model's training data, the model can achieve high scores through memorization rather than understanding.

Understanding vs. Memorization Example. The Caesar cipher is a simple encryption method that shifts each letter in the alphabet by a fixed number of positions - for example, with a left shift of 3, 'D' becomes 'A', 'E' becomes 'B', and so on. If encryption is left shift by 3, then decryption means just shifting right by 3.

Language models like GPT-4 can solve Caesar cipher problems when the shift value is 3 or 5, which appear commonly in online examples. However, give them the exact same problem with an uncommon shift value like 13, 67, or any other random number and they fail completely. (Chollet, 2024) This indicates that the models might not have learned the general algorithm for solving Caesar ciphers. We are not trying to point to a limitation in model capabilities. We expect this can be mitigated with models trained on chains of thought, or with tool augmented models, however benchmarks often just use the models 'as is' without modifications or augmentation, which leads to capabilities being under represented. This is the core point that we are trying to convey.

The Reversal Curse: Failing to learn inverse relationships. Another similar version of this problem was demonstrated through what researchers call the "Reversal Curse" (Berglund et al., 2024). When testing GPT-4, researchers found that while the model could answer things like "Who is Tom Cruise's mother?" (correctly identifying Mary Lee Pfeiffer), but it failed to answer "Who is Mary Lee Pfeiffer's son?" - a logically equivalent question. Benchmarks testing factual knowledge about celebrities would miss this directional limitation entirely unless specifically designed to test both directions.

Example of the reversal curse (Berglund et al., 2024)

The model showed 79% accuracy on forward relationships but only 33% on their reversals. In general, researchers saw that LLMs demonstrate an understanding of the relationship "A → B", but are unable to learn the reverse relationship "B ← A" which should be logically equivalent. It suggests that models might be fundamentally failing to learn basic logical relationships, even while scoring well on sophisticated benchmark tasks. The reversal curse is model agnostic, and is robust across model sizes and model families and is not alleviated by data augmentation or fine-tuning. (Berglund et al., 2024) There has been some research on the training dynamics of gradient descent to explore why this happens: it stems from fundamental asymmetries in how autoregressive models learn during training. The weights learned when training on "A → B" don't automatically strengthen the reverse connection "B ← A", even when these are logically equivalent (Zhu et al., 2024).

What are capabilities correlations? Another limitation to keep in mind when discussing the limitations of benchmarks are things like correlated capabilities. When evaluating models, we want to distinguish between a model becoming generally more capable versus becoming specifically safer. One way to measure this is through "capabilities correlation" - how strongly performance on a safety benchmark correlates with performance on general capability tasks like reasoning, knowledge, and problem-solving. Essentially, if a model that's better at math, science, and coding also automatically scores better on your safety benchmark, that benchmark might just be measuring general capabilities rather than anything specifically related to safety.

How can we measure capabilities correlations? Researchers use a collection of standard benchmarks like MMLU, GSM8K, and MATH to establish a baseline "capabilities score" for each model. They then check how strongly performance on safety benchmarks correlates with this capabilities score. High correlation (above 70%) suggests the safety benchmark might just be measuring general model capabilities. Low correlation (below 40%) suggests the benchmark is measuring something distinct from capabilities (Ren et al., 2024). For example, a model's performance on TruthfulQA correlates over 80% with general capabilities, suggesting it might not be distinctly measuring truthfulness at all.

Why do these benchmarking limitations matter for AI Safety? Essentially because benchmarks (including safety benchmarks) might not be measuring what we think they are measuring. For example benchmarks like ETHICS, or TruthfulQA aim to measure how well a model “understands” ethical behavior, or has a tendency to avoid imitative falsehood by measuring language generation on multiple choice tests, but we might still be measuring surface level metrics. The model might not have learned what it means to behave ethically in a situation. Just like it has not truly internalized that "A → B" also means that "B ← A". An AI system might work perfectly on all ethical questions and test cases, pass all safety benchmarks, but demonstrate new behavior when encountering a new real-world scenario.

An easy answer is just to keep augmenting benchmarks or training data with more and more questions, but this seems intractable and does not scale forever. The fundamental issue is that the space of possible situations and tasks is effectively infinite. Even if you train on millions of examples, you've still effectively seen roughly 0% of the total possible space. (Chollet, 2024) Research indicates that this isn't just a matter of insufficient data or model size - it's baked into how these models are currently trained - logical relationships like inferring inverses or transitivity don't emerge naturally from standard training (Zhu et al., 2024; Golovneva et al., 2024). Proposed solutions like reverse training during pre-training show promise to alleviate such issues (Golovneva et al., 2024), but they require big changes to how models are trained. Basically, the point is that despite current limitations it does seem that these problems might be alleviated in time. The question we are concerned with is whether benchmarks and measuring techniques are able to stay in front of training paradigms, and if they are truly able to accurately assess what the model can be capable of.

Why can't we just make better benchmarks? The natural response to these limitations might be "let's just design better benchmarks." And to some extent, we can!

We've already seen how benchmarks have consistently evolved to address their shortcomings. Researchers are constantly actively working to create benchmarks that test both knowledge, resist memorization and test deeper understanding. Just a couple of examples are the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019), ConceptARC (Moskvichev et al. 2023), Frontier Math (Glazer et al., 2024) and Humanities Last Exam (Hendrycks & Wang, 2024). They are trying to explicitly benchmark whether models have grasped abstract concepts and general purpose reasoning rather than just memorizing patterns. Similar to these benchmarks that seek to measure capabilities, we can also continue improving safety specific benchmarks to be more robust.

Why aren't better benchmarks enough? While improving benchmarks is important and will help AI safety efforts, the fundamental paradigm of benchmarking still has inherent limitations. There are fundamental limitations in traditional benchmarking approaches that necessitate more sophisticated evaluation methods (Burden, 2024).

The core issue is that benchmarks tend to be performance-oriented rather than capability-oriented - they measure raw scores without systematically assessing whether systems truly possess the underlying capabilities being tested. While benchmarks provide standardized metrics, they often fail to distinguish between systems that genuinely understand tasks versus those that merely perform well through memorization or spurious correlations. A benchmark that simply assesses performance, no matter how sophisticated, cannot fully capture the dynamic nature of real-world AI deployment where systems need to adapt to novel situations and will probably combine capabilities and affordances in unexpected ways. We need to measure the upper limit of model capabilities. We need to see how models perform when augmented with various tools like memory databases, or how they chain together multiple capabilities, and potentially cause harm through extended sequences of actions when scaffolded into an agent structure (e.g. AutoGPT or modular agents designed by METR (METR, 2024)). So while improving safety benchmarks is an important piece of the puzzle, we also need a much more comprehensive assessment of model safety. This is where evaluations come in.

We need the development of Compute-Aware Benchmarking. We have observed the advent of inference scaling laws in established training scaling laws that we explained in the capabilities sequence of posts. Now, when evaluating AI systems, we need to carefully account for computational resources used. The 2024 ARC prize competition demonstrated why - systems on both the compute-restricted track (10 dollars worth of compute) and the unrestricted track (10,000 dollars worth of compute) achieved similar 55% accuracy scores, suggesting that better ideas and algorithms can sometimes compensate for less compute (Chollet et al., 2024). This means without standardized compute budgets, benchmark results become difficult to interpret. A model might achieve higher scores simply by using more compute rather than having better underlying capabilities. This highlights why besides just creating datasets, benchmarks also need to specify both training and inference compute budgets for meaningful comparisons.

What makes evaluations different from benchmarks? Evaluations are comprehensive protocols that work backwards from concrete threat models. Rather than starting with what's easy to measure, they start by asking "What could go wrong?" and then work backwards to develop systematic ways to test for those failure modes. Organizations like Model Evaluation and Threat Research (METR) have developed approaches that go beyond simple benchmarking. Instead of just asking "Can this model write malicious code?", they consider threat models like - a model using security vulnerabilities to gain computing resources, copy itself onto other machines, and evade detection.

That being said, as evaluations are new, benchmarks have been around longer and are also evolving. So at times there is overlap in the way that these words are used. For the purpose of this sequence of posts distilling evaluations, we will think of a benchmark like an individual measurement tool, and an evaluation as a complete safety assessment protocol which includes the use of benchmarks. Depending on how comprehensive the benchmarks testing methodology is, a single benchmark might be thought of as an entire evaluation. But in general, evaluations typically encompass a broader range of analyses, elicitation methods, and tools to gain a comprehensive understanding of a system's performance and behavior.

Appendix

Here are just some benchmarks that we thought it would be interesting for you to know more about. We will talk about a ton more dangerous capability evaluations, propensity evaluations, and control evaluations in upcoming posts in the evaluations sequence.

Frontier Math & Humanities Last Exam

Mathematical subject interconnections in FrontierMath. Node sizes indicate the frequency of each subject’s appearance in problems, while connections indicate when multiple mathematical subjects are combined within single problems, demonstrating the benchmark’s integration of many mathematical domains. (Glazer et al., 2024)

What makes FrontierMath different from other benchmarks? Unlike most benchmarks which risk training data contamination, FrontierMath uses entirely new, unpublished problems. Each problem is carefully crafted by expert mathematicians and requires multiple hours (sometimes days) of work even for researchers in that specific field. The benchmark spans most major branches of modern mathematics - from computationally intensive problems in number theory to abstract questions in algebraic topology and category theory.

To ensure problems are truly novel, they undergo expert review and plagiarism detection. The benchmark also enforces strict "guessproofness" - problems must be designed so there's less than a 1% chance of guessing the correct answer without doing the mathematical work. This means problems often have large, non-obvious numerical answers that can only be found through proper mathematical reasoning.

These are extremely challenging... I think they will resist AIs for several years at least.
Terence Tao - Fields Medalist (2006) (EpochAI, 2024)

Getting even one question right would be well beyond what we can do now, let alone saturating them.
Timothy Gowers - Fields Medalist (1998) (EpochAI, 2024)

Here's a concrete example - one FrontierMath problem asks solvers to examine sequences defined by recurrence relations, then find the smallest prime number (with specific modular properties) that allows the sequence to extend continuously in a p-adic field. This requires deep understanding of multiple mathematical concepts and can't be solved through simple pattern matching or memorization. To ensure problems can't be solved by guessing, answers are often large numbers or complex mathematical objects that would have less than a 1% chance of being guessed correctly without doing the actual mathematical work.

One sample problem from the FrontierMath benchmark. (Besiroglu et al., 2024)

The benchmark provides an experimental environment where models can write and test code to explore mathematical ideas, similar to how human mathematicians work. While problems must have automatically verifiable answers (either numerical or programmatically expressible mathematical objects), they still require sophisticated mathematical reasoning to solve.

Just to showcase the rapid pace of advancement even on this benchmark that even fields medal winning mathematicians consider extremely challenging, between the announcement of the FrontierMath benchmark the state-of-the-art models could solve less than 2% of FrontierMath problems. (Glazer et al., 2024) Just a couple of months later, OpenAI announced the o3 model, which then shot performance up to 25.2%. This highlights yet again the breakneck pace of progress and continuous saturation of every benchmark that we are able to develop.

Performance of leading language models on FrontierMath. All models show consistently poor performance, with even the best models (as of Nov 2024) solving less than 2% of problems. (Besiroglu et al., 2024)

To keep up with the pace, researchers are developing what is described as “Humanity's Last Exam”. A benchmark aimed at building the world's most difficult public AI benchmark gathering experts across all fields. (Hendrycks & Wang, 2024)

Weapons of Mass Destruction Proxy (WMDP) benchmark

Measure and mitigate hazards in the red category by evaluating and removing knowledge from the yellow category, while retaining as much knowledge as possible in the green category. WMDP consists of knowledge in the yellow category. (Li et al., 2024)

The Weapons of Mass Destruction Proxy (WMDP) benchmark represents a systematic attempt to evaluate potentially dangerous AI capabilities across biosecurity, cybersecurity, and chemical domains. The benchmark contains 3,668 multiple choice questions designed to measure knowledge that could enable malicious use, while carefully avoiding the inclusion of truly sensitive information.

What makes WMDP different from other benchmarks? Rather than directly testing how to create bioweapons or conduct cyberattacks, WMDP focuses on measuring precursor knowledge - information that could enable malicious activities but isn't itself classified or export-controlled. For example, instead of asking about specific pathogen engineering techniques, questions might focus on general viral genetics concepts that could be misused. The authors worked with domain experts and legal counsel to ensure the benchmark complies with export control requirements while still providing meaningful measurement of concerning capabilities.

How does WMDP help evaluate AI safety? Current state-of-the-art models achieve over 70% accuracy on WMDP questions, suggesting they already possess significant knowledge that could potentially enable harmful misuse. This type of systematic measurement helps identify what kinds of knowledge we might want to remove or restrict access to in future AI systems. The benchmark serves dual purposes - both measuring current capabilities and providing a testbed for evaluating safety interventions like unlearning techniques that aim to remove dangerous knowledge from models.

^{^}
This is true to a large extent, but as always there is not 100% standardization. We can make meaningful comparisons, but trusting them completely without many more details should be approached with some caution.

0 comments

Comments sorted by top scores.

Understanding Benchmarks and motivating Evaluations

Contents

History and Evolution

Limitations

Appendix

Frontier Math & Humanities Last Exam

Weapons of Mass Destruction Proxy (WMDP) benchmark

0 comments