A Taxonomy Of AI System Evaluations

post by Maxime Riché (maxime-riche), JaimeRV (jaime-raldua-veuthey), Harrison G (Hubarruby), Edoardo Pona (edoardo-pona) · 2024-08-19T09:07:45.224Z · LW · GW · 0 comments

Contents

  Introduction
      TLDR
      Introduction to evaluations
      Motivation
      Audience
      The goal of the taxonomy
  Overview of the taxonomy
    High-level view
    Explaining the groups of dimensions we use
        (A) What is evaluated
        (B) Evaluation
        (C) Using evaluation outputs
  Taxonomy of AI System Evaluations
    (A) What is being evaluated 
      (a) The AI system evaluated
      (b) The elicitation process used
    (B) Evaluation
      (c) Generation of evaluee-independent information
      (d) Producing evaluee-dependent information
      (e) Analysis
    (C) Using evaluation outputs
      (f) Understanding the evaluation and the results
      (g) Usability of the evaluation
      (h) Properties of the evaluation results
  Taxonomy of properties evaluated:
  Comparing our taxonomy with existing taxonomies
      A Survey on Evaluation of Large Language Models 
        They completely cover: 
        They partially cover:
        They cover something we don’t cover:
      Evaluating Large Language Models: A Comprehensive Survey
        They partially cover:
        They cover something we don’t cover:
      When can we trust model evaluations?
        They completely cover:
        They partially cover:
        Contributions
None
No comments

Warning: This post was written at the start of 2024 as part of the AISC project "Evaluating Alignment Evaluations". We are not especially satisfied with the quality reached, but since we are not planning to work on it anymore, we are releasing it as a Work-In-Progress document.

Introduction

TLDR

Introduction to evaluations

For an introduction to model evaluations, see A starter guide for evals — LessWrong [LW · GW]. For clarification about what evaluations are relative to characterizations, see also What is the difference between Evaluation, Characterization, Experiments, and Observations [LW(p) · GW(p)]. 

Motivation

Clarifying and understanding the characteristics of AI system evaluations is important to better use and design them.

Audience

The taxonomy is written for people who already know a minimum about model evaluations and AI systems. For example, you already read A starter guide for evals [LW · GW], and you already worked with or analyzed evaluations for a few days or more. For audience members with less knowledge, you may expect to not understand all the content.

The goal of the taxonomy

Our taxonomy focuses on features of AI system evaluations.

It does NOT try to include the features relevant for the following evaluations:

This taxonomy does NOT focus on reviewing the characteristics of popular evaluations, nor on reviewing ONLY which properties are evaluated (e.g., capabilities, propensities). Instead, it focuses on providing an overview of many dimensions(features) over which AI system evaluations can vary, even if most of the existing evaluations don’t vary much over those dimensions.

Thus, the taxonomy we describe should be valuable for people wanting to understand evaluations, create new ones, or meta-evaluate existing evaluations (meta-evaluate means: evaluate the properties of evaluations, e.g., accuracy and generalization of evaluation results).

Our taxonomy may be less valuable for people seeking a representative overview of the field of evaluations as of today. And if you are looking for taxonomies (mostly) focusing on which object-level behaviors are evaluated you can see works such as A Survey on Evaluation of Large Language Models and Evaluating Large Language Models: A Comprehensive Survey.

Overview of the taxonomy

This section contains the list of the dimensions included in the taxonomy. To our best knowledge, we include references to who inspired these dimensions if they exist. We may not have remembered nor know all previous work.

High-level view

Here is the high-level view of the dimensions included in the taxonomy:

Explaining the groups of dimensions we use

We can summarize the components of AI system evaluations in a schema. 

For clarity between evaluations and characterizations you can see What is the difference between Evaluation, Characterization, Experiments, Observations [LW(p) · GW(p)].

(A) What is evaluated

Differentiating between "(a) Dimensions about the AI system evaluated" and "(b) Dimensions about the elicitation process used" makes sense because the evaluation involves not only the AI system itself but also its initial state and any modifications applied during the elicitation process. The elicitation process can include both static modifications, like specific prompts, and algorithmic augmentations, such as API calls or decoding strategies, which significantly impact the system's behavior. By clearly distinguishing these dimensions, we allow for correcting for their impact when comparing AI systems.

We thus decompose the section “(A) What is evaluated” into:

(a) Dimensions about the AI system evaluated 

(b) Dimensions about the elicitation process used

For reference, see METR’s content about the elicitation process they use for their dangerous capability evaluations: Example Protocol | METR’s Autonomy Evaluation ResourcesGuidelines for capability elicitationMeasuring the impact of post-training enhancements.

(B) Evaluation

We decompose the remaining of “(B) evaluation process” into the following stages, which may not always occur in sequence:

(c) Generation of evaluee-independent information. E.g.:

(d) Producing evaluee-dependent information. E.g.:

(e) Analysis. E.g.:

(C) Using evaluation outputs

The analysis produces characterization results that then must be understood and can be statistically studied. Thus, we decompose “(C) Using evaluation and characterization results” into:

(f) Understanding the evaluation and the results

(g) Usability of the evaluation

(h) Properties of the evaluation results

Taxonomy of AI System Evaluations

(A) What is being evaluated 

(a) The AI system evaluated

  1. What property is evaluated?
    • Properties evaluated belong to different classes: Behavioral (e.g., Propensities (ref), Capabilities), Beliefs (e.g., Latent Knowledge, World Model), and others. See our sub-taxonomy [LW · GW] for a hierarchical view of these properties and more. Other existing sub-taxonomies for this dimension are refref.
  2. Which kind of AI system is evaluated? 
    • By kind, we mean the architecture of the AI system, including the humans involved.
    • E.g., a model only (i.e. model evaluations), scaffolded model(s) (ref [LW · GW]), scaffolded model(s) with affordances (ref, e.g., METR’s evals), scaffolded model(s) used by humans, multi-agent system of models

(b) The elicitation process used

  1. Which elicitation technique is the evaluation using? 
    • E.g., finetuning, prompting, activation steering (refref), etc
      1. Examples of fine-tunings (ref [LW · GW]): no fine-tuning, RL fine-tuning, I.I.D. fine-tuning 
      2. Examples of prompting techniques: few-shot prompting, prompt optimization, jailbreaking (ref), CoT prompting (ref)
      3. Other: best-of-N sampling (ref)
  2. Under which elicitation mode is the model evaluated?
    • Common modes: 
      1. Red teaming (worst case search, ref)
      2. Best case (often used for capability evaluations)
      3. Neutral case (e.g., under minimum elicitation pressure)
    • Exotic modes (ryan_greenblatt's LW comment [LW(p) · GW(p)]): 
      1. Future case (future best and worst cases in N years). This mode can’t be used but can be studied.
      2. Self-elicitation case (when the model elicits its own capabilities, e.g.)
  3. How much elicitation/optimization pressure is applied? 
    • The elicitation process can be applied using a range of intensity and with various effectiveness. As other before, we use the term “elicitation pressure” to design a hypothetical metric on which all elicitation processes could be compared (ref [LW · GW]).
    • E.g., FLOPS used for I.I.D. finetuning, FLOPS and algorithm used for prompt optimization, minimum description length of the elicitation, nats estimation (ref [AF · GW]).
    • As already described by others, a possible improvement to current evaluations could be to have their elicitation process and elicitation pressure characterized by a standard (meta-)evaluation. Ideally, the characterization of the elicitation process used would allow to normalize evaluation results, and compare and extrapolate them to other elicitation processes. This does not exist yet and may be intractable. (We need a Science of Evals — LessWrong [LW · GW], Measuring the impact of post-training enhancements | METR’s Autonomy Evaluation Resources)

(B) Evaluation

(c) Generation of evaluee-independent information

Evaluations often use information independent of the AI system evaluated (e.g. ground truth, logical constraints, inputs). The generation of this information is not causally connected to the AI system evaluated. This information can be fed to the AI system or used during the analysis of its outputs. 

  1. How is the evaluee-independent information generated? 
    • E.g.:
      1. Human-generated (e.g., human-written prompts for a benchmark)
      2. Heuristics or Science (e.g., n-grams for computing the BLEU score)
      3. Machine-generated (e.g., sensors producing images used as inputs)
      4. Al-generated (e.g., another AI system generating the evaluation benchmark)
  2. What is the source of truth used?
    • If ground truths are used, what produces them? Ground truths are typically evaluatee-independent information in behavioral evaluations, but they can also be evaluatee-dependent information (e.g., in honesty evaluations).
    • If ground truths are not used, what is the source of truth used to perform the analysis (e.g. requirement of logical consistency)?
    • E.g.: (Same as in (6))
      1. Human-generated (e.g., harmfulness scores of outcomes in text games)
      2. Heuristics or Science (e.g., the solution of a math problem)
      3. Machine-generated (e.g., protein crystallography)
      4. AI-generated (e.g., using an finetuned or MCTS amplified LLM to produce solutions for a benchmark)
      5. No ground truth (e.g., consistency, or robustness evaluations)
  3. When is the ground truth generated?
    • (Only when a ground truth is used)
    • Pre-generated: generated before the start of the evaluation (e.g., labels for the answers, dataset used to train a reward model)
    • During the evaluation: the ground truth is generated as part of the evaluation process (e.g., this is the case with functional evaluations which produce the ground truth at the start of each evaluation process (ref), making them more robust to data leaks)
    • After the evaluation: the ground truth is  (e.g., evaluation of open-ended answers, evaluation of the history of interactions, evaluation of summaries of games)

(d) Producing evaluee-dependent information

The evaluation process, requires gathering information about the AI system evaluated. This information can later be analyzed (e.g. comparing inferences from the model to ground-truths).

  1. What is the data format and the environment used for evaluation? E.g.:
    • Data format: Task solving, QA, Role-play games, No prompt
    • Environment: Text only, Text and affordances, Sandbox VM, Real-world
    • Modalities: Unimodal, Multimodal
  2. Which information about the model is used? E.g.:
    • When producing observations about the AI system, the level of access to the internals of the system can be classified into:
      1. No box (prediction from a description of the system), Black box, Grey box, White box (full access to the internals) (ref)
    • The observations can be produced using information from different timesteps during the life of the AI system: 
      1. Information from training time only, from evaluation time (the case for most of the current evaluations), information from deployment time (e.g. reviewing interactions with users), or the full lifecycle (e.g. the aggregation of all this information can be fed to the evaluator).

(e) Analysis

At this point in the evaluation, we have extracted information about the AI system (and we may have gathered complementary information needed for the analysis). We need to make sense of this information, this is what the analysis does.

  1. Which kind of characterization method is used?
    • Quantitative
      1. Binary/Classification problem: You have a data point as input; the analysis consists of classifying it; the process of classifying the input contains most of the value of the analysis. You can later aggregate those results, e.g., you can count occurrences of each class. Most evaluation benchmarks use this characterization method.
      2. Numerical/Regression problem: You have a data point as input; the analysis consists of assigning a numerical value to it (e.g., scoring or measuring an intensity). You can later aggregate those results. E.g.: Loss functions, Reward Models, BLEU score.
    • Qualitative: 
      1. E.g.: Outputing a textual analysis of the observation (e.g., usually done by Humans but could be done using AI systems, in that case, the evaluation is a model-evaluated evaluation).
      2. This dimension may directly affect how easy it is for the AI system being evaluated to manipulate the evaluation results (e.g., informal interviews of Humans are more biased than structured interviews).
  2. Where is the analysis done? 
    • This dimension may directly affect how easy it is for the AI system being evaluated to manipulate the evaluation results.
    • Localizations:
      1. Inside the model evaluated: The analysis is done directly inside the model evaluated, without having information flow out of the model before the analysis is done. E.g.: Self-reported preferences are such an evaluation. Most of the analysis is produced by the AI system analyzed.
      2. Outside the model evaluated: The analysis is done outside of the AI system evaluated. This can be inside another AI system. Information about the AI system evaluated is here first extracted and then sent to another (AI, human or algorithmic) system to be evaluated.
      3. “Looped in”: The analysis is done inside the model evaluated, but the information about the model evaluated is first extracted from the model and then fed again to it. E.g., GPT-4 evaluating the harmfulness of texts generated by GPT-4.
  3. Who is doing the analysis? 
    • E.g., Human judgment, Programmatic, Model-evaluated
  4. Is the analysis producing absolute results?
    • Absolute: Regardless of how other AI systems score on the eval, the eval provides useful standalone information.
    • Relative: The results only provide information when they are compared with other evaluation results (e.g., a consistency evaluation can use evaluation results from the same model but under different elicitations, to detect a change in behavior). The results are not standalone.

(C) Using evaluation outputs

(f) Understanding the evaluation and the results

  1. What is the motivation for this evaluation?
    • E.g., producing feedback for capability research, for safety research, for governance, to be used in RSP, to be used to train models…
  2. What are the normative evaluation criteria?
    • Evaluations use normative criteria to analyze the characterization results. E.g.:
      1. Directional: “Higher honesty is better”, “Higher success rate is better”, “Corrigible is better”
      2. Threshold: “Honesty above 99%”, “Accuracy above human level”, “No instance of deceptive alignment found”.
    • If there are no normative criteria, then the “evaluation” should be better named as “characterization” (See What is the difference between Evaluation, Characterization, Experiments, Observations [LW(p) · GW(p)]). In practice, the normative criteria are not always explicit.
  3. What metrics does the evaluation use? (ref)
    • Notice that metrics may take into account the elicitation process, e.g., to normalize the results by the strength of the elicitation if feasible.
    • E.g., accuracy, calibration, fairness, robustness, frequency of behavior, consistency
  4. What is the coverage of the evaluation? 
    • By coverage, we mean the fraction of the input distribution during deployment on which the evaluation results provide meaningful information. 
      1. A metric quantifying evaluation coverage may be valuable. However, in practice, the input distribution during deployment is dependent on other adversarial agents, which may render such a coverage metric very unstable or simply useless.
    • E.g., Unimodal, Multimodal, Large, Narrow, Undefined

(g) Usability of the evaluation

  1. How automated is the evaluation? (ref)
    • Amount of automation:
      1. Automated: no human intervention is needed.
      2. Manual: important steps always need human interventions.
      3. Partially automated: e.g., difficulty cases are sent to human evaluators.
    • Archetypes of automation:
      1. Model-generated: E.g., models are used to generate QAs.
      2. Model-evaluated: E.g., models are used to evaluate model outputs.
      3. Program-generated: E.g., the model inputs are extracted from existing databases or by applying transformations on some existing data (e.g., ‘cities’ benchmark)
      4. Program-evaluated: E.g., the scoring function is a program (excluding DL) (e.g., BLEU score)
  2. Which are the failure modes that could impact the evaluation when evaluating a deceptive model? (When can we trust model evaluations? — LessWrong [LW · GW])
  3. Is the evaluation used for or usable for training?
    • Used for training (e.g., Reward Models, loss functions)
    • Usable (e.g.: training possible but not competitive in practice)
    • Protected (e.g., some of METR’s evaluation tasks are kept private)
    • Non-Usable (e.g., non-differentiable metrics when using supervised learning)
  4. Is the evaluation sensitive to data leaks? (ref)
    • Sensitive to data leaks
    • Non-Sensitive: Even if the evaluation leaks and the AI system evaluated is trained on it, the results of the evaluations should not change. In practice, evaluations have different degrees of sensitivity to data leaks. Functional evaluations, evaluations that generate new evaluation data at each use, may be less sensitive to data leaks than static and publicly available benchmarks.

(h) Properties of the evaluation results

The properties below can be both influenced by the evaluation process and by the AI system evaluated. As such, it is going to be difficult to conclude whether an evaluation produces accurate results or not without specifying the AI system evaluated and the elicitation process used. But evaluations in themselves can be meta-evaluated to measure how they behave. They could and should be meta-evaluated using standard methods that would let us compare their properties.

  1. What are the characteristics of the evaluation results? E.g.:
    • Accuracy: Are the characterization results correctly describing reality?
    • Systematic errors: Are the characterization results biased in some way?
    • Precision: Does evaluating several times produce the same evaluation results? 
    • Robustness to the evaluation process: Are the characterization results sensitive to the evaluation input space? (excluding the model input space)
    • Generalization: How far do the evaluation results inform us about the model's behavior over its input space?
    • Resolution: How detailed / How resolute are the evaluation results produced?
    • Sensitivity: What is the minimum change in the AI system’s property (e.g., a capability or propensity) that the evaluation is able to detect?
  2. What are the scaling trends of the characteristics of the results? E.g.,
    • Does the evaluation become more accurate when the models get more capable?
      1. Upward/Neutral/Downward/U-shaped/… accuracy trend when scaling 
    • etc.

Taxonomy of properties evaluated:

We give below a partial taxonomy of properties that can or are evaluated in AI systems. This taxonomy provides detail about dimension 1 of the taxonomy of AI system evaluations.

Comparing our taxonomy with existing taxonomies

For several of the previously existing taxonomies that we are aware of, we are going to list which of our dimensions they covered.

We will only include existing taxonomies that include several dimensions. Thus, we exclude from this section taxonomies that only include one dimension. 

For references about who created or inspired dimensions in our taxonomy, you can directly look at the references we provide for some dimensions.

A Survey on Evaluation of Large Language Models 

They completely cover: 

(Dim 17) What metrics does the evaluation use?

Their dimension “How to evaluate” completely covers our (Dim 17)

They partially cover:

(Dim 1) What property is evaluated?

Their dimension “What to evaluate” is a taxonomy matching the lowest level of the hierarchical sub-taxonomy that our Dim 1 is. See section Dimension 1: What property is evaluated? for the detail of our hierarchical sub-taxonomy. Notice that we don’t describe the content of this lowest level.

They cover something we don’t cover:

Their dimension “Where to evaluate” covers how evaluations are assembled into benchmarks. We don’t include this dimension since it doesn’t seem valuable to produce new insights about evaluations.

Evaluating Large Language Models: A Comprehensive Survey

They partially cover:

(Dim 1) What property is evaluated?

Their dimensions “Knowledge and Capability Evaluation”, “Alignment Evaluation”, “Safety Evaluation”, and “Specialized LLMs Evaluation” are taxonomies matching the lowest level of the hierarchical sub-taxonomy that our Dim 1 is. See section Dimension 1: What property is evaluated? for details on our hierarchical sub-taxonomy. Notice that we don’t describe the content of this lowest level.

They cover something we don’t cover:

Their dimension “Evaluation Organization” covers how evaluations are assembled into benchmarks. We don’t include this dimension since it doesn’t seem valuable to produce new insights about evaluations.

When can we trust model evaluations? [LW · GW]

They completely cover:

(Dim 20) Failure mode when evaluating a deceptive model

They partially cover:

(Dim 3) Which elicitation technique is the evaluation using? 

 

 

Contributions

Main contributors: Maxime Riche, Jaime Raldua Veuthey

Comments, discussions, and feedback: Edoardo Pona, Harrison Gietz

0 comments

Comments sorted by top scores.