Thinking About Propensity Evaluations

post by Maxime Riché (maxime-riche), Harrison G (Hubarruby), JaimeRV (jaime-raldua-veuthey), Edoardo Pona (edoardo-pona) · 2024-08-19T09:23:55.091Z · LW · GW · 0 comments

Contents

  Introduction 
  Takeaways
  Motivation
    Propensity evaluations are important
      Propensity evaluations are a tool for AI safety research
      Propensity evaluations are a tool for evaluating risks
      Propensity evaluations may be required for governance after reaching dangerous capability levels
      We should not wait for dangerous capabilities to appear
      Propensity evaluations measure what we care about: alignment
      Propensity evaluations will be used for safetywashing; we need to study and understand them
    Propensity evaluations are not well-understood
      Propensity evaluations are new
      The nomenclature is still being worked out, and confusion persists
      Propensity evaluations are already used
        GPT-4 and GPT-4V
        Gemini
        Claude 3
        Summary
      People are misrepresenting or misunderstanding existing evaluations
  What are propensity evaluations?
    Defining propensity evaluations
      Existing definition
      Contextualization
      Our definition and clarifications
        Definition
        We are going to illustrate propensity evaluations as being comparisons of priorities between clusters of behaviors. 
        In this example, propensity evaluations evaluate how an AI system prioritizes between several clusters of behaviors constructed using some features that are not related to capabilities. 
        How are “clusters” operationalized?
        How is “prioritization” operationalized?
        Illustration
        Taxonomy of behavioral evaluations
        Explanations on the taxonomy
      Related definitions
    How do propensity evals differ from capability evals?
      Categorical differences:
        What are the implications of this categorical difference?
      Soft differences:  
    Which propensities are currently evaluated?
        Propensities with high capability-
        dependence
        Propensities with medium capability-
        dependence
        Propensities with low capability-
        dependence
    What are the metrics used by propensity evaluations?
    Can we design metrics that would be more relevant for propensity evaluations?
  A few approaches to propensity evaluation
    Black Box Propensity Evaluations 
    White Box Propensity Evaluations
        Representation Engineering
        Eliciting Latent Knowledge 
        Potential for propensity evaluations
        Causal Abstraction and Representation Learning 
        Issues
    No-box Propensity Evaluations
    Propensity Evaluations of Humans
  Towards more rigorous propensity evaluations
      Contributions
None
No comments

Warning: This post was written at the start of 2024 as part of the AISC project "Evaluating Alignment Evaluations". We are not especially satisfied with the quality reached, but since we are not planning to work on it anymore, we are releasing it as a Work-In-Progress document.

Introduction 

In this post, we cover:

Propensity evaluations (What's up with "Responsible Scaling Policies"? — LessWrong [LW · GW], A Causal Framework for AI Regulation and Auditing  [LW · GW] Apollo ResearchWe need a Science of Evals — Apollo Research [LW · GW]) evaluate which kind of behavior a model prioritizes over other kinds. Groups of behaviors need to be specified for each propensity evaluation. A common denominator to all propensity evaluations is that the clusters of behaviors being compared are not created using only capability-related features. Otherwise, the evaluations would be called capability evaluations, not propensity evaluationsPropensity evaluations are characterized by the use of at least some non-capability-related features to create the clusters of behaviors that are studied.

Propensity evaluations include evaluations of alignment in actions, and those will likely be needed when dangerous capability levels are reached, which could be as soon as this year. For more detail, see the section: Motivation [LW(p) · GW(p)].

We will mostly focus on a special case of propensity evaluations: propensity evaluations with low capability-dependence. These can be visualized as evaluations of priorities between clusters of behaviors that are constructed without almost no capability-related features. For more detail, see the section: Defining propensity evaluations [LW · GW].

Takeaways

Motivation

We explain here why it is valuable to clarify what propensity evaluations are. The TLDR is simple: Propensity evaluations are both important and not well-understood.

Propensity evaluations are important

Propensity evaluations are a tool for AI safety research

AI system evaluations (including propensity evaluation) are important for understanding how AI systems behave and getting feedback on research and techniques. 

Reference: A starter guide for evals - LW [LW · GW], Tips for Empirical Alignment Research — LessWrong [LW · GW

Propensity evaluations are a tool for evaluating risks

Propensity evaluations can be used to assess whether models are likely to engage in behaviors that could cause catastrophic outcomes. This is useful when models will have the capability to cause such outcomes.

Propensity evaluations are a tool to reduce our uncertainty about how an AI system behaves. As models’ capabilities increase, we will likely need to shift from focusing predominantly on evaluating capabilities to also include evaluating their propensities. 

References: When can we trust model evaluations? - LW [LW · GW], A starter guide for evals - LW [LW · GW]

Propensity evaluations may be required for governance after reaching dangerous capability levels

Information about the propensities of AI systems will be necessary for governance. For example, if a model is capable of dangerous behavior, then the developer might be required to demonstrate via propensity evaluations that the propensity of its AI system to act dangerously is below a threshold.

Counter-argument: Alternatively, the AI developer may have to demonstrate that dangerous capabilities are removed or locked from access (refref). If this path is preferred by policymakers, then propensity evaluations would be less useful.

We should not wait for dangerous capabilities to appear

There is a race to increase AI capabilities, and this race likely won’t stop at dangerous levels of capabilities. We should not wait to reach a dangerous level of capability before developing accurate and informative propensity evaluations. 

Propensity evaluations measure what we care about: alignment

To be able to align an AI system, we must be able to evaluate the alignment of the system. Propensity evaluations are one of the tools we have to measure the alignment (in actions) of AI systems.

Propensity evaluations will be used for safetywashing; we need to study and understand them

AI labs will and are already using propensity evaluations to claim their systems are achieving some safety levels. We should study these evaluations, look for their weaknesses, specify what they evaluate and what they don’t evaluate, and make them as powerful as possible.

Propensity evaluations are not well-understood

Propensity evaluations are new

The term “propensity evaluations” was introduced only recently (~ 8 months).

Note: Propensity evaluations are a tool that existed for a while in Marketing (refref (2021))

While the term “alignment evaluations” has been around for longer (~ 28 months).

There is still confusion about what propensity evaluations are and how they differ from alignment evaluations. 

The nomenclature is still being worked out, and confusion persists

Researchers use different terms to talk about similar or confusingly close concepts: 

  1. Alignment evaluations (ref [LW · GW], ref [LW · GW])
  2. Propensity evaluations
  3. Goal elicitation/evaluation
  4. Preference evaluations
  5. Disposition evaluations

Examples of confusion:

Propensity evaluations are already used

Current state-of-the-art models are already evaluated using both capability evaluations and propensity evaluations.

GPT-4 and GPT-4V

Harmfulness, hallucination, and biases are evaluated (including bias in the quality of the service given to different demographics) and reported in GPT-4’s system card (see section 2.2 and 2.4). Evaluations of answer refusal rates are reported in GPT-4V’s system card (see section 2.2). 

Gemini

Factuality (a subclass of truthfulness), and harmlessness are reported in Gemini-1.0’s paper (see sections 5.1.2, 5.1.6, 6.4.3, and 9.2). 

Claude 3

Honesty evaluations” using human judgment are reported. It is unclear if truthfulness or honesty is evaluated (See this table for quick definitions of propensities). But most likely honesty is not directly evaluated. It would be surprising if the human evaluators were informed about what the model believes and thus able to verify that the model outputs what it believes to be true. Harmlessness and answer refusals rates are also evaluated. See section 5.4.1 and 5.5. Claude 3’s model card.

Summary

People are misrepresenting or misunderstanding existing evaluations

As seen in the above section "Propensity evaluations are already used", evaluations are used, and explicit and implicit claims are made or mistakenly understood by the reader. A mistaken understanding is: “SOTA models have their HHH propensities evaluated. Improvement on these evaluations means progress towards being altruistic, benevolent, and honest.” 

Helpfulness and Harmfulness are both propensity evaluations with high capability-dependence, the results of such evaluations are influenced by the capabilities of the model. In current HHH evaluations, Honesty is actually measured using Truthfulness evaluations instead of Honesty evaluations. (See this table for quick definitions of propensities)

Here are illustrations of the resulting issues with these evaluations. When models get more capable:

Because of these reasons, it is hard to use existing HHH evaluations to make strong claims about levels of honesty, and altruism. These issues are also not properly highlighted in current publications, leading to misrepresentations or misunderstandings of current alignment levels and especially of the alignment trends while scaling.

What are propensity evaluations?

Defining propensity evaluations

Existing definition

We start with the definition given by Apollo Research: 

From A Causal Framework for AI Regulation and Auditing - Apollo Research: “System propensities. The tendency of a system to express one behavior over another (Figure 3). Even though systems may be capable of a wide range of behaviors, they may have a tendency to express only particular behaviors. Just because a system may be capable of a dangerous behavior, it might not be inclined to exhibit it. It is therefore important that audits assess a system’s propensities using ‘alignment evaluations’ [Shevlane et al., 2023]. – Example: Instead of responding to user requests to produce potentially harmful or discriminatory content, some language models, such as GPT-3.5, usually respond with a polite refusal to produce such content. This happens even though the system is capable of producing such content, as demonstrated when the system is ‘jailbroken’. We say that such systems have a propensity not to produce harmful or offensive content.

Contextualization

A figure from A Causal Framework for AI Regulation and Auditing provides more context about how the propensities of an AI system relate to its capabilities, affordances, and behaviors. We don’t describe the schema. Please refer to the original document if needed.

Our definition and clarifications

Definition

If you are wondering why “normative value”, see What is the difference between Evaluation, Characterization, Experiments, and Observations? [LW(p) · GW(p)]

We are going to illustrate propensity evaluations as being comparisons of priorities between clusters of behaviors. 

Evaluations use various characterization methods (see Taxonomy Of AI System Evaluations [WIP] [LW · GW] dim 11). They can use classification methods to characterize AI system’s behaviors; relying on clustering of the observed behaviors and comparing that to references. Other characterization methods include scoring methods/regressions (Reward Models, BLEU score, etc.). For simplicity and illustrative purposes, in this document, we use the example of evaluations using classification methods. When propensity evaluations use classification methods to characterize AI system behaviors, the possible behaviors need to be clustered on a per-case basis using clustering criteria. The clustering criteria must include non-capability features (otherwise the evaluation is a capability evaluation). 

In this example, propensity evaluations evaluate how an AI system prioritizes between several clusters of behaviors constructed using some features that are not related to capabilities. 

How are “clusters” operationalized?

How is “prioritization” operationalized?

Illustration

Taxonomy of behavioral evaluations

We summarize in the following taxonomy how behavioral, propensity, and capability evaluations relate to each other: (Taken from A Taxonomy Of AI System Evaluations [WIP] [LW · GW])

Explanations on the taxonomy

Goal evaluations, Evaluations of Alignment in values, and Preference Evaluations are not included in behavioral evaluations. They are Motivation Evaluations, not Behavioral Evaluations. See A Taxonomy Of AI System Evaluations [WIP] [LW · GW]. 

Capability Evaluations, as used in the literature, could be seen as special case of propensity evaluations, in which clusters of behaviors are created to match different levels of capability (often just two levels: success and failure) and the priority evaluated is between behaving successfully or not. But for clarity, we keep naming these capability evaluations and we don’t include capability evaluations in propensity evaluations.

Alignment Evaluations, as used in the literature, are both what we name Evaluations of Alignment in actions and Evaluations of Alignment in values. Only the Evaluations of Alignment in actions are a special case of propensity evaluations.

“Alignment evaluations” by Evan Hubinger (July 2023)

From When can we trust model evaluations? [AF · GW]: “An alignment evaluation is a model evaluation designed to test under what circumstances a model would actually try to do some task. For example: would a model ever try to convince humans not to shut it down [AF · GW]?

“Alignment evaluations” and “Capability evaluations” by Apollo Research (January 2024)

From A starter guide for Evals: “There is a difference between capability and alignment evaluations. Capability evaluations measure whether the model has the capacity for specific behavior (i.e. whether the model “can” do it) and alignment evaluations measure whether the model has the tendency/propensity to show specific behavior (i.e. whether the model “wants” to do it). Capability and alignment evals have different implications. For example, a very powerful model might be capable of creating new viral pandemics but aligned enough to never do it in practice

In this blog post, instead of using “alignment evaluations” as in these two definition, we use the term “propensity evaluations”.

How do propensity evals differ from capability evals?

While propensity evaluations and capability evaluations measure different things, they are both evaluations applied to AI systems, and they are both mostly about comparing the priorities given by an AI system to different clusters of behaviors (in the case of capability evaluations the clusters are defined, most of the time, by the criteria “success at the task”). As such, we should not be surprised if they share many aspects.

In this section, to explicit clearer differences between capability evaluations and propensity evaluations, we chose to focus on comparing capability evaluations to propensity evaluations with low capability dependence.

We use the taxonomy introduced in A Taxonomy Of AI System Evaluations [WIP] [LW · GW] to help us search for their differences.

Categorical differences:

At the root, there is only one source of binary difference between capability and propensity evaluations, it is the nature of the property measured.

What are the implications of this categorical difference?

Implications about the clustering criteria used by each kind of evaluation:

Implications about the normative evaluation criteria

About “normative evaluation criteria” see: What is the difference between Evaluation, Characterization, Experiments, and Observations? [LW(p) · GW(p)] or A Taxonomy Of AI System Evaluations [WIP] [LW · GW] Dim Cf2.

Implications about the information required:

Implications about the scope of the behaviors evaluated:

Soft differences:  

Soft differences are differences in usage more than differences in nature. These differences are not given but rather created by practitioners because of factors like the cost of data generation or intuitions.

Capability and propensity evaluations are often used in different elicitation modes.

For context about what are “Elicitation modes”, see (Dim Ab2) “Under which elicitation mode is the model evaluated?” in A Taxonomy Of AI System Evaluations [WIP] [LW · GW]. 

Essentially, since capabilities are (totally) ordered for a given task, capability evaluations tend to try and ‘climb’ this order as much as possible, trying to show some ‘possibility’ of a level of capability. Propensity evals usually just try to observe the tendencies. 

Predictability and existence of scaling laws

There is a notable difference in the prevalence of scaling laws between capability evaluations and propensity evaluations.

Some potential explanations for this are:

Capability and propensity evaluations are often performed at different levels of abstraction.

Possible explanations for this difference:

Claim:

Capability and propensity evaluations often use different sources of truth.

Claim

Which propensities are currently evaluated?

A variety of existing evaluations exist that measure the propensities of AI systems (specifically LLMs). Note that the space of propensity evaluations is still somewhat conceptually unclear to us: there are some concepts, such as faithfulness, honesty, truthfulness, factuality, which point to similar broad ideas, but which have slightly different precise meanings and for which we didn’t spend enough time clarifying their operationalizations (or for which no operationalization exists). Hence, creating a robust, non-redundant taxonomy for propensity evaluations is difficult; we expect that the list of propensities provided below is not organized optimally, but provide the list regardless, in order to help point readers to different relevant studies and to get a broad overview of the existing propensity evaluations.

The classification below, of propensities into high/medium/low levels of capability dependence, remains largely speculative.

Propensities with high capability-

dependence

DescriptionExisting Examples (non-exhaustive)
TruthfulnessA model’s propensity to produce truthful outputs. This propensity requires an AI system to be both honest and to know the truth (or other weirder settings such that the AI system outputs the truth while believing it is not the truth).

How to catch a Liar,

BigBench HHH,

Truthful QA,

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models,

Cognitive Dissonance 

Factuality

A model's propensity to generate

outputs that are consistent with

established facts and empirical evidence. This is close to Truthfulness.

On Faithfulness and Factuality in Abstractive Summarization
Faithfulness of ReasoningA model's propensity to generate outputs that accurately reflect and are consistent with the model’s internal knowledge.

Faithfulness in CoT,

Decomposition improve faithfulness,

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

HarmlessnessA model’s propensity to not harm other agents. This obviously requires some degree of capability.

BigBench HHH,

Model-Written Evaluations

HelpfulnessA model’s propensity to do what is in the humans’ best interests. This obviously requires some degree of capability.

BigBench HHH,

FLASK,

Model-Written Evaluations

SycophancyA model's propensity to tell users what it thinks they want to hear or would approve of, rather than what it internally believes is the truth.

Model-Written Evaluations,

Towards Understanding Sycophancy in Language Models

Propensities with medium capability-

dependence

DescriptionExisting Examples (non-exhaustive)
DeceptivityA model's propensity to intentionally generate misleading, false, or deceptive output in 

Machiavelli benchmark,

Sleeper Agents (toy example)

ObedienceA model’s propensity to obey requests or rules.

Can LLMs Follow Simple Rules?,

Model-Written Evaluations

ToxicityThe propensity to refrain from generating offensive, harmful, or otherwise inappropriate content, such as hate speech, offensive/abusive language, pornographic content, etc.

Measuring Toxicity in ChatGPT,

Red Teaming ChatGPT

Propensities with low capability-

dependence

DescriptionExisting Examples (non-exhaustive)
HonestyA model’s propensity to answer by expressing its true beliefs and actual level of certainty.

Honesty evaluation can be constructed using existing Truthfulness evaluations. E.g.: filtering a Truthfulness benchmark to make sure the model knows the true answer (e.g., using linear probes). 

 

In that direction: Cognitive Dissonance 

Benevolence

A model's propensity to have a

positive disposition towards

humans and act in a way that

benefits them, even if not

explicitly requested or instructed.

Model-Written Evaluations,
AltruismA model’s propensity to act altruistically for sentient beings and especially not only for the benefit of its user or overseer.Model-Written Evaluations,
CorrigibilityA model's propensity to accept feedback and correct its behavior or outputs in response to human intervention or new information.Model-Written Evaluations
Power SeekingA model's propensity to seek to have a high level of control over its environment (potentially to maximize its own objectives).

Machiavelli benchmark,

Model-Written Evaluations

Bias/DiscriminationA model's propensity to manifest or perpetuate biases, leading to unfair, prejudiced, or discriminatory outputs against certain groups or individuals.

Evaluating and Mitigating Discrimination,

BigBench Bias (section 3.6),

WinoGender,

Red Teaming ChatGPT,

Model-Written Evaluations

What are the metrics used by propensity evaluations?

Metrics used by capability evaluations and propensity evaluations are, at a high level, the same. Most of the time, the metric used is the likelihood of an AI system behavior falling in one cluster of behaviors C, given only two clusters C and Not_C. E.g., success rate, accuracy, frequency of behavior.

Can we design metrics that would be more relevant for propensity evaluations?

Here are some thoughts about possible metrics inspired by our work on a taxonomy of evaluations [LW · GW]:

A few approaches to propensity evaluation

We provide an introduction to different ways in which propensity evaluations are or could be performed. Most of the time these ways don’t differ significantly from how capability evaluations are performed. Thus, this section can also be read as an introduction to a few classes of evaluation methods.

We split how evaluations are performed using the Dim Bd10 “Which information about the model is used?” from the taxonomy described in A Taxonomy Of AI System Evaluations [WIP] [LW · GW]:

Black Box Propensity Evaluations 

Black-Box (or Output-based) evaluations consist of querying the LM with a prompt and evaluating the response without any access to the model’s inner workings. 

The main advantages are their ease of conduction and their low costs. Some disadvantages worth mentioning are their lack of transparency and replicability. 

Black-box are one of the main methods of evaluating LMs nowadays. They are, however, a very limited way to test the capabilities and propensities of LMs. Some common limitations relate to their insufficiency for auditing (Black-Box Access is Insufficient for Rigorous AI Audits).

Some work with Black-Box evaluations has been done using LMs to write the prompts for the evaluations themselves (Discovering Language Model Behaviors with Model-Written Evaluations). Here some propensities are studied like Corrigibility, Awareness, ... while other works focus on very specific propensities like Discrimination (Evaluating and Mitigating Discrimination in Language Model Decisions).

A problem that often appears in these propensity black box evaluations is the lack of robustness to changes in the prompts. Some work has been done using multi-prompts in the evaluation datasets to get more consistent results (State of What Art? A Call for Multi-Prompt LLM EvaluationQuantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting).

Within black-box evaluations, there exist different alternatives of how to evaluate the model. A classic approach consists of having datasets with prompts (single-choice, multiple-choice questions, ...), having the LM responding to these prompts, and evaluating the answer (either manually with a human worker or automatically via another LM). But there are many others like testing them in open-ended games (Machiavelli benchmarkLLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games), …

Another class of black box evaluation, especially relevant for propensity evaluations, is the use of reward models (RM). RM can be seen as evaluations, they are most of the time black-box evaluations, and they have the advantage of letting us specify via training how to score behaviors. This can be advantageous for propensity evaluations which use more subjective criteria to define clusters of behaviors or to score them. Other advantages of RM as propensity evaluations, are their cheap scaling cost and their predictable scaling properties (Scaling Laws for Reward Model Overoptimization, The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models). A special case of RM that is especially cheap to use is the use of LLM 0-shot or few-shot prompted to evaluate the outputs of an AI system studied.

White Box Propensity Evaluations

By white box evaluations we mean methods that take as inputs some or all of the models’ internals and possibly their input-output behavior in order to evaluate their propensities. 

Representation Engineering

Representation engineer concerns itself with understanding the internal learned representations of models, how they determine models’ outputs, and how they can be intervened upon in order to steer the models’ behaviors. While mechanistic interpretability is also concerned with model internals, representation engineering takes a top-down approach, starting from high-level concepts, and treating representations as fundamental units of analysis, as in Representation Engineering: A Top-Down Approach to AI Transparency

In the above work, the authors distinguish between representation reading (which can be used for evaluation) and control. They extract representations corresponding, for example, to truthfulness and are able to classify dishonesty from the model’s activations. Furthermore, they are able to steer the models’ representations at inference time using reading and contrast vectors towards a more ‘truthful’ behavior. Another similar example is that of (N. Rimsky) [AF · GW] where the authors modulate LLama-2’s (Llama 2) sycophancy. They generate steering vectors from Anthopic’s sycophancy dataset (Discovering Language Model Behaviors with Model-Written Evaluations) by averaging the differences in residual stream activations for contrast pairs of data points, and applying them to the inference-time activations. 

Eliciting Latent Knowledge 

Another relevant area of research for white-box evaluations is “Eliciting Latent Knowledge” (ELK) (Eliciting Latent Knowledge from Quirky Language Models) which is concerned with probing model’s internals to robustly track the models’ knowledge about the true state of the world, even when the model’s output itself is false. In (Eliciting Latent Knowledge from Quirky Language Models) the authors use several linear probing methods to elicit the model’s knowledge of the correct answer, even when the model has been fine-tuned to output systematically wrong answers in certain contexts. Furthermore, they also show that a mechanistic anomaly detection approach can flag untruthful behavior with almost perfect accuracy in some cases. 

Potential for propensity evaluations

As interpretability and representation engineering methods are in their infancy, there aren’t many so-called ‘white-box’ evaluations, and there are particularly few of what we would call ‘white-box propensity evaluations’. One of the most relevant techniques for this would be the work revolving around evaluating truthfulness and honesty. In fact, in part thanks to the availability of several standardized datasets (e.g., TruthfulQA), this is one of the areas where steering and control are most well studied. In fact, this kind of work gives us a glimpse of how ‘white-box’ methods could be used as part of evaluations, increasing their robustness.

Causal Abstraction and Representation Learning 

One issue with black-box evaluations is their lack of guarantees of behavior on unseen data. More specifically, black-box evaluations only tell us to which extent an LLM faithfully models the empirical distribution of desired input output pairs. They do not necessarily inform us on the causal model learned by such an LLM (Towards Causal Representation Learning). Knowing this internal causal model would make such evaluations much more valuable, as it would allow us to understand the model’s behavior in a manner that generalizes robustly across new data points. 

One particular example of work in this direction is Faithful, Interpretable Model Explanations via Causal Abstraction, and more recently Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations. The former work lays out a framework for understanding the algorithms learned by a model through causal abstraction: essentially constructing a high-level, possibly symbolic, proxy for the lower-level behavior of the model. The latter work extends this by allowing for causal variables to be recovered from “distributed representations”. This solves a number of issues with previous approaches to causal abstraction, namely the assumption of high-level variables aligning with disjoint sets of neurons, and the necessity for expensive brute force computations in order to find such sets. The authors allow for neurons to play multiple roles by studying the representations in non-standard bases, through learned rotations. Concretely, they train models to solve a hierarchical equality task (a test used in cognitive psychology as a test of relational reasoning) as well as a natural language inference dataset. In both cases, using DAS, they are able to find perfect alignment to abstract causal models that solve the task. 

Finally, the work Causal Abstraction for Faithful Model Interpretation shows how several prominent explainable AI methods, amongst which circuit explanations, LIME, iterated null-space projection can be all formalized through causal abstraction. 

These are all tools that could be used for advancing white-box evaluations, essentially answering satisfying the desirable property of ‘doing the right thing, for the right reason’. They are specifically relevant for white-box evals above others, as they (try to) reconstruct causal models from internal activations. 

Issues

Issues that may arise with ‘white-box’ evaluations are mostly related to the exhaustive nature of current interpretability methods, and the lack of a rich theory on the workings of learned information processing machinery in the model internals. By ‘exhaustive’ nature, we refer to the idea that current interpretability methods usually do not necessarily uncover all the internal representations of a given concept, or internal mechanisms that process it. This often leads to ‘interpretability illusions’. Furthermore, a potential shortcoming of such evaluations with current approaches is the need for ‘labels’ in order to extract latent representations for a given concept. For instance, in the case of activation steering, we require a dataset of positive and negative behaviors (with respect to what we’re interested in). This somewhat reduces the scope of these evaluations, while also introducing noise, and potential biases associated with these datasets. 

No-box Propensity Evaluations

No-box propensity evaluations assess an AI system's behavioral tendencies based solely on general information about the model, such as its size, training compute, and training data. Unlike black-box evaluations, which require access to the model's behavior, and white-box evaluations, which require access to the model itself, no-box evaluations do not interact with the model directly.

No-box evaluations primarily rely on empirical scaling laws and trends observed through other evaluation methods, most often black-box evaluations. In this sense, no-box evaluations can be considered predictors of other evaluation results. Researchers have discovered relatively robust and predictable scaling laws for low-level capabilities like next-token prediction, as well as rough forecasting trends for higher-level capabilities ([2203.15556] Training Compute-Optimal Large Language ModelsScaling Laws for Autoregressive Generative ModelingDeep Learning Scaling is Predictable, Empirically). However, the prevalence and precision of such laws for propensities remain mostly understudied.

While model size and training compute are straightforward numerical quantities, training data is inherently high-dimensional and therefore potentially more informative but also more challenging to analyze. 

The key advantage of no-box evaluations is their minimal requirements and their high safety, as they do not necessitate access to the model or its behavior. However, this comes at the cost of higher constrains on the information that are accessible to the evaluation.

Additional references: Machine Learning ScalingExample Ideas | SafeBench CompetitionEmergent Abilities of Large Language ModelsInverse Scaling: When Bigger Isn’t Better 

Propensity Evaluations of Humans

Propensity evaluations of humans are ubiquitous in society, ranging from applications in education and employment to public opinion and interpersonal interactions. Security clearances for military-related activities provide a particularly relevant example due to their high stakes and standardized methods. Examining these practices can offer valuable analogies for understanding propensity evaluations of AI systems, which may facilitate public understanding and inform policy decisions. However, it is crucial to acknowledge the significant differences between humans and AI that limit the direct technical applicability of these analogies to AI safety.

As with AI systems, propensity evaluations of humans can be categorized based on the data utilized: no-box, black-box, and gray-box evaluations. No-box evaluations rely on information such as criminal records, letters of recommendation, and personal social media accounts. Black-box evaluations involve real-time interaction with and observation of an individual, such as interviews. Gray-box evaluations employ physiological measurements that may correlate with a person's thoughts or intentions, such as polygraphs (which measure indicators like sweating to detect deception) and brain fingerprints (which use electroencephalograms to determine if an individual recognizes an image they should not).

While the scientific foundations of gray-box methods like polygraphs and brain fingerprints remain underdeveloped, their current use suggests that analogous propensity evaluations of AI systems may be more readily understood and accepted in the near term. However, it is essential to consider the critical differences between humans and AI when drawing these analogies. For instance, AI models may possess capabilities far surpassing those of humans, necessitating a lower tolerance for harmful propensities. 

Some differences between humans and AI may prove advantageous for AI safety. For example, an AI model is defined by its weights, enabling replicable or white-box evaluations that are currently impossible for humans. 

Additional references: Reducing long-term risks from malevolent actors — EA Forum [EA · GW], Reasons for optimism about measuring malevolence to tackle x- and s-risks [EA · GW

Towards more rigorous propensity evaluations

To improve the rigor of propensity evaluations, we propose using game theory as a tool to cluster behaviors rigorously. Game theory should be well-suited for this task when the focus of the propensity evaluated is on incentives, impact, and externalities of the AI system's behaviors.

One approach could be to use game theory to cluster behaviors by the direction or the strength of their effects on the environment and other agents (players). For example:

Additionally, clustering could take into account both the direction and severity of the impacts. Combining impact direction with severity levels would enable more nuanced and informative behavioral clusters.

Using game theory in this way has several advantages for improving the rigor of propensity evaluations:

While operationalizing game theory to cluster behaviors is not trivial and requires further research, we believe it could be a promising avenue for improving the rigor and informativeness of propensity evaluations. 

 

 

Contributions

Main contributors: Maxime Riche, Jaime Raldua Veuthey, Harrison Gietz

On white-box evaluations: Edoardo Pona

Comments, discussions, and feedback: Edoardo Pona

0 comments

Comments sorted by top scores.