An introduction to language model interpretability

post by Alexandre Variengien (alexandre-variengien) · 2023-04-20T22:22:49.519Z · LW · GW · 0 comments

Contents

  Introduction
  Two Whys
    A toy model of target features for self-supervised models
    Caveats of the toy model
    Sidenote: The anthropomorphization bias
  Two Hows
  Choosing a task
  Interpreting the model
  Communicating results
None
No comments

This document was written while working at Redwood Research as part of the training material for REMIX. Its goal was to introduce useful considerations before beginning empirical work on language model interpretability. It assumes familiarity with causal scrubbing [LW · GW], and the work on IOI

While it's technically already publicly accessible from the REMIX github repo, I repost it here with minor edits to make it more visible, as it could be helpful for newcomers to the field of LM interpretability.

Introduction

So you have a language model, and you want to do interp to explain how it implements a particular behavior. Or maybe you're still searching for a behavior to explain. Before jumping straight into defining a task or running experiments, it is useful to contextualize your subject of study. How to think about language model behaviors? How are they shaped by the training process? For a given model behavior, what structure in the model internals should I expect to find? 

These are hard questions, and we are far from having definitive answers to them. Here, I'll present a toy model enabling informed guesses in practical cases. It's better seen as a "scaffolded intuition" than a proper theoretical framework. After introducing the toy model, I'll end with a list of practical advice for LM interp research.

To start the reasoning, I found it useful to borrow questions from ethology. After all, this is another field studying behaviors in organisms selected for maximizing a particular objective (fitness) function. I’ll go through an adaptation of Tinbergen's four questions — two “why” and two “how” — to think about what are “behaviors”. 

Two Whys

Why #1 - Function - Why does the model learn this behavior on this input distribution rather than something else? How does the behavior affect the model loss on its training set?

In the case of language models, what are the features present in text data that the behavior is mimicking? In other words: If a language model is a butterfly that mimics the appearance of leaves, what is the exact shape of the leaf's veins it’s trying to imitate?

A model imitating its training data

If you don’t have any convincing story for the previous question, maybe it’s because the behavior you're looking at is not a useful mechanism to model training data. It can be the side effect of a useful mechanism. For instance, text-ada-001 completes the prompt “This&is&a&sentence&” by a loop of “&is&a&sentence&” separated by a skipped line. I’m not sure that this particular behavior corresponds to any meaningful feature of internet text data. It's likely that the prompt is far out of distribution, that the model is in “high uncertainty mode”, and so it uses basic heuristics that are generally useful, like repeating text. 

In all cases, it’s important to differentiate between the target feature the behavior is mimicking, and its approximation implemented by the behavior (which is often hard to fully characterize). E.g. on IOI, the feature is “identify a plausible indirect object from the context”, and the behavior implemented by GPT2-small is roughly “look for non-duplicated names in the context”. The two functions overlap enough that this is a good strategy to minimize the loss in training.

A toy model of target features for self-supervised models

The goal of self-supervised language models is to predict a next-token probability distribution that imitates the large pile of text they saw in training. 

As a simplified model, we can think of the training distribution as a set of target features to imitate.[1] Some of those features are “The name used as an indirect object appeared previously in the context”, “When a sentence is ending in a quotation mark, a closing quotation mark follows”, “After ‘Barack’ often comes ‘Obama’” etc. Each target feature is a task to solve for the language model.

In this model, target features are disjoint such that the total loss on the training set is the sum of the losses on each individual target feature, weighted by how frequently they appear in the training set. 

If we zoom in on a particular target feature, we can ask: given a model and a limited implementation capacity, what is the best approximation of the target feature I can find? In other words: what is the Pareto frontier of the imitation loss and complexity budget? Each point on this line is a behavior mimicking the target features.

Depending on the task, this graph will look different. For instance, on a memorization task like predicting “Obama” after “Barack”, the shape is a sharp transition from no association to a learned association. More advanced tricks can be learned with sufficient complexity budgets. For instance, detecting contexts where Obama’s full name (Barack Hussein Obama II) will be used instead of the short version. But such fancy tricks require a lot of model capacity while only making little gain in imitation loss.

 

 

Some target features require understanding subtle clues in the sentence, e.g. choosing if the next token should be “is” or “was”. It is hard to describe exactly how such features could be implemented with a Python program. They often rely on fuzzy concepts like “the likely tense of a sentence”, “the sentiment of the sentence” etc. To imitate these features, there is a sequence of increasingly complex statistical heuristics. Simple models could use bigrams statistics while on the (unrealistic) very far end of the complexity spectrum, models could simulate human brains to extract statistics about how humans tend to complete such sequences.

 

 

Finally, other target features are in between the two previous types. There exists a simple algorithm that is much better than memorized statistics, e.g. because the next token can be retrieved from the context, as in IOI.

 

 

After shaping the landscape of the target feature approximation, we need to discuss what pushes a model to choose a particular tradeoff, a particular point on the curve.

We can see a language model as a reservoir of complexity. During training, its job is to

  1. Allocate its complexity budget to target features according to their importance.
  2. For each target feature, optimize its approximation given the allocated budget.

As the loss is averaged over the training dataset, the importance of target features is their frequency in the training data. 

This tradeoff is better illustrated by considering different models. This leads us to the second question.

Why #2 - Comparative Model Analysis - Why does this behavior exist in a particular model? What is the corresponding behavior in related models (larger / smaller, slightly different architectures), and what are the architecture details necessary for it to arise?

By changing the model, you change

It’s important to keep track of those two points when considering a model: “Does this model have enough capacity to store excerpts from Hamlet?”, “In a Transformer architecture, is the algorithm for 2-digit addition more complex than a lookup table?” etc. 

Caveats of the toy model

We cannot consider target features in isolation. In reality, the behavior implementations are all running in parallel and interacting. However, the toy model can be good enough in practice. LM seems to have a strong gating mechanism that allows certain behavior to be expressed and not others (e.g. the French-speaking part of the LM doesn’t fire in the middle of an English text). 
But there are cases where the interaction cannot be ignored. Take the example of an important target feature, F, which is modeled by behavior B. There can be a target feature G that is less important but resemble F. On G, the LM will execute B, despite not being a good approximation of the feature G.

Generally, this toy model rests on many ill-defined quantities such as implementation complexity and complexity budget. It is intended as an intuition pump to make you think about useful considerations more than a definitive model of LM behaviors.

Sidenote: The anthropomorphization bias

It’s sometimes hard to make the difference between the model’s behavior and the target features it’s approximating. As humans, it’s easy to project the way we do when we see human-like behavior. General advice applies here: try thinking like an LM and prefer simpler explanations when available. 

Two Hows

How #1 - Development - How does the behavior evolve during training? Is there some path dependence in the emergence of the behavior? What training data are necessary for the trained model to display the behavior?

Until now, we considered models performing perfect allocation of their capacity according to the importance of target features, and perfect approximation under those constraints.

In reality, those tradeoffs and approximations are made by a gradient descent-based optimization and are likely to be imperfect. Hence, it's important to keep in mind the specificity of the training process of your model to anticipate deviation from the perfect allocation described above. For instance, if a model is trained with dropout, it is likely that you’ll find component redundancy on the internals, despite not being required to implement the behavior.
 

How #2 - Causation - How is the behavior implemented in the model’s internals? What are the text features of the prompt that elicit the response?

Finally, this is the question we want to answer by doing interp! I'll not go through a comprehensive description of how to tackle it. Instead, I'll give some advice about various steps along the research process. Some are high-level while some are technical points. Feel free to skip points if they feel too in the weeds.

Choosing a task

With the toy model in mind, here is a set of points to consider when choosing a behavior to investigate. 

Once you have an idea of a task, before answering how the behavior is implemented, it’s helpful to spend some time testing the extent of the behavior, i.e. trying to answer the second causation question.

Interpreting the model

I’ll focus on the step where you have the structure of the circuit (e.g. obtained by path patching) and you try to get a human understanding of what each part is doing. A component refers to an MLP layer or an attention head acting at a particular token position. 

Communicating results

By doing interp, you’ll develop intuition about the roles of each node by ingurgitating a huge number of numerical values. Sometimes it’s hard to put into words why you think what you’re thinking. You’ll spend most of the time confused with half-baked models. A big part of the work is to introspect what are the hidden assumptions you’re making, and what evidence supports those hypotheses, and be ready to update. A good way to do this is to communicate with others.





 

  1. ^

    This view is similar to the quantization of model abilities introduced in a recent work (published after the original version of this document).

0 comments

Comments sorted by top scores.