We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

post by robertzk (Technoguyrob), Connor Kissane (ckkissane), Arthur Conmy (arthur-conmy), Neel Nanda (neel-nanda-1) · 2024-03-06T05:03:09.639Z · LW · GW · 0 comments

Contents

  Executive Summary
  Introduction
  Technique: Attention Head Attribution via Attention Layer SAEs
  Overview of Attention Heads Across Layers
  Investigating Attention Head Polysemanticity
  Discovering Plausibly Monosemantic Heads
  Case Study: Long Prefix Induction Head
  Limitations
  Appendix: Attention Heads Feature Map
  Citing this work
  Author Contributions Statement
None
No comments

This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort

Executive Summary

Introduction

In previous work [LW · GW], we trained and open sourced a set of attention SAEs on all 12 layers of GPT-2 Small. We found that random SAE features in each layer were highly interpretable, and highlighted a set of interesting features families. We’ve since leveraged our SAEs as a tool to interpret the roles of attention heads. The key idea of the technique relies on our SAEs being trained to reconstruct the entire layer, but that contributions to specific heads can be inferred. This allows us to find the top 10 features most salient to a given head, and note whenever there is a pattern that it may suggest a role of that head. We then used this to manually inspect the role of every head in GPT-2 small, and spend the rest of this post exploring various implications of our findings and the technique.

In the spirit of An Overview of Early Vision in InceptionV1, we start with a high-level, guided tour of the different behaviors implemented by heads across every layer, building better intuitions for what attention heads learn in a real language model. 

To validate that the technique is teaching something real about the roles of these heads, we confirm that our interpretations match previously studied heads. We note that our annotator mostly did not know a priori which heads had previously been studied. We find:

In addition to building intuition about what different heads are doing, we use our SAEs to better understand the prevalence of attention head polysemanticity at a high level. Our technique suggests that the vast majority of heads (>90%) are doing multiple tasks.  This implies that we’ll either need to understand each head’s multiple roles, or primarily use features (rather than heads) as better units of analysis. However, our work also suggests that not all heads are fully polysemantic. We find 9% plausibly monosemantic and 5% mostly monosemantic heads, though we note that as our technique only explores the top 10 most salient features per head it is not reliable enough to prove monosemanticity on its own.

Finally, we’re excited to find that our SAE-based technique helps enable faster mech interp research on non-SAE questions. A long open question concerns why there are so many induction heads. Inspecting salient features for the two induction heads in Layer 5 immediately motivated the hypothesis that one head specializes in “long prefix induction”, wanting the last several tokens to match, while the other performs “standard induction”, only needing a single token to match. We then verified this hypothesis with more rigorous non-SAE based experiments. With so much of the field investing effort in sparse autoencoders, it is nice to have signs of life that these are a legitimately useful tool for advancing the field’s broader agenda.

Technique: Attention Head Attribution via Attention Layer SAEs

Recall that we train our SAEs for GPT-2 small [LW(p) · GW(p)] on the “z” vectors (aka the mixed values aka the attention outputs before a linear map) concatenated across all heads for each layer. See our first post [LW · GW] for more technical details.

We develop a technique specific to this setup: decoder weight attribution by head. For each layer, our attention SAEs are trained to reconstruct the concatenated outputs of each head. Thus each SAE decoder direction is of shape (n_heads * d_head), and we can think of each n_heads slice as specifically reconstructing the output of the corresponding head. We then compute the norm of each slice as a proxy for how strongly each head writes the feature. Finally, for each head, we sort features by their decoder weight attribution to get a sense of what features that head is most responsible for outputting.

For example, for head 7.9, Feature 29342 has a maximal decoder weight attribution of 0.49. 

A better metric for determining attribution of SAEs to attention heads might be to look at the fraction of variance explained (FVE) by each head via DFA, as we did in our first post [LW · GW]. However, this is more computationally expensive and an examination of decoder weight attribution against FVE on random heads shows these are correlated and should roughly preserve ranking order. One example for H4.5 is shown below.

There are a few heads such as H8.5 where top-ranking features by FVE are not as correlated for the top features and using FVE may have produced better results. We recommend using FVE for any reproductions. 

Overview of Attention Heads Across Layers

Based on a review of the top 10 SAE features by decoder weight attribution for each attention head, we observed that features become more abstract up to layer 9 and then less so after that.

A car context feature, one of 6/10 features attributed to L0H8 focused on context tracking
L6H9: 10/10 induction features
L7H9: 8/10 are reasoning and justification related, with DFA on “believe”, “which”, “clearly”, etc.
L9H0: The features activate on more abstract concepts, like actions related to eyes.
L11H4: 9/10 features involved copying a single letter in the context.

The table below summarizes the kind of feature groupings that we identified across the various layers. Notable heads indicated with an asterisk (*) were particularly interesting as possibly monosemantic due to exhibiting the same feature grouping for all top 10 features.

LayerFeature groups / possible rolesNotable Heads
0

- Single-token (“of”)

- Bi-gram features (following “S”)

- Micro-context features (carsApple techsolar)

H0.1: Top 6 features are all variants capturing “of”
H0.5:  Identified as duplicate token head from 9/10 features

H0.9: Long range context tracking family (headlinesnutrition factslists of nouns)

1

- Single-token (Roman numerals)

- Bi-gram features (following “L”)

- Specific noun tracking (choicerefugeegenderfilm/movie)

H1.5*: Succession or pairs related behavior
H1.8: Long range context tracking with very weak weight attribution
2- Short phrases (“never been…”)
- Entity features (courtmediagovt)
- Bi-gram & tri-gram features (“un-”)
- Physical direction and logical relationships (“under”)
- Entities followed by what happened (govt)

H2.0: Short phrases following a predicate like not/just/never/more
H2.3*: Short phrases following a quantifier (both, all, every, either) or spatial/temporal predicate (after, before, where)

H2.5: Subject tracking for physical directions (under, after, between, by), logical relationships (then X, both A and B)

H2.7: Groups of context tracking features
H2.9*: Entity followed by a description of what it did 

3- Entity-related fragments (“world’s X”)
- Tracking of a characteristic (ordinality or extremity) (“whole/entire”)
- Single-token and double-token (eg)
- Tracking following commas (whilethoughgiven)

H3.0: Identified as duplicate token head from 8/10 features
H3.2*: Subjects of predicates (so/of/such/how/from/as/that/to/be/by)
H3.6: Government entity related fragments, extremity related phrases

H3.11: Tracking of ordinality or entirety or extremity

4

- Active verbs (doshare)
- Specific characterizations (the same Xso Y)

- Context tracking families (story highlights)
- Single-token (predecessor)

H4.5: Characterizations of typicality or extremity
H4.7*: Weak/non-standard duplicate token head
H4.11*: Identified as a previous token head based on all features
- Induction (F)

 
H5.1: Long prefix Induction head
H5.5: Induction head 
6- Induction (M)
- Active verbs (want togoing to)
- Local context tracking for certain concepts (vegetation)

 

H6.3: Active verb tracking following a comma
H6.5: Short phrases related to agreement building

H6.7: Local context tracking for certain concepts (paymentvegetationrecruiting, death)

H6.9*: Induction head

H6.11: Suffix completions on specific verb and phrase forms

7- Induction (al-)
- Active verbs (asked/needed)
- Reasoning and justification phrases (becausefor which)

H7.2*: Non-standard induction
H7.5: Highly polysemantic but still some groupings like family relationship tracking
H7.8: Phrases related to how things are going or specific action taken (decision to X, issue was Y, situation is Z)
H7.9: Reasoning and justification related phrasing (of which, to which, just because, for which, at least, we believe, in fact)

H7.10*: Induction head

8- Active verbs (“hold”)
- Compound phrases (either)
- Time and distance relationships
- Quantity or size comparisons or specifiers (larger/smaller)
- Url completions (twitter)

H8.1*: Prepositions copying (with, for, on, to, in, at, by, of, as, from)

H8.5: Grammatical compound phrases (either A or B, neither C nor D, not only Z)

H8.8: Quantity or time comparisons or specifiers

9

- Complex concept completions (timeeyes)
- Specific entity concepts
- Assertions about characteristics (big/large)

- Grammatical relationship joiners (between)

H9.0*: Complex tracking on specific concepts (what is happening to time, where focus should be, actions done to eyes, etc.)

H9.2: Complex concept completions (deathdiagnosisLGBT discriminationproblem and issuefeminismsafety)

H9.9*: Copying, usually names, with some induction
H9.10: Grammatical relationship joiners (from X to, Y with, aided by, from/after, between)

10

- Grammatical adjusters

- Physical or spatial property assertions
- Counterfactual and timing/tense assertions (would havehoped that)
- Certain prepositional expressions (alongunder)
- Capital letter completions (“B”)

H10.1: Assertions about a physical or spatial property (up/back/down/ over/full/hard/soft/long/lower)
H10.4: Various separator character (colon for time, hyphen for phone, period for quantifiers)
H10.5: Counterfactual and timing/tense assertions (if/than/had/since/will/would/until/has X/have Y)
H10.6: Official titles
H10.10*: Capital letter completions with some context tracking (possibly non-standard induction)
H10.11: Certain conceptual relationships
11- Grammatical adjustments
- Bi-grams
- Capital letter completions
- Long range context tracking
H11.3: Late layer long range context tracking, possibly for output confidence calibration

Investigating Attention Head Polysemanticity

While our technique is not sufficient to prove that a head is monosemantic, we believe that having multiple unrelated features attributed to a head is evidence that the head is doing multiple tasks. We also note that there is a possibility we missed some monosemantic heads due to missing patterns at certain levels of abstraction (e.g. some patterns might not be evident from a small sample of SAE features, and in other instances an SAE might have mistakenly learned some red herring features).

During our investigations of each head, we found 14 monosemantic candidates (i.e. all of the top 10 attributed features for these heads were closely related). This suggests that over 90% of the attention heads in GPT-2 small are performing at least two different tasks.

To explicitly show an example of a polysemantic head, we use evidence from what SAEs attributed to the head have learned to deduce that the head performs multiple tasks. In H10.2, we find both integer copying behavior and url completion behavior in the top SAE features. Zero ablating each head in Layer 10 and recording the mean change in loss on synthetic datasets[1] for each task shows a clear jump for H10.2 relative to other heads in L10, confirming that this head really is doing both of these tasks:  

 Average change in loss when ablating Layer 10 heads on prompts of the form “ {noun} {digit}, {filler text} {noun} {digit}” when predicting the the repeated digit.
Average change in loss when ablating Layer 10 heads on prompts of the form “ https://t.co/{random_resource_id}” for the resource id predictions.

Note that the line between polysemantic and monosemantic heads is blurry, and we had a high bar for considering monosemantic candidates. For example, consider H5.10: all 10 features look like context features, boosting the logits of tokens related to that context. We labeled this as polysemantic since some of the contexts are unrelated, but we could plausibly think of it as a general monosemantic “context” head. We also think that head polysemanticity is on a spectrum (e.g. heads with 2 roles are less polysemantic than heads with 10). If we can understand the multiple roles of a polysemantic head, it still might be worth trying to fully understand it in the style of McDougall et al.

Discovering Plausibly Monosemantic Heads

While our technique doesn’t prove that heads are monosemantic, we are excited that our SAEs might narrow down the search space for monosemantic heads, and reveal new, messier behaviors that have been historically harder to understand in comparison to cleaner algorithmic tasks. 

For the monosemantic candidates, we see a range of different behaviors:

For example, consider H8.1. For all top 10 features, the top DFA suggests that the head is  attending to a different preposition (including ‘with’, ‘from’, ‘for’, ‘to’), and the top logits boost that same preposition. This suggests that this might be a monosemantic “Preposition mover” head that specifically looks for opportunities to copy prepositions in the context. 

As shown in the table below, we also found instances where a head was almost monosemantic or plausibly bisemantic.

Head TypeFraction of Heads
Plausibly monosemantic: All top 10 features were deemed conceptually related by our annotator. 9.7% (14/144)
Plausibly monosemantic, with minor exceptions: All features were deemed conceptually related with just one or two exceptions.5.5% (8/144)
Plausibly bisemantic: All features were clearly in only two conceptual categories.2.7% (4/144)

Case Study: Long Prefix Induction Head

To further verify that our SAE features are teaching us something real about the roles of heads, we show that we can distinguish meaningfully different roles for two induction heads in L5 of GPT-2 small, shedding some light on why there are so many induction heads in language models. We find that one head seems to be specializing in “long prefix induction”, while one head mostly does “standard induction”.

Notably this hypothesis was motivated by our SAEs, as glancing at the top features attributed to 5.1 shows “long induction” features, defined as features that activate on examples of induction with at least two repeated prefix matches (eg 2-prefix induction: ABC … AB -> C).  We spot these by comparing tokens at (and before) each top feature activation with the tokens preceding the corresponding top DFA[2] source position. While previous work from Goldowsky-Dill et al found similar “long induction” behavior in a 2-layer model, we (Connor and Rob) were not aware of this during our investigations, showing that our SAEs can teach us novel insights about attention heads.

As an illustrative example, we compare two “‘-’ is next by induction” features attributed to heads 5.1 and 5.5 respectively. Notice that all of the top examples for 5.1’s feature are examples of long prefix induction, while almost all of the examples in 5.5’s feature are standard (AB ... A -> B) induction. For example, comparing the top DFA to the feature activation for 5.1’s top example shows a 4-prefix match (.| ”|Stop| victim), while 5.5’s top feature is  1-prefix ( center). 

L5H1 F19293: ‘-’ is next by long induction. Notice the same 4-token prefix on the left (activations) and right (DFA) examples.
L5H5 F35744: ‘-’ is next by induction. Notice that for most examples there is no common prefix other than the preceding token.

To confirm that this isn’t just an artifact of our SAEs, we reinforce this hypothesis with independent lines of evidence. We first generate synthetic induction datasets with random repeated tokens of varying prefix lengths. We confirm that while both induction scores rise as we increase prefix length, 5.1 has a much more dramatic phase change as we transition to long prefixes (i.e. >=2 ):



We also check each head’s average direct logit attribution (DLA) to the correct next token as a function of prefix length. We again see that 5.1’s DLA skyrockets as we enter the long prefix regime, while 5.5’s DLA remains relatively constant:


We now check that these results hold on a random sample of the training distribution. We first filter for examples where the heads are attending non-trivially to some token[3] (i.e. not just attending to BOS), and check how often these are examples of n-prefix induction. We find that 5.1 will mostly attend to tokens in long prefix induction, while 5.1 is mostly doing normal 1-prefix induction.  


We intervene on the long induction examples from the training distribution, corrupting them to only be one prefix, and show that 5.1’s average induction score plummets from ~0.55 to ~0.05, while 5.1 still maintains an induction score of ~0.45. 


Finally, we see hints of universality: checking average induction scores on our synthetic induction dataset for a larger model, GPT-2 Medium, reveals signs of both “long prefix” and “standard” induction heads. 


Limitations

Appendix: Attention Heads Feature Map

We recorded the groupings for all heads in this Google Sheet based on the corresponding attention head feature dashboards. We thank Callum McDougall for providing the visualization codebase on top of which these dashboards were constructed. The code for generating the dashboards is available by messaging any of the first two authors.

Citing this work

Feel free to use the citation from the first post [LW · GW], or this citation specifically for this current post:

@misc{gpt2_attention_saes_3,
  author= {Robert Krzyzanowski and Connor Kissane and Arthur Conmy and Neel Nanda},
  url = {https://www.alignmentforum.org/posts/xmegeW5mqiBsvoaim/we-inspected-every-head-in-gpt-2-small-using-saes-so-you-don},
  year = {2024},
  howpublished = {Alignment Forum},
  title = {We Inspected Every Head in GPT-2 Small Using SAEs So You Don’t Have To}
}

Author Contributions Statement

Connor and Rob were core contributors on this project. Rob performed high-level grouping analysis of every attention head in GPT-2 Small and some corresponding shallow dives. Connor performed the long prefix induction deep dive and the H10.2 polysemanticity experiment. Arthur and Neel gave guidance and feedback throughout the project. The original project idea was suggested by Neel.

We would like to thank Georg Lange and Joseph Bloom for extremely helpful criticism about our claims on polysemanticity in an earlier draft of this work.

  1. ^

    For other heads, we also used proxies to detect examples in OpenWebText with the hypothesized head behaviors, but had messier results. Our initial hypotheses were often too broad (eg "succession" vs "succession for integers"), which led to false negatives. Synthetic data was helpful to filter these.

  2. ^

    Note DFA is attention weighted, so you can think of it as similar to an attention pattern

  3. ^

    We show a threshold of 0.3. The results generally hold for a range of thresholds.

0 comments

Comments sorted by top scores.