How Do Induction Heads Actually Work in Transformers With Finite Capacity?

post by Fabien Roger (Fabien) · 2023-03-23T09:09:33.002Z · LW · GW · 0 comments

Contents

  How Are Induction Heads Supposed to Work
    The usual explanation
    Fine-grained Hypothesis 1
    Fine-grained Hypothesis 2
  Experiments
    Setup
    Result 1: Induction Works Much Better for Some Tokens Than for Others
    Result 2: Induction Makes Tokens Near to B More Likely
  Appendix
    Spread of Induction Strengths
    Top and Bottom 10 Tokens for A and B Powers
None
No comments

Thanks to Marius Hobbhahn for helpful discussions and feedback on drafts.

I think the usual explanation of how induction heads work in actual Transformers, as described in Anthropic’s Mathematical Framework, is incomplete. Concretely, it seems like the following two statements are in contention:

The main reason is that there isn’t enough capacity to copy “A” to the second position and “B” to the last position without very lossy compression.

In this post:

I would appreciate if someone could find better fine-grained hypotheses. I think it could lead to a better understanding of how Transformers deal with capacity constraints.

Further experiments about real induction behavior could be nice exercises for people starting to do Transformer interpretability.

How Are Induction Heads Supposed to Work

The usual explanation

Let’s say you have a sequence AB…A. The previous token head copies the A at position 1 to position 2. Then this copied A token is the key to an attention head, which matches the A at the last position, which is used as query at the last position. B at position 2 is used as value, and therefore gets copied over at the last position, enabling the network to predict B.

Fine-grained Hypothesis 1

Here is a detailed explanation of how induction heads work, which tracks capacity constraints:

What is weird in this detailed hypothesis:

Note: token A can be compressed heavily without induction heads failing too badly, it even helps with “fuzzy matching”.

Fine-grained Hypothesis 2

Here is another detailed explanation:

This too requires extreme and efficient compression:

Experiments

Setup

I use sequences of the form …AB…A, where A and B are random tokens and “…” are sequences of 10 random tokens. I compare the logprobs at the last position of those “induction sequences” with the logprobs of the same sequences but where the all tokens expect the last one have been shuffled (such that B is almost never right after A). If there are induction heads which work regardless of context and tokens, then B should be more probable after an “induction” sequence than after a “shuffled” sequence.

The metric I use is the difference of log odd-ratios before and after shuffling . I call this induction strengthBigger induction strength means the network predicts B more strongly after A when there has been AB in the context.

Experiments are done with Neel Nanda’s 2-layer attention only model, using the TransformerLens library. The code can be found here.

Result 1: Induction Works Much Better for Some Tokens Than for Others

For a given token, I measure it’s average induction strength across 256 sequences where it is token A (“A power”), and it’s average induction strength across 256 sequences where it is token B (“B power”). I plot the histogram of A and B powers across the first 10000 tokens.

To understand what this means in terms of log odd-ratios, I plot the actual log odd-ratios before and after shuffling for tokens “ Fab” as fixed A, and “ien” as fixed B for 4096 sequences.

Observations and interpretation:

Result 2: Induction Makes Tokens Near to B More Likely

I measured the induction strength but instead of measuring , I measured  where B* is a token near to B in the unembedding space (according to the cosine similarity). I also measured it for  where R are random tokens (I draw new random tokens for each experiment). I measure induction strengths on 4096 sequences.

Induction strengths of tokens near B are roughly half-way between the induction strength of B and the induction strengths of random tokens.

Observations and interpretation:

Appendix

Spread of Induction Strengths

Top and Bottom 10 Tokens for A and B Powers

std are the 1-sigma standard deviation divided by sqrt(number of samples), over 4096 samples.


 

0 comments

Comments sorted by top scores.