Positional kernels of attention heads
post by Alex Gibson · 2025-03-03T01:40:13.014Z · LW · GW · 0 commentsContents
Introduction: Attention decomposition: Decomposition of Attention Scores: Decomposition of Attention Probabilities: Positional Patterns: Computing positional patterns: Visualization of Positional Patterns: Uses of different positional kernels: Ruling out superposition: Contextual attention heads: Approximation of softmax denominator: Metric for spread of positional pattern: Conclusion: None No comments
Introduction:
When working with attention heads in later layers of transformer models there is often an implicit assumption that models handle position in a similar manner to the first layer. That is, attention heads can have a positional decay, or attend uniformly, or attend to the previous token, or take on any manner of different positional kernels of a similar form.
By staring at attention heads in later layers for long enough, this assumption seems reasonable. That is, it seems like even in the later layers of a model like gpt2-small for instance, attention heads tend to fall into categories like "attending uniformly to positions across the sequence", or "attending just to the previous 10 positons", etc..
This positional kernel (we assume), lies in the background of content-dependent operations attention heads may perform. For instance, you can equally well imagine an induction head which attends uniformly to occurrences of "a b .... a" across the sequence, and an induction head which only attends to "a b ... a" occurrences over the previous 20 tokens.
It seems useful to make concrete this assumption, because if such a positional kernel exists, then finding it would make a great starting point in narrowing down hypotheses for the function of an attention head. It also would allow us to see gaps in our implicit assumptions about heads. For example, you might assume an induction head attends uniformly by default, but the positional kernel could tell you otherwise.
It turns out if we make an additivity assumption on positional embeddings in later layers, we can define a reasonable notion of a positional pattern, and this positional pattern seems to match empirically observed attention patterns in models like "TinyStories" and "GPT2-small". This assumption doesn't work for RoPE models, unfortunately.
The assumption we make is that the model encodes the "same content" at different positions identically up to a static layer-specific positional embedding. For example, if the bigram "Hot Dog" appears at positions and , we expect and to have similar directions, rather than depending on position in a non-additive way (such as through position-dependent rotation).
This feels like a reasonable assumption to make. We don't test it here, but we instead go from this assumption to a definition of a positional kernel/pattern. We find that it's easy to compute an estimate for this positional kernel just by sampling about 500 texts from OpenWebText and taking an average.
Then we look at the in practice observed positional kernels, and discuss ways models could use different positional kernels in different situations. We argue that these positional kernels can often be adequately summarised by a single variable summary statistic, and then produce a heatmap showing the kinds of positional kernels across all layers and heads of models like gpt2-small and TinyStories. And conclude by speculating on the roles of different layers of these models, informed by the heatmap.
Attention decomposition:
We write the post embedding of the input at position in layer as:
Where is a static positional embedding, and is the content-dependent remainder. The expected value is taken over a large text distribution, such as OpenWebText.
The decomposition is always possible, but without the above assumption on translation-invariance of content representation is not intuitive to interpret.
Decomposition of Attention Scores:
For a particular attention head , consider an input sequence , where is the current destination position. For any position , the attention score measures the weight that position places on position :
Where concatenation of letters denotes matrix multiplication, and Q, K are the query and key matrices of head h.
Decomposition of Attention Probabilities:
For conciseness, we often refer to by , not to be confused with the token embedding .
The exponentiated attention score decomposes into two independent components:
Where:
depends solely on the input embedding of the current position and
depends on the input embedding at position and at position
Positional Patterns:
We define the positional pattern as:
This represents how much attention each position receives based solely on its position, independent of content.
The final softmax probabilities are:
Computing positional patterns:
We compute for each position by sampling from OpenWebText and averaging post-ln1 embeddings. About samples is sufficient for stable patterns.
While positional patterns depend on , they remain largely consistent in practice, except when attention heads attend to the <end-of-text> token, causing emphasis on the first sequence token. We handle this in practice by omitting the first couple sequence positions from our softmax. This is a bit hacky, but the resulting kernels seem to be more consistent, and accurately represent the nature of the attention heads.
Visualization of Positional Patterns:
We identify three common types of positional patterns, shown below. The x-axis represents the key position, and the y-axis shows the query position. We take to be the th position of a chapter from the Bible. As you can see, the patterns are quite consistent across different values, and for the local positional pattern, you can observe the translation equivariance discussed earlier. Similar equivariance emerges for the slowly decaying positional pattern, but the context window required to demonstrate this is too large to show here.
Positional kernels of the first layer:
The observed translation equivariance and weak dependence on makes it reasonable to talk about the position kernel of an attention head, rather than as a function of the embedding at position .
Already there are interesting things we can learn from the positional kernels. For instance, in the IOI circuit work, and in subsequent work, both Head 0.1, Head 0.10, and Head 0.5 were identified as duplicate token heads. However the positional kernels make it clear that Head 0.1 and Head 0.10 will attend most to duplicates occuring locally, whereas head 0.5 will attend to duplicates close to uniformly across the sequence.
Head 0.1 and Head 0.10 were the duplicate token heads most active in the IOI circuit, suggesting these heads are used for more grammatical tasks, requiring local attention. Whereas perhaps Head 0.5 is used for detecting repeated tokens far back in the sequence, such as for use by later induction heads.
I show just the first layer positional kernels not because later layers are particularly different, just because there are too many layers to show them all, and the later layers all have positional kernels falling into these basic categories.
Uses of different positional kernels:
Local positional pattern: Ideal for detecting n-grams in early layers. The equivariant pattern ensures n-grams obtain consistent representations regardless of position. Strong positional decay prevents interference from irrelevant parts of the sequence. Generally useful for "gluing together" adjacent position representations.
Slowly decaying positional pattern: Useful for producing local context summaries by averaging over the sequence. Since there are exponentially many possible sequences within a context window, these heads likely produce linear summaries rather than distinguishing specific sequences. Of course can also be used for other tasks, like Head 0.1.
Uniform positional pattern: Used by heads that summarize representations across the entire sequence without positional bias, such as duplicate token heads or induction heads. Also useful for global context processing.
Ruling out superposition:
It's often hypothesised that attention heads within the same layer may be working in superposition with each other. If attention heads have dramatically different positional kernels, it seems we can immediately rule out superposition between these heads. It doesn't feel coherent to talk about superposition between an attention head attending locally and an attention head attending globally.
Contextual attention heads:
We now examine a common property of certain heads with slow positional decay/uniform positional kernels: the stability of softmax denominators within a fixed context.
Approximation of softmax denominator:
For conciseness, we refer to by , not to be confused with the token embedding .
From the softmax probability formula:
This is difficult to analyze because the denominator involves the entire previous sequence.
However, within a fixed context, we can model the sequence as drawn from i.i.d. representations according to some distribution. While nearby representations will correlate, distant representations should have low correlation within a fixed context.
This is a key place where we make use of the assumption that the terms are translation-invariant up to static positional embedding. Without this assumption, the representations drawn from a fixed context can't be modeled as identically distributed across different positions.
Under these assumptions:
Two key factors determine this variance:
: Measures the spread of positional patterns. For uniform attention across tokens, equals .
: Quantifies variation in content-dependent component.
Heads with slow decay /global positional patterns have small values for because they are spread out. If they also have low content-dependent variance within a context, the softmax denominator will have low variance.
This means the softmax denominator will concentrate around its expected value, effectively becoming a context-dependent constant.
If an attention head has low content-dependent variance across almost all contexts and values, and a broad positional pattern (as measured by ), we call it a "contextual attention head." Some have very small content-dependent components, appearing visually like fixed positional kernels averaging over previous positions, meaning that we can drop the context-dependent factor. Others are less well behaved, and for instance weigh keywords above stopwords, while still preserving the overall low content-dependent variance.
These contextual attention heads can be interpreted as taking a linear summary over previous positions, with potentially a context-dependent scaling factor. I'm unclear on what precisely to take context-dependent to mean. The key point is that models are able to reduce an exponential state space by performing a linear summary, and there should be few enough "contexts" that the model can handle them on a case by case basis. Each "context" refers to an exponential number of input sequences.
Contextual attention heads within the same layer as each other that have similar positional ekrnels are natural candidates for attention head superposition. Within each fixed context, each of the heads is effectively computing a positionally weighted, weakly content-modulates, linear summary of the text. We can combine together these linear summaries across contextual attention heads with similar positional kernels to form a large "contextual circuit."
Metric for spread of positional pattern:
The above analysis naturally suggests as a metric for the spread of positional patterns. Although, as previously mentioned, later layer attention heads often turn themselves off by attending to <end-of-text>, so we should exclude the first or so positions and take the softmax over the remaining positions for this metric.
We define the Effective Token Count (ETC) of a positional kernel to be . If the positional kernel attends uniformly to tokens, it will have an ETC of , giving us a natural interpretation of this definition.
Now we expect local positional patterns to have a low ETC, as they attend to just the previous ~ tokens. A slow positional decay will have a higher ETC, and a uniform positional kernel will have an ETC of .
Language models aren't very creative with their positional kernels, so the ETC gives a good summary of the type of positional kernel at an attention head.
Reducing the spread to a single summary statistic allows us to produce a single graph giving an idea for the positional kernels across all heads and layers of a single language model.
Below is the heatmap of across all layers and heads of GPT2-Small and TinyStories Instruct 33M, for . I found it best to plot for visualization purposes. Orange-Yellow corresponds to uniform positional patterns. Pink-Purple corresponds to slow positional decay. And Blue corresponds to local positional patterns.
For reference, compare the first column of the GPT2-Small heatmap with the plots of the positional patterns above. Heads 3, 4, and 7 are in blue because they are local positional patterns. Heads 0,1,2,6,8,9,10 are in magenta as they have a slow positional decay. And Heads 5 and 11 are yellow because they are close to uniform.
We can visually observe interesting things about the layers of GPT2-Small. For instance, notice the density of local positional patterns in the third and fourth layers. Potentially this is from the model extracting initial local grammatical structure from the text.
On the other hand, the second layer has more slow positional decay / uniform positional patterns. In fact, on closer inspection, the second layer has lots of attention heads which act purely as fixed positional kernels, falling into the category of "contextual attention heads" discussed earlier. This suggests the model initially builds a linear summary of the surrounding text, and then begins to build more symbolic representations in the third and fourth layer.
Layer is known for having lots of induction heads. Heads 5.0, 5.1, and 5.5 are known to be induction heads. These stick out visually as having close to uniform positional patterns, which validates our intuition that induction heads tend not to care about position.
In general, heads with close to uniform positional patterns seem like good places to search for heads with "interesting" functional behaviour. It'd be interesting to investigate what role Heads 2.1 and 3.0 perform, for instance.
Conclusion:
It seems like positional kernels are a useful notion to look at when first assessing attention heads, and they suggest many different lines of inquiry. One interesting piece of future work could be looking at how these positional kernels develop over the course of training.
However, the assumption made at the start of the post has not been validated, and it'd be important to look at this in future work.
This Google Collab contains the code required to reproduce the results found here.
0 comments
Comments sorted by top scores.