Positional kernels of attention heads

alex-gibson

Positional kernels of attention heads

post by Alex Gibson · 2025-03-10T23:17:25.068Z · LW · GW · 0 comments

  Introduction:
  Attention decomposition:
      Decomposition of Attention Scores:
  Positional Kernels:
      Visualization of Positional Patterns:
        For the rest of this post, when we reference specific heads, they belong to GPT2-Small
      Uses of different positional kernels:
      Ruling out superposition:
  Contextual attention heads:
    Approximation of softmax denominator:
      First layer contextual circuit:
      Computing positional kernels in later layers:
  Metric for spread of positional pattern:
  Conclusion:
None
No comments

Introduction:

In this post, we introduce "positional kernels" that capture how attention is distributed based on position, independent of content, enabling more intuitive analysis of transformer computation. We define these kernels under the assumption that models use static head-specific positional embeddings for the keys.

We identify three distinct categories of positional kernels: local attention with sharp decay, slowly decaying attention, and uniform attention across the context window. These kernels demonstrate remarkable consistency across different query vectors and exhibit translation equivariance, suggesting that they serve as fundamental computational primitives in transformer architectures.

We analyze a class of attention heads with broad positional kernels and weak content dependence, which we call "contextual attention heads". We show they can be thought of as aggregating representations over the input sequence. We identify a circuit of 6 contextual attention heads in the first layer of GPT2-Small with shared positional kernels, allowing us to straightforwardly combine the output of these heads. We use our analysis to detect first-layer neurons in GPT2-Small that respond to specific training contexts through a primarily weight-based analysis, without running the model over a text corpus.

We expect the assumption on static positional embeddings to break down for later layers (due to the development of dynamic positional representations, such as tracking which number sentence you are in), so in later layers we view these kernels simply as heuristics, whereas in the first layer we can obtain an analytic decomposition.

We focus on GPT2-Small as a reference model. But our decomposition works on any model with additive positional embeddings (as well as T5 and ALiBi). We can also construct extensions of our analysis to RoPE models.

Attention decomposition:

We suppose the keys of the input at position for head $h$ on input $x$ can be written as $key [i] (x) = E [i] (x) + P [i]$

Where $P [i]$ is a static positional embedding specific to head $h$ , and $E [i] (x)$ is a content-dependent remainder. $E [i] (x)$ can be defined to be any function of the input sequence, including trivially $key [i] (x)$ .

However, this decomposition is useful when in some sense $E [i] (x)$ represents the "relevant content" at position $i$ , independent of $i$ . We want to decouple the content at a position from the position in which it occurred. This can be motivated as allowing us to reduce the description length of a head, by only requiring us to describe its role at a single position, using positional embeddings to translate this description to other positions.
As a concrete example, induction heads could have $E [i] (x)$ encoding information about the token at the previous position, and for duplicate token heads we could encode the current token. So in this case $E [i] (x)$ corresponds to an interpretable notion of content, and it’s meaningful to test whether a decomposition like the above is possible.

Decomposition of Attention Scores:

For a particular attention head $h$ , the $i$ th attention score measures the weight that position $n$ places on position $i$ :

$attn_score [i] = \frac{query [n]^{T} (P [i] + E [i])}{\sqrt{d_{value}}}$

To compute the final softmax probabilities, the model takes a softmax of these attention scores. This involves first exponentiating them, and then normalising.

The exponentiated attention scores decompose into two independent components:

$e^{attn_score [i]} = e^{\frac{query [n]^{T} P [i]}{\sqrt{d_{value}}}} \cdot e^{\frac{query [n]^{T} E [i]}{\sqrt{d_{value}}}} = p_{i, query [n]} \cdot {content}_{i, query [n]}$

Positional Kernels:

We define the positional kernel as:

${pos}_{i, embedding [n]} = \frac{p_{i, embedding [n]}}{\sum_{j = 1}^{n} p_{j, embedding [n]}}$ = $Softmax (\frac{query [n]^{T} P}{\sqrt{d_{value}}})_{i}$

This represents how much attention each position receives based solely on its position, independent of content.

The final softmax probabilities are:

$soft_prob [i] = \frac{e^{{attn}_{score} [i]}}{\sum_{j = 1}^{n} e^{{attn}_{score} [j]}} = \frac{p_{i, query [n]} \cdot {content}_{i, query [n]}}{\sum_{j = 1}^{n} p_{j, query [n]} \cdot {content}_{j, query [n]}} = \frac{{pos}_{i, query [n]} \cdot {content}_{i, query [n]}}{\sum_{j = 1}^{n} {pos}_{j, query [n]} \cdot {content}_{j, query [n]}}$

Note in the last line, we are simply dividing the numerator and denominator by the same value.

The reason for taking the softmax to obtain the positional kernel is that by default, the positional component of the attention scores are only defined up to a constant, because the softmax function is invariant under a constant shift. Taking a softmax is therefore a way of picking a canonical representation of the positional component. It makes it easy to compare positional components between distinct query vectors, and provides an intuitive visual, as positional kernels can be visualized in the same manner as attention patterns. For the first layer, the positional kernel is the attention pattern you would get if the token in each position were identical. It is the underlying "landscape" upon which attention patterns are built.

To handle T5 and ALiBi positional schemes, we can simply add the appropriate bias terms to the positional attention scores before taking the softmax to compute the positional kernel. For RoPE models we can define a positional kernel depending on both the query and the key, and we recover the analysis given here when there is a weak dependence of the positional kernel on the key, as there empirically often is.

Visualization of Positional Patterns:

We identify three common types of positional patterns, shown below. The x-axis represents the key position, and the y-axis shows the query position. We take $query [n]$ to be the $n$ th position of a chapter from the Bible. As you can see, the patterns are quite consistent across different query vectors, and for the local positional pattern, you can observe the translation equivariance discussed earlier. Similar equivariance emerges for the slowly decaying positional pattern, but the context window required to demonstrate this is too large to show here.

For the rest of this post, when we reference specific heads, they belong to GPT2-Small

Positional kernels of the first layer:

The observed translation equivariance and weak dependence on the query vector makes it reasonable to talk about the position kernel of an attention head, rather than as a function of the embedding at position $n$ .

Already there are interesting things we can learn from the positional kernels. For instance, in the IOI circuit work, and in subsequent work, both Head 0.1, Head 0.10, and Head 0.5 were identified as duplicate token heads. However the positional kernels make it clear that Head 0.1 and Head 0.10 will attend most to duplicates occuring locally, whereas head 0.5 will attend to duplicates close to uniformly across the sequence.

Head 0.1 and Head 0.10 were the duplicate token heads most active in the IOI circuit, suggesting these heads are used for more grammatical tasks, requiring local attention. Whereas perhaps Head 0.5 is used for detecting repeated tokens far back in the sequence, such as for use by later induction heads.

We show just the first layer positional kernels not because later layers are particularly different, just because there are too many layers to show them all, and the later layers all have positional kernels falling into these basic categories.

Uses of different positional kernels:

Local positional pattern: Ideal for detecting n-grams in early layers. The equivariant pattern ensures n-grams obtain consistent representations regardless of position. Strong positional decay prevents interference from irrelevant parts of the sequence. Generally useful for "gluing together" adjacent position representations.

Slowly decaying positional pattern: Useful for producing local context summaries by averaging over the sequence. Since there are exponentially many possible sequences within a context window, these heads likely produce linear summaries rather than distinguishing specific sequences. Of course can also be used for other tasks, like Head 0.1.

Uniform positional pattern: Used by heads that summarize representations across the entire sequence without positional bias, such as duplicate token heads or induction heads. Also useful for global context processing.

Ruling out superposition:

It's often hypothesised that attention heads within the same layer may be working in superposition with each other. If attention heads have dramatically different positional kernels, it seems we can immediately rule out superposition between these heads. It doesn't feel coherent to talk about superposition between an attention head attending locally and an attention head attending globally.

Contextual attention heads:

We now use the positional kernel to analyze a common type of head with a broad positional kernel and weak dependence on content, which we call a Contextual Attention Head. We focus on the first layer, because it simplifies analysis by allowing us to model the query vector as independent of the keys, but a similar type of argument should be feasible in later layers. Often, the query vector only depends on the tokens immediately surrounding it, even in later layers.

Approximation of softmax denominator:

Assume we are in the first layer, and fix $query [n$ ] for the remainder of this section.

Note that because of layer norm we don't have an exact $E + P$ $decomposition, but that the scale of layer norm is close to constant, so that we can well approximate it like this.

We have the softmax probability formula:

$softprob [i] = \frac{{pos}_{i, query [n]} \cdot {content}_{E [i], query [n]}}{\sum_{j = 1}^{n} {pos}_{j, query [n]} \cdot {content}_{E [j], query [n]}}$

This is difficult to analyze because the denominator involves the entire previous sequence.

However, within a fixed context, the position-normalized softmax denominators of contextual attention heads often have low variance. To give an intuition for this, we model the sequence as drawn as independent representations according to some underlying distribution. Because we are in the first layer, we are justified in treating the keys as independent of $query [n]$ .

Under these assumptions:

$Var (\sum_{j = 1}^{n} {pos}_{j, query [n]} \cdot {content}_{E [j], query [n]}) \leq (\sum_{j = 1}^{n - 1} {pos}_{j, query [n]}^{2}) \cdot Var ({content}_{E [1], query [n]})$

Two key factors determine this variance:

$\sum_{j = 1}^{n} {pos}_{j, query [n]}^{2}$ : Measures the spread of the positional kernel. For uniform attention across m tokens, equals $\frac{1}{m}$ .
$Var ({content}_{query [i], query [n]})$ : Quantifies variation in the content-dependent component.

If an attention head has low content-dependent variance, weak dependence on $query [n]$ , and a broad positional kernel (as measured by $\sum_{i = 1}^{n} {pos}_{i}^{2}$ ), we call it a "contextual attention head." Based on the above, we expect contextual attention heads to have low variance in their position normalized softmax denominator within a fixed context.

First layer contextual circuit:

In the first layer we can get a visual for the variance in the softmax denominator for a fixed $query [n]$ by simply substituting the corresponding token into the residual stream at position $i$ for $500 < i < 600$ and plotting the position-normalised softmax denominator.

Heads with slowly decaying positional kernels (excluding the duplicate-token Head 1) exhibit stable softmax denominators across fixed text inputs. While Heads 6 and 8 show greater variance due to emphasizing keywords, their denominators remain better behaved than local-kernel heads (Heads 3, 4, 7).

To analyze these heads collectively (termed the contextual circuit), we fix the query embedding to the token " the"—a common stopword known to anchor contextual features in later layers. By freezing softmax denominators and assuming shared positional kernels across heads (0,2,6,8,9,10), we approximate their combined OV contributions to MLP neuron $j$ as a position-weighted sum: $\sum_{i = 1}^{n} {pos}_{n, i} contribution [x_{i}, j, n]$ where contributions depend only on the current token. The frozen softmax denominator lets us view each OV circuit as an additive sum of positionally-weighted token signals, and the shared positional kernel lets us straightforwardly combine the token signals from distinct heads. This lets us view the contextual circuit as a Naive Bayes-like aggregation of token signals.

If we restrict ourselves to prose (as opposed to code), and suppose we have an approximately average keyword density, the percentage error of the frozen circuit for each head should be at most 25 $%$ for some of the worst behaved heads like Head 6 and Head 8, and significantly better for other heads. We expect input texts drawn from the same context to have a range of softmax denominators, so we should expect the model to be robust to moderate variations in softmax denominator.

We identify contextual neurons by examining neurons with large maximum token contributions from this frozen circuit. These neurons should demonstrate significant sensitivity to the surrounding context rather than just the current token.

For an initial sense of what such neurons detect, we pick a threshold maximum token contribution at 5.0. This threshold leaves about 100 out of the 3072 first-layer MLP neurons remaining, which is enough for us to get a broad range of examples, without being too many to manually examine.

The vast majority of the neurons we find above this threshold have top token contributions corresponding to particular topics, languages, or sentiment, from the training set. The fact that we are able to find these neurons in a mostly weights-based way (up to initializing the denominators for our frozen circuit), is compelling evidence that these neurons are genuinely responding to these topics, as we haven’t injected any information about the training data.

The frozen circuit reveals contextual neurons sensitive to:

Spanish Text
Political Conspiracy (Neuron 2990)
Astronomy (Neuron 508)
Commonwealth English vs American English (Neuron 704, Positive token contributions are British spellings and references, and negative token contributions are American spellings and references. Note top dataset activations are just spamming the "£" symbol because this strongly distinguishes British from American, but this neuron does not refer to "countries, regions, and large numerals for financial amounts" )
Medieval text
Cooking
Bracket Matching (Neuron 1121,+1 contribution for tokens containing '(', -1 contribution for tokens containing ')')

Below we zoom in on a particularly illustrative example:

Commonwealth vs American English Neuron (Neuron 704):

Top token contributions: [’pract’, ’foc’, ’recogn’, ’UK’, ’British’, ’London’, ’£’, ’Australia’, ’Britain’, ’isation’, ’Australian’, ’emphas’, ’favour’, ’Labour’, ’centre’, ’util’, ’BBC’, ’Scotland’, ’behaviour’, ’defence’, ’Manchester’, ’colour’, ’C’, ’labour’, ’analys’]

Bottom token contributions: [’program’, ’mom’, ’favor’, ’color’, ’Center’, ’center’, ’defense’, ’toward’, ’Texas’, ’favorite’, ’organization’, ’programs’, ’behavior’, ’analy’, ’izes’, ’izations’, ’neighbor’, ’labor’, ’Color’, ’avor’, ’marijuana’, ’organizations’, ’Defense’, ’license’, ’attorney’]

This neuron classifies text as using Commonwealth English or American English, activating on Commonwealth English. It is an example where the bottom token contributions are as important to the function of the head as the top token contributions. Together, these token contributions allow the model to perform a kind of Naive Bayes classification of the text.

This neuron illustrates a potential pitfall in automated interpretation techniques that look at top activating examples. The highest activating examples from OpenWebText consist mainly of repeated "£" symbols in price lists. While this makes sense mathematically—"£" strongly distinguishes Commonwealth from American English—it leads some automated interpretation techniques to incorrectly classify this as a "Finance/Currency" neuron rather than recognizing its broader role in detecting writing style.

You can imagine it being quite important for the model to figure out whether the text is Commonwealth or American to make spelling decisions. In hindsight, this is an obvious feature for the model to have, but it’s not necessarily something someone manually probing for features would come up with.

The advantage of a mechanistic understanding is that we can find these neurons without needing any preconceptions about what features the model might encode, and so can be surprised by them.

Computing positional kernels in later layers:

If we have a proposed $E [i] (x)$ , then we can compute the positional parts $P [i]$ up to a constant by subtracting identical content at distinct positions, because $E [i] (x)$ allows us to identify sequences of different lengths. However, most of the time we don’t have explicit explanations for the roles of heads, and yet still might find it useful to compute positional kernels.

One assumption that allows us to approximate positional kernels is to assume that $E [E [i] (x)]$ is independent of $i$ . Then we can estimate $E [key [i] (x)] = P [i] + E [E [0] (x)]$ , and recover P[i] up to a constant. But softmax is invariant under constant shifts, so this lets us recover the positional kernel, assuming such a decomposition exists.

We estimate $E [key [i] (x)]$ for each layer and position by sampling from OpenWebText and averaging keys. About 500 samples is sufficient for stable kernels.

Once we have an estimate for $E [key [i] (x)]$ for each i, we can compute the positional kernel of a head at position $n$ for any given $query [n]$ simply by performing a softmax operation. The cost of computing the positional kernels of all the heads is dominated by the cost of running the model on 500 sample inputs.

While positional kernels depend on $query [n]$ , their character remains largely consistent in practice. One point that comes up for later layers is that models sometimes attend to the <end-of-text> token to control how much an attention
head is "activated". This can be thought of as a violation of the constant $E [E [i] (x)]$ condition. We can handle this by grouping together the initial tokens of a sequence, and tracking how much attention goes towards this group compared to the rest of the sequence. Then we omit the first few tokens of the sequence when computing positional kernels.

Metric for spread of positional pattern:

The above analysis naturally suggests $\sum_{i = 1}^{n} {pos}_{i}^{2}$ as a metric for the spread of positional patterns. Although, as previously mentioned, later layer attention heads often turn themselves off by attending to <end-of-text>, so we should exclude the first $5$ or so positions and take the softmax over the remaining positions for this metric.

We define the Effective Token Count (ETC) of a positional kernel to be $\frac{1}{\sum_{i = 5}^{n} ({pos}_{i})^{2}}$ . If the positional kernel attends uniformly to $m$ tokens, it will have an ETC of $m$ , giving us a natural interpretation of this definition.

Now we expect local positional patterns to have a low ETC, as they attend to just the previous ~ $5$ tokens. A slow positional decay will have a higher ETC, and a uniform positional kernel will have an ETC of $n$ .

Language models aren't very creative with their positional kernels, so the ETC gives a good summary of the type of positional kernel at an attention head.

Reducing the spread to a single summary statistic allows us to produce a single graph giving an idea for the positional kernels across all heads and layers of a single language model.

Below is the heatmap of $\sqrt{E T C}$ across all layers and heads of GPT-2 and Tiny stories. I found it best to plot $\sqrt{E T C}$ for visualization purposes. Orange-Yellow corresponds to uniform positional patterns. Pink-Purple corresponds to slow positional decay. And Blue corresponds to local positional patterns.

For reference, compare the first column of the GPT2-Small heatmap with the plots of the positional patterns given in Section 3. Heads 3, 4, and 7 are in blue because they are local positional patterns. Heads 0,1,2,6,8,9,10 are in magenta as they have a slow positional decay. And Heads 5 and 11 are yellow because they are close to uniform.

We can visually observe interesting things about the layers of GPT2-Small. For instance, notice the density of local positional patterns in the third and fourth layers. Potentially this is from the model extracting local grammatical structure from the text.

On the other hand, the second layer has more slow positional decay / uniform positional patterns. In fact, on closer inspection, the second layer has lots of attention heads which act purely as fixed positional kernels, falling into the category of "contextual attention heads" discussed earlier. This suggests the model builds an initial linear summary of the surrounding text, and then begins to build more symbolic representations in the third and fourth layer. We observe a similar pattern in GPT2-Medium, and to some extent in GPT2-XL.

Layer 5 is known for having many induction heads: Heads 5.0, 5.1, and 5.5 are known to be induction heads. These stick out visually as having close to uniform positional patterns, which validates the intuition that induction heads tend not to care about position.

The fact that there are so many local positional patterns in layers 2-4 gives a potential explanation for the small number of interesting specialized heads found in these layers. Attention heads with uniform positional kernels like induction heads feel more likely to be selected for "interesting behaviour", than heads which attend only locally.

Conclusion:

It seems like positional kernels are a useful notion to look at when first assessing attention heads, and they suggest many different lines of inquiry. One interesting piece of future work could be looking at how these positional kernels develop over the course of training.

This Google Colab contains the code required to reproduce the results found here.

0 comments

Comments sorted by top scores.

Positional kernels of attention heads

Contents

Introduction:

Attention decomposition:

Decomposition of Attention Scores:

Positional Kernels:

Visualization of Positional Patterns:

Uses of different positional kernels:

Ruling out superposition:

Contextual attention heads:

Approximation of softmax denominator:

First layer contextual circuit:

Computing positional kernels in later layers:

Metric for spread of positional pattern:

Conclusion:

0 comments