The Residual Expansion: A Framework for thinking about Transformer Circuits
post by Daniel Tan (dtch1997) · 2024-08-02T11:04:56.347Z · LW · GW · 13 commentsContents
AMFOTC and its Limitations The Residual Expansion What does this get us? Open Questions / Ideas Conclusion None 13 comments
Edit: The math here has turned out to be wrong. See Joseph Miller's reply here [LW(p) · GW(p)]. I will revise the main content of this post at some point to reflect this.
This is an informal note describing my current approach for thinking about transformer circuits. I've not spent a lot of time thinking deeply about this but I believe the overall claims here are correct.
Note: A lot of the high-level ideas I include here are not really original, but I haven't seen the specific framing here applied to transformers, and I would like more people to think about this / tell me whether this is obviously flawed in some way.
AMFOTC and its Limitations
A Mathematical Framework is one of my favourite mech interp papers ever, and has spawned a very successful subfield of circuit analysis. I really like it because it provides a general framework for how to think about transformers and circuits.
However, there are some notable limitations to this framework:
- It doesn't consider MLP blocks
- It doesn't consider layer-norm
- The analysis stops at 2-layer transformers
- possibly because it becomes too onerous to write the equations for larger models
- which in turn suggests that better abstraction / notation is needed
Also, this framework centralises on the 'model basis' (attention heads, residual stream) and fails to incorporate other ideas (superposition, SAEs).
So I spent some time thinking about how we might extend this framework and here's what I came up with.
The Residual Expansion
A very old idea in machine learning, dating all the way back to ResNets, is that a sequence of residual operations can be 'expanded out' into a set of feedforward operations
- These feedforward components act like an ensemble, in that they sum together to exactly reconstruct the model's output
To make this concrete, let's consider a 1-layer transformer with attention and MLP blocks.
- Let MLP denotes the MLP block and Att denotes the attention block.
- We can write this as:
- We can convert product-of-sum into a sum-of-products:
- Each of the four terms represents a single (nonlinear) computational path through the model
More generally, for an -layer transformer, we can write a big summation of terms:
- The first bracket contains degree 0 terms. This is simply the identity, i.e. a no-op on the token embeddings
- The second bracket contains degree 1 terms. This captures the direct effect of each block in isolation
- And so on...
- Generally there will be terms of up to degree , since there are blocks per layer
What does this get us?
The residual decomposition gives us a sum of feedforward paths through the model, each of which is nonlinear.
- Maybe these paths are interpretable!
Circuits. A circuit could possibly be represented as a sum of a small number of these feedforward paths.
- An appropriate summation over paths can be "re-factorized" into a graph
- Exercise 1 (for me, and for others): Write the IOI circuit in the residual decomposition
AMFOTC. The residual decomposition is fully compatible with AMFOTC
- AMFOTC simply explains how to decompose a single attention operation into linear (or almost-linear) terms
- The residual decomposition tells you how to build lots of attention operations up into a bigger picture
- Exercise 2 (for me, and for others): Write "bigrams" and "skip-trigrams" using the residual expansion
Individual terms. Generally, many other ideas in interpretability can be thought of as attempts to understand individual terms in the residual decomposition
- E.g. MLP transcoders decompose the MLP block into a sparsely-activating set of paths through SAE features.
- This decomposition can be 'substituted back' into the residual decomposition to yield finer-grained circuits.
- Ditto for attention-out SAEs, bilinear SAEs, etc.
- If we take "SAE-fication" to its logical extreme by replacing everything with SAEs, we get sparse feature circuits
Other remarks
- Layernorm. We've handwaved layernorm here, but it's simply to modify our equations to account for it. Notably, because layernorm is not a residual operation, we don't end up with any more terms than we did originally
Tl;dr I think this is a nice unifying way to think about lots of circuit analysis work.
Open Questions / Ideas
Here's some ideas motivated by this line of thinking.
- Hybrid paths / circuits. As discussed above, if we take 'paths' through the model to be the unit of analysis, we can easily decompose paths (in the model space) further using SAE features. I'm generally very excited to work on more extensive efforts to do circuit analysis using such paths.
- Path attributions. How much does each path contribute to the output? Do we notice patterns?
- Pick a reasonable notion of 'how much does X path contribute to the output' (e.g. attributions, Shapley values, etc)
- Task complexity. The degree of a path (i.e. the number of blocks it uses) could be used as a rough notion of 'task complexity'. Does this align with our intuition about what tasks are more or less complex? E.g. I would expect IOI and Docstrings to be more complex (as they're somewhat algorithmic) than other tasks like gender bias and hypernymy (as they're simply variants of lookup)
- Parallel processing. Lucius Bushnaq has argued to me that different blocks of the model could be implementing different parts of a higher-level operation. E.g. to calculate AND(a,b), two different MLP blocks may be responsible for computing a and b respectively. If so we'd expect them not to interact in any way. Then we could formalise this as
Conclusion
- AMFOTC is great, but doesn't go far enough IMO, and is outdated
- 'Residual expansion' explains how to extend AMFOTC to full transformers and fit it in with SAE-based circuit analysis
I'm very interested to hear takes on this!
13 comments
Comments sorted by top scores.
comment by Buck · 2024-08-05T15:17:00.857Z · LW(p) · GW(p)
You might enjoy this paper about a related idea.
comment by Joseph Miller (Josephm) · 2024-08-02T22:53:35.294Z · LW(p) · GW(p)
If I understand correctly, this residual decomposition is equivalent to the edge / factorized view of a transformer described here [LW · GW].
Update: actually the residual decomposition is incorrect - see my other comment [LW(p) · GW(p)].
↑ comment by Daniel Tan (dtch1997) · 2024-08-05T10:20:51.229Z · LW(p) · GW(p)
I agree, this seems like exactly the same thing, which is great! In hindsight it's not surprising that you / other people have already thought about this
Do you think the 'tree-ified view' (to use your name for it) is a good abstraction for thinking about how a model works? Are individual terms in the expansion the right unit of analysis?
Replies from: Josephm↑ comment by Joseph Miller (Josephm) · 2024-08-05T18:50:48.453Z · LW(p) · GW(p)
The treeified view is different from the factorized view! See figure 1 here.
I think the factorized view is pretty useful. But on other hand I think MLP + Attention Head circuits are too coarse-grained to be that interpretable.
Replies from: oliver-daniels-koch↑ comment by Oliver Daniels (oliver-daniels-koch) · 2024-08-07T05:22:34.921Z · LW(p) · GW(p)
Just to make it explicit and check my understanding - the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
= input -> output
= input-> Attn 1.0 -> MLP 2 -> Attn 4.3 -> output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the "paths" from input to output constructed from the factorized DAG.
Replies from: Josephm↑ comment by Joseph Miller (Josephm) · 2024-08-08T05:07:42.537Z · LW(p) · GW(p)
Actually I think the residual decomposition is incorrect - see my other comment [LW(p) · GW(p)].
comment by Joseph Miller (Josephm) · 2024-08-07T07:50:24.268Z · LW(p) · GW(p)
@Oliver Daniels-Koch [LW · GW]'s reply [LW(p) · GW(p)] to my comment made me read this post again more carefully and now I think that that your formulation of the residual expansion is incorrect.
Given it does not follow that because is a non-linear operation. It cannot be decomposed like this.
My understanding of your big summation (with representing any MLP or attention head):
again does not hold because the s are non-linear.
There are two similar ideas which do hold, namely (1) the treeified / unraveled view and (2) the factorized view (both of which are illustrated in figure 1 here), but your residual expansion / big summation is not equivalent to either.
The treeified / unraveled view is the most similar. It separates each path from input to output, but the difference is that this does not claim that the output is the sum of all separate paths.
The factorized view follows from treeified view and is just the observation that any point in the residual stream can be decomposed into the outputs of all previous components.
↑ comment by Daniel Tan (dtch1997) · 2024-08-07T08:14:02.049Z · LW(p) · GW(p)
If I understand correctly, you're saying that my expansion is wrong, because , which I agree with.
- Then isn't it also true that
- Also, if the output is not a sum of all separate paths, then what's the point of the unraveled view?
↑ comment by Joseph Miller (Josephm) · 2024-08-07T19:19:09.843Z · LW(p) · GW(p)
Yes is what I'm saying.
- Yes I agree
- (Firstly note that it can be true without being useful). In the Residual Networks Behave Like Ensembles of Relatively Shallow Networks paper, they discover that long paths are mostly not needed for the model. In Causal Scrubbing [AF · GW] they intervene on the treeified view to understand which paths are causally relevant for particular behaviors.
↑ comment by Daniel Tan (dtch1997) · 2024-08-08T08:34:12.400Z · LW(p) · GW(p)
That makes sense to me. I guess I'm dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy [LW(p) · GW(p)].
comment by mishka · 2024-08-02T11:45:38.277Z · LW(p) · GW(p)
I think this makes sense.
I am not sure how new this approach is (for simplified Transformers, the original AMFOTC paper has several sections called "* Path Expansion *", which seem to do something very similar for a reduced set of transformations, and their formalism of "virtual attention heads" seems also to be in that spirit).
Replies from: dtch1997↑ comment by Daniel Tan (dtch1997) · 2024-08-02T13:28:47.794Z · LW(p) · GW(p)
Fair point, and I should amend the post to point out that AMFOTC also does 'path expansion'. However, I think this is still conceptually distinct from AMFOTC because:
- In my reading of AMFOTC, the focus seems to be on understanding attention by separating the QK and OV circuits, writing these as linear (or almost linear) terms, and fleshing this out for 1-2 layer attention-only transformers. This is cool, but also very hard to use at the level of a full model
- Beyond understanding individual attention heads, I am more interested in how the whole model works; IMO this is very unlikely to be simply understood as a sum of linear components. OTOH residual expansion gives a sum of nonlinear components and maybe each of those things is more interpretable.
- I think the notion of path 'degrees' hasn't been explicitly stated before and I found this to be a useful abstraction to think about circuit complexity.
maybe this post is better framed as 'reconciling AMFOTC with SAE circuit analysis'.
Replies from: mishka↑ comment by mishka · 2024-08-02T14:26:59.467Z · LW(p) · GW(p)
Yes, I think this makes sense.
Here is one aspect which might be useful to keep in mind.
If we think about all this as some kind of "generalized Taylor expansion", there are some indications that the deviations from linearity might be small.
E.g. there is this rather famous post, https://www.lesswrong.com/posts/JK9nxcBhQfzEgjjqe/deep-learning-models-might-be-secretly-almost-linear [LW · GW].
Another indication pointing to "almost linearity" is that "model merge" works pretty well. Although, interestingly enough, people often prefer to approach "model merge" in a more subtle fashion than just linear interpolation, so, presumably, non-linearity does matter quite a bit as well, e.g. https://huggingface.co/blog/mlabonne/merge-models.