The Residual Expansion: A Framework for thinking about Transformer Circuits

post by Daniel Tan (dtch1997) · 2024-08-02T11:04:56.347Z · LW · GW · 13 comments

Contents

  AMFOTC and its Limitations
  The Residual Expansion
  What does this get us? 
  Open Questions / Ideas
  Conclusion
None
13 comments

Edit: The math here has turned out to be wrong. See Joseph Miller's reply here [LW(p) · GW(p)]. I will revise the main content of this post at some point to reflect this. 

This is an informal note describing my current approach for thinking about transformer circuits. I've not spent a lot of time thinking deeply about this but I believe the overall claims here are correct. 

Note: A lot of the high-level ideas I include here are not really original, but I haven't seen the specific framing here applied to transformers, and I would like more people to think about this / tell me whether this is obviously flawed in some way. 

AMFOTC and its Limitations

A Mathematical Framework is one of my favourite mech interp papers ever, and has spawned a very successful subfield of circuit analysis. I really like it because it provides a general framework for how to think about transformers and circuits. 

However, there are some notable limitations to this framework: 

Also, this framework centralises on the 'model basis' (attention heads, residual stream) and fails to incorporate other ideas (superposition, SAEs). 

So I spent some time thinking about how we might extend this framework and here's what I came up with. 

The Residual Expansion

A very old idea in machine learning, dating all the way back to ResNets, is that a sequence of residual operations can be 'expanded out' into a set of feedforward operations

To make this concrete, let's consider a 1-layer transformer with attention and MLP blocks. 

More generally, for an -layer transformer, we can write a big summation of terms:

What does this get us? 

The residual decomposition gives us a sum of feedforward paths through the model, each of which is nonlinear. 

Circuits. A circuit could possibly be represented as a sum of a small number of these feedforward paths. 

AMFOTC. The residual decomposition is fully compatible with AMFOTC

Individual terms. Generally, many other ideas in interpretability can be thought of as attempts to understand individual terms in the residual decomposition

Other remarks

Tl;dr I think this is a nice unifying way to think about lots of circuit analysis work. 

Open Questions / Ideas

Here's some ideas motivated by this line of thinking. 

Conclusion

I'm very interested to hear takes on this! 

13 comments

Comments sorted by top scores.

comment by Buck · 2024-08-05T15:17:00.857Z · LW(p) · GW(p)

You might enjoy this paper about a related idea.

comment by Joseph Miller (Josephm) · 2024-08-02T22:53:35.294Z · LW(p) · GW(p)

If I understand correctly, this residual decomposition is equivalent to the edge / factorized view of a transformer described here [LW · GW].

Update: actually the residual decomposition is incorrect - see my other comment [LW(p) · GW(p)].

Replies from: dtch1997
comment by Daniel Tan (dtch1997) · 2024-08-05T10:20:51.229Z · LW(p) · GW(p)

I agree, this seems like exactly the same thing, which is great! In hindsight it's not surprising that you / other people have already thought about this

Do you think the 'tree-ified view' (to use your name for it) is a good abstraction for thinking about how a model works? Are individual terms in the expansion the right unit of analysis? 

Replies from: Josephm
comment by Joseph Miller (Josephm) · 2024-08-05T18:50:48.453Z · LW(p) · GW(p)

The treeified view is different from the factorized view! See figure 1 here.

I think the factorized view is pretty useful. But on other hand I think MLP + Attention Head circuits are too coarse-grained to be that interpretable.

Replies from: oliver-daniels-koch
comment by Oliver Daniels (oliver-daniels-koch) · 2024-08-07T05:22:34.921Z · LW(p) · GW(p)

Just to make it explicit and check my understanding - the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g

 = input -> output

 = input-> Attn 1.0 -> MLP 2 -> Attn 4.3 -> output 

And it follows that the (pre final layernorm) output of a transformer is the sum of all the "paths" from input to output constructed from the factorized DAG. 

Replies from: Josephm
comment by Joseph Miller (Josephm) · 2024-08-08T05:07:42.537Z · LW(p) · GW(p)

Actually I think the residual decomposition is incorrect - see my other comment [LW(p) · GW(p)].

comment by Joseph Miller (Josephm) · 2024-08-07T07:50:24.268Z · LW(p) · GW(p)

@Oliver Daniels-Koch [LW · GW]'s reply [LW(p) · GW(p)] to my comment made me read this post again more carefully and now I think that that your formulation of the residual expansion is incorrect.

Given   it does not follow that   because  is a non-linear operation. It cannot be decomposed like this.

My understanding of your big summation (with  representing any MLP or attention head):

again does not hold because the s are non-linear.

There are two similar ideas which do hold, namely (1) the treeified / unraveled view and (2) the factorized view (both of which are illustrated in figure 1 here), but your residual expansion / big summation is not equivalent to either.

  1. The treeified / unraveled view is the most similar. It separates each path from input to output, but the difference is that this does not claim that the output is the sum of all separate paths.

  2. The factorized view follows from treeified view and is just the observation that any point in the residual stream can be decomposed into the outputs of all previous components.

Replies from: dtch1997
comment by Daniel Tan (dtch1997) · 2024-08-07T08:14:02.049Z · LW(p) · GW(p)

If I understand correctly, you're saying that my expansion is wrong, because , which I agree with. 

  1. Then isn't it also true that 
  2. Also, if the output is not a sum of all separate paths, then what's the point of the unraveled view? 
Replies from: Josephm
comment by Joseph Miller (Josephm) · 2024-08-07T19:19:09.843Z · LW(p) · GW(p)

Yes  is what I'm saying.

  1. Yes I agree  
  2. (Firstly note that it can be true without being useful). In the Residual Networks Behave Like Ensembles of Relatively Shallow Networks paper, they discover that long paths are mostly not needed for the model. In Causal Scrubbing [AF · GW] they intervene on the treeified view to understand which paths are causally relevant for particular behaviors.
Replies from: dtch1997
comment by Daniel Tan (dtch1997) · 2024-08-08T08:34:12.400Z · LW(p) · GW(p)

That makes sense to me. I guess I'm dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy [LW(p) · GW(p)]. 

comment by mishka · 2024-08-02T11:45:38.277Z · LW(p) · GW(p)

I think this makes sense.

I am not sure how new this approach is (for simplified Transformers, the original AMFOTC paper has several sections called "* Path Expansion *", which seem to do something very similar for a reduced set of transformations, and their formalism of "virtual attention heads" seems also to be in that spirit).

Replies from: dtch1997
comment by Daniel Tan (dtch1997) · 2024-08-02T13:28:47.794Z · LW(p) · GW(p)

Fair point, and I should amend the post to point out that AMFOTC also does 'path expansion'. However, I think this is still conceptually distinct from AMFOTC because: 

  • In my reading of AMFOTC, the focus seems to be on understanding attention by separating the QK and OV circuits, writing these as linear (or almost linear) terms, and fleshing this out for 1-2 layer attention-only transformers. This is cool, but also very hard to use at the level of a full model
  • Beyond understanding individual attention heads, I am more interested in how the whole model works; IMO this is very unlikely to be simply understood as a sum of linear components. OTOH residual expansion gives a sum of nonlinear components and maybe each of those things is more interpretable. 
  • I think the notion of path 'degrees' hasn't been explicitly stated before and I found this to be a useful abstraction to think about circuit complexity. 

maybe this post is better framed as 'reconciling AMFOTC with SAE circuit analysis'. 

Replies from: mishka
comment by mishka · 2024-08-02T14:26:59.467Z · LW(p) · GW(p)

Yes, I think this makes sense.

Here is one aspect which might be useful to keep in mind.

If we think about all this as some kind of "generalized Taylor expansion", there are some indications that the deviations from linearity might be small.

E.g. there is this rather famous post, https://www.lesswrong.com/posts/JK9nxcBhQfzEgjjqe/deep-learning-models-might-be-secretly-almost-linear [LW · GW].

Another indication pointing to "almost linearity" is that "model merge" works pretty well. Although, interestingly enough, people often prefer to approach "model merge" in a more subtle fashion than just linear interpolation, so, presumably, non-linearity does matter quite a bit as well, e.g. https://huggingface.co/blog/mlabonne/merge-models.