200 COP in MI: Exploring Polysemanticity and Superposition

post by Neel Nanda (neel-nanda-1) · 2023-01-03T01:52:46.044Z · LW · GW · 6 comments

Contents

  Background
  Motivation
  Resources
  Tips
  Problems
None
6 comments

Important Note: Since writing this, there's been a lot of exciting work on understanding superposition via training sparse autoencoders to take features out of superposition. I recommend reading up on that work, since it substantially changes the landscape of what problems matter here.

This is the fifth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here [AF · GW], then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer

Motivating papers: Toy Models of Superposition, Softmax Linear Units

Background

If you're familiar with polysemanticity and superposition, skip to Motivation or Problems.

Neural networks are very high dimensional objects, in both their parameters and their activations. One of the key challenges in Mechanistic Interpretability is to somehow resolve the curse of dimensionality, and to break them down into lower dimensional objects that can be understood (semi-)independently. 

Our current best understanding of models is that, internally, they compute features: specific properties of the input, like "this token is a verb" or "this is a number that describes a group of people" or "this part of the image represents a car wheel". That early in the model there are simpler features, are later used to compute more complex features by being connected up in a circuit (example shown above (source)). Further, our guess is that features correspond to directions in activation space. That is, for any feature that the model represents, there is some vector corresponding to it. And if we dot product the model's activations with that vector, we get out a number representing whether that feature is present.(these are known as decomposable, linear representations)

This is an extremely useful thing to be true about a model! An even more helpful thing to be true would be if neurons correspond to features (ie the output of an activation function like ReLU). Naively, this is natural for the model to do, because a non-linearity like ReLU acts element-wise - each neuron's activation is computed independently (this is an example of a privileged basis). Concretely, if a neuron can represent feature A or feature B, then that neuron will fire differently for feature A and NOT feature B, vs feature A and feature B, meaning that the presence of B interferes with the ability to compute A. But if each feature is its own neuron we're fine! 

If features correspond to neurons, we're playing interpretability on easy mode - we can focus on just figuring out which feature corresponds to each neuron. In theory we could even show that a feature is not present by verifying that it's not present in each neuron! However, reality is not as nice as this convenient story. A countervailing force is the phenomena of superposition. Superposition is when a network represents more features than it has dimensions, and squashes them all into a lower dimensional space. You can think of superposition as the model simulating a larger model

Anthropic's Toy Models of Superposition paper is a great exploration of this. They build a toy model that learns to use superposition (notably different from a toy language model!). The model starts with a bunch of independently varying features, needs to compress these to a low dimensional space, and then is trained to recover each feature from the compressed mess. And it turns out that it does learn to use superposition! 

Specifically, it makes sense to use superposition for sufficiently rare (sparse) features, if we give it non-linearities to clean up interference. Further, the use of superposition can be modelled as a trade-off between the costs of interference, and the benefits of representing more features. And digging further into their toy models, they find all kinds of fascinating motifs regarding exactly how superposition occurs, notably that the features are sometimes compressed in geometric configurations, eg 5 features being compressed into two dimensions as the vertices of a pentagon, as shown below. 

Motivation

Zooming out, what does this mean for what research actually needs to be done? To me, when I imagine what real progress here might look like, I picture the following:

  1. Crisp conceptual frameworks: I still feel pretty confused about what is even going on with superposition! How much does it occur? The Toy Models paper significantly clarified my intuitions, but it's far from complete. I expect progress here to mostly look like identifying the aspects of transformers and superposition that we're still confused about, building toy models to model those, and seeing what insights can be learned
  2. Empirical data from real models: It's all well and good to have beautiful toy models and conceptual frameworks, but it's completely useless if we aren't learning anything about real models! I would love to have some well-studied cases of superposition and polysemanticity in real models, and to know whether any of the toy model's predictions transfer. 
    1. Can we find any truly monosemantic neurons? Can we find a pentagon of features in a real residual stream? Can we reverse engineer a feature represented by several neurons? 
  3. Dealing with superposition in practice: Understanding superposition is only useful in that it allows us to better understand networks, so we need to know how to deal with it in practice! Can we identify all directions that correspond to features? Can we detect whether a feature is at all neuron-aligned, or just an arbitrary direction in space?

The direction I'm most excited about is a combination of 1 and 2, to form a rich feedback loop between toy models and real models - toy models generate hypotheses to test, and exploring real models generates confusions to study in toy models. 

Resources

Tips

Problems

This spreadsheet lists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)

Notation: ReLU output model is the main model in the Toy Models of Superposition paper which compresses features in a linear bottleneck, absolute value model is the model studied with a ReLU hidden layer and output layer, and which uses neuron superposition.

6 comments

Comments sorted by top scores.

comment by LawrenceC (LawChan) · 2023-12-27T02:52:53.539Z · LW(p) · GW(p)

I strongly upvoted this post because of the "Tips" section, which is something I've come around on only in the last ~2.5 months. 

comment by sudo · 2023-07-13T21:06:43.416Z · LW(p) · GW(p)

neuron has

I was confused by the singular "neuron." 

I think the point here is that if there are some neurons which have low activation but high direct logit attribution after layernorm, then this is pretty good evidence for "smuggling."

Is my understanding here basically correct?

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2023-07-13T22:05:13.707Z · LW(p) · GW(p)

Yes that's correct

comment by neverix · 2023-11-21T22:30:07.622Z · LW(p) · GW(p)

This happens in transformer MLP layers. Note that the hidden dimen

Is the point that transformer MLPs blow up the hidden dimension in the middle?

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2023-11-21T22:59:05.736Z · LW(p) · GW(p)

Thanks for the catch, I deleted "Note that the hidden dimen". Transformers do blow up the hidden dimension, but that's not very relevant here - they have many more neurons than residual stream dimensions, and they have many more features than neurons (as shown in the recent Anthropic paper)

Replies from: neverix
comment by neverix · 2023-11-23T22:34:42.504Z · LW(p) · GW(p)

To clarify, I thought it was about superposition happening inside the projection afterwards.