[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

post by chanind, TomasD (tomas-dulka), hrdkbhatnagar, Joseph Bloom (Jbloom) · 2024-09-25T09:31:03.296Z · LW · GW · 6 comments

This is a link post for https://arxiv.org/abs/2409.14507

Contents

  We pose two questions:
  Feature splitting vs feature absorption:
None
6 comments

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by Joseph Bloom (Decode Research). Find out more about the programme and express interest in upcoming iterations here.

This high level summary will be most accessible to those with relevant context including an understanding of SAEs. The importance of this work rests in part on the surrounding hype, and potential philosophical issues [LW · GW]. We encourage readers seeking technical details to read the paper on arxiv.

Explore our interactive app here.

TLDR: This is a short post summarising the key ideas and implications of our recent work studying how character information represented in language models is extracted by SAEs. Our most important result shows that SAE latents can appear to classify some feature of the input, but actually turn out to be quite unreliable classifiers (much worse than linear probes). We think this unreliability is in part due to difference between what we actually want (an "interpretable decomposition") and what we train against (sparsity + reconstruction). We think there are many possibly productive follow-up investigations. 

We pose two questions:

  1. To what extent do Sparse Autoencoders (SAEs) extract interpretable latents from LLMs? The success of SAE applications (such as detecting safety-relevant features or efficiently describing circuits) will rely on whether SAE latents are reliable classifiers and provide an interpretable decomposition. 
  2. How does varying the hyperparameters of the SAE affect its interpretability? Much time and effort is being invested in iterating on SAE training methods, can we provide a guiding signal for these endeavours? 

To answer these questions, we tested SAE performance on a simple first letter identification task using over 200 Gemma Scope SAEs. By focussing on a task with ground truth labels we precisely measured the precision and recall of SAE latents tracking first letter information. 

Our results revealed a novel obstacle to using SAEs for interpretability which we term 'Feature Absorption'. This phenomenon is a pernicious, asymmetric form of feature splitting where:

Feature splitting vs feature absorption:

Feature Absorption is problematic:

While our study has clear limitations, we think there’s compelling evidence for the existence of feature absorption. It’s important to note that our results use a leverage model (Gemma-2-2b),  one SAE architecture (JumpReLU) and one task (the first letter identification task). However, the qualitative results are striking (which readers can explore for themselves using our streamlit app) and causal interventions are built directly into our metric for feature absorption.

We’re excited about future work in a number of areas (in order of concrete / import to most exciting!):

  1. Validating feature absorption.
    1. Is it reduced in different SAE architectures? (We think likely no). 
    2. Does it appear for other kinds of features? (We think likely yes). 
  2. Refining feature absorption metrics such as removing the need to train a linear probe.
  3. Exploring strategies for mitigating feature absorption
    1. We think Meta-SAEs [LW · GW] may be a part of the answer.
    2. We’re also excited about fine tuning against attribution sparsity with task relevant metrics/datasets. 
  4. Building better toy models and theories. Most prior toy models assume independent features, but can we construct “Toy Models of Feature Absorption” with features that co-occur together, and subsequently construct SAEs which effectively decompose them? (such as via the methods described in point 3).

We believe our work provides a significant contribution by identifying and characterising the feature absorption phenomenon, highlighting a critical challenge in the pursuit of interpretable AI systems.

  1. ^

6 comments

Comments sorted by top scores.

comment by Joseph Miller (Josephm) · 2024-09-25T14:03:17.876Z · LW(p) · GW(p)

Nice post. I think this is a really interesting discovery.

[Copying from messages with Joseph Bloom]
TLDR: I'm confused what is different about the SAE input that causes the absorbed feature not to fire.

Me:

Summary of your findings

  • Say you have a “starts with s” feature and a “snake” feature.
  • You find that for most words, “starts with s” correctly categorizes words that start with s. But for a few words that start with s, like snake, it doesn’t fire.
  • These exceptional tokens where it doesn’t fire, all have another feature that corresponds very closely to the token. For example, there is a “snake” feature that corresponds strongly to the snake token.
  • You say that the “snake” feature has absorbed the “starts with s” feature because the concept of snake also contains/entails the concept of ‘start with s’.
  • Most of the features that absorb other features correspond to common words, like “and”.

So why is this happening? Well it makes sense that the model can do better on L1 on the snake token by just firing a single “snake” feature (rather than the “starts with s” feature and, say, the “reptile” feature). And it makes sense it would only have enough space to have these specific token features for common tokens.

Joseph Bloom:

rather than the “starts with s” feature and, say, the “reptile” feature

We found cases of seemingly more general features getting absorbed in the context of spelling but they are more rare / probably the exception. It's worth distinguishing that we suspect that feature absorption is just easiest to find for token aligned features but conceptually could occur any time a similar structure exists between features.

And it makes sense it would only have enough space to have these specific token features for common tokens.

I think this needs further investigation. We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token). I predict there is a strong density effect but it could be non-trivial.

Me:

We found cases of seemingly more general features getting absorbed in the context of spelling

What’s an example?

We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token)

You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?

Do this only happen if the french word also starts with s?

Joseph Bloom:

What’s an example?

You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?

Yes

Do this only happen if the french word also starts with s?

More likely. I think the process is stochastic so it's all distributions.

↓[Key point]↓

Me:

But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? How is it able to fire on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature? I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?

Joseph Bloom:

I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?

I think the success of the linear probe is why we think the snake token does have the starts with s direction. The linear probe has much better recall and doesn't struggle with obvious examples. I think the feature absorption work is not about how models really work, it's about how SAEs obscure how models work.

But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? Like what is the mechanism by which it fires on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature?

Short answer, I don't know. Long answer - some hypotheses:

  1. Linear probes, can easily do calculations of the form "A AND B". In large vector spaces, it may be possible to learn a direction of the form "(^S.*) AND not (snake) and not (sun) ...". Note that "snake" has a component seperate to starts with s so this is possible.  To the extent this may be hard, that's possibly why we don't see more absorption but my own intuition says that in large vector spaces this should be perfectly possible to do.
  2. Encoder weights and Decoder weights aren't tied. If they were, you can imagine the choosing these exceptions for absorbed examples would damage reconstruction performance. Since we don't tie the weights, the model can detect "(^S.*) AND not (snake) and not (sun) ..." but write "(^S.*)". I'm interested to explore this further and am sad we didn't get to this in the project.
Replies from: Jbloom
comment by Joseph Bloom (Jbloom) · 2024-09-25T15:32:30.712Z · LW(p) · GW(p)

This thread reminds me that comparing feature absorption in SAEs with tied encoder / decoder weights and in end-to-end SAEs seems like valuable follow up. 

comment by Bart Bussmann (Stuckwork) · 2024-09-25T13:06:47.585Z · LW(p) · GW(p)

Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs [LW · GW] we found many spelling/sound related meta-latents.

I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE. 

Potentially, this would remove the "Starts-with-L"-component from the "lion"-token direction and activate the "Starts-with-L" latent instead. Although this would come at the cost of worse sparsity/reconstruction.

Replies from: Jbloom
comment by Joseph Bloom (Jbloom) · 2024-09-25T15:08:22.042Z · LW(p) · GW(p)

Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs [LW · GW] we found many spelling/sound related meta-latents.

 

Thanks! We were sad not to have time to try out Meta-SAEs but want to in the future.

I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE. 

Potentially, this would remove the "Starts-with-L"-component from the "lion"-token direction and activate the "Starts-with-L" latent instead. Although this would come at the cost of worse sparsity/reconstruction.

I think this is the wrong way to go to be honest. I see it as doubling down on sparsity and a single decomposition, both of which I think may just not reflect the underlying data generating process. Heavily inspired by some of John Wentworth's ideas here [LW · GW]. 

Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible).  Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context. 

comment by eggsyntax · 2024-09-25T13:58:46.043Z · LW(p) · GW(p)

What a great discovery, that's extremely cool. Intuitively, I would worry a bit that the 'spelling miracle [LW · GW]' is such an odd edge case that it may not be representative of typical behavior, although just the fact that 'starts with _' shows up as an SAE feature assuages that worry somewhat. I can see why you'd choose it, though, since it's so easy to mechanically confirm what tokens ought to trigger the feature. Do you have some ideas for non-spelling-related features that would make good next tests?

Replies from: Jbloom
comment by Joseph Bloom (Jbloom) · 2024-09-25T15:31:24.693Z · LW(p) · GW(p)

Thanks Egg! Really good question. Short answer: Look at MetaSAE's for inspiration. 

Long answer:

There are a few reasons to believe that feature absorption won't just be a thing for graphemic information:

  • People have noticed SAE latent false negatives in general, beyond just spelling features. For example this quote from the Anthropic August update. I think they also make a comment about feature coordination being important in the July update as well. 

If a feature is active for one prompt but not another, the feature should capture something about the difference between those prompts, in an interpretable way.  Empirically, however, we often find this not to be the case – often a feature fires for one prompt but not another, even when our interpretation of the feature would suggest it should apply equally well to both prompts.

  • MetaSAEs are highly suggestive of lots of absorption. Starts with letter features are found by MetaSAEs [LW · GW] along with lots of others (my personal favorite is a " Jerry" feature on which a Jewish meta-feature fires. I won't what that's about!?) 🤔
  • Conceptually, being token or character specific doesn't play a big role. As Neel mentioned in his tweet here, once you understand the concept, it's clear that this is a strategy for generating sparsity in general when you have this kind of relationship between concepts. Here's a latent that's a bit less token aligned in the MetaSAE app which can still be decomposed into meta-latents.

In terms of what I really want to see people look at: What wasn't clear from Meta-SAEs (which I think is clearer here) is that absorption is important for interpretable causal mediation. That is, for the spelling task, absorbing features look like a kind of mechanistic anomaly (but is actually an artefact of the method) where the spelling information is absorbed. But if we found absorption in a case where we didn't know the model knew a property of some concept (or we didn't know it was a property), but saw it in the meta-SAE, that would be very cool. Imagine seeing attribution to a latent tracking something about a person, but then the meta-latents tell you that the model was actually leveraging some very specific fact about that person. This might really important for understanding things like sycophancy...