Exploring Decomposability of SAE Features

post by Vikram_N (viknat) · 2024-09-30T18:28:09.348Z · LW · GW · 0 comments

Contents

  TL;DR
  Motivation
  Prior Work
  Methodology
  Results
  Limitations and Future Work
None
No comments

TL;DR

SAE features are often less decomposable than the feature descriptions imply. By leveraging a prompting technique to test potential sub-components of individual SAE features (for example, decomposing Einstein into “German”, “physics”, "relativity” and “famous”, I found very divergent behaviour in how decomposable these features were. I built an interactive visualization to explore these differences by feature. The key finding is that although many features can be decomposed in a human-intuitive way such as in the Einstein example above, many cannot, and these indicate more opaque model behaviour.

Motivation

The goal of this writeup is to explore the atomicity and decomposability of SAE features? How precisely do they describe the sets of inputs that will cause them to activate? Are there cases where inputs that activate SAEs are non-intuitive and unrelated to concepts that we might expect to be related?

Apart from being an interesting area of exploration, I think this is also an important question for alignment. SAE features represent our current best attempt at inspecting model behaviour in a human-interpretable way. Non-intuitive feature decompositions might indicate the potential for alignment failures.

Prior Work

I was partly inspired by this work [LW · GW] on “meta-SAEs” (training SAEs on decoder directions from other SAEs) because it clearly demonstrated that SAE features aren’t necessarily atomic, and it is possible to decompose them into more granular latent dimensions. I was curious as to whether it was possible to come at this problem from a different direction. Given a particular SAE feature, can we generate inputs that “should” activate this feature in areas that a human would think of as related, and observe the feature activations that we see.

Methodology

I used the pretrained Gemma Scope SAEs and randomly sampled features from layer 20 of the 2B parameter model. The choice of layer 20 was somewhat arbitrary – as a layer deeper in the model the hope was to be able to work with more high-level and abstract features. I prompted ChatGPT to produce, for each feature:
 

The results were returned in JSON in order to be machine-consumable.

I then measured the activation of the original feature using the original SAE by each of the activating phrases (15 in total – 3 for each subcomponent). In practice I ended up using the Neuronpedia API to run most of this due to being compute-constrained.

Results

Streamlit visualization to view this breakdown by feature

I found significant differences in how likely a given feature was to be cleanly decomposable. There were three main classes of behaviour: 

Limitations and Future Work

This prompting technique has limitations as it doesn’t directly analyze SAE or model internals to capture feature meaning. It does also have advantages as it can be more model and technology-agnostic. Some specific limitation:

0 comments

Comments sorted by top scores.