How do SAE Circuits Fail? A Case Study Using a Starts-with-'E' Letter Detection Task

post by adsingh-64 · 2025-03-30T00:47:18.711Z · LW · GW · 0 comments

Contents

  TLDR: 
  Introduction and Experimental Setup 
    'Tartan has the first letter: T.Mirth has the first letter: M.Elephant has the first letter:'
  Key Results
  Conclusion
  Further Work
  Limitations
None
No comments

TLDR

We investigated cases where SAE circuits "fail" at detecting if words start with "E", in that the circuit omits the most important computational pathway (latent) in the model for detecting the letter "E". We find that in the vast majority of these failure cases, the missing computational pathway (latent) is nearly orthogonal to the main computational pathways (latents) in the circuit, suggesting SAE computation is diffuse and follows no clear geometric pattern.

Introduction and Experimental Setup 

Sparse autoencoder (SAE) circuits have emerged as a promising tool for understanding language model behavior. These circuits contain components (called latents) that ideally correspond to human-interpretable features, offering an improvement over traditional circuit analysis units such as neurons or attention heads, which can be difficult to interpret.

That said, SAE circuits are subnetworks of the full model, and circuit filtering techniques may omit crucial parts of the model for a given task. 

We investigate failure cases in SAE circuits using a "Starts with 'E'" letter detection task used in the feature absorption work by Chanin et al. (2024). This is a task where an input prompt asks the model to identify the starting letter of a word, which is always "E". Here is an example few-shot prompt:

'Tartan has the first letter: T.
Mirth has the first letter: M.
Elephant has the first letter:'

The correct model output is ' E'. To determine the main causally important latents for this task, we follow Chanin et al. (2024) and define the following metric:

We then compute latent attributions with respect to  using the attribution vector , where  is the latent activation vector, which isolates those latents which push up the logits for ' E' in particular as opposed to the logits for letters in general. 

All experiments are done on Gemma 2B and the canonical 16k width GemmaScope SAE for the layer 5 residual stream.

Key Results

Averaged over 5000 few-shot prompts, here are the top 25 latents by attribution: 

By an order of magnitude, there are two clear latents by attribution. Hence, it is reasonable to expect that an SAE circuit (in particular, a sparse SAE circuit) optimized for this particular task will keep only these top two latents and mean ablate all others. In turn, it makes sense to define a failure case as an example prompt where the two main latents by attribution (latents 16070 and 13484) do not include the top latent by attribution for that particular example. Here is an example of a failure case for a prompt with the token 'Env': 

Approximately 20% of the examples were classified as failure cases. When examining these failure cases, we found that the cosine similarity between the top latent for the specific example and the main latents (16070 and 13484) exceeded a threshold of 0.025 in only 6% of cases (we take the max of the cosine similarity of the latent's decoder direction with each of the two main latents' decoder directions):

As a baseline comparison, nearly a quarter of latents have cosine similarity of at least 0.025 with either one of latent 16070 or latent 13484:

Conclusion

The final two plots illustrate that when the main latent(s) by attribution is replaced by a different latent as the top latent by attribution, there is minimal interference between them. That is, SAE circuits may have entirely different, orthogonal computational pathways. 

Further Work

Back-up latents: One possible explanation for the results we found is back-up latents. That is, might other latents outside of the two main latents perform the role of increasing the logits for ' E' precisely when the two main latents do not? To serve this back-up role, it is necessary that the cosine similarity between encoder directions is extremely low between the back-up and main latent (else they would activate or not activate together). Hence, one way to test for the existence of back-up latents is to find examples of absorbing latents whose encoder directions are nearly orthogonal to the encoder directions of the two main latents 16070 and 13484.

Feature absorption by attribution [speculative]: Another possible explanation for the results we found is the idea of feature absorption by attribution. Standard feature absorption (by activation) is when a main latent that activates on a broad class of examples such as text that starts with 'E' does not activate on particular instances of such text such as 'Elephants', because there is another absorbing "elephants-latent" that activates instead. It may be possible that this occurs but with latents as actions on the logits, i.e., a latent pushes up the logits for ' E' on text that starts with "E" most of the time, but on particular instances of such text such as 'Elephants', another latent assumes the promote  ' E' logit role instead. However, from a theoretical standpoint this is unjustified, as feature absorption by activation can help a model decrease the L_0 or L_1 penalty on the latent activations, while feature absorption by attribution does no comparable thing.

Limitations

The findings in this project should not be taken as general laws about SAE circuits as we tested only on limited data distributions and a single task. However, we hope that these broad observations can better inform SAE circuit design.


This research builds on work by Marks et al. (2024) on using attribution patching at scale to find circuits of SAE latents. The starts-with-'E' letter identification task was taken from Chanin et al. (2024) in their work on feature absorption.

Full source code available at: https://github.com/adsingh-64/SAE-Circuits

0 comments

Comments sorted by top scores.