A Bunch of Matryoshka SAEs

post by chanind, TomasD (tomas-dulka), Adrià Garriga-alonso (rhaps0dy) · 2025-04-04T14:53:56.805Z · LW · GW · 0 comments

Contents

    I don't care about any of that, just give me the SAEs!
    Neuronpedia
  Training info
    Snap loss
    Balancing absorption and hedging
  SAEs and stats
    Gemma-2-2b
      Snap Matryoshka SAEs
      Standard Matryoshka SAEs
    Gemma-2-9b
    Gemma-3-1b
None
No comments

This work was done as part of MATS 7.0.

MATS provides a generous compute stipend, and towards the end of the program we found we had some unspent compute. To let this not go to waste, we trained batch topk Matryoshka SAEs on all residual stream layers of Gemma-2-2b, Gemma-2-9b, and Gemma-3-1b, and are now releasing them publicly. The hyperparams for these SAEs were not aggressively optimized, but they should hopefully be decent. Below we describe our rationale for how these SAEs were trained and why, and the stats for each SAE. Key decisions:

I don't care about any of that, just give me the SAEs!

You can load all of the SAEs using SAELens via the following releases:

For each release, the SAE ID is just the corresponding Transformer Lens post residual stream hook point, e.g. blocks.5.hook_resid_post for the layer 5 residual stream SAE.

Each SAE can be loaded in SAELens as follows:

from sae_lens import SAE

sae = SAE.from_pretrained("<release>", "<sae_id>")[0]

For instance, to load the layer 5 snap variant SAE for gemma-2-2b, this would look like the following:

sae = SAE.from_pretrained("gemma-2-2b-res-snap-matryoshka-dc", "blocks.5.hook_resid_post")[0]

Neuronpedia

Neuronpedia has generously hosted some of these SAEs, with more coming soon. Check them out at: https://www.neuronpedia.org/res-matryoshka-dc.

Matryoshka SAEs should be much better than standard SAEs at finding general, high-frequency concepts like parts of speech. In standard SAEs, latents tracking these concepts will get shot to pieces by feature absorption as they co-occur with so many other concepts. As Matryoshka SAEs should be much more resilient to absorption, we thus expect to find more meaningful high-density latents in Matryoshka SAEs (although these latents may be messed up by feature hedging instead). For instance, here's a high-density first-layer latent from layer 12 of Gemma-2-2b, which appears to (very noisily) perform a grammatical function similar to Treebank's IN (Preposition or subordinating conjunction) part of speech. 

Higher frequency concepts should be concentrated in earlier latent indices. The highest frequency concepts should be in latents 0-127, then the next highest frequency should be in latents 128-511, etc...

Training info

All Matryoshka SAEs in this release are trained on 750M tokens from the Pile using a modified version of SAELens. The SAEs are all 32k width with the following Matryoshka levels: 128, 512, 2048, 8192, and 32768. We included two layers (128, 512) that are much narrower than the model residual stream to make it easier to study what the first  features are that the SAE learns. These are all batch top-k SAEs, following the original Matryoshka SAEs work. We largely did not optimize hyperparams for these SAEs, so it's likely possible to squeeze out more performance from the SAE with optimized choices of learning rate and more training tokens, but hopefully these SAEs should be decent.

Snap loss

One of the notable components of this release is the addition of snap loss variants of all SAEs for Gemma-2-2b. Snap loss is described in our post on Feature Hedging [LW · GW], and involves switching the reconstruction loss of the SAE from MSE to L2 mid-way through training. Practically, we don't see much difference in SAEs trained on LLMs using snap loss, but are releasing these regardless in case others are curious to investigate the effect of snap loss, as we have the SAE trained anyway. If you notice a meaningful difference in practice between the snap loss and standard variants of these SAEs, please let us know!

Balancing absorption and hedging

Intuitively, it might seem like we'd want the inner layers of Matryoshka SAEs to be insulated from gradients from outer layers. Outer layers will pull the inner latents towards absorption, which defeats the purpose of a Matryoshka SAE! However, in toy models, hedging and absorption have opposite effects on the SAE encoder, so allowing some absorption pressure can help counteract the hedging of the SAE and improve performance. We notice that the dictionary_learning implementation of matryoshka SAEs also does not stop gradients between layers, and likely this is because stopping gradients causes hedging to mess up the SAE more severly.

For a further investigation of balancing hedging and absorption in Matryoshka SAEs, check out this colab.

We suspect that it may be possible to intentionally balance hedging with absorption in a more optimal way, and we plan to investigate this in future work. 

SAEs and stats

Below we list all the SAEs trained along with some core stats.

Gemma-2-2b

We trained both snap and standard variants of SAEs for Gemma-2-2b. These SAEs have the relase ID gemma-2-2b-res-snap-matryoshka-dc for snap-loss variant, and gemma-2-2b-res-matryoshka-dc for the standard variant.

Snap Matryoshka SAEs

layerSAE IDwidthl0explained variance
0blocks.0.hook_resid_post32768400.919964
1blocks.1.hook_resid_post32768400.863969
2blocks.2.hook_resid_post32768400.858767
3blocks.3.hook_resid_post32768400.815844
4blocks.4.hook_resid_post32768400.821094
5blocks.5.hook_resid_post32768400.797083
6blocks.6.hook_resid_post32768400.79815
7blocks.7.hook_resid_post32768400.78946
8blocks.8.hook_resid_post32768400.779236
9blocks.9.hook_resid_post32768400.759022
10blocks.10.hook_resid_post32768400.743998
11blocks.11.hook_resid_post32768400.731758
12blocks.12.hook_resid_post32768400.725974
13blocks.13.hook_resid_post32768400.727936
14blocks.14.hook_resid_post32768400.727065
15blocks.15.hook_resid_post32768400.757408
16blocks.16.hook_resid_post32768400.751874
17blocks.17.hook_resid_post32768400.763654
18blocks.18.hook_resid_post32768400.77644
19blocks.19.hook_resid_post32768400.768622
20blocks.20.hook_resid_post32768400.761658
21blocks.21.hook_resid_post32768400.765593
22blocks.22.hook_resid_post32768400.741098
23blocks.23.hook_resid_post32768400.729718
24blocks.24.hook_resid_post32768400.754838

Standard Matryoshka SAEs

layerSAE IDwidthl0explained variance
0blocks.0.hook_resid_post32768400.91832
1blocks.1.hook_resid_post32768400.863454
2blocks.2.hook_resid_post32768400.841324
3blocks.3.hook_resid_post32768400.814794
4blocks.4.hook_resid_post32768400.820418
5blocks.5.hook_resid_post32768400.796252
6blocks.6.hook_resid_post32768400.797322
7blocks.7.hook_resid_post32768400.787601
8blocks.8.hook_resid_post32768400.779433
9blocks.9.hook_resid_post32768400.75697
10blocks.10.hook_resid_post32768400.745011
11blocks.11.hook_resid_post32768400.732177
12blocks.12.hook_resid_post32768400.726209
13blocks.13.hook_resid_post32768400.719405
14blocks.14.hook_resid_post32768400.719056
15blocks.15.hook_resid_post32768400.756888
16blocks.16.hook_resid_post32768400.742889
17blocks.17.hook_resid_post32768400.757294
18blocks.18.hook_resid_post32768400.76921
19blocks.19.hook_resid_post32768400.766661
20blocks.20.hook_resid_post32768400.760939
21blocks.21.hook_resid_post32768400.759883
22blocks.22.hook_resid_post32768400.740612
23blocks.23.hook_resid_post32768400.729678
24blocks.24.hook_resid_post32768400.747313

Gemma-2-9b

These SAEs have the release ID gemma-2-9b-res-matryoshka-dc.

layerpathwidthl0explained variance
0blocks.0.hook_resid_post32768600.942129
1blocks.1.hook_resid_post32768600.900656
2blocks.2.hook_resid_post32768600.869154
3blocks.3.hook_resid_post32768600.84077
4blocks.4.hook_resid_post32768600.816605
5blocks.5.hook_resid_post32768600.826656
6blocks.6.hook_resid_post32768600.798281
7blocks.7.hook_resid_post32768600.796018
8blocks.8.hook_resid_post32768600.790385
9blocks.9.hook_resid_post32768600.775052
10blocks.10.hook_resid_post32768600.756327
12blocks.12.hook_resid_post32768600.718319
13blocks.13.hook_resid_post32768600.714065
14blocks.14.hook_resid_post32768600.709635
15blocks.15.hook_resid_post32768600.706622
16blocks.16.hook_resid_post32768600.687879
17blocks.17.hook_resid_post32768600.695821
18blocks.18.hook_resid_post32768600.691723
19blocks.19.hook_resid_post32768600.690914
20blocks.20.hook_resid_post32768600.684599
21blocks.21.hook_resid_post32768600.691355
22blocks.22.hook_resid_post32768600.705531
23blocks.23.hook_resid_post32768600.702293
24blocks.24.hook_resid_post32768600.707655
25blocks.25.hook_resid_post32768600.721022
26blocks.26.hook_resid_post32768600.721717
27blocks.27.hook_resid_post32768600.745809
28blocks.28.hook_resid_post32768600.753267
29blocks.29.hook_resid_post32768600.76466
30blocks.30.hook_resid_post32768600.763025
31blocks.31.hook_resid_post32768600.765932
32blocks.32.hook_resid_post32768600.760822
33blocks.33.hook_resid_post32768600.73323
34blocks.34.hook_resid_post32768600.746912
35blocks.35.hook_resid_post32768600.738031
36blocks.36.hook_resid_post32768600.730805
37blocks.37.hook_resid_post32768600.722875
38blocks.38.hook_resid_post32768600.715494
39blocks.39.hook_resid_post32768600.7044
40blocks.40.hook_resid_post32768600.711277

Gemma-3-1b

These SAEs have the release ID gemma-3-1b-res-matryoshka-dc.

layerSAE IDwidthl0explained variance
0blocks.0.hook_resid_post32768400.99118
1blocks.1.hook_resid_post32768400.985819
2blocks.2.hook_resid_post32768400.981468
3blocks.3.hook_resid_post32768400.979252
4blocks.4.hook_resid_post32768400.973719
5blocks.5.hook_resid_post32768400.977229
6blocks.6.hook_resid_post32768400.982247
7blocks.7.hook_resid_post32768400.989271
8blocks.8.hook_resid_post32768400.985447
9blocks.9.hook_resid_post32768400.985869
10blocks.10.hook_resid_post32768400.98235
11blocks.11.hook_resid_post32768400.980853
12blocks.12.hook_resid_post32768400.977682
13blocks.13.hook_resid_post32768400.969005
14blocks.14.hook_resid_post32768400.956484
15blocks.15.hook_resid_post32768400.937399
16blocks.16.hook_resid_post32768400.928849
17blocks.17.hook_resid_post32768400.912209
18blocks.18.hook_resid_post32768400.904198
19blocks.19.hook_resid_post32768400.895405
20blocks.20.hook_resid_post32768400.883044
21blocks.21.hook_resid_post32768400.868396
22blocks.22.hook_resid_post32768400.831975
23blocks.23.hook_resid_post32768400.793732
24blocks.24.hook_resid_post32768400.7452

0 comments

Comments sorted by top scores.