Posts
Comments
Makes sense and totally agree, thanks!
Excellent work and I think you raise a lot of really good points, which help clarify for me why this research agenda is running into issues, and I think ties in to my concerns about activation space work engendered by recent success in latent obfuscation (https://arxiv.org/abs/2412.09565v1).
In a way that does not affect the larger point, I think that your framing of the problem of extracting composed features may be slightly too strong: in a subset of cases, e.g. if there is a hierarchical relationship between features (https://www.lesswrong.com/posts/XHpta8X85TzugNNn2/broken-latents-studying-saes-and-feature-co-occurrence-in) SAEs might be able to pull out groups of latents that act compositionally (https://www.lesswrong.com/posts/WNoqEivcCSg8gJe5h/compositionality-and-ambiguity-latent-co-occurrence-and). The relationship to any underlying model compositional encoding is unclear, this probably only works in a few cases, and generally does not seem like a scalable approach, but I think that SAEs may be doing something more complex/weirder than only finding composed features.
I agree that comparing tied and untied SAE might be a good way to separate cases where the underlying features are inherently co-occurring. I have wondered if this might lead to a way to better understand the structure of how the model makes decisions, similar to the work of Adam Shai (https://arxiv.org/abs/2405.15943). It may be that cases where the tied SAE has to just not represent a feature, are a good way of detecting inherently hierarchical features (to work out if something is an apple you first decide if it is a fruit for example), if LLM learn to think that way.
I think what you say about clustering of activation densities makes sense, though in the case of Gemma I think the JumpReLU might need to be corrected for to 'align' them.
In terms of classifying 'uncertainty' vs 'compositional' cases of co-occurrence, I believe there is a in the graph structure of what features co-occured with one another, but have not yet nailed down how much structure implies function and vice-versa.
Compositionality seemed to correlate with a 'hub and spoke' type of structure (see here, top left panel: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 and https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 .
We also found a cluster in layer 18 that mirrors the first example above in layer 12 of Gemma-2-2b. It has worse compostional encoding, but also a slightly less hub-like structure: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201.
For ambiguity, we normally see a close to fully connected graph e.g. https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201
This is clearly not perfect, as https://feature-cooccurrence.streamlit.app/?model=gpt2-small&subgraph=125&sae_release=res-jb-feature-splitting&sae_id=blocks_8_hook_resid_pre_24576&size=5&point_x=-31.171305&point_y=-6.12931 does not fit this pattern, looking like there is a compositional encoding of position of a token in the url, but the graph is not in the hub/spoke pattern.
Nevertheless, I think this points to a way that could likely quantify composition vs ambiguity.
Regarding non-linear / circular projections my intuition is that this goes hand in hand with compositionality, but I would not say this for certain.
But trying to nail down the relation between the co-occurrence graph structure and the type of co-occurrence is certainly something I would like to look into further.
Fascinating post! I (along with Hardik Bhatnagar and Joseph Bloom) recently completed a profile of cases of SAE latent co-occurrence in GPT2-small and Gemma-2-2b (see here) and I think that this is a really compelling driver for a lot of the behaviour that we see, such as the link to SAE width. In particular, we observe a lot of cases with apparent parent-child relations between the latents (e.g. here).
We also see a similar 'splitting' of activation strength in cases of composition e.g. we find a case where the child latents are all days of the week (e.g. 'Monday'), but the activation (of lack thereof) of the parent latent corresponds to whether there is a space in the token (e.g. ' Monday') (see here). When the parent and child are active, both have roughly half the activation strength of the child when it is active alone, which I think is similar to what you observe, although made more complex because we do not know the underlying true features in this case. If this holds in general, perhaps it would be possible to improve your method for preventing co-occurrence/absorption by looking not only for cases of splits in the activation density, but for the activation strengths between pairs of features being strongly coupled/proportional in such a manner?
How would your approach handle techniques to obfuscate latents and thus frustrate SAEs e.g. https://arxiv.org/html/2412.09565v1 ?
PIBBSS was a fantastic experience, I highly recommend people apply to the 2025 Fellowship! Huge thanks to the whole team and especially my mentor Joseph Bloom!