Posts

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs 2024-08-23T18:52:31.019Z

Comments

Comment by Kola Ayonrinde (kola-ayonrinde) on Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs · 2024-09-19T20:35:38.920Z · LW · GW

Yeah, we hope others take on this approach too!

have you considered quantizing different features’ activations differently?

Stay tuned for our upcoming work 👀

do the rate-distortion curves of different SAEs intersect? I.e. is it the case that some SAE A achieves a lower loss than SAE B at a low bitrate, but then at a high bitrate, SAE B is better than SAE A? If so, then this might suggest a way to infer hierarchies of features from a set of SAEs: use SAE A to get low-resolution information about your input, and then use SAE B for the high-res detailed information.

This is an interesting perspective - my initial hypothesis before reading your comment was that allowing for variable bitrates for a single SAE would get around this issue but I agree that this would be interesting to test and one that we should definitely check! 

With the constant bit-rate version, then I do expect that we would see something like this, though we haven't rigorously tested that hypothesis. 

I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want and if we're wanting something more coarse-grained for a different level of analysis maybe we should switch to Representation Engineering, or even more coarse-grained working at the level of heads etc. Perhaps SAEs don't have to be all things to all people! I'd be interested to hear any opposing views that we really might want many SAEs at different resolutions though*

 

Thanks for your questions and thoughts, we're really interested in pushing this further and will hopefully have some follow-up work in the not-too-distant future

 

EDIT: *I suspect some of the reason that people want different levels of SAEs is that they accept undesirable feature splitting as a fact of life and so want to be able to zoom in and out of features which may not be "atomic". I'm hoping that if we can address the feature splitting problem, then at least that reason may have somewhat less pull