Sparse Autoencoder Features for Classifications and Transferability

post by Shan23Chen (shan-chen) · 2025-02-18T22:14:12.994Z · LW · GW · 0 comments

This is a link post for https://arxiv.org/abs/2502.11367

Contents

No comments

A few months ago, we explored whether Sparse Autoencoder (SAE) features from a base model remained meaningful when transferred to a multimodal system—specifically, LLaVA—in our preliminary post Are SAE Features from the Base Model still meaningful to LLaVA? [LW · GW]. Today, I’m excited to share how that initial work has evolved. Our new arXiv paper, Sparse Autoencoder Features for Classifications and Transferability.

 

Our study makes three key contributions to the field of interpretable AI and feature extraction in Large Language Models (LLMs). First, it establishes classification benchmarks by introducing a robust methodology for evaluating and selecting Sparse Autoencoder (SAE)-based features in safety-critical classification tasks, demonstrating their superior performance over traditional baselines. Second, it provides a multilingual transfer analysis, examining the cross-lingual transferability of SAE features in multilingual toxicity detection. The results show that SAE features outperform all in-domain methods and exhibit promising generalization capabilities across languages. Finally, the study extends behavioral analysis and model oversight by exploring whether LLMs can predict their own correctness and that of larger models, underscoring the potential for scalable oversight mechanisms in AI systems. These contributions collectively advance the understanding of SAE-based feature extraction, supporting their deployment in transparent, interpretable, and high-stakes AI applications.

 

Overall, SAE can be seen as a big adapter on top of the existing residual stream. It can outperform the residual stream overall and provide a potentially more interpretable classifier based on the SAE features. We also find an easy yet efficient way to represent chunks of text using SAE features (which does well on classification tasks, at least). We are excited to share our results and receive feedback from the community!

0 comments

Comments sorted by top scores.