SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

can

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

post by Can (Can Rager), Adam Karvonen (karvonenadam), Johnny Lin (hijohnnylin), Curt Tigges (curt-tigges), Joseph Bloom (Jbloom), chanind, Yeu-Tong Lau (yeu-tong-lau), Eoin Farrell, Arthur Conmy (arthur-conmy), CallumMcDougall (TheMcDouglas), Kola Ayonrinde (kola-ayonrinde), Matthew Wearden (matthew-wearden), Sam Marks (samuel-marks), Neel Nanda (neel-nanda-1) · 2024-12-11T06:30:37.076Z · LW · GW · 6 comments

This is a link post for https://www.neuronpedia.org/sae-bench/info

  TL;DR
  Introduction
None
6 comments

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution

TL;DR

We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and downstream tasks. Use our codebase to evaluate your own SAEs!
You can compare 200+ SAEs of varying sparsity, dictionary size, architecture, and training time on Neuronpedia.
Think we're missing an eval? We'd love for you to contribute it to our codebase! Email us.

🔍 Explore the Benchmark & Rankings

📊 Evaluate your SAEs with SAEBench

✉️ Contact Us

Introduction

Sparse Autoencoders (SAEs) have become one of the most popular tools for AI interpretability. A lot of recent interpretability work has been focused on studying SAEs, in particular on improving SAEs, e.g. the Gated SAE, TopK SAE, BatchTopK SAE [AF · GW], ProLu SAE [AF · GW], Jump Relu SAE, Layer Group SAE, Feature Choice SAE, Feature Aligned SAE, and Switch SAE. But how well do any of these improvements actually work?

The core challenge is that we don't know how to measure how good an SAE is. The fundamental premise of SAEs is a useful interpretability tool that unpacks concepts from model activations. The lack of ground truth labels for model internal features led the field to measure and optimize the proxy of sparsity instead. This objective successfully provided interpretable SAE latents. But sparsity has known problems as a proxy, such as feature absorption and composition of independent features [AF · GW]. Yet, most SAE improvement work merely measures whether reconstruction is improved at a given sparsity, potentially missing problems like uninterpretable high frequency latents, or increased composition.

In the absence of a single, ideal metric, we argue that the best way to measure SAE quality is to give a more detailed picture with a range of diverse metrics. In particular, SAEs should be evaluated according to their performance on downstream tasks, a robust signal of usefulness.

Our comprehensive benchmark provides insight to fundamental questions about SAEs, like what the ideal sparsity, training time, and other hyperparameters. To showcase this, we've trained a custom suite of 200+ SAEs of varying dictionary size, sparsity, training time, and architecture (holding all else constant). Browse the evaluation results covering Pythia-70m and Gemma-2-2B on Neuronpedia.

SAEBench enables a range of use cases, such as measuring progress with new SAE architectures, revealing unintended SAE behavior, tuning training hyperparameters, and selecting the best SAE for a particular task. We find that these evaluation results are nuanced and there is no one ideal SAE configuration - instead, the best SAE varies depending on the specifics of the downstream task. Because of this, we cannot combine the results into a single number without obscuring tradeoffs. Instead, we provide a range of quantitative metrics so that researchers can measure the nuanced effects of experimental changes.

We are releasing a beta version of SAEBench, including a convenient demonstration notebook that evaluates custom SAEs on multiple benchmarks and plots the results. Our flexible codebase allows you to easily add your own evaluations.

Check out the original post with interactive plots for more details on metrics and takeaways!

6 comments

Comments sorted by top scores.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2025-01-16T19:57:09.942Z · LW(p) · GW(p)

We find that these evaluation results are nuanced and there is no one ideal SAE configuration - instead, the best SAE varies depending on the specifics of the downstream task. Because of this, we cannot combine the results into a single number without obscuring tradeoffs. Instead, we provide a range of quantitative metrics so that researchers can measure the nuanced effects of experimental changes.

It might be interesting (perhaps in the not-very-near future) to study if automated scientists (maybe roughly in the shape of existing ones, like https://sakana.ai/ai-scientist/) using the evals as proxy metrics, might be able to come up with better (e.g. Pareto improvements) SAE architectures, hyperparams, etc., and whether adding more metrics might help; as an analogy, this seems to be the case for using more LLM-generated unit tests for LLM code generation, see Dynamic Scaling of Unit Tests for Code Reward Modeling.

Replies from: karvonenadam

↑ comment by Adam Karvonen (karvonenadam) · 2025-01-23T00:56:21.722Z · LW(p) · GW(p)

SAEs are early enough that there's tons of low hanging fruit and ideas to try. They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas. I wouldn't be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just "Come up with idea, modify existing loss function, train, evaluate, get a quantitative result".

Replies from: bogdan-ionut-cirstea

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2025-01-24T13:43:50.259Z · LW(p) · GW(p)

They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas.

Ok, this seems surprisingly cheap. Can you say more about what such a 1$ training run typically looks like (what the hyperparameters are)? I'd also be very interested in any analysis about how SAE (computational) training costs scale vs. base LLM pretraining costs.

I wouldn't be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just "Come up with idea, modify existing loss function, train, evaluate, get a quantitative result".

This sounds spiritually quite similar to what's already been done in Discovering Preference Optimization Algorithms with and for Large Language Models and I'd expect something roughly like that to probably produce something interestin, especially if a training run only cost $1.

Replies from: karvonenadam

↑ comment by Adam Karvonen (karvonenadam) · 2025-01-24T16:17:33.637Z · LW(p) · GW(p)

A $1 training run would be training 6 SAEs across 6 sparsities at 16K width on Gemma-2-2B for 200M tokens. This includes generating the activations, and it would be cheaper if the activations are precomputed. In practice this seems like large enough scale to validate ideas such as the Matryoshka SAE or the BatchTopK SAE.

Replies from: neel-nanda-1

↑ comment by Neel Nanda (neel-nanda-1) · 2025-01-25T12:20:14.445Z · LW(p) · GW(p)

Yeah, if you're doing this, you should definitely pre compute and save activations

comment by matijasever · 2024-12-14T18:19:33.888Z · LW(p) · GW(p)

Sparsity’s a mess. https://eldfall-chronicles.com/product-category/miniatures/ SAEBench’s nuanced approach is a game changer. No “one-size” magic.

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Contents

TL;DR

Introduction

6 comments