SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

post by Can (Can Rager), Adam Karvonen (karvonenadam), Johnny Lin (hijohnnylin), Curt Tigges (curt-tigges), Joseph Bloom (Jbloom), chanind, Yeu-Tong Lau (yeu-tong-lau), Eoin Farrell, Arthur Conmy (arthur-conmy), CallumMcDougall (TheMcDouglas), Kola Ayonrinde (kola-ayonrinde), Matthew Wearden (matthew-wearden), Sam Marks (samuel-marks), Neel Nanda (neel-nanda-1) · 2024-12-11T06:30:37.076Z · LW · GW · 0 comments

This is a link post for https://www.neuronpedia.org/sae-bench/info

Contents

  TL;DR
  Introduction
None
No comments

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution

TL;DR

🔍 Explore the Benchmark & Rankings

📊 Evaluate your SAEs with SAEBench

✉️ Contact Us

Introduction

Sparse Autoencoders (SAEs) have become one of the most popular tools for AI interpretability. A lot of recent interpretability work has been focused on studying SAEs, in particular on improving SAEs, e.g. the Gated SAE, TopK SAE, BatchTopK SAE [AF · GW], ProLu SAE [AF · GW], Jump Relu SAE, Layer Group SAE, Feature Choice SAE, Feature Aligned SAE, and Switch SAE. But how well do any of these improvements actually work?

The core challenge is that we don't know how to measure how good an SAE is. The fundamental premise of SAEs is a useful interpretability tool that unpacks concepts from model activations. The lack of ground truth labels for model internal features led the field to measure and optimize the proxy of sparsity instead. This objective successfully provided interpretable SAE latents. But sparsity has known problems as a proxy, such as feature absorption and composition of independent features [AF · GW]. Yet, most SAE improvement work merely measures whether reconstruction is improved at a given sparsity, potentially missing problems like uninterpretable high frequency latents, or increased composition.

In the absence of a single, ideal metric, we argue that the best way to measure SAE quality is to give a more detailed picture with a range of diverse metrics. In particular, SAEs should be evaluated according to their performance on downstream tasks, a robust signal of usefulness.

Our comprehensive benchmark provides insight to fundamental questions about SAEs, like what the ideal sparsity, training time, and other hyperparameters. To showcase this, we've trained a custom suite of 200+ SAEs of varying dictionary size, sparsity, training time, and architecture (holding all else constant). Browse the evaluation results covering Pythia-70m and Gemma-2-2B on Neuronpedia.

SAEBench enables a range of use cases, such as measuring progress with new SAE architectures, revealing unintended SAE behavior, tuning training hyperparameters, and selecting the best SAE for a particular task. We find that these evaluation results are nuanced and there is no one ideal SAE configuration - instead, the best SAE varies depending on the specifics of the downstream task. Because of this, we cannot combine the results into a single number without obscuring tradeoffs. Instead, we provide a range of quantitative metrics so that researchers can measure the nuanced effects of experimental changes.

We are releasing a beta version of SAEBench, including a convenient demonstration notebook that evaluates custom SAEs on multiple benchmarks and plots the results. Our flexible codebase allows you to easily add your own evaluations.


Check out the original post with interactive plots for more details on metrics and takeaways!

0 comments

Comments sorted by top scores.