[Summary] Progress Update #1 from the GDM Mech Interp Team

post by Neel Nanda (neel-nanda-1), Arthur Conmy (arthur-conmy), lewis smith (lsgos), Senthooran Rajamanoharan (SenR), Tom Lieberum (Frederik), János Kramár (janos-kramar), Vikrant Varma (amrav) · 2024-04-19T19:06:17.755Z · LW · GW · 0 comments

Contents

  Introduction
  Summaries
None
No comments

Introduction

This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team’s excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results.

Our team’s two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One exception is our infrastructure snippet, which we think could be useful to mechanistic interpretability researchers more broadly. We present preliminary results in a range of areas to do with SAEs, from improving and interpreting steering vectors, to improving ghost grads, to replacing SAE encoders with an inference-time sparse approximation algorithm. 

Where possible, we’ve tried to clearly state our level of confidence in our results, and the evidence that led us to these conclusions so you can evaluate for yourself. We expect to be wrong about at least some of the things in here! Please take this in the spirit of an interesting idea shared by a colleague at a lab meeting, rather than as polished pieces of research we’re willing to stake our reputation on. We hope to turn some of the more promising snippets into more fleshed out and rigorous papers at a later date. 

We also have a forthcoming paper on an updated SAE architecture that seems to be a moderate Pareto-improvement, stay tuned!

How to read this post: This is a short summary post, accompanying the much longer post with all the snippets [AF · GW]. We recommend reading the summaries of each snippet below, and then zooming in to whichever snippets seem most interesting to you. They can be read in any order.

Summaries

0 comments

Comments sorted by top scores.