SAE-VIS: Announcement Post

post by CallumMcDougall (TheMcDouglas), Joseph Bloom (Jbloom) · 2024-03-31T15:30:49.079Z · LW · GW · 8 comments

Contents

  Summary
  Other links
None
8 comments

This is a post to officially announce the sae-vis library, which was designed to create feature dashboards like those from Anthropic's research.

Summary

There are 2 types of visualisations supported by this library: feature-centric and prompt-centric.

The feature-centric vis is the standard from Anthropic’s post, it looks like the image below. There’s an option to navigate through different features via a dropdown in the top left.

You can see the interactive version at the GitHub repo, at _feature_vis_demo.html.

The prompt-centric vis is centred on a single user-supplied prompt, rather than a single feature. It will show you the list of features which score highest on that prompt, according to a variety of different metrics. It looks like the image below. There’s an option to navigate through different possible metrics and choices of token in your prompt via a dropdown in the top left.

You can see the interactive version at the GitHub repo, at _prompt_vis_demo.html.

 

Other links

Here are some more useful links:

You might also be interested in reading about Neuronpedia, who make use of this library in their visualizations.

If you're interested in getting involved, please reach out to me or Joseph Bloom! We will also be publishing a post tomorrow, discussing some of the features we've discovered during our research.

8 comments

Comments sorted by top scores.

comment by Neel Nanda (neel-nanda-1) · 2024-03-31T15:38:16.690Z · LW(p) · GW(p)

Thanks for open sourcing this! We've already been finding it really useful on the DeepMind mech interp team, and saved us the effort of writing our own :)

Replies from: TheMcDouglas
comment by CallumMcDougall (TheMcDouglas) · 2024-03-31T15:44:02.284Z · LW(p) · GW(p)

Thanks so much, really glad to hear it's been helpful!

comment by Connor Kissane (ckkissane) · 2024-03-31T16:35:06.004Z · LW(p) · GW(p)

Amazing! We found your original library super useful for our Attention SAEs [LW · GW] research, so thanks for making this!

Replies from: TheMcDouglas
comment by CallumMcDougall (TheMcDouglas) · 2024-04-01T11:09:28.966Z · LW(p) · GW(p)

Thanks so much! (-:

comment by Johnny Lin (hijohnnylin) · 2024-03-31T18:45:48.893Z · LW(p) · GW(p)

Thanks Callum and yep we've been extensively using SAE-Vis at Neuronpedia - it's been extremely helpful for generating dashboards and it's very well maintained. We'll have a method of directly importing to Neuronpedia using the exports from SAE-Vis coming out soon.

Replies from: TheMcDouglas
comment by CallumMcDougall (TheMcDouglas) · 2024-04-05T13:26:51.747Z · LW(p) · GW(p)

Thanks!! Really appreciate it

comment by Jonas Kgomo (jonas-kgomo) · 2024-03-31T18:25:35.792Z · LW(p) · GW(p)

is this something that can work for a hosted web version (npm i or api) 

Replies from: Jbloom
comment by Joseph Bloom (Jbloom) · 2024-03-31T18:39:24.214Z · LW(p) · GW(p)

I'm a little confused by this question. What are you proposing?