Useful starting code for interpretability

eggsyntax

Useful starting code for interpretability

post by eggsyntax · 2024-02-13T23:13:47.940Z · LW · GW · 2 comments

2 comments

Want to try your hand at neural network interpretability? A very nice way to get started is to find an existing Python notebook using one or more interpretability techniques, hopefully one written with beginners in mind. In a click or two you can make a copy of it, which you can typically run without any modification, and then start tweaking it to look at what you're interested in.

Fortunately, many such notebooks already exist, thanks to helpful members of the interp community! This post is just a list of those, mostly Colab notebooks. Many of them I have no personal experience with, but all of them have been recommended by people who know what they're doing. This list will probably be acceptably current through late 2024 or so; after that you should use a more up-to-date resource if one exists (although if one existed now I would have used it instead of writing one, so there may or may not be another one then).

Suggestions for other similarly useful starter notebooks for other areas are extremely welcomed!

Notebooks for understanding machine learning (as background): Transformers From Scratch, some other ML technique notebooks, reinforcement learning.

And the main list is in no particular order, so no need to go top to bottom.

@Neel Nanda [LW · GW]'s exploratory analysis demo for TransformerLens walks you through many of the basic mech interp techniques, and is highly recommended, and he has others as well.
Another intro to mech interp from ARENA, along with several other excellent notebooks reproducing some important mech interp results:
Two activation steering notebooks, based on "Steering GPT-2-XL by adding an activation vector [LW · GW]" (bonus: several different implementations from @Annah [LW · GW]) (extra bonus: quick and dirty representation engineering on Mistral)
Developmental interpretability and singular learning theory notebooks, from @Jesse Hoogland [LW · GW].
A smallish notebook on using the tuned lens technique (successor to the logit lens [LW · GW]).
Mech interp on Mamba using nnsight.

Thanks to @Jesse Hoogland [LW · GW] and @CallumMcDougall [LW · GW] for extremely useful input!

2 comments

Comments sorted by top scores.

comment by Neel Nanda (neel-nanda-1) · 2024-02-14T00:19:59.458Z · LW(p) · GW(p)

This seems like a useful resource, thanks for making it! I think it would be more useful if you enumerated the different ARENA notebooks, my guess is many readers won't click through to the link, and are more likely to if they see the different names. And IMO the arena tutorials are much higher production quality than the other notebooks on that list

Replies from: eggsyntax

↑ comment by eggsyntax · 2024-02-14T03:58:44.046Z · LW(p) · GW(p)

That seems reasonable! When I get a minute I'll list out the individual ARENA notebooks and give them more emphasis (I did personally really like that exploratory analysis demo because of how well it situates the techniques in the context of a concrete problem. Maybe the ARENA version does too, I haven't gone through it).

[EDIT - done]

Useful starting code for interpretability

Contents

2 comments