Concrete Steps to Get Started in Transformer Mechanistic Interpretability

post by Neel Nanda (neel-nanda-1) · 2022-12-25T22:21:49.686Z · LW · GW · 7 comments

This is a link post for www.neelnanda.io/mechanistic-interpretability/getting-started

Contents

  Introduction
  Defining “Decent Baseline”
  Getting the Fundamentals
  Paths for Further Exploration
    Explore and Build On A Paper
    Work on a Concrete Problem
    Read Around the Field (Ie Do a Lit Review)
  Appendix
    Advice on compute
    Other Resources
None
7 comments

Disclaimer: This post mostly links to resources I've made. I feel somewhat bad about this, sorry! Transformer MI is a pretty young and small field and there just aren't many people making educational resources tailored to it. Some links are to collations of other people's work, and I link to more in the appendix.

Introduction

Feel free to just skip the intro and read the concrete steps

The point of this post is to give concrete steps for how to get a decent level of baseline knowledge for transformer mechanistic interpretability (MI). I try to give concrete, actionable, goal-oriented advice that I think is enough to get decent outcomes - please take this as a starting point and deviate if something feels a better fit to your background and precise goals!

A core belief I have about learning mechanistic interpretability is that you should spend at least a third of your time writing code and playing around with model internals, not just reading papers. MI has great feedback loops, and a large component of the skillset is the practical, empirical skill of being able to write and run experiments easily. Unlike normal machine learning, once you have the basics of MI down, you should be able to run simple experiments on small models within minutes, not hours or days. Further, because the feedback loops are so tight, I don’t think there’s a sharp boundary between reading and doing research. If you want to deeply engage with a paper, you should be playing with the model studied, and testing the paper’s basic claims. 

The intended audience is for people new-ish to Mechanistic Interpretability but who know they want to learn about it - if you have no idea what MI is, check out this Mech Interp ExplainerCircuits: Zoom In or Chris Olah’s overview post [LW · GW].

Defining “Decent Baseline”

Here’s an outline of what I mean when I say “a decent level of baseline knowledge”

Getting the Fundamentals

A set of goal oriented steps that I think are important for getting the fundamental skills. If you feel comfortable meeting the success criteria of a step, feel free to skip it.

Callum McDougall has made a great set of tutorials for mechanistic interpretability and TransformerLens, with exercises, solutions and beautiful diagrams. This section is in large part an annotated guide to those!

  1. Learn general Machine Learning prerequisites: There's a certain baseline level of understanding about ML in general that's important context. It's also important to be familiar with an ML framework like PyTorch to actually write code, and to help ground your knowledge.
    • Resource: Read the Barebones Guide to Mechanistic Interpretability Prerequisites, and learn the pre-reqs you're missing (see step 2 for more on transformers)
    • Success criteria: Write and train an MLP in PyTorch to solve MNIST
      • I have deliberately set this success criteria to emphasise that your goal here is to get enough intuition and context that you can learn more. The goal here is not to get a deep knowledge, or to understand all of the sub-fields - ML contains a ton of niches, and this is not on the critical path to exploring MI! 
    • Tips:
      • Try PyTorch over other frameworks (Jax, Tensorflow, etc) if you’re doing MI. It is best to Einops and einsum for tensor manipulation - if you don’t use these you’ll introduce a lot of subtle bugs.
      • It's very easy to overestimate how much you need to learn general pre-reqs - when in doubt, move on to MI, and go back when you notice something missing! 
  2. Deeply understand transformers: A key skill in MI is to have a gears-level model in your head of the transformer - what the parameters and activations are and mean, what are all of the moving parts and how do they fit together. You want to build some intuition for what kinds of algorithms a transformer can/cannot implement. 
  3. Familiarise yourself with Mechanistic Interpretability (MI) Tooling and basics: You want to be able to easily write and run experiments as you learn and explore MI, so it's worth some up front investment to learn what's out there. The goal here is familiarity with the basics, not deep expertise - the best way to really learn tooling is by using it in practice to solve problems that you care about.
    • ResourceCallum McDougall’s tutorials for TransformerLens + Induction Heads
      • If you’re new to ML coding, I recommend doing your work in a Colab Notebook (with a GPU) unless you’re confident you know what you’re doing. You don’t want to be wasting time setting up infrastructure!
    • Success criteria: Work your way through the tutorial, compare your work to the answers, and check you understand 80% of the solutions
      • A core goal here is to learn how to use the TransformerLens library for doing mechanistic interpretability of GPT-style language models. 
        • Bonus: Use TransformerLens to load GPT-2 Small in a Colab notebook, run the model, and visualise the activations.
    • Bonus: Data visualization: A core skill in MI is being good at visualizing data. Neural networks are high dimensional objects, and you need to be able to understand what's going on! My research workflow looks like running an experiment, visualizing the data, staring at the data, being confused, forming more hypotheses, and iterating. Plot data often, and in a diversity of ways.
      • Familiarise yourself with a plotting library. My personal favourite is Plotly, but MatplotlibBokeh and Holoview are other options.
        • Some concrete goals: Be able to plot several loss curves on the same plot, make a scatter plot, and display a matrix (imshow), and add axis labels, titles, and line labels.
        • Callum McDougall has a great Plotly intro
        • Matplotlib is the most popular and easiest to google/use ChatGPT or Copilot for. But personally I find it really annoying, unintuitive and limiting. 
      • Play around with Alan Cooney's CircuitsVis library, which lets you pass in tensors to an interactive visualization and show it in a Jupyter notebook. See the existing visualizations here
        • You can write your own visualizations in React, but it's higher effort
    • Bonus: Other tutorials on mechanistic interpretability, to dig more into things. Just pick any of these that sound interesting, and move on if none of them feel shiny:
  4. (Optional) Learn key concepts in MI: Get your head around basic concepts in MI, and get an overview of the field. The goal of this step is a whirlwind overview, not to get a deep and perfect knowledge of everything!
    • Resource: A Comprehensive Mechanistic Interpretability Explainer & Glossary
    • Success criteria: Read through the whole explainer, and be able to follow the gist of most of it. 
      • Feel free to skip or skim sections or definitions that you don’t follow or find interesting.
      • Concrete goal: Be able to look at each section in the table of contents, and for 80% have a rough intuition for what that section is about, why you might care about it, and what you should go back to that section for

Paths for Further Exploration

A slightly less concrete collection of different strategies to further explore the field. These are different options, not a list to follow in order. Read all of them, notice if one jumps out at you, and dive into that. If nothing jumps out, start with the “exploring and building on a paper” route. If you spend several weeks in one of these modes and feel stuck, you should zoom out and try one of the other modes for a few days! Further, it’s useful to read through all of the sections, the tips and resources in one often transfer!

Explore and Build On A Paper

Find a paper you’re excited about and try to really deeply engage with it, this means both running experiments and understanding the ideas, and ultimately trying to replicate (most of) it. And if this goes well, it can naturally transition into building on the paper’s ideas, exploring confusions and dangling threads, and into interesting original research.

Work on a Concrete Problem

Find a problem in MI that you’re excited about, and go and try to solve it! Note: A good explicit goal is learning and having fun, rather than doing important research. Doing important original research is just really hard, especially as a first project without significant mentorship! Note: I expect this to be a less gentle introduction than the other two paths, only do this one first if it feels actively exciting to you! Check out this blog post [AF · GW] and this walkthrough for accounts of what the actual mech interp research process can look like.

Read Around the Field (Ie Do a Lit Review)

Go and read around the field. Familiarise yourself with the existing (not that many!) papers in MI, try to understand what’s going on, big categories of open problems, what we do and do not know. Focus on writing code as you read and aim to deeply understand a few important papers rather than skimming many. 

Appendix

Advice on compute

Other Resources

See Apart’s Interpretability Playground for a longer list, and a compilation of my various projects here

7 comments

Comments sorted by top scores.

comment by Alexander (alexander-1) · 2022-12-26T02:52:18.938Z · LW(p) · GW(p)

Bad question, but curious why it's called "mechanistic"?

Replies from: LawChan
comment by LawrenceC (LawChan) · 2022-12-26T04:25:52.078Z · LW(p) · GW(p)

Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME

In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.

Or see this post by Daniel Filan [AF · GW].

Replies from: neel-nanda-1, alexander-1
comment by Neel Nanda (neel-nanda-1) · 2022-12-26T12:23:49.733Z · LW(p) · GW(p)

Thanks! That's a great explanation, I've integrated some of this wording into my MI explainer (hope that's fine!)

comment by Alexander (alexander-1) · 2022-12-26T04:30:00.868Z · LW(p) · GW(p)

Wonderful, thank you!

comment by Jay Bailey · 2022-12-26T06:00:02.301Z · LW(p) · GW(p)

This is awesome stuff. Thanks for all your work on this over the last couple of months! When SERI MATS is over, I am definitely keen to develop some MI skills!

comment by Kay Kozaronek (kay-kozaronek) · 2023-01-10T03:44:46.507Z · LW(p) · GW(p)

Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the "Getting the Fundamentals" part of the curriculum?

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2023-01-10T17:12:51.242Z · LW(p) · GW(p)

Er, I'm bad at time estimates at the best of times. And this is a particularly hard case, because it's going to depend wildly on someone's prior knowledge and skillset and you can choose how deep to go, even before accounting for general speed and level of perfectionism. Here are some rough guesses:

ML pre-reqs 10-40h Transformer implementation 10-20h Mech Interp Tooling 10-20h Learning about MI Field 5-20h

But I am extremely uncertain about these. And I would rather not put these into the main post, since it's easy to be misleading and easy to cause people to feel bad if they think they "should" get something done in <20h and it actually takes 60 to do right. I'd love to hear how long things take you if you try this!