200 COP in MI: Techniques, Tooling and Automation

post by Neel Nanda (neel-nanda-1) · 2023-01-06T15:08:27.524Z · LW · GW · 0 comments

Contents

  Motivation
  Further Thoughts
  Tips
  Resources
  Problems
None
No comments

This is the seventh post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here [AF · GW], then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer

Motivating papersCausal Scrubbing [AF · GW], Logit Lens [LW · GW]

Motivation

In Mechanistic Interpretability, the core goal is to form true beliefs about what’s going on inside a network. The search space of possible circuits is extremely large, and even once a circuit is found, we need to verify that that’s what’s really going on. These are hard problems, and having good techniques and tooling is essential to making progress. This is particularly important in mech interp because it’s such a young field that there isn’t an established toolkit and standard of evidence, and each paper seems to use somewhat different and ad-hoc techniques (pre-paradigmatic, in Thomas Kuhn’s language).,  

Getting better at this is particularly important to enable us to get traction interpreting circuits at all, even in a one layer toy language model! But it’s particularly important for dealing with the problem of scale. Mech interp can be very labour intensive, and involve a lot of creativity and well honed research intuitions. This isn’t the end of the world with small models, but ultimately we want to understand models with hundreds of billions to trillions of parameters! We want to be able to leverage researcher time as much as possible, and the holy grail is to eventually automate the finding of circuits and understanding models. My guess is that the most realistic path to really understanding superhuman systems is to slowly automate more and more of the work with weaker systems, while making sure that we understand those systems, and that they’re aligned with what we want.

This can be somewhat abstract, so here are my best guesses for what progress could look like:

A mechanistic perspective: This is an area of mech interp research where it’s particularly useful to study non-mechanistic approaches! People have tried a lot of techniques to understand models. Approaches that solely focus on the model’s inputs or outputs don’t seem that relevant, but there’s a lot of work digging into model internals (Rauker et al is a good survey) and many of these ideas are fairly general and scalable! I think it’s likely that there are insights here that are underappreciated in the mech interp community. 

A natural question is, what does mech interp have to add? I have a pretty skeptical prior on any claim about neural network internals, let alone that there’s a technique that works in general or that can be automated without human judgement. In my opinion, one of the core things missing is grounding - concrete examples of systems and circuits that are well understood, where we can test these techniques and see how well they work, their limitations, and whether they miss anything important. My vision for research here is to take circuits we understand, and use this as training data to figure out scalable techniques that work for those. And to then use these refined techniques to search for new circuits, do our best to fully understand those, and to build a feedback loop that validates how well these techniques generalise to new settings. 

Further Thoughts

“Build better techniques” is a broad claim, and encompasses many approaches and types of techniques. Here’s my attempt to operationalise the key ways that techniques can vary - note that these are spectrums, not black and white! Important context is that I generally approach mech interp with two mindsets: exploration, where I’m trying to become less confused about a model and form hypotheses, and verifying/falsifying, where I’m trying to break the hypotheses I’ve formed and look for flaws, or for stronger evidence that I’m correct. Good research looks like regularly switching between these mindsets, but they need fairly different techniques. 

  1. General vs specific: 
    1. General techniques are a broad toolkit that work for many circuits, including ones we haven’t identified yet. 
      1. Eg: direct logit attribution - looking at which model components directly contribute to the logit for the correct next token. 
    2. Specific techniques focus on identifying a single type of circuit/circuit family
      1. Eg: Prefix matching, identifying induction heads by looking for the induction attention pattern on repeated random tokens
    3. All other things being the same, general techniques are much more exciting, but specific techniques can be much easier to create, and can still be very useful (it’s great that we can automatically identify all induction heads in a model!)
  2. Exploratory vs confirmatory: 
    1. Exploratory techniques are about getting information about confusing behaviour in a model - what can we do to learn more about model internals and get more data on what’s going on? These tend to be pretty general.
      1. Visualising model internals is a good example, eg looking attention patterns, or plotting the first 2-3 principal components of activations. 
    2. Confirmatory techniques focus on taking a hypothesised circuit, and confirming or falsifying whether that’s what’s actually going on. Ideally something objective enough that other researchers can trust them. These can be either specific or general. 
      1. Causal Scrubbing is the best attempt I’ve seen here
    3. This is emphatically a spectrum - good exploratory techniques also help verify circuits, and being able to quickly verify a circuit can significantly help exploration. 
    4. One key difference is how subjective vs objective the output is. Exploratory techniques can output high dimensional data and visuals for a researcher to subjectively interpret as they form hypotheses and iterate, while confirmatory techniques should give objective output, eg a specific metric of how good the circuit is.
  3. Rigorous vs suggestive: Techniques vary from the rigorous, with strong and reliable evidence, to the merely suggestive. 
    1. Rigorous example: Activation patching
      1. If copying a single activation from input A to input B is sufficient to flip it from answer B to answer A, you can be pretty confident that that activation contained the key information distinguishing A from B
    2. Suggestive example: Interpreting a neuron by looking for patterns in its max activating dataset examples
      1. This is known to be misleading, and earlier parts of the neuron’s activation range could easily mean other things. But, equally, it definitely tells you something useful about that neuron, and can be strong evidence that a neuron is not monosemantic!
    3. This is emphatically a spectrum! No technique is perfect, and all of them have some pathological edge cases. Equally, even merely suggestive techniques can be useful for forming hypotheses and iterating. 
      1. In practice, my advice is to have a clear view of the strengths and weaknesses of each technique, and to take it as useful but limited data
      2. A particularly important thing is to track the correlations between technique failures. Activation patching may fail in similar ways to ablations, but likely max activating dataset examples
    4. This can be unidirectional - if ablating a component kills model performance then that’s decent evidence that it matters, but if that has no effect, there may be a backup head taking over
      1. Note that ablating a component may break things because it breaks all model behaviour (eg if it’s used as an important bias term) - this happens with MLP0 in GPT-2 Small
  4. Scalable vs labour intensive: Techniques vary from fully automated approaches that can be run on arbitrary models and produce a simple output, to painstakingly staring at neurons and looking at weights.

Finally, a cautionary note to beware premature optimization. A common reaction among people new to the field is to dismiss highly labour intensive approaches, and jump to techniques that are obviously scalable. This isn’t crazy, scalable approaches are an important eventual goal! But if you skip over the stage of really understanding what’s going on, it’s very easy to trick yourself and produce techniques or results that don’t really work.

Further, I think it is a mistake to discard promising seeming interpretability approaches for fear that they won’t scale - there’s a lot of fundamental work to do in getting to a point where we can even understand small toy models or specific circuits at all. I see a lot of the work to be done right now as being basic science - building an understanding of the basic principles of networks and a collection of concrete examples of circuits (like what is up with superposition?!), and I expect this to then be a good foundation to think about scaling. We only have like 3 examples of well understood circuits in real language models! It’s plausible to me that we shouldn’t be focusing too hard on automation or scalable techniques until we have at least 20 diverse example circuits, and can get some real confidence in what’s going on!

But automated and scalable techniques remain a vital goal! 

Tips

Resources

Problems

This spreadsheet lists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)

0 comments

Comments sorted by top scores.