200 COP in MI: Looking for Circuits in the Wild

post by Neel Nanda (neel-nanda-1) · 2022-12-29T20:59:53.267Z · LW · GW · 5 comments

Contents

  Motivation
  Resources
  Tips
  Problems
None
5 comments

This is the third post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here [AF · GW], then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer

Motivation

Motivating paper: Interpretability In The Wild

Our ultimate goal is to be able to reverse engineer real, frontier language models. And the next step up from toy language models is to look for circuits in the wild. That is, taking a real model that was not designed to be tractable to interpret (albeit much smaller than GPT-3), take some specific capability it has, and try to reverse engineer how it does it. I think there’s a lot of exciting low-hanging fruit here, and that enough groundwork has been laid to be accessible to newcomers to the field! In particular, I think there’s a lot of potential to better uncover the underlying principles of models, and to leverage this to build better and more scalable interpretability techniques. We currently have maybe three published examples of well-understood circuits in language models - I want to have at least 20!

This section is heavily inspired by Redwood Research’s recent paper: Interpretability In The Wild. They analyse how GPT-2 Small solves the grammatical task of Indirect Object Identification: e.g. knowing that “John and Mary went to the store, then John handed a bottle of milk to” is followed by Mary, not John. And in a heroic feat of reverse engineering, they rigorously deciphered the algorithm. In broad strokes, the model first identifies which names are repeated, inhibits any repeated names, and then predicts that the name which is not inhibited comes next. The full circuit consists of 26 heads, sorted into 7 distinct groups, as shown below - see more details in my MI explainer. A reasonable objection is that this is a cherry-picked task on a tiny (100M parameter) model, so why should we care about it? But I’ve learned a lot from this work! 

It’s helped uncover some underlying principles of networks. Notably that the model has built in redundancy: if the name mover heads (which attend to the correct name) are knocked-out, then there are backup heads which take over and do the task instead! And that the model has learned to repurpose existing functionality: the model has a component (induction heads) for the more complex task of continuing repeated subsequences, but here they also get re-purposed to detect whether a name is repeated. It would have been much harder to understand what was going on here without the prior work of reverse-engineering induction circuits! And there’s many open questions building on the work that could clarify other principles like universality: when another model learns this task, does it learn the same circuit?

Further, it’s helped to build out a toolkit of techniques to rigorously reverse engineer models. In the process of understanding this circuit, they refined the technique of activation patching into more sophisticated approaches such as path patching (and later causal scrubbing). And this has helped lay the foundations for developing future techniques! There are many interpretability techniques that are more scalable but less mechanistic, like probing. If we have some well understood circuits, and can see how well these techniques work in those settings (and what they miss!), this gives grounding to find the good techniques and how to understand their results. 

Finally, I think that enough groundwork has been laid that finding circuits in the wild is tractable, even by people new to the field! There are sufficient techniques and demos to build off of that you can make significant progress without much background. I’m excited to see what more we can learn! 

Resources

Tips

  1.  Identify a behaviour (like Indirect Object Identification) in a model that you want to understand
  2. Try to really understand the behaviour as a black box. Feed in a lot of inputs with many variations and see how the model’s behaviour changes. What does it take to break the model’s performance? Can you confuse it or trip it up?
  3. Now, approach the task as a scientist. Form hypotheses about what the model is doing - how could a transformer implement an algorithm for this? And then run experiments to support and to falsify these hypotheses. Aim to iterate fast, be exploratory, and get fast feedback.
    1. Copy my exploratory analysis tutorial and run the model through it on some good example prompts, and try to localise what’s going on - which parts of the model seem most important for the task?
  4. Regularly red-team yourself and look for what you’re missing - if there’s a boring explanation for what’s going on, or a flaw in your techniques, what could it be? How could you falsify your hypothesis? Generating convoluted explanations for simple phenomena is one of the biggest beginner mistakes in MI!
  5. Once you have some handle on what’s going on, try to scale up and be more rigorous - look at many more prompts, use more refined techniques like path patching and causal scrubbing, try to actually reverse engineer the weights, etc. At this stage you’ll likely want to be thinking about more ad-hoc techniques that make sense for your specific problem.

Problems

This spreadsheet lists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)

5 comments

Comments sorted by top scores.

comment by LawrenceC (LawChan) · 2022-12-30T00:06:39.338Z · LW(p) · GW(p)
  • C* What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI?  Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?

So, it turns out that negative prediction heads appear ~everywhere! For example, Noa Nabeshima found them on ResNeXts trained on ImageNet: there seem to be heads that significantly reduce the probability of certain outputs. IIRC the explanation we settled on was calibration; ablating these heads seemed to increase log loss via overconfident predictions on borderline cases? 

comment by Noosphere89 (sharmake-farah) · 2022-12-29T21:12:08.118Z · LW(p) · GW(p)

Further, it’s helped to build out a toolkit of techniques to rigorously reverse engineer models. In the process of understanding this circuit, they refined the technique of activation patching into more sophisticated approaches such as path patching (and later causal scrubbing). And this has helped lay the foundations for developing future techniques! There are many interpretability techniques that are more scalable but less mechanistic, like probing. Having some

See a Twitter thread of some brief explorations I and Alex Silverstein did on this

I think you cut yourself off there both times.

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2022-12-29T21:17:22.713Z · LW(p) · GW(p)

Lol thanks. Fixed

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2022-12-29T21:18:58.333Z · LW(p) · GW(p)

You're welcome, though did you miss a period here or did you want to write more?

See a Twitter thread of some brief explorations I and Alex Silverstein did on this

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2022-12-29T23:55:29.562Z · LW(p) · GW(p)

Missed a period (I'm impressed I didn't miss more tbh, I find it hard to remember that you're supposed to have them at the end of paragraphs)