(OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

post by Neel Nanda (neel-nanda-1) · 2022-10-18T21:08:33.033Z · LW · GW · 5 comments

This is a link post for https://www.neelnanda.io/mechanistic-interpretability/favourite-papers

Contents

  This post is out of date, see v2 here
    Introduction
    Priority 1: What is Mechanistic Interpretability?
    Priority 2: Understanding Key Concepts in the field
    Priority 3: Expanding Understanding
      Language Models
      Algorithmic Tasks
      Image Circuits
    Priority 4: Bonus
None
5 comments

This post is out of date, see v2 here [AF · GW]

Introduction

This is an extremely opinionated list of my favourite mechanistic interpretability papers, annotated with my key takeaways and what I like about each paper, which bits to deeply engage with vs skim (and what to focus on when skimming) vs which bits I don’t care about and recommend skipping, along with fun digressions and various hot takes.

This is aimed at people trying to get into the field of mechanistic interpretability (especially Large Language Model (LLM) interpretability). I’m writing it because I’ve benefited a lot by hearing the unfiltered and honest opinions from other researchers, especially when first learning about something, and I think it’s valuable to make this kind of thing public! On the flipside though, this post is explicitly about my personal opinions - I think some of these takes are controversial and other people in the field would disagree.

The four top level sections are priority ordered, but papers within each section are ordered arbitrarily - follow your curiosity

Priority 1: What is Mechanistic Interpretability?

Priority 2: Understanding Key Concepts in the field

Priority 3: Expanding Understanding

Language Models

Algorithmic Tasks

Image Circuits

Priority 4: Bonus

5 comments

Comments sorted by top scores.

comment by aogara (Aidan O'Gara) · 2022-10-18T21:48:39.025Z · LW(p) · GW(p)

Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are doing to be interesting to plenty of academics, and I would think that's what the academics are trying to do. Are they just bad at generating insights? Do they look for the wrong kinds of progress, perhaps motivated by different goals? Why is the large academic field of interpretability not particularly useful for x-risk motivated AI safety?

Replies from: neel-nanda-1, lahwran
comment by Neel Nanda (neel-nanda-1) · 2022-10-18T22:49:57.029Z · LW(p) · GW(p)

Honestly, I also feel fairly confused by this - mechanistic questions are just so interesting. Empirically, I've fairly rarely found academic interpretability that interesting or useful, though I haven't read that widely (though there are definitely some awesome papers from academia as linked in the post, and some solid academics, and many more papers that contain some moderately useful insight).

To be clear, I am focusing on mechanistic interpretability - actually reverse engineering the underlying algorithms learned by a model - and I think there's legitimate and serious work to be done in other areas that could reasonably be called interpretability.

My take would roughly be that there's a few factors - but again, I'm fairly confused by this and "it's actually great and I'm just being a chauvinist" is also a pretty coherent explanation (and I know some alignment researchers who'd argue for the latter hypothesis):

  • Doing rigorous mechanistic work is just fairly hard, and doesn't really fit the ML paradigm - it doesn't really work to frame in terms of eg benchmarks, and it's often more qualitative than quantitative. And thus is both difficulty and so hard to publish in.
  • Lots of interpretability/explainability work treats the ground truth as things like "do human operators rate this explanation as helpful" or "does this explanation help human operators understand the model's output better", which feel like fairly boring metrics to me, and not very relevant to mechanistic stuff.
  • Lots of work focuses too much on pretty abstractions (eg syntax trees and formal grammars) and not enough on grounding their work in what's actually going on inside the model.
  • Mechanistic interpretability is pre-paradigmatic - there just isn't an agreed upon way to make progress and find truth, nor an established set of techniques. This both makes it harder to do research in, and harder to judge the quality of work in (and thus also harder to publish in!).

I think ease of publishing is a pretty important point, even if an academic doesn't personally care about publications, often their collaborators/students/supervisors might, and there's strong career incentives to care. Eg, if a PhD student wants to work on this, and it'd be much harder to publish on it, a good supervisor probably should be discouraging of it (within reason), since part of their job is looking out for their student's career interests.

Though hopefully it's getting easier to publish in nowadays! There's a few mechanistic papers submitted to ICLR.

comment by the gears to ascension (lahwran) · 2022-10-19T01:27:32.899Z · LW(p) · GW(p)

I mean, personally I'd say it's the only hope we have of making any of the reflection algorithms have any shot at working. you can't do formal verification unless your network is at least interpretable enough that, when formal verification fails, you can know what it is about the dataset means the network had to make your non-neural prover run out of compute time when you try to ask what the lowest margin to a misbehavior is. if the network's innards aren't readable enough to get an intuitive sense of why a subpath failed the verification or what computation the network performed that timed out the verifier, it's hard to move the data around to clarify.

of course, this doesn't help that much if it turns out that a strong planner can amplify a very slight value misalignment quickly as expected by miri folks; afaict, miri is worried that the process of learning (their word is "self improving") can speed up a huge amount when the network can make use of the full self-rewriting possibility of its substrate and the network properly understands the information geometry of program updates (ie, afaict, they expect significant amounts of generalizing improvement of architecture or learning rule or such things once its strong enough to become a strong quine as an incidental step of doing its core task.)

and so interpretability would be expected to be made useless by the ai breaking into your tensorflow to edit its own compute graph or your pytorch to edit its own matmul invocation order or something. presumably that doesn't happen at the expected level until you have an ai strong enough to significantly exceed the generalization performance of current architecture search incidentally without being aimed at that, because the ais they're imagining wouldn't have even been trained on that specifically the way eg alphatensor was narrowly aimed at matmul itself.

wow this really got me on a train of thinking, I'm going to post more rambling to my shortform.

comment by infinitevoid (iyngkarran-kumar) · 2022-10-23T22:15:33.868Z · LW(p) · GW(p)

Thanks for writing this - I've found it useful in my current attempts to survey some key mechanistic interpretability literature.

a decent survey paper on what’s up in the rest of interpretability.

I’m personally pretty meh about the majority of the academic field of interpretability

A bit confused by this. This paper's abstract and intro claim to be focusing on inner interpretability methods - which they define as learned features and internal structure. This seems to fit my idea of what mechanistic interpretability is pretty well, but you seem to classify it as 'the rest of interpretability'.

Do you see a clear distinction between mechanistic interpretability methods vs the methods reviewed in this paper? If so, what's the distinction?

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2022-10-24T14:46:24.801Z · LW(p) · GW(p)

This is a fair point! I honestly have only vaguely skimmed that survey, and got the impression there was a lot of stuff in there that I wasn't that interested in. But it's on my list to read properly at some point, and I can imagine updating this a bunch.