A Comprehensive Mechanistic Interpretability Explainer & Glossary

post by Neel Nanda (neel-nanda-1) · 2022-12-21T12:35:08.589Z · LW · GW · 6 comments

This is a link post for https://neelnanda.io/glossary

Contents

  Why does this doc exist?
  How to read this doc?
  Table of Contents
None
6 comments

This is a linkpost for a very long doc defining, explaining, and giving intuitions and conceptual frameworks for all the concepts I think you should know about when engaging with mechanistic interpretability. If you find the UI annoying, there's an HTML version here

Why does this doc exist?

How to read this doc?

Table of Contents

6 comments

Comments sorted by top scores.

comment by Dalcy (Darcy) · 2023-01-05T18:49:38.776Z · LW(p) · GW(p)

Just wanted to comment that this is an absolutely amazing resource and have saved me a ton of time trying to get into this field & better understand several of the core papers. Thank you so much for writing this!

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2023-01-05T20:31:30.905Z · LW(p) · GW(p)

Thanks :) I'm happy to hear that people are actually using it!

comment by Logan Riggs (elriggs) · 2023-02-23T16:44:26.903Z · LW(p) · GW(p)

Unfinished line here

Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature

comment by Jett (jett) · 2023-10-09T08:41:10.458Z · LW(p) · GW(p)

The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching [LW · GW].

comment by [deleted] · 2023-01-15T22:34:05.750Z · LW(p) · GW(p)

Thanks for writing this. A question:

Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input. 

Shouldn't it be "each feature corresponds to a neuron" rather than "each neuron corresponds to a feature"? 

Because some could be just calculations to get to a higher-level features (part of a circuit). 

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2023-01-16T16:26:06.699Z · LW(p) · GW(p)

Fair point, corrected.

Because some could be just calculations to get to a higher-level features (part of a circuit).

IMO, the intermediate steps should mostly be counted as features in their own right, but it'd depend on the circuit. The main reason I agree is that neurons probably still do some other stuff, eg memory management or signal boosting earlier directions in the residual stream.