Video/animation: Neel Nanda explains what mechanistic interpretability is

This is a link post for

Nice little video - audio is Neel Nanda explaining what mechanistic interpretability is and why he does it, and it's illustrated by the illustrious Hamish Doodles. Excerpted from the AXRP episode [LW · GW].

(It's not technically animation I think, but I don't know what other single word to use for "pictures that move a bit and change")


comment by Sheikh Abdur Raheem Ali (sheikh-abdur-raheem-ali) · 2023-02-23T01:19:03.935Z · LW(p) · GW(p)

Lots of alpha in AI research distillers learning motion-canvas/motion-canvas: Visualize Complex Ideas Programmatically ( and making explainers.

Replies from: alexander-cai, lahwran
comment by adzcai (alexander-cai) · 2023-02-23T01:36:42.976Z · LW(p) · GW(p)

Or even better, finetuning an LLM to automate writing the code!

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-02-23T02:16:16.123Z · LW(p) · GW(p)

cyborgism, activate!

just don't use an overly large model.

comment by the gears to ascension (lahwran) · 2023-02-23T02:17:52.460Z · LW(p) · GW(p)

For those reading (I imagine Sheikh knows about these already), some videos from the creator of that library:

comment by novalinium · 2023-02-22T23:57:32.720Z · LW(p) · GW(p)

A single word for this would be an animatic, probably.

Replies from: DanielFilan
comment by DanielFilan · 2023-02-23T00:09:05.963Z · LW(p) · GW(p)

I kinda guess that most people don't know what that means.

comment by TinkerBird · 2023-02-23T09:29:31.142Z · LW(p) · GW(p)

Here's a dumb idea: if you have a misaligned AGI, can you keep it inside a box and have it teach you some things about alignment, perhaps through some creative lies?