Explainer - AutoInterpretation Finds Sparse Coding Beats Alternatives

post by Gauraventh (aryangauravyadav) · 2023-08-01T17:29:16.962Z · LW · GW · 0 comments

This is my best attempt at explaining what is going on here [LW · GW].

What is Sparse Coding?

Why do Sparse Coding:

This work by Hoagy builds off the paper 'Language models can explain neurons in language models' by OpenAI: I was initially confused about what they were trying to do in this paper; maybe other people also have this confusion. I thought they were explaining to GPT-4 when a neuron activates in some text and then trying to generate new text with the subject model using only those activated neurons.

What is really happening is:

What is Automatic Interpretation?

What does this post attempt to do?


MLP Results:

Residual Stream Results:


Comments sorted by top scores.