
Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability 2024-06-17T11:46:09.671Z
Exploring Llama-3-8B MLP Neurons 2024-06-09T14:19:10.822Z


Comment by ntt123 (thong-nguyen) on Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability · 2024-06-18T01:38:56.660Z · LW · GW

Thank you for the upvote! My main frustration with logit lens and tuned lens is that these methods are kind of ad hoc and do not reflect component contributions in a mathematically sound way.  We should be able to rewrite the output as a sum of individual terms, I told myself.

For the record, I did not assume MLP neurons are monosemantic or polysemantic, and this is why I did not mention SAEs.