Posts
Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability
2024-06-17T11:46:09.671Z
Exploring Llama-3-8B MLP Neurons
2024-06-09T14:19:10.822Z
Comments
Comment by
ntt123 (thong-nguyen) on
Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability ·
2024-06-18T01:38:56.660Z ·
LW ·
GW
Thank you for the upvote! My main frustration with logit lens and tuned lens is that these methods are kind of ad hoc and do not reflect component contributions in a mathematically sound way. We should be able to rewrite the output as a sum of individual terms, I told myself.
For the record, I did not assume MLP neurons are monosemantic or polysemantic, and this is why I did not mention SAEs.