Posts
Comments
Comment by
Jinjin Zhao (jinjin-zhao) on
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 ·
2024-07-15T19:51:49.362Z ·
LW ·
GW
I am curious about your thoughts on the differences between activation patching and SAE. Do you think they are complimentary research, or may there be some overarching idea that encapsulates both?
Is there any application for one that can't be done with the other? It seems that activation patching may result in more interpretable concepts, but SAE may result in more fundamental features. My intuition is that it may be possible for activation patching to replace SAEs in the future.