Posts
Comments
Comment by
lampinen on
Activation space interpretability may be doomed ·
2025-03-03T19:21:55.592Z ·
LW ·
GW
We've done some studies of related phenomena in: https://openreview.net/forum?id=aY2nsgE97a — e.g., that the activation patterns can be strongly biased towards capturing easy (linear) features over difficult (nonlinear) ones (or more prevalent over less prevalent ones, or earlier-learned ones, etc.), which can lead interpretations based on activations to miss some of the important features that the model is computing.