

Comment by lampinen on Activation space interpretability may be doomed · 2025-03-03T19:21:55.592Z · LW · GW

We've done some studies of related phenomena in: — e.g., that the activation patterns can be strongly biased towards capturing easy (linear) features over difficult (nonlinear) ones (or more prevalent over less prevalent ones, or earlier-learned ones, etc.), which can lead interpretations based on activations to miss some of the important features that the model is computing.