Posts
Comments
Comment by
intern on
EIS IX: Interpretability and Adversaries ·
2024-12-06T00:29:50.321Z ·
LW ·
GW
Networks that have to learn more features may become more adversary-prone simply because the adversary can leverage more features which are represented more densely.
Also, in the top figure the loss is 'relative to the non-superposition model', but if I'm not mistaken the non-superposition model should basically be perfectly robust. Because it's just one layer, its Jacobian would be the identity, and because the loss is MSE, any perturbation to the input would be perfectly reflected only in the correct output feature, meaning no change in loss whatsoever. It's only when you introduce superposition that any change to the input can change the loss (as features actually 'interact').