Posts
Comments
This weakness of SAEs is not surprising, as this is a general weakness of any interpretation method that is calculated based on model behaviours for a selected dataset. The same effect has been shown for permutation feature importances, partial dependence plots, Shapley values, integrated gradients and more. There is a reasonably large body of literature on the subject from the interpretable ML / explainable ML research communities in the last 5-10 years.
'Have you ever tried to explain the difference between correlation and causation to someone who didn't understand it? I'm not convinced that this is even something humans innately have, rather than some higher-level correction by systems that do that.'
You are outside and feel wind on your face. In front of you, you can see trees swaying in the wind. Did the swaying of the trees cause the wind? Or did the wind cause the trees to sway?
The cat bats at a moving toy. Usually he misses it. If he hits it, it usually makes a noise, but not always. The presence of the noise is more closely correlated with the cat successfully hitting the toy than the cat batting at the ball. But did the noise cause the cat to hit the ball or did the cat batting at the ball cause the hit?
The difference between correlation and causation is something we humans have a great sense for, so these questions seem really stupid. But they're actually very challenging to answer using only observations (without being able to intervene).