Posts
Comments
Some questions about the feasibility section:
About (2): what would 'crisp internal representations' look like? I think this is really useful to know since we haven't figure this out for the brain or LLMs (e.g. Interpreting Neural Networks through the Polytope Lens
Moreover, the current methods used in comparing the human brain's representations to those of ML models like RSA are quite high-level, or, at the very least do not reveal any useful insights for interpretability for either side (pls feel free to correct me). This question is not a point against mechanistic interpretability, however--granted the premise that LLMs are pressured to learn similar representations, interpretability research can both borrow from and shed light on how representations in the brain work (which reaches no agreement even after decades).
About (3): when we talk about automating interpretability, do we have a human or another model examining the research outputs? The fear is that due to our inherent limitation in how we reason about the world (e.g., we tend to make sense of something by positing some 'objects' and consider what 'relations' they have to each other), we may not be able to interpret what ML models at a meaningfully high level? To be concrete, so far a lot of interpretability results are in the form of 'circuits' in performing some peculiar task like modular addition, but how can we safely generalize that to some theory that has a wider scope from automation? (I have in my mind Predictive Processing for the brain but not sure how good an example it is)
Thank you so much for posting this. It feels weird to tick every single symptom mentioned here...
The burnout that 'Dmitry' experiences is remarkably accurate for what am I experiencing, Are there any further guides on how to manage this? It would help me so much, any help is appreciated:)