Posts

Comments

Comment by QSED on On Developing a Mathematical Theory of Interpretability · 2023-02-10T20:26:53.226Z · LW · GW

I'm skeptical, but I'd love to be convinced. I'm not sure that it's necessary to make interpretability scale, but it definitely strikes me as a potential trump card that would allow interpretability research to keep pace with capabilities research.

Here are a couple relatively unsorted thoughts (Keep in mind that I'm not a mathematician!):

  • Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's because our current mathematical understanding is lacking or something else (DL not being as 'reductionist' as other fields?). How do you lean in this regard? You mentioned that you're not sure when it comes to how amenable interpretability itself is, but would you guess that it's more or less amenable than deep learning as a whole?
  • How would success of this relate to capabilities research? It's a general criticism of interpretability research that it also leads to heightened capabilities, would this fare better/worse in that regard? I would have assumed that a developed rigorous theory of interpretability would probably also entail significant development of a rigorous theory of deep learning.
  • How likely is it that the direction one may proceed in would be correct? You mention an example in mathematical physics, but note that it's perhaps relatively unimportant that this work was done for 'pure' reasons. This is surprising to me, as I thought that a major motivation for pure math research, like other blue sky research, is that it's often not apparent whether something will be useful until it's well developed. I think this is the similar to you mentioning that the small scale problems will not like the larger problem. You mention that this involves following one's nose mathematically, do you think this is possible in general or only for this case? If it's the latter, why do you think interpretability is specifically amenable to it?