Posts
Comments
In my view, the key purpose of interpretability is to translate model behavior to a representation that is readily understood by humans. This representation may include first-order information (e.g., feature attribution techniques that are common now), but should also include higher-order side-effects induced by the model as it is deployed in an environment. This second-order information will be critical for thinking about un-intended emergent properties that may arise, as well as bound their likelihood under formal guarantees.
If you view alignment as a controls problem (e.g., [1]), interpretability is giving us a mechanism for assessing (and forecasting) measured output of a system. This step is necessary for taking appropriate corrective action that reduces measured error. In this sense, interpretability is in some sense the inverse of the alignment problem. This notion of interpretability captures many of the points mentioned in the list above, especially #1, #2, #3, #7, #8, and #9.
[1] https://en.wikipedia.org/wiki/Control_theory#/media/File:Feedback_loop_with_descriptions.svg