On Interpretability's Robustness

post by WCargo (Wcargo) · 2023-10-18T13:18:51.622Z · LW · GW · 0 comments

Contents

  Would you trust the IOI circuit?
  Interpretability Tech Tree
  A Methodology for Interpretability
  Personal research
  Conclusion
None
No comments

Léo Dana - Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, as well as an internship at FAR AI, both under the mentorship of Claudia Shi.

Would you trust the IOI circuit?

Interpretability is envisioned by many as the main source of alignment tool for future AI systems, yet I claim that interpretability’s Theory of Change has a central problem: we don’t trust interpretability tools. Why is that?

 

Although some people don’t trust interpretability, others seem to me to take for granted articles even when the authors admit the limitations in the robustness of their methods:

Note that I’m not saying such findings are not robust, but their robustness was not completely evaluated.

 

I read Against Inerpretability [LW · GW] and think that most critics hold because we cannot yet trust our interpretability tools. If we had robust tools, interpretability would be at least instrumentally useful to understand models.

Moreover, robustness is not unachievable! It is a gradient of how much arguments support you hypothesys. I think Anthropics’ paper Towards Monosemanticity and In-context Learning provide good examples of thorough research that clearly states the hypothesis, claims, and give several independent evidence of their results.

For those reasons, I think we should search what is a robust circuit, and evaluate how robust are the circuits we have found. Especially, while trying to find newer or more complex circuits, not after.

 

Interpretability Tech Tree

Designing an interpretability tool with all the important desiderata it needs to satisfy (robustness, scales to larger models, …) “first try” is very optimistic. Instead one can use a Tech Tree to know which desideratum to try next.

In Transparency Tech Tree [LW · GW], Hubinger details a path to follow to find a tool for transparency. A Tech Tree approximately says: “Once you have a proof of concept, independently search two improvements, and then combine them”, but also adds which improvement should be pursued, and in what order. Not only does this enable parallelization of research, but each improvement is easier since it is based on simpler models or data.

In Hubinger’s case, his tree starts with Best-case inspection transparency, which can be improved in Worst-case inspection transparency, or Best-case training process transparency, and finally combined with Worst-case training process transparency.

Example for robustness: Here is a 2-D table with side complexity, and robustness. New tools start as Proof of concept and the goal is to transform them into a Final product that one could use on a SOTA model. The easier path is to do *1 -> 2 * and 1 -> 3 in parallel, followed by 2 + 3 -> 4, combining tools rather than finding new ones.

 

non-robust

robust

complex circuit

2. Proof of complexity

4. Final product

easy circuit

1. Proof of concept

3. Proof of robustness


 

 

 

 

A Methodology for Interpretability

In general, interpretability lacks a methodology: a set of tools, principles, or methods that we can use to increase our trust in the interpretations we build. The goal is not to set certain ways to do research in stone, but to make explicit that we need to trust our research, and highlight bottlenecks for the community. The recent Towards Best Practices of Activation Patching is a great example of standards for activation patching![1] For example, they analyze which one of logit or probabilities (normalized logits) you should use to find robust results, and how this may impact your findings.

Creating such a methodology is not just about technical research, we also need to be clear on what is an interpretation, and what counts as proof. This part is much harder[2] since it involves operationalization of "philosophical" questionning.

Fortunately, researchers at Stanford are trying to answer those questions with the Causal Abstraction framework. It tries to formalize the hypothesis that “a network implements an algorithm”, and how to test it. So given our network, and an explicit implementation of the algorithm we think it implements, there exists a method to link their parameters and test how close our algorithm is to the network.

 

Personal research

At SERI MATS, my research[3] goal was to test the robustness of activation steering [LW · GW]. The idea was to test the tradeoffs directions had to face on being good at:

However, I chose to apply it to gender which turned out to be too complex/not the right paradigm. Using a dataset with the concept of utility, Annah [LW · GW] and shash42 [LW · GW] were able to test these tradeoffs [LW · GW] and found interesting behavior: it is harder to remove a concept than to steer a model, and the choice of the direction needs to be precise to make the removal happen. I really liked their work and I think it is important to understand directions in LLMs!

 

Conclusion

I hope to have convinced you of the importance of Interpretability’s Robustness as a research strategy. And especially that the most efficient way to create such a methodology is not by serendipitously stumbling upon it.

Thanks for reading. I’ll be happy to answer comments elaborating on and against this idea. I will likely, by default, pursue this line of research in the future.

 

  1. ^

    It would have helped me a lot during my internship!

  2. ^

    The work that has already been done in Cognitive-Sciences and Philosophy might be useful in order to create this methodology.

  3. ^

    It is not available at the moment, I will plug a link when done with it.

0 comments

Comments sorted by top scores.