Mech Interp Lacks Good Paradigms

post by Daniel Tan (dtch1997) · 2024-07-16T15:47:32.171Z · LW · GW · 0 comments

Contents

  Towards a Grand Unifying Theory (GUT) with MI
  A GUT Needs Paradigms
    Paradigms Are Instrumental for Progress
    Three Desiderata for Paradigms
  Examining Paradigms in Mechanistic Interpretability
    A Mathematical Framework of a Transformer Circuit 
      Epistemics: 4.5/5 
      Generality: 5/5
      Open-Endedness: 5/5
    The Linear Representation Hypothesis
      Epistemics: 3/5
      Generality: 2/5
      Open-Endedness: 2/5
    The Superposition Hypothesis
      Epistemics: 3/5
      Generality: 4/5
      Open-Endedness: 5/5 
  Other Bodies of Theory
    Singular Learning Theory
      Epistemics: 5/5
      Generality: 3/5
      Open-Endedness: 3/5
    Computational Mechanics
    The Polytope Hypothesis
  Distilling A Technical Research Agenda
  Conclusion 
  Acknowledgements
None
No comments

Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free! 

Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome!

Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas. However, how useful is this MI research?  

A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret. 

In this post, I'll critically examine several leading candidates for "paradigms" in MI, consider the available evidence for / against, and identify good future research directions (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined. 

Towards a Grand Unifying Theory (GUT) with MI

Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned. 

Some people who have espoused this opinion: 

As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which  were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis. 

It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated. 

(Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere [LW · GW], so I won't elaborate further here. 

A GUT Needs Paradigms

Paradigm - an overarching framework for thinking about a field

In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for thinking about a field. Kuhn argues that the establishment of accepted paradigms is a sign of "maturity" in the development of a field. 

Paradigms Are Instrumental for Progress

In the absence of a paradigm, it's very hard to draw the right conclusions from data, for two reasons. 

Multiple hypotheses could explain the data. Kuhn argues that, in the absence of a paradigm, a reasonable researcher might reach "any one of a number of incompatible conclusions". For example, we might incorporate variables with no actual predictive power into our explanation, like an ancient guru using the motions of stars to predict the future. The variables we choose to use are a function of our prior experience in other field and "accidents" in the process of our investigation. 

We may not have sufficiently good mental abstractions to understand what we're seeing. 

A famous thought experiment in neuroscience  considers what results popular interpretability techniques would yield on microprocessors, which are an example of a complex information-processing systems that we understand at all levels. Techniques such as "lesion experiments" (a.k.a activation patching) completely fail to elucidate the underlying structure - without a more advanced interpretation, MI experiments aren't very useful. 

As another example of how an absence of paradigms leads to illusions, consider Lucius Bushnaq's thought experiment on interpreting a hypothetical feature tracking entropy of a physical system [AF(p) · GW(p)]. 

It seems to sort of activate more when the system is warmer. But that's not all it's doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.

In a pre-paradigmatic field, Kuhn argues that all research amounts to no more than a "fact-gathering" exercise conducted mostly at random. The value of a paradigm is in forcing concerted and systematic study of a common set of phenomena in a standardised way, facilitating subsequent progress in the field. Furthermore, by concentrating attention on "mysteries" which the paradigm fails to explain, a paradigm sets up subsequent work to find the next paradigm. 

Three Desiderata for Paradigms

Distilling and synthesizing Kuhn's treatment, a paradigm has three important properties.

  1. A paradigm has good epistemics, i.e. we believe it to be true because it explains existing data very well, or because it is strongly implied by other things we assume to be true. 
  2. A paradigm is general, i.e it applies to many seemingly-distinct cases with a high degree of accuracy. For example, Newton's laws of gravitation explain both the motion of celestial bodies and the trajectories of thrown objects. 
  3. A paradigm is open-ended, i.e. can be easily used as a building block for future work. For example, the results associated with a paradigm may result in useful practical applications. (Note: this necessitates a minimum degree of rigour.) Alternatively, attempting to validate the premises associated with a paradigm might point the way to very informative experiments. 

In the subsequent discussion, I will continually return to these criteria for evaluating subsequent paradigms (or "precursor" paradigms). 

Examining Paradigms in Mechanistic Interpretability

Our most "paradigm-y" things at the moment include:

TLDR; my view is that these all fall short in some way. In the first case, it actually is a great paradigm, just insufficient for what we want. In the latter two cases, they're not sufficiently rigorous to serve as building blocks for theory.  

A Mathematical Framework of a Transformer Circuit 

The mathematical framework blogpost, published by Anthropic in 2021, is a seminal example of what I consider to be a great work introducing a paradigm in MI, perhaps my favourite, that has pushed the field forward a lot.

An overview:

Epistemics: 4.5/5 

Generality: 5/5

Open-Endedness: 5/5

My main criticism is that "A Mathematical Framework" is not high-level enough. As excellent as it is, this paradigm feels like a level of abstraction "below" what we want. 

Concrete analogy: Answering biological questions with chemistry. If we think of a biological organism (aka a model), "Mathematical Framework" is like a complete description of its chemistry (aka circuitry). The latter is definitely foundational and informationally-complete w.r.t the former. At the same time, it is totally insufficient for answering higher-level questions like "does it have goals?" 

The Linear Representation Hypothesis

An overview of the LRH. 

Epistemics: 3/5

Generality: 2/5

Open-Endedness: 2/5

My biggest problem with LRH is that it's very difficult to make meaningful predictions a priori with it. 

The Superposition Hypothesis

An overview of superposition. 

I think the assumptions here are quite reasonable, which facilitates a high generality score. However, the epistemics could be better. 

Epistemics: 3/5

The largest body of evidence for superposition so far is that we consistently observe polysemantic neurons in models, across all sizes, architectures, and domains. Polysemanticity is self-evident. 

Secondly, superposition has also been demonstrated in toy models. However, it's unclear what disanalogies there are between superposition in toy models and superposition in real LLMs. For example, the assumptions about the data may not hold, or maybe two-layer ReLU networks are qualitatively quite different from much deeper transformer architectures. 

The last main evidence in favour of superposition is that interp approaches inspired by superposition, i.e. sparse autoencoders, have seemed to work really well. However, this evidence is also not very solid. 

As a specific alternative hypothesis on what SAEs are doing, a common argument is that SAEs simply cluster the data, and interpretability comes from having tighter clusters. 

Generality: 4/5

Open-Endedness: 5/5 

Assuming superposition as a concept turns out to be basically correct, it illuminates a lot of useful follow-up work. 

A point which I have not seen made elsewhere: I think we have not yet fully elucidated the "kinds" of superposition that can occur in models. Arbitrarily complex things can be in superposition, and features could just be the tip of the iceberg. 

Overall, I am actually quite confident that superposition is essentially correct. That is why I'm currently working on circuit analysis using SAEs. But I think there's insufficient evidence atm to reject leading alternative hypotheses and cement its status as a paradigm. 

Other Bodies of Theory

Note: I am considerably less familiar with these other bodies of theory than I am with the preceding 3, so there may be errors or inaccuracies here. Please feel free to point those out as necessary

There are some other bodies of theory which currently don't make the top cut, but which I think are promising nonetheless as things which could yield paradigms, given time. 

Singular Learning Theory

Ratings are offered here, but they should be taken with a very large pinch of salt because I am not very familiar with SLT

Overview: 

Epistemics: 5/5

Generality: 3/5

In principle, because SLT covers all things within the class of "singular" models, the claims are extremely general - even more so than "A Mathematical Framework", which only applies to neural nets, and even more specifically only transformer architectures. 

However, I'm not confident that SLT results are general, for the following reasons

Open-Endedness: 3/5

On the whole, I think there is a lot of untapped potential here for SLT to be a paradigm, but this potential is quite far from being fully realized at the moment due to both issues with communicating the foundations of SLT to the broader MI community and a lack of "killer applications". 

Computational Mechanics

No rating is offered here because I haven't engaged sufficiently with the material. I'm including Comp Mech mostly for completeness 

The seminal work that I am aware of is about the fact that transformers model belief states [LW · GW] using simplices, which are a specific representational geometry. Brief comments here: 

I don't have great takes on what Comp Mech aims to do as a field, and in any case it hasn't made significant impact (yet) on the MI literature. I'll revisit this in the future if it becomes relevant. 

The Polytope Hypothesis

This is the idea that the correct atom for models' feature geometry are "polytopes". This is a very nascent trend I observe in some recent papers [Kiho Park et al, Adam Shai et al [LW · GW], Polytope Lens, and circular features in day-of-the-week math]. I intend to write a more complete note about this in a follow-up work. 

Distilling A Technical Research Agenda

Note: Here, I summarize technical research items proposed in previous sections, which I think would be exciting.  I've omitted the "other bodies of theory" for now because I think my takes will not be very good. 

Generally: More killer applications of existing paradigms. 

On "Mathematical Framework": 

On the LRH: 

On superposition: 

Conclusion 

In summary, I think it's important to critically evaluate whether MI has succeeded in delivering general paradigms with high explanatory power over nontrivial phenomena. My take on this so far is that we have a few separate attempts but all of these are lacking at the moment. That's fine, since incomplete paradigms are still useful, and this highlights good avenues for future research. 

Acknowledgements

Thanks to Egg Syntax, Jake Mendel, Robert Kirk, Joseph Bloom for useful feedback and discussions! 

0 comments

Comments sorted by top scores.