Against blanket arguments against interpretability
post by Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-22T09:46:23.486Z · LW · GW · 2 commentsContents
On blanket criticism and refutation The critiques First critique. Second critique Third critique Conclusion. None 2 comments
On blanket criticism and refutation
In his long post [LW · GW] on the subject, Charbel-Raphaël argues against theories of impacts of interpretability. I think it's a largely a good, well-argued post, and if the only thing you get out of it is reading that post, I'll be contributing to improving the discourse. There is other material with similar claims that I think are made with low context, and also I should say that I'm not very versed in the history and the various versions of the debate.
At the same time, I disagree with the take.
In this post I'm going to "go high" and debate strong, general forms of the criticism, rather than the more object-level subcomponents. Generalizing away from specifics, I think Charbel-Raphaël's post has three valid general reasons to doubt the usefulness of the interpretability paradigm. Below, I'll try to write out condensed but strong versions of them, and address them one by one.
But before doing this, let's set some parameters for the debate. Note that the goal of a blanket critique like Charbel-Raphaël's post is not to say "there are worlds where interpretability definitely doesn't lead to better safety", but rather that "interpretability doesn't lead to noticeably better safety in the vast majority of worlds". Therefore in refuting this, I'll be free to make some "plausible" assumptions about the world coming from insights that we have on current models, without having to defend these from "what if" arguments about exceptional worlds where they don't work.
This means that in this post I am very definitely not making the strongest case for the impact of interpretability. There are many alternative routes to impact the come from understanding systems better. If I were making a post that makes the strongest case for interpretability being useful given what we know now, I'd not just give one plausible story but go through many of the alternative routes. I would also put a lot of emphasis on operationalizing our high uncertainty (including high upside) about the general point that "understanding systems better in a fundamental way leads to tools of shapes we didn't expect existed".
For the purpose of this discussion, I'll more narrowly look at the critiques, and provide counterexamples. In refuting blanket critiques, a sufficient condition for a counterexample is: a description of a concrete assumption about the future that has non-vanishing probability (according to our best current understandings of how AIs work) for which the blanket criticism fails[1].
In other words, as Plato puts in his famous text on interpretability[2], the Phaedrus:
Let us therefore speak in the manner of a ‘plausible myth’ [...].
The critiques
It remains to go through and discuss each high-level critique, as I understand it.
First critique.
Whatever sophisticated tools we get to understand the model, cryptography tells us that (in most systems), the AI has sufficient expressivity to scuttle them.
I think here I notice a particular amount of confusion in the community. First of all, I think that basically this is true! Unless we have a really fundamentally different paradigm of AI in the future (to a level which I think is unlikely), there is and will continue to be sufficient theoretical expressivity in an AI's weights to be arbitrarily deceptive: distinguish training from deployment, perform un-decodable steganography, hide deceptive behaviors. Moreover it seems likely that in essentially all current and future architectures, it will be, in a fundamental complexity-theoretic sense, impossible to distinguish deceptive behavior from benign stochasticity inside the model's weights[3].
However, there is a big difference between cryptographically intractable behavior being possible in principle vs. being present in the specific class of trained systems we get in neural nets (and again, this is a pretty generic property that we should expect will hold in future architectures, namely that algorithms that get learned are special and in particular not-maximally-cryptographic -- a vague term, some versions of which can be operationalized in complexity theory).
Instead we are dealing with certain "organic" systems which are learnable. It is almost definitely true that for systems like this, methods of "robustly avoiding cryptographically-shaped deceptions" are possible in an appropriate "limit" notion of possibility. For example the "hardness" proofs that show that arbitrary deception is cryptographically possible will imply that it is possible, in principle, to have a very small (maybe GPT3-size) deceptive AI whose deception is impossible to find by any friendly AGI that fits in our universe. But "in real life", almost all AI safety experts believe that a friendly AGI with access to the training process can guarantee (outside exponentially vanishing "maxwell's demon"- level outliers) the safety of any much smaller trained system. In a similar sense, while it is cryptographically impossible to make proofs about the out-of-distribution behavior or compressibility of general programs, this is not the case for most programs that matter. For instance formal verification methods are possible and frequently implemented on "SOTA level" human-generated code (e.g. verifiable versions of the linux kernel).
At the end of the day, in parallel to the "existence proofs" of undecodable deception, we also have more or less certain "existence proofs" of interpretability techniques which catch deception in programs executing learning. The mathematical question (while useful as an intuition pump) then reduces to a pragmatic question of "what will happen first" (and whether deception-screening methods will be cheap enough, etc.). Hard questions, but as I hope to convince you below, solvable ones in certain realistic worlds.
Second critique
Even assuming we get significantly better fundamental interpretability theory, full "enumerative" interpretability is probably prohibitively hard and expensive.
Again, this is a place where I tend to bite the bullet: I think that some paradigm of "completely understanding AIs to the neuron level" is neither likely nor desirable. The point is that once again we're dealing with a real system where complexity is dictated by some messy properties of reality and the training algorithm, rather than with some idealized general interpretable system with an irreducible complexity from the interpretation to the behavior.
In particular, real systems tend to have the properteis of modularity and scale separation. Scale separation is a formalizable notion in thermodynamics and solid-state physics (it's related to the theoretical thermodynamics discussion here [LW(p) · GW(p)], though by no means reducible to the specific "numerical renormalization" picture discussed there). More generally, it corresponds to an assumption that behaviors on large scales (such as an engineering balls-and-rods structural analysis of bridge stability) are independent or "shielded" from behaviors at small scales (i.e., local variations in crystal structure and vibration), with all "interaction" mediated by fully understandable large-scale phenomena like heat, strain tensors, and mechanical waves.
In the context of interpreting neural nets, such a separation could also very plausibly take place, separating low-level phenomena on the level of induction heads or name association circuits, etc., from higher-level phenomena like encodings of chess positions or sophisticated grammatical parsers, and separating these from further undiscovered phenomena like primitives of logical thinking. Note that if this is the case, proving such separation may well be hard -- but if a safety technique relying on unproven assumptions happens to lead to safety, that's still a win. As I explained above, all I need to counter a blanket criticism is a plausible story.
Now one can easily imagine an interpretability scheme that interprets every new token generated by an LLM, e.g., in terms of a handful of relevant behaviors (as I mentioned before, the diagrams in appendix F of this paper give an excellent "anchoring complexity scale" of characteristic phenomena that occur on the level of a typical sentence parsing, and while the techniques in that paper probably don't scale indefinitely due to limitations of SAEs, the complexity takeaways from this paper are in my view likely to be essentially valid). The key point now is that deception is a high-level behavior. In any sentence-generating task, it seems plausible to me that only a couple of behaviors on this level will be active per token, and in such a picture robustly checking every token for deceptive behaviors may be extremely cheap (with robustness guaranteed by a separation of scales phenomenon).
Again, none of the things I said should be taken as predictions: things are likely to be more complicated and have a more sophisticated set of moving parts than just some bucket of heuristic mechanisms at different scales. Instead, at least to me this is a sufficient "plausible myth" to counter a universal objection of this type.
Third critique
If we look at progress in interpretability so far then, even allowing for some amount of future theoretical progress, it seems implausible that interpretability will get to a level of being able to track complex behaviors on most reasonable AGI timescales.
This to me is the most important critique and the one I am most sympathetic to. I agree that it is very likely that extremely weird AGI will happen before we have even very messy and speculative white-box techniques to track deception or other forms of misalignment. At the same time, I think we're not as far behind as many people seem to think (at least, in certain "good worlds" or "plausible paradigms"). Namely, we are quickly developing an increasingly rich collection of techniques to probe for various AI behaviors. We have pretty robust measurements of complexity of algorithms from SLT, and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress). We have methods from Anthropic and Apollo and a number of independent researchers that give pictures, on various levels of complexity and "fundamentalness", of relatively sophisticated LLM behaviors, like linearly decodable positional look-ahead in chess, and for linearly eliciting various semantic and functional behaviors. (And here I am just giving examples from the narrow class of topics I follow in the subject; there are many other success stories of similar scale in the wider ML literature.) Certainly new theoretical ideas are needed, but I think it's quite plausible that they will come soon. The science is not (as described by some skeptics) at the level of Galileo marveling at the sky through a primitive telescope -- it's closer to the level of early-20th-century physicists starting to discover a zoo of new ways to experimentally probe and discover behaviors and regularities of atoms, but not quite sure which ones are fundamental and which ones are not (of course in AI we are in a much more complex setting).
In parallel to this optimism, I think it's entirely plausible that we're on the wrong track: that the primitive feature-finding and circuit-finding and linear-embedding methods we're using are entirely incidental, not at all useful for capturing a more fundamental "minimal viable" level of understanding that's needed to characterize behaviors like deception. For what it's worth, my view, inspired by condensed matter physics and development of other sciences, is that even very incidental ways of experimentally decomposing behaviors tend to eventually converge to the right insights. But I think this worry is a valid one, held strongly by people I respect. Nevertheless, an honest person here I think will admit to a genuine level of uncertainty: there are "plausible myths" that mild extensions of white-box techniques that currently exist will be sufficiently strong to be safety-relevant.
Conclusion.
I've been consistently harping on this, but it bears repeating: none of the "plausible stories" I described to dissolve the various criticisms are actually things that I consider a sure bet, or even >50% likely. That's not the point. None of them show that interpretability is a more important thing to be working on than other subjects that also need to go right for good futures: questions of coordination, politics, etc. None of them show that the sorts of "physics-pilled" concrete interpretability stories I gave are more important than either more basic black-box testing and red-teaming or, conversely, deeper questions about intelligence pursued by people who study Agent Foundations. However I am somewhat confident that even a very weak version of these arguments should be sufficient to convince a reasonable person that some amount of fundamental interpretability research is needed. At least I think this is true for a reasonable person with similar intuitions to myself about "the gist of what we know so far about how ML works", obtained from reading about and playing with "realistic" ML algorithms[4].
I am happy about the existence of honest skeptical discussions like the one in Charbel-Raphaël's post, and think holding discussions of this type leads to clearer pictures of what's valuable, what's hard, whether the community has perhaps overinvested in certain forms of interpretability. At the same time, if you even partially track the shapes of arguments I outlined here, you should reject any fundamental belief in the complete unworkability of interpretability techniques. For all blanket arguments I have heard so far, "plausible-myth shaped" counterexamples exist.
- ^
This notion has some things in common with safety case sketches, though is less specific.
- ^
Interpretability of souls, but that's beside the point
- ^
Note that in a way I don't have time to explain here, stochasticity in AI models is probably irreducible -- i.e., it is not just noise but, like for efficient sorting algorithms that use random hashes, it is a core component of the program -- we show that certain efficient implementations of boolean algorithms have this property in our paper on computation in superposition, for example.
- ^
The modern intuition is in my understandinv quite different from the thinking a decade ago, when the view of NN's as powerful black-box oracles was predominant.
2 comments
Comments sorted by top scores.
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-01-22T10:49:18.768Z · LW(p) · GW(p)
Beautifully argued, Dmitry. Couldn't agree more.
I would also note that I consider the second problem of interpretability basically the central problem of complex systems theory.
I consider the first problem a special case of the central probem of alignment. It's very closely related to the 'no free lunch' problem.
comment by jacob_drori (jacobcd52) · 2025-01-23T17:56:09.632Z · LW(p) · GW(p)
We have pretty robust measurements of complexity of algorithms from SLT
This seems overstated. What's the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a "complexity"?
... and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)
Citation?