'Fundamental' vs 'applied' mechanistic interpretability research
post by Lee Sharkey (Lee_Sharkey) · 2023-05-23T18:26:18.174Z · LW · GW · 6 commentsContents
Why I currently prioritize fundamental interpretability None 6 comments
When justifying my mechanistic interpretability research interests to others, I've occasionally found it useful to borrow a distinction from physics and distinguish between 'fundamental' versus 'applied' interpretability research.
Fundamental interpretability research is the kind that investigates better ways to think about the structure of the function learned by neural networks. It lets us make new categories of hypotheses about neural networks. In the ideal case, it suggests novel interpretability methods based on new insights, but is not the methods themselves.
Examples include:
- A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)
- Toy Models of Superposition (Elhage et al., 2022)
- Polysemanticity and Capacity in Neural Networks (Scherlis et al., 2022)
- Interpreting Neural Networks through the Polytope Lens (Black et al., 2022)
- Causal Abstraction for Faithful Model Interpretation (Geiger et al., 2023)
- Research agenda: Formalizing abstractions of computations (Jenner, 2023 [AF · GW])
- Work that looks for ways to identify modules in neural networks (see LessWrong 'Modularity' tag [? · GW]).
Applied interpretability research is the kind that uses existing methods to find the representations or circuits that particular neural networks have learned. It generally involves finding facts or testing hypotheses about a given network (or set of networks) based on assumptions provided by theory.
Examples include
- Steering GPT-2-XL by adding an activation vector (Turner et al., 2023 [AF · GW])
- Discovering Latent Knowledge in Language Models (Burns et al., 2022)
- The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable (Millidge et al., 2022 [AF · GW])
- In-context Learning and Induction Heads (Olsson et al., 2022)
- We Found An Neuron in GPT-2 (Miller et al., 2023 [AF · GW])
- Language models can explain neurons in language models (Bills et al., 2023)
- Acquisition of Chess Knowledge in AlphaZero (McGrath et al., 2021)
Although I've found the distinction between fundamental and applied interpretability useful, it's not always clear cut:
- Sometimes articles are part fundamental, part applied (e.g. arguably 'A Mathematical Framework for Transformer Circuits' is mostly theoretical, but also studies particular language models using new theory).
- Sometimes articles take generally accepted 'fundamental' -- but underutilized -- assumptions and develop methods based on them (e.g. Causal Scrubbing [AF · GW], where the key underutilized fundamental assumption was that the structure of neural networks can be well studied using causal interventions).
- Other times the distinction is unclear because applied interpretability feeds back into fundamental interpretability, leading to fundamental insights about the structure of computation in networks (e.g. the Logit Lens [AF · GW] lends weight to the theory that transformer language models do iterative inference).
Why I currently prioritize fundamental interpretability
Clearly both fundamental and applied interpretability research are essential. We need both in order to progress scientifically and to ensure future models are safe.
But given our current position on the tech tree, I find that I care more about fundamental interpretability.
The reason is that current interpretability methods are unsuitable for comprehensively interpreting networks on a mechanistic level. So far, our methods only seem to be able to identify particular representations that we look for or describe how particular behaviors are carried out. But they don't let us identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean). Let's call the ability to do these things 'comprehensive interpretability' [1].
We need comprehensive interpretability in order to have strong-ish confidence about whether dangerous representations or circuits exist in our model. If we don't have strong-ish confidence, then many theories of impact for interpretability [AF · GW] are inordinately weakened:
- We're a lot less able to use interpretability as a 'force multiplier on alignment research' because we can't trust that our methods haven't missed something crucial. This is particularly true when models are plausibly optimizing against us and hiding dangerous thoughts in places we aren't looking. A similar pattern holds for theories of impact based on 'Empirical evidence for/against threat models', 'Improving human feedback', and 'Informed oversight'.
- We can't be confident about our interpretability audits. Not only does this raise the risk that we'll miss something, but it makes it much harder to justify including interpretability in regulations, since effective regulation usually requires technical clarity. It also makes it harder for clear norms around safety to form.
- We don't get the coordination/cooperation benefits resulting from some actors being able to actually trust other actors' systems.
- We definitely can't use our interpretability methods in the loss function. To be clear, we probably shouldn't do this even if we believed we had comprehensive interpretability. We'd probably want provably comprehensive interpretability (or some other reason to believe that our interpretability methods wouldn't simply be circumvented [AF · GW]) before we could safely justify using them in the loss function.
For most of these theories of impact, the relationship feels like it might be nonlinear: A slight improvement to interpretability that nevertheless falls short of comprehensive interpretability does not lead to proportional safety gains; only when we cross a threshold to something resembling comprehensive interpretability would we get the bulk of the safety gains. And right now, even though there's a lot of valuable applied work to be done, it feels to me like progress in fundamental interpretability is the main determinant of whether we cross that threshold.
- ^
Similar terms for 'comprehensive interpretability' include Anthropic's notion of 'enumerative safety', Evan Hubinger's notion of 'worst-case inspection transparency [AF · GW]', and Erik Jenner's notion of 'quotient interpretability [AF · GW]'.
6 comments
Comments sorted by top scores.
comment by Logan Riggs (elriggs) · 2023-05-24T16:27:06.167Z · LW(p) · GW(p)
How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability?
Are there other specific areas you're excited about?
Replies from: Lee_Sharkey, joseph-van-name, joseph-van-name↑ comment by Lee Sharkey (Lee_Sharkey) · 2023-05-26T10:59:10.146Z · LW(p) · GW(p)
Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers.
Dictionary learning - This is one of my main bets for comprehensive interpretability.
Other areas - I'm also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709
↑ comment by Joseph Van Name (joseph-van-name) · 2023-05-25T11:43:56.890Z · LW(p) · GW(p)
Set a random variable to be a trained model with bilinear layers with random initialization and training data . Then I would like to know if various estimated upper bounds for various entropies for are much lower than if were a more typical machine learning model where a linear layer is composed with ReLU. It seems like entropy is a good objective measure of the lack of decipherability.
↑ comment by Joseph Van Name (joseph-van-name) · 2023-05-25T13:57:16.274Z · LW(p) · GW(p)
Now that I actually think about it, I have some ideas about how we can cluster neurons together if we are using bilinear layers. Because of this, I am starting to like bilinear layers a bit more, and I am feeling much more confident about the problem of interpreting neural networks as long as the neural networks have an infrastructure that is suitable for interpretability. I am going to explain everything in terms of real-valued mappings, but everything I say can be extended to complex and quaternionic matrices (but one needs to be a little bit more careful about conjugations,transposes, and adjoints, so I will leave the complex and quaternionic cases as an exercise to the reader).
Suppose that are -real symmetric matrices. Then define a mapping by setting .
Now, given a collection of -real matrices, define a partial mapping by setting where denotes the spectral radius and denotes the tensor product. Then we say that is a real -spectral radius dimensionality reduction (LSRDR) if is locally maximized. One can compute LSRDRs using a variant gradient ascent combined with the power iteration technique for finding the dominant left and right eigenvectors and eigenvalues of and .
If is an LSRDR of , then you should be able to find real matrices where for . Furthermore, there should be a constant where . We say that the LSRDR is normalized if , so let's assume that is a normalized LSRDR. Then define . Then should be a (not-necessarily orthogonal, so but we could have ) projection matrix of rank . If are all symmetric, then the matrix should be an orthogonal projection. The vector space will be a cluster of neurons. We can also determine which elements of this cluster are most prominent.
Now, define a linear superoperator by setting and set which is the adjoint of where we endow with the Frobenius inner product. Let denote a dominant eigenvector of and let denote a dominant eigenvector of. Then after multiplying by constant real factors, the matrices will be (typically distinct) positive definite trace 1 matrices of rank with . If we retrained the LSRDR but with a different initialization, then the matrices will still remain the same.
If , then the values will be numbers in the in the interval [0,1] that measure how much the vector belongs in the cluster.
If is an -orthogonal matrix, then the matrices will remain the same if they were trained on instead of , so the matrices care about just the inner product space structure of while ignoring any of the other structure of . Let .
We can then use LSRDRs to compute the backpropagation of a cluster throughout the network.
Suppose that where each is a bilinear layer. Then whenever is a bilinear mapping, and is a positive semidefinite matrix that represents a cluster in , the positive semidefinite matrices represent clusters in .
I have not compared LSRDRs to other techniques to other clustering and dimensionality reduction techniques such as higher order singular value decompositions, but I like LSRDRs since my computer calculations indicate that they are often unique.
A coordinate free perspective:
Suppose that are real finite dimensional inner product spaces. Then we say that a function is a quadratic form if for each , the mapping is a quadratic form. We say that a linear operator is symmetric if for each , the operator is symmetric. The quadratic forms can be put into a canonical one-to-one correspondence with the symmetric linear operators .
If is an arbitrary linear operator, then define by letting where denotes the partial trace.
Given a linear mapping , and a dimensional real inner product space , define a partial mapping by setting . We say that a linear mapping is a real LSRDR of if the value is locally maximized. If is a real LSRDR of , one can as before (if everything goes right) find linear operators and constant where and where . As before, we can normalize the LSRDR so that . In this case, we can set to be a dominant eigenvector of and to be a dominant eigenvector of . We still define and the mapping will be a non-orthogonal projection, and will still be positive semidefinite (up-to a constant factor). The situation we are in is exactly as before except that we are working with abstract finite dimensional inner product spaces without any mention of coordinates.
Conclusions:
The information that I have given here can be found in several articles that I have posted at https://circcashcore.com/blog/.
I have thought of LSRDRs as machine learning models themselves (such as word embeddings), but it looks like LSRDRs may also be used to interpret machine learning models.
When generalizing bilinear layers to a quaternionic setting, do we want the layers to be linear in both variables or do we want them to be linear in one variable and anti-linear in the other variable?
comment by Arthur Conmy (arthur-conmy) · 2023-05-28T15:35:19.553Z · LW(p) · GW(p)
This was a nice description, thanks!
However, regarding
comprehensively interpreting networks [... aims to] identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean)
I think this is incredibly optimistic hope that I think need be challenged more.
On my model GPT-N has a mixture of a) crisp representation, b) fuzzy heuristics made are made crisp in GPT-(N+1) and c) noise and misgeneralizations. Unless we're discussing models that perfectly fit their training distribution, I expect comprehensively interpreting networks involves untangling many competing fuzzy heuristics which are all imperfectly implemented. Perhaps you expect this to be possible? However, I'm pretty skeptical this is tractable and expect the best good interpretability work to not confront these completeness guarentees.
Related (I consider "mechanistic interpretability essentially solved" to be similar to your "comprehensive interpreting" goal)
comment by carboniferous_umbraculum (Spencer Becker-Kahn) · 2023-05-26T08:40:28.664Z · LW(p) · GW(p)
Interesting thoughts!
It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post [AF · GW] by jylin04 [AF · GW]:
...a dream for interpretability research would be if we could reverse-engineer our future AI systems into human-understandable code [AF · GW]. If we take this dream seriously, it may be helpful to split it into two parts: first understanding what "programming language" an architecture + learning algorithm will end up using at the end of training, and then what "program" a particular training regimen will lead to in that language [7] [AF(p) · GW(p)]. It seems to me that by focusing on specific trained models, most interpretability research discussed here is of the second type. But by constructing an effective theory for an entire class of architecture that's agnostic to the choice of dataset, PDLT is a rare example of the first type.
I don't necessarily totally agree with her phrasing but it does feel a bit like we are all gesturing at something vaguely similar (and I do agree with her that PDLT-esque work may have more insights in this direction than some people on our side of the community have appreciated).
FWIW, in a recent comment reply [LW(p) · GW(p)] to Joseph Bloom, I also ended up saying a bit more about why I don't actually see myself working much more in this direction, despite it seeming very interesting, but I'm still on the fence about that. (And one last point that didn't make it into that comment is the difficulties posed by a world in which increasingly the plucky bands of interpretability researchers on the fringes literally don't even know what the cutting edge architectures and training processes in the biggest labs even are.