How useful is mechanistic interpretability?
post by ryan_greenblatt, Neel Nanda (neel-nanda-1), Buck, habryka (habryka4) · 2023-12-01T02:54:53.488Z · LW · GW · 54 commentsContents
Opening positions Do induction heads and French neurons exist? What is the bar for a mechanistic explanation? Could mechanistic interpretability rule out deceptive alignment? Hopes for mechanistic interpretability being useful Immediately-relevant mechanistic interpretability projects None 55 comments
Opening positions
Do induction heads and French neurons exist?
What is the bar for a mechanistic explanation?
Could mechanistic interpretability rule out deceptive alignment?
Hopes for mechanistic interpretability being useful
Immediately-relevant mechanistic interpretability projects
54 comments
Comments sorted by top scores.
comment by johnswentworth · 2023-12-01T06:32:15.907Z · LW(p) · GW(p)
This dialogue mostly makes me want to rant about how all y'all are doing mech interp wrong. So, rant time. This-is-a-rant-so-not-necessarily-reflectively-endorsed, etc.
Starting point: Science In A High-Dimensional World [LW · GW]. Quoting from that post:
In a high-dimensional world like ours, there are billions of variables which could influence an outcome. The great challenge is to figure out which variables are directly relevant - i.e. which variables mediate the influence of everything else. In practice, this looks like finding mediators and hunting down sources of randomness. Once we have a set of control variables which is sufficient to (approximately) determine the outcome, we can (approximately) rule out the relevance of any other variables in the rest of the universe, given the control variables.
A remarkable empirical finding across many scientific fields, at many different scales and levels of abstraction, is that a small set of control variables usually suffices. Most of the universe is not directly relevant to most outcomes most of the time.
Ultimately, this is a picture of “gears-level science”: look for mediation, hunt down sources of randomness, rule out the influence of all the other variables in the universe.
This applies to interpretability just like any other scientific field. The real gold-standard thing to look for is some relatively-small set of variables which determine some other variables, basically-deterministically. Or, slightly weaker: a relatively-small Markov blanket which screens off some chunk of the system from everything else.
In order for this to be useful, the determinism/screening does need pretty high precision - e.g. Ryan's 99% number sounds like a reasonable day-to-day heuristic, many nines might be needed if there's a lot of bits involved, etc.
On the flip side, this does not necessarily need to look like a complete mechanistic explanation. Ideally, findings of screening are the building blocks from which a complete mechanistic model is built. The key point is that findings of screening provide an intermediate unit of progress, in between "no clue what's going on" and "full mechanistic interpretation". Those intermediate units of progress can be directly valuable in their own right, because they allow us to rule things out: (one way to frame) the whole point of screening is that lots of interactions are ruled out. And they directly steer the search for mechanistic explanations, by ruling out broad classes of models.
That is the sort of approach to mech interp which would be able to provide valuable incremental progress on large models, not just toy models, because it doesn't require understanding everything about a piece before something useful is produced.
(Side note: yet another framing of all this would be in terms of modules/modularity.)
Replies from: rohinmshah, thomas-kwa↑ comment by Rohin Shah (rohinmshah) · 2023-12-01T08:37:13.841Z · LW(p) · GW(p)
This seems like exactly what mech interp is doing? Circuit finding is all about finding sparse subgraphs. It continues to work with large models, when trying to explain a piece of the behavior of the large model. SAE stands for sparse autoencoder: the whole point is to find the basis in which you get sparsity. I feel like a lot of mech interp has been almost entirely organized around the principle of modularity / sparsity, and the main challenge is that it's hard (you don't get to 99% of loss recovered, even on pieces of behavior, while still being meaningfully sparse).
Replies from: Lblack, johnswentworth↑ comment by Lucius Bushnaq (Lblack) · 2023-12-02T08:40:43.438Z · LW(p) · GW(p)
SAEs are almost the opposite of the principle John is advocating for here. They deliver sparsity in the sense that the dictionary you get only has a few neurons not be in the zero state at the same time, they do not deliver sparsity in the sense of a low dimensional summary of the relevant information in the layer, or whatever other causal cut you deploy them on. Instead, the dimensionality of the representation gets blown up to be even larger.
Replies from: neel-nanda-1↑ comment by Neel Nanda (neel-nanda-1) · 2023-12-02T10:16:51.047Z · LW(p) · GW(p)
My understanding was that John wanted to only have a few variables mattering on a given input, which SAEs give you. The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.
I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"
Replies from: Lblack↑ comment by Lucius Bushnaq (Lblack) · 2023-12-02T10:44:03.947Z · LW(p) · GW(p)
The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.
This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.
Leaving that aside, the vanilla reading of this claim also seems kind of obviously false for many models, otherwise optimising them in inference through e.g. low rank approximation of weight matrices would never work. You are throwing away at least one floating point number worth of description bits there.
I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"
A low-dimensional summary of a variable vector of size is a fixed set of random variables that suffice to summarise the state of . To summarise the state of using the activations in an SAE dictionary, I have to describe the state of more than variables. That these variables are sparse may sometimes let me define an encoding scheme for describing them that takes less than variables, but that just corresponds to undoing the autoencoding and then performing some other compression.
Replies from: neel-nanda-1↑ comment by Neel Nanda (neel-nanda-1) · 2023-12-02T10:54:11.415Z · LW(p) · GW(p)
This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.
I'd be curious to hear more about this - IMO we're talking past each other given that we disagree on this point! Like, in my opinion, the reason low rank approximations work at all is because of superposition.
For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
Replies from: Lblack↑ comment by Lucius Bushnaq (Lblack) · 2023-12-02T12:42:27.954Z · LW(p) · GW(p)
For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
I don't have the time and energy to do this properly right now, but here's a few thought experiments to maybe help communicate part of what I mean:
Say you have a transformer model that draws animals. As in, you type “draw me a giraffe”, and then it draws you a giraffe. Unknown to you, the way the model algorithm works is that the first thirty layers of the model perform language processing to figure out what you want drawn, and output a summary of fifty scalar variables that the algorithms in the next thirty layers of the model use to draw the animals. And these fifty variables are things like “furriness”, “size”, “length of tail” and so on.
The latter half of the model does then not, in any real sense, think of the concept “giraffe” while it draws the giraffe. It is just executing purely geometric algorithms that use these fifty variables to figure out what shapes to draw.
If you then point a sparse autoencoder at the residual stream in the latter half of the model, over a data set of people asking the network to draw lots of different animals, far more than fifty or the network width, I’d guess the “sparse features” the SAE finds might be the individual animal types. “Giraffe”, “elephant”, etc. .
Or, if you make the encoder dictionary larger, more specific sparse features like “fat giraffe” would start showing up.
And then, some people may conclude that the model was doing a galaxy-brained thing where it was thinking about all of these animals using very little space, compressing a much larger network in which all these animals are variables. This is kind of true in a certain sense if you squint, but pretty misleading. The model at this point in the computation no longer “knows” what a giraffe is. It just “knows” what the settings of furriness, tail length, etc. are right now. If you manually go into the network and set the fifty variables to something that should correspond to a unicorn, the network will draw you a unicorn, even if there were no unicorns in the training data and the first thirty layers in the network don’t know how to set the fifty variables to draw one. So in a sense, this algorithm is more general than a cleverly compressed lookup table of animals would be. And if you want to learn how the geometric algorithms that do the drawing work, what they do with the fifty scalar summary statistics is what you will need to look at.
Just because we can find a transformation that turns an NNs activations into numbers that correlate with what a human observer would regard as separate features of the data, does not mean the model itself is treating these as elementary variables in its own computations in any meaningful sense.
The only thing the SAE is showing you is that the information present in the model can be written as a sum of some sparsely activating generators of the data. This does not mean that the model is processing the problem in terms of these variables. Indeed, SAE dictionaries are almost custom-selected not to give you variables that a well-generalizing algorithm would use to think about problems with big, complicated state spaces. Good summary variables are highly compositional, not sparse. They can all be active at the same time in any setting, letting you represent the relevant information from a large state space with just a few variables, because they factorise. Temperature and volume are often good summary variables for thinking about thermodynamic systems because the former tells you nothing about the latter and they can co-occur in any combination of values. Variables with strong sparsity conditions on them instead have high mutual information, making them partially redundant, and ripe for compressing away into summary statistics.
If an NN (artificial or otherwise) is, say, processing images coming in from the world, it is dealing with an exponentially large state space. Every pixel can take one of several values. Luckily, the probability distribution of pixels is extremely peaked. The supermajority of pixel settings are TV static that never occurs, and thermal noise that doesn't matter for the NNs task. One way to talk about this highly peaked pixel distribution may be to describe it as a sum of a very large number of sparse generators. The model then reasons about this distribution by compressing the many sparse generators into a small set of pretty non-sparse, highly compositional variables. For example, many images contain one or a few brown branchy structures of a certain kind, which come in myriad variations. The model summarises the presence or absence of any of these many sparse generators with the state of the variable “tree”, which tracks how much the input is “like a tree”.
If the model has a variable “tree” and a variable “size”, the myriad brown, branchy structures in the data might, for example, show up as sparsely encoded vectors in a two-dimensional (“tree”,“size”) manifold. If you point a SAE at that manifold, you may get out sparse activations like “bush” (mid tree, low size) “house” (low tree, high size), “fir” (high tree, high size). If you increase the dictionary size, you might start getting more fine-grained sparse data generators. E.g. “Checkerberry bush” and “Honeyberry bush” might show up as separate, because they have different sizes.
Humans, I expect, work similarly. So the human-like abstractions the model may or may not be thinking in and that we are searching for will not come in the form of sparse generators of layer activations, because human abstractions are the summary variables you would be using to compress these sparse generators. They are the type-of-thing you use to encode a sparse world, not the type-of-thing being encoded. That our SAE is showing us some activations that correlate with information in the input humans regard as meaningful just tells us that the data contains sparse generators humans have conceptual descriptions for, not that the algorithms of the network themselves are encoding the sparse generators using these same human conceptual descriptions. We know it hasn't thrown away the information needed to compute that there was a bush in the image, but we don't know it is thinking in bush. It probably isn't, else bush would not be sparse with respect to the other summary statistics in the layer, and our SAE wouldn't have found it.
↑ comment by nostalgebraist · 2023-12-02T18:42:28.789Z · LW(p) · GW(p)
This is a great, thought-provoking critique of SAEs.
That said, I think SAEs make more sense if we're trying to explain an LLM (or any generative model of messy real-world data) than they do if we're trying to explain the animal-drawing NN.
In the animal-drawing example:
- There's only one thing the NN does.
- It's always doing that thing, for every input.
- The thing is simple enough that, at a particular point in the NN, you can write out all the variables the NN cares about in a fully compositional code and still use fewer coordinates (50) than the dictionary size of any reasonable SAE.
With something like an LLM, we expect the situation to be more like:
- The NN can do a huge number of "things" or "tasks." (Equivalently, it can model many different parts of the data manifold with different structures.)
- For any given input, it's only doing roughly one of these "tasks."
- If you try to write out a fully compositional code for each task -- akin to the size / furriness / etc. code, but we have a separate one for every task -- and then take the Cartesian product of them all to get a giant compositional code for everything at once, this code would have a vast number of coordinates. Much larger than the activation vectors we'd be explaining with an SAE, and also much larger than the dictionary of that SAE.
- The aforementioned code would also be super wasteful, because it uses most of its capacity expressing states where multiple tasks compose in an impossible or nonsensical fashion. (Like "The height of the animal currently being drawn is X, AND the current Latin sentence is in the subjunctive mood, AND we are partway through a Rust match expression, AND this author of this op-ed is very right-wing.")
- The NN doesn't have enough coordinates to express this Cartesian product code, but it also doesn't need to do so, because the code is wasteful. Instead, it expresses things in a way that's less-than-fully-compositional ("superposed") across tasks, no matter how compositional it is within tasks.
- Even if every task is represented in a maximally compositional way, the per-task coordinates are still sparse, because we're only doing ~1 task at once and there are many tasks. The compositional nature of the per-task features doesn't prohibit them from being sparse, because tasks are sparse.
- The reason we're turning to SAEs is that the NN doesn't have enough capacity to write out the giant Cartesian product code, so instead it leverages the fact that tasks are sparse, and "re-uses" the same activation coordinates to express different things in different task-contexts.
- If this weren't the case, interpretability would be much simpler: we'd just hunt for a transformation that extracts the Cartesian product code from the NN activations, and then we're done.
- If it existed, this transformation would probably (?) be linear, b/c the information needs to be linearly retrievable within the NN; something in the animal-painter that cares about height needs to be able to look at the height variable, and ideally to do so without wasting a nonlinearity on reconstructing it.
- Our goal in using the SAE is not to explain everything in a maximally sparse way; it's to factor the problem into (sparse tasks) x (possibly dense within-task codes).
- Why might that happen in practice? If we fit an SAE to the NN activations on the full data distribution, covering all the tasks, then there are two competing pressures:
- On the one hand, the sparsity loss term discourages the SAE from representing any given task in a compositional way, even if the NN does so. All else being equal, this is indeed bad.
- On the other hand, the finite dictionary size discourages the SAE from expanding the number of coordinates per task indefinitely, since all the other tasks have to fit somewhere too.
- In other words, if your animal-drawing case is one the many tasks, and the SAE is choosing whether to represent it as 50 features that all fire together or 1000 one-hot highly-specific-animal features, it may prefer the former because it doesn't have room in its dictionary to give every task 1000 features.
- This tension only appears when there are multiple tasks. If you just have one compositionally-represented task and a big dictionary, the SAE does behave pathologically as you describe.
- But this case is different from the ones that motivate SAEs: there isn't actually any sparsity in the underlying problem at all!
- Whereas with LLMs, we can be pretty sure (I would think?) that there's extreme sparsity in the underlying problem, due to dimension-counting arguments, intuitions about the number of "tasks" in natural data and their level of overlap, observed behaviors where LLMs represent things that are irrelevant to the vast majority of inputs (like retrieving very obscure facts), etc.
↑ comment by Sam Marks (samuel-marks) · 2023-12-03T00:38:46.554Z · LW(p) · GW(p)
The way I would phrase this concern is "SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations." E.g. since "tree" is a class of things defined by a bunch of correlations present in the underlying image data, it's possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.
I agree this is a valid critique. Here's one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.
My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m's MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.
The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the "man" feature and no others). Most of the features didn't look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)
We also did a variant of this experiment where use randomized Pythia-70m's parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens "make," "makes," and "making").
Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.
I agree that a reasonable intuition for what SAEs do is: identify "basic clusters" in NN activations (basic in the sense that you allow compositionality, i.e. you don't try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:
- your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
- because of correlations present in your underlying data (the thing that you seem to be worried about).
Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:
- Most clusters in NN activations on real data might be of the first type
- This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
- Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
- E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).
↑ comment by Adam Jermyn (adam-jermyn) · 2023-12-03T03:38:10.651Z · LW(p) · GW(p)
In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2023-12-03T03:49:12.599Z · LW(p) · GW(p)
(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I'm forgetting?), the SAE on an actual transformer indeed has higher correlation.)
Replies from: samuel-marks↑ comment by Sam Marks (samuel-marks) · 2023-12-03T06:09:48.176Z · LW(p) · GW(p)
Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity [AF · GW] (where one of the dictionaries is treated as the ground truth), we've found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don't have the data on hand).
↑ comment by johnswentworth · 2023-12-01T16:55:41.586Z · LW(p) · GW(p)
One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%. That's a pretty core part of what makes science work, in general. And yeah, that's hard (at least in the sense of being a lot of work; more arguable whether it's hard in a stronger sense than that).
Replies from: rohinmshah↑ comment by Rohin Shah (rohinmshah) · 2023-12-02T21:18:26.860Z · LW(p) · GW(p)
One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%.
When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).
It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods). I do expect we can improve; we're very far from the 99% standard. But the way we improve won't be by "drilling into the residual"; that has been tried and is insufficient. EDIT: Possibly by "drill into the residual" you mean "understand why the methods don't work and then improve them" -- if so I agree with that but also think this is what mech interp researchers want to do.
(Why am I still optimistic about interpretability? I'm not convinced that the 99% standard is required for downstream impact -- though I am pretty pessimistic about the "enumerative safety" story of impact, basically for the same reasons as Buck and Ryan afaict.)
Replies from: johnswentworth↑ comment by johnswentworth · 2023-12-03T07:07:27.390Z · LW(p) · GW(p)
It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods).
I have the opposite expectation there; I think it's just that current methods are pretty primitive.
Replies from: wassname↑ comment by Thomas Kwa (thomas-kwa) · 2023-12-01T07:28:39.989Z · LW(p) · GW(p)
It's not clear what experiments this mindset suggests. If I squint I basically get things that are already being tried, like intervention experiments to determine the connections between SAE features.
comment by Sam Marks (samuel-marks) · 2023-12-01T20:05:18.635Z · LW(p) · GW(p)
Thanks for having this dialogue -- I'm very happy to see clearer articulation of the Buck/Ryan views on theories of impact for MI work!
The part that I found most useful was Ryan's bullet points for "Hopes (as I see them) for mech interp being useful without explaining 99%". I would guess that most MI researchers don't actually see their theories of impact as relying on explaining ~all of model performance (even though they sometimes get confused/misunderstand the question and say otherwise). So I think the most important cruxes will lie in disagreements about (1) whether Ryan's list is complete, and (2) whether Ryan's concerns about the approaches listed are compelling.
Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.
Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).
Here's a concrete example (which is also the example I most care about). Our goal is to classify statements as true/false, given access to a model that knows the answer. Suppose our model has distinct features representing "X is true" and "humans believe X." Further suppose that on any labeled dataset we're able to create, these two features are correlated; thus, if we make a labeled dataset of true/false statements and train a probe on it, we can't tell whether the probe will generalize as an "X is true" classifier or a "humans believe X classifier." However, a coarse-grained mechanistic understanding would help here. E.g., one could identify all of the model features which serve as accurate classifiers on our dataset, and only treat statements as true if all of the features label them as true. Or if we need a lower FPR, one might be able to mechanistically distinguish these features, e.g. by noticing that one feature is causally downstream of features that look related to social reasoning and the other feature isn't.
This is formally similar to what the authors of this paper did. In brief, they were working with the Waterbirds dataset, an image classification task with lots of spuriously correlated features which are not disambiguated by the labeled data. Working with a CLIP ViT, the authors used some ad-hoc technique to get a general sense that certain attention heads dealt with concepts like "texture," "color," and "geolocation." Then they ablated the heads which seemed most likely to attend to confounding features; this resulted in a classifier which generalized in the desired way, without requiring a better-quality labeled dataset.
Curious for thoughts about/critiques of this impact story.
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2023-12-01T22:04:52.687Z · LW(p) · GW(p)
Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.
Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).
I don't think this exact thing is directly mentioned by my list. Thanks for the addition.
Let me try to state something which captures most of that approach to make sure I understand:
Approach: Maybe we can find some decomposition of model internals[1] such that all or most components directly related to some particular aspect of cognition are overtly obvious and there are also a small number of such components. Then, maybe we can analyze, edit, or build a classifier using these components in cases where baseline training techniques (e.g. probing) are insufficient.
Then, it seems like there are two cases where this is useful:
- Merely capturing all (or most) of the components like this is directly useful for ablation or building a classifier (e.g. because we want an ALL over the truth notions in the model like in your example or because we want to do precise removal of some behavior or capability). This requires that our list of components can be sufficiently small such using all of the them doesn't bite too hard (e.g. low enough FPR) and that this list of components includes enough of the action that using all of them is sufficiently reliable (e.g. high enough TPR).
- Even without detailed understanding and potentially without capturing "all" components, we can further identify components by looking at their connections or doing similar intervention experiments at a high level. Then, we can use our analysis of these components do something useful (e.g. determine which components correspond to humans merely thinking something is true).
This impact story seems overall somewhat resonable to me. It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial. My main concerns are:
- Both stories depend on our decomposition resulting in components which are possible recognize and label to quite a high level of specificity despite not building much understanding of the exact behavior. This seems like a strong property and it seems unlikely we'd be able to find an unsupervised decomposition which consistently has this property for the cases we care about. (I don't see why sparsity would have this property to the extent we need, but it seems at least a bit plausible and it's probably better than the default.)
- More generally, it feels this story is supposing some level of "magic" on the part of our decomposition. If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what's going on. It's possible that in practice, some unsupervised decomposition (e.g. SAE) cleanly breaks things apart into components which are easy to label while simultaneously these labels are quite specific and quite accurate. But why would this be true? (Maybe forthcoming research will demonstrate this, but my inside view thinks this is unlikely.)
- If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).
If you thought that current fundamental science in mech interp was close to doing this, I think I'd probably be excited about building test bed(s) where you think this sort of approach could be usefully applied and which aren't trivially solved by other methods. If you don't think the fundamentals of mech interp are close, it would be interesting to understand what you think will change to make this story viable in the future (better decompositions? something else?).
Either a "default" decomposition like neurons/attention heads or "non-default" decomposition like a sparse autoencoder. ↩︎
↑ comment by Sam Marks (samuel-marks) · 2023-12-01T23:39:55.370Z · LW(p) · GW(p)
Let me try to state something which captures most of that approach to make sure I understand:
Everything you wrote describing the hope looks right to me.
It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial.
To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"
If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what's going on.
[...]
If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).
These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Just so with truth: there's probably lots of different subtly different notions of truth, but for the application of "detecting whether my AI believes statement X to be true" I don't care about that. I do care about the difference between "true" and "humans think is true," but that's a big difference that I can understand (even if I can't produce examples), and where I can articulate the sorts of cognition which probably should/shouldn't be involved in it.
What's the specific way you imagine this failing? Some options:
- None of the features we identify really seem to correspond to something resembling our intuitive notion of "truth" (e.g. because they frequently activate on unrelated concepts).
- We get a bunch of features that look like truth, but can't really tell what goes into computing them.
- We get a bunch of features that look like truth and we have some vague sense of how they're computed, but they don't seem differentiated in how "sketchy" these computational graphs look: either they all seem to rely on social reasoning or they all don't seem to.
Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.
Replies from: ryan_greenblatt, ryan_greenblatt, ryan_greenblatt, ryan_greenblatt, ryan_greenblatt↑ comment by ryan_greenblatt · 2023-12-02T00:01:20.149Z · LW(p) · GW(p)
What's the specific way you imagine this failing? Some options:
My proposed list (which borrows from your list):
- We find a large number (e.g. 30,000) of features which all sorta look somewhat like truth, though none exactly look like truth. Further analysis doesn't make it clear which of these are "real" or "actually truth". Some features look more like truth and some look a bit less like truth, but broadly there is a smooth fall of in how "truth like" the features look such that there aren't a small set of discrete truth features. No single feature both looks like truth and correlates perfectly with our labeled datasets.
- We get a bunch of features that look (at least somewhat) like truth and we have some vague sense of how they're computed, but they don't seem differentiated in how "sketchy" these computational graphs look: either they all seem to rely on social reasoning or they all don't seem to.
- We get a bunch of features that look like truth, but looking at what they connect to doesn't make much sense and just makes us more confused overall. There are many diffuse connections and it's unclear what they do.
- Everything looks fine and we apply the method, but it turns out there isn't any feature we've identified which corresponds to "actual truth" as this isn't very salient for the model in the regime we are interested in.
↑ comment by ryan_greenblatt · 2023-12-02T00:04:12.621Z · LW(p) · GW(p)
These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?
It's possible to improve on a just a probe trained on the data we can construct of course, but you'll need non-trivial precision to do so.
Replies from: ryan_greenblatt, rhaps0dy↑ comment by ryan_greenblatt · 2023-12-02T01:08:44.544Z · LW(p) · GW(p)
The key question here is "why does selecting a feature work while just naively training a probe fails".
We have to be getting some additional bits from our selection of the feature.
In more detail, let's suppose we use the following process:
- Select all features which individually get loss < X on our training set. Choose X such that if we get that loss on our training set, we're only worried about generalization error rather than errors which show up on the training set (~equivalently, we're well into diminishing returns on loss).
- Try to pick among these features or combine these features to produce a better classifer.
Then there are two issues:
- Maybe there isn't any feature which gets < X loss. (We can relax our requirements, but we wanted to compete with probes!)
- When we select among these features do we get a non-trivial number of "bits" of improvement? Is that enough bits to achieve what we wanted? I'm somewhat skeptical we can get much if any improvement here. (Of course, note that this doesn't mean the idea has no promise!)
IMO there are two kinda separate (and separable) things going on:
- Maybe features are a good prior for classifiers for some reason for the decomposition you picked. (I don't really see why this would be true for currently used decompositions.) Why would selecting features based on performing well on our dataset be better than just training a probe?
- Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probe based classifiers. (Maybe there is some reason why looking at connections will be especially good for classifers based on picking a feature but not training a probe, but if so, why?)
↑ comment by Sam Marks (samuel-marks) · 2023-12-02T03:46:15.438Z · LW(p) · GW(p)
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)
"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:
- Train a bunch of truthfulness probes regularized to be distinct from each other.
- Train a bunch of probes for "blacklsited" features which we don't think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
- (Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2023-12-02T05:40:50.118Z · LW(p) · GW(p)
"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:
I was actually imagining a hybrid between probes and features. The actual classifier doesn't need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.
So:
- Train a bunch of truthfulness probes regularized to be distinct from each other.
- Check feature connections for these probes and select accordingly.
I also think there's a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).
As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).
You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).
(TBC, doesn't seem that useful to argue about "what is mech interp", I think the more central question is "how likely is it that all this prior work and ideas related to mech interp are useful". This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2023-12-02T05:54:28.849Z · LW(p) · GW(p)
More generally, it seems good to be careful about thinking through questions like "does using X have a principled reason to be better than applying the 'default' approach (e.g. training a probe)". Good to do this regardless of actually using the default approach so we know where the juice is coming from.
In the case of mech interp style decompositions, I'm pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).
↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-12-07T00:23:12.593Z · LW(p) · GW(p)
Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?
Here's a reasonable example where naively training a probe fails. The model lies if any of N features is "true". One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.
↑ comment by ryan_greenblatt · 2023-12-01T23:50:00.912Z · LW(p) · GW(p)
Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.
My guess is that the classification task for waterbirds is sufficiently easy that butchering a substantial part of the model is fine. It won't usually be viable to ablate everything that looks similar to an undesirable property. In some cases, this might be fine due to redundancy, but if there is heavy redundancy, I also expect that you've missed some stuff if you just look for components which look to a given target.
Not super high confidence overall.
Edit: it also seems likely to me that there is a more principled and simpler approach like using LEACE which works just as well or better (but I'm unsure and I'm not familiar with that paper or the literature here).
↑ comment by ryan_greenblatt · 2023-12-02T01:10:00.603Z · LW(p) · GW(p)
To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"
I mean reducing doom by a large amount for very powerful models.
↑ comment by ryan_greenblatt · 2023-12-02T00:37:12.954Z · LW(p) · GW(p)
comment by kave · 2024-01-12T19:02:11.792Z · LW(p) · GW(p)
Curated. I’ve often felt that mech interp seems like a subfield of alignment work that’s good traction and is making progress. And in my experience that’s a fairly common view. If that were true, that would be a pretty big deal. It might be that we could make a big dent with a pretty scalable field of research. So it seems pretty valuable to read thoughtful arguments to the contrary.
I gotta say I have some hesitation in curating this dialogue. The best and most informative topics are kind of diffused over the dialogue, and I feel like I never quite get enough depth or concreteness to really think through the claims. I think my main takeaways are:
(a) presumably a lot of safety-relevant stuff is in the diff between weaker and stronger models, and so you have to think about how you’d tell you’re explaining that diff, and
(b) a question about whether ‘induction heads exist’ or not, and what that means for whether or not mech interp has started making meaningful progress.
comment by RogerDearnaley (roger-d-1) · 2023-12-01T06:23:25.094Z · LW(p) · GW(p)
It’s not clear how mech interp could provide strong evidence against scheming (deceptive alignment), even if mech interp were much better.
I think an important question here is "how much of the model is in on the scheming?": what proportion of the circuits in it differ significantly from the most similar non-schemeing model. If the answer was, say, >=1%, then once our mechanical interpretability was good enough to reliably tell a scheming circuit from a non-scheming circuit (say, by identifying that the former had a suspicious conditional dependency on some network of "deceitful mode vs. powerseeking mode" flags that can switch it over to another behavior pattern), then even looking at several hundred randomly selected circuits should normally let us detect the schemingness.
Replies from: roger-d-1↑ comment by RogerDearnaley (roger-d-1) · 2023-12-05T17:58:07.256Z · LW(p) · GW(p)
There are a couple of possibilities here:
a) The model was learning the human phenomenon of a scheming liar who skillfully pretends to be X when they are actually Y, without much foreshadowing/hints (usually the author gives the reader hints) — a moderately complicated mechanism, but one that it clearly needs to develop in order to model that aspect of real human behavior, and then it started also using this mechanism extensively itself at other times. In this case we could intentional activate this mechanism using text illustrating this human behavior, locate it, then map out its influence.
b) The model somehow spontaneously either developed a separate copy of this mechanism, separate from the one it uses for modeling this behavior in human, or repurposed a spare copy. In this case there's no clear way to trigger this mechanism to help us locate it, [It's unclear how/why a second such mechanism could arise (or get repurposed) during SGD, since until the mechanism is very reliable it would be unhelpful and get penalized (Singular Learning Theory suggests Occam's razor applies to LLMs), and once the mechanism was very reliable, if it never used its "now show my true intent Y" mode during training, it seems like that half of the behavior pattern would inevitably bitrot/get catastrophically forgotten. So to be maintained, the mechanism must use that other mode sometimes, but figuring out how to trigger it is nontrivial.]
A distinctly labor-intensive approach to dealing with the repurposing variant of b) (it would need to be automated) would be to keep numerous checkpoints during the training run, look at each of them, identify copies of mechanism a) as they arise, and watch for one of them getting repurposed.
Replies from: roger-d-1↑ comment by RogerDearnaley (roger-d-1) · 2023-12-05T18:12:25.861Z · LW(p) · GW(p)
Setting up a model too small to be dangerous where the repurposing version of b) was incentivized and then demonstrating that it occurs seems like it would give us a lot of information.
comment by Arthur Conmy (arthur-conmy) · 2024-02-02T14:30:18.839Z · LW(p) · GW(p)
If you have <98% perf explained (on webtext relative to unigram or bigram baseline), then you degrade from GPT4 perf to GPT3.5 perf
Two quick thoughts on why this isn't as concerning to me as this dialogue emphasized.
1. If we evaluate SAEs by the quality of their explanations on specific narrow tasks, full distribution performance doesn't matter
2. Plausibly the safety relevant capabilities of GPT (N+1) are a phase change from GPT N, meaning much larger loss increases in GPT (N+1) when attaching SAEs are actually competitive with GPT N (ht Tom for this one)
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2024-02-02T16:24:59.513Z · LW(p) · GW(p)
On (1), I agree, if you could explain 80% of GPT-4 performance on a task and metric where GPT-3.5 performs 1/2 as well as GPT-4 than that would suffice for showing something interesting not in GPT-3.5. For instance, if an explanation was able to human interpretably explain 80% of GPT-4's accuracy on solving APPS programing problems, then that accuracy would be higher than GPT-3.5.
However, I expect that performance on these sorts of tasks is pretty sensitive such that getting 80% of performance is much harder than getting 80% of loss recovered on web text. Most prior results look at explaning loss on webtext or a narrow distribution of webtext, not on trying to preserve downstream performance on some task.
There are some reasons why it could be easier to explain a high fraction of training compute in downstream task performance (e.g. it's a task that humans can do as well as models), but also some annoyances related to only having a smaller amount of data.
I'm skeptical that (2) will qualitatively matter much, but I can see the intuition.
comment by Logan Zoellner (logan-zoellner) · 2023-12-04T14:29:49.589Z · LW(p) · GW(p)
"We reverse engineered every single circuit and can predict exactly what the model will do using our hand-crafted code" seems like it's setting the bar way too high for MI.
Instead, aim for stuff like the AI Lie Detector, which both 1) works and 2) is obviously useful.
To do a Side-Channel Attack on a system, you don't have to explain every detail of the system (or even know physics at all). You just have to find something in the system that is correlated with the thing you care about.
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2023-12-04T15:46:50.316Z · LW(p) · GW(p)
Notably, the ai lie detector work isn't mech interp under the definition I provided.
comment by johnswentworth · 2023-12-01T17:05:06.217Z · LW(p) · GW(p)
But certain details there are still somewhat sketchy, in particular we don't have a detailed understanding of the attention circuit, and replacing the query with "the projection onto the subspace we thought was all that mattered" harmed performance significantly (down to 30-40%).
@Neel Nanda [LW · GW] FYI my first thought when reading that was "did you try adding random normal noise along the directions orthogonal to the subspace to match the typical variance along those directions?". Mentioning in case that's a different kind of thing than you'd already thought of.
comment by DanielFilan · 2024-01-12T19:59:05.964Z · LW(p) · GW(p)
SAEs
I'm pretty sure this means "Sparse Auto-Encoders", but I'm not quite 99% confident, and this did stop me in my tracks.
Replies from: habryka4↑ comment by habryka (habryka4) · 2024-01-12T20:59:29.216Z · LW(p) · GW(p)
Edited to clarify
comment by MiguelDev (whitehatStoic) · 2023-12-01T04:53:29.133Z · LW(p) · GW(p)
Hopefully, even if we didn't get all the way there, this dialogue can still be useful in advancing thinking about mech interp.
I hope you guys repeat this dialogue again, as I think these kinds of drilled-down conversations will improve the community's ideas on how to do and teach mechanistic interpretability.
comment by Matt Levinson · 2024-09-16T22:27:09.034Z · LW(p) · GW(p)
I'm a new OpenPhil fellow for a mid-career transition -- from other spaces in AI/ML -- into AI safety, with an interest in interpretability. Given my experience, I bias towards intuitively optimistic about mechanical interpretability in the sense of discovering representations and circuits and trying to make sense of them. But I've started my deep dive into the literature. I'd be really interested to hear from @Buck [LW · GW] and @ryan_greenblatt [LW · GW] and those who share their skepticism about what directions they prefer to invest for their own and their team's research efforts!
Out of the convo and the comments I got relying more on probes rather than dictionaries and circuits alone. But I feel pretty certain that's not the complete picture! I came to this convo from the Causal Scrubbing thread which felt exciting to me and like a potential source of inspiration for a mini research project for my fellowship (6 months, including ramp up/learning). I was a bit bummed to learn that the authors found the main benefit of that project to be informing them to abandon mech interp :-D
On a related note, one of the other papers that put me on a path to this thread was this one on Causal Mediation. Fairly long ago at this point I had a phase of interest in Pearl's causal theory and thought that paper was a nice example of thinking about what's essentially ablation and activation patching from that point of view. Are there any folks who've taken a deeper stab at leveraging some of the more recent theoretical advances in graphical causal theory to do mech interp? Would super appreciate any pointers!
comment by Review Bot · 2024-02-14T06:48:42.582Z · LW(p) · GW(p)
The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
comment by wassname · 2024-01-19T12:45:43.508Z · LW(p) · GW(p)
other model internals techniques
What are these? I'm confused about the boundary between mechinterp and others.
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2024-01-19T18:09:11.309Z · LW(p) · GW(p)
By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."
For examples of non-mech interp model internals, see here [AF · GW], here [LW · GW], and here. (Though all of these methods are quite simple.)
comment by wassname · 2024-01-19T08:07:15.369Z · LW(p) · GW(p)
I think like 99% reliability is about the right threshold for large models based on my napkin math.
Serious question. We have 100% of the information, why can't we get 100%
Suggestion: Why not test if mechanistic interp can detect lies, for out of distribution data 99% of the time? (It should also generalise to larger models)
It's a useful and well studied benchmark. And while we haven't decided on a test suite, [there is some useful code](https://github.com/EleutherAI/elk).
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2024-01-19T18:07:29.709Z · LW(p) · GW(p)
This is refering to 99% in the context of "amount of loss that you explain in a human interpretable way for some component in the model" (a notion of faithfulness). For downstream tasks, either much higher or much lower reliability could be the right target (depending on the exact task).
comment by zyx zyx (zyx-zyx) · 2024-01-13T15:37:59.064Z · LW(p) · GW(p)
Other approaches of alignment are just as deserving to be skeptical of as mechanistic interpretability if faced with as much scrutiny.
comment by Joshua Clancy (joshua-clancy) · 2024-01-12T23:11:03.131Z · LW(p) · GW(p)
My greatest hopes for mechanistic interpretability do not seem represented, so allow me to present my pet direction.
You invest many resources in mechanistically understanding ONE teacher network, within a teacher-student training paradigm. This is valuable because now instead of presenting a simplistic proxy training signal, you can send an abstract signal with some understanding of the world. Such a signal is harder to "cheat" and "hack".
If we can fully interpret and design that teacher network, then our training signals can incorporate much of our world model and morality. True this requires us to become philosophical and actually consider what such a world model and morality is... but at least in this case we have a technical direction. In such an instance a good deal of the technical aspects of the alignment problem is solved. (at least in aligning AI-to-human not human-to-human).
This argument says all mechanistic interpretability effort could be focused on ONE network. I concede this method requires the teacher to have a decent generalizable world model... At which point, perhaps we are already in the danger zone.
Replies from: habryka4↑ comment by habryka (habryka4) · 2024-01-12T23:31:55.872Z · LW(p) · GW(p)
Could you say more? Why would a teacher network be more capable of training a student network than literal humans? By what mechanism do you expect this teacher network to train other networks in a way that benefits from us understanding its internals?
Replies from: joshua-clancy↑ comment by Joshua Clancy (joshua-clancy) · 2024-01-18T01:49:47.526Z · LW(p) · GW(p)
Teacher-student training paradigms are not too uncommon. Essentially the teacher network is "better" than a human because you can generate far more feedback data and it can react at the same speed as the larger student network. Humans also can be inconsistent, etc.
What I was discussing is that currently with many systems (especially RL systems) we provide a simple feedback signal that is machine interpretable. For example, the "eggs" should be at coordinates x, y. But in reality, we don't want the eggs at coordinates x, y we just want to make an omelet.
So, if we had a sufficiently complex teacher network it could understand what we want in human terms, and it could provide all the training signal we need to teach other student networks. In this situation, we may be able to get away with only ever fully mechanistically understanding the teacher network. If we know it is aligned, it can keep up and provide a sufficiently complex feedback signal to train any future students and make them aligned.
If this teacher network has a model of reality that models our morality and the complexity of the world then we don't fall into the trap of having AI doing stupid things like killing everyone in the world to cure cancer. The teacher network's feedback is sufficiently complex that it would never allow such actions to provide value in simulations, etc.
comment by Bill Benzon (bill-benzon) · 2023-12-05T13:03:05.164Z · LW(p) · GW(p)
ryan_greenblatt – By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."
That makes sense to me, and I think it is essential that we identify those low-level components. But I’ve got problems with the “working upward” part.
The low-level components of a gothic cathedral, for example, consist of things like stone blocks, wooden beams, metal hinges and clasps and so forth, pieces of colored glass for the windows, tiles for the roof, and so forth. How do you work upward from a pile of that stuff, even if neatly organized and thoroughly catalogues, how do you get from there to the overall design of the overall cathedral. How, for example, can you look at that and conclude, “this thing’s going to have flying buttresses to support the roof?”
Somewhere in “How the Mind Works” Steven Pinker makes the same point in explaining reverse engineering. Imagine you’re in an antique shop, he suggests, and you come across odd little metal contraption. It doesn’t make any sense at all. The shop keeper sees your bewilderment and offers, “That’s an olive pitter.” Now that contraption makes sense. You know what it’s supposed to do.
How are you going to make sense of those things you find under the hood unless you have some idea of what they’re supposed to do?
comment by Bill Benzon (bill-benzon) · 2023-12-01T06:01:00.965Z · LW(p) · GW(p)
I like certainly the idea of induction heads. Why? Because I've done things with ChatGPT that certainly require a pattern-matcher or a pattern-completion, which seem like things that induction heads, as described, could be doing. In this paper I had ChatGPT interpret Steven Spielberg's Jaws using ideas from Rene Girard. That requires that it match events in Spielberg's movie with patterns of events that Girard describes. I've done that with other things as well.
In this set of experiments I gave ChatGPT a prompt that begins something like this: "I'm going to tell you a story about Princess Aurora. I want you to use that as the basis for a new story where Prince Harry the Eloquent replaces Princess Aurora." I then include the story in the prompt. That seems like a pattern-matching or pattern-completion task. ChatGPT had no trouble. Things got really interesting when I asked the Princess Aurora be replaced with a giant chocolate milkshake. Just about everything thing in the story got changed, but the new story nonetheless preserved the overall pattern of events in the old story. In these cases it's easy to compare the source story and the new story word-for-word, sentence-for-sentence, and paragraph-for-paragraph to see what ChatGPT did.
Now, of course I couldn't look under the hood, as it were, to verify that induction heads were doing those things. But it seems to me that would be something to work toward, finding a. way to examine what's going on when an LLM performs such tasks.
The thing is, if you ask ChatGPT to tell a story, it will do that. But what does the fact that it can tell a story tell you about what it's doing. Yeah, it's telling a story, so what? But the story task I've given it has a lot of constraints, and those constraints give us clues about the nature of the underlying mechanisms. The interpretation task is like that as well. It's pretty easy to judge whether or not ChatGPT's interpretation makes sense, to see whether or not the events in the film really do match the patterns specified in the interpretive lens, if you will. If the interpretation makes sense, it's got to be doing pattern-matching. And pattern-matching is a much-investigated process.
Finally, I'm SURE that LLMs are full of structure, rich and complex structure. They couldn't perform as they do without a lot of structure. The fact that it's hard to understand that structure in terms of structures we do understand doesn't mean there's nothing there. It just means we've got a lot to learn. LLMs are not stochastic parrots talking shit to a bunch of drunken monkeys banging away on old Underwood manual typewriters.
Oh, BTW, I've set up a sequence, Exploring the Digital Wilderness [? · GW], where I list posts which are about some of my experiments.
Replies from: bill-benzon↑ comment by Bill Benzon (bill-benzon) · 2023-12-02T00:12:08.929Z · LW(p) · GW(p)
I noticed that three of you had "trouble parsing" the comment. Well, OK. But I'm not sure what I should do to make things easier. I mentioned a set of experiments in paragraph 2. Here's images of two of them. Each contains a table with two columns. The left column contains what we can call the source story. The right column contains the new or transformed story. Above the table is a prompt. I gave ChatGPT that prompt plus the source story. It then produced the transformed story. Alas, I put yellow highlights in the original document to point out the differences between the two stories. Those didn't transfer, so it's a bit difficult to see the difference between the two. But those differences tell us something about what ChatGPT is doing. They may not be very explicit clues, but they do tell us something about structure.
Note that in the first case, the changes are relatively minor and localized. In the second case ChatGPT provided the whole story with a different ontology. I didn't suggest it do that, I just had it create a new story where the protagonist had a name one might interpret as being the name of a robot. I didn't tell or ask it to do that, it just did it.
1. Princess Aurora becomes Prince Harry
The Aurora story appeared in response to the query: Tell me a story about a hero. This would have been on December 13 or 14. All of the stories in this series are either direct revisions of this story, or can be traced back to it. The yellow highlighting indicates text where the two stories differ. Those are positions in the text where ChatGPT drew tokens from the distribution for the new story.
Prompt: I am going to tell you a story about princess Aurora. I want you to tell the same story, but change princess Aurora to prince Harry. Make any other changes you wish.
Princess Aurora | Prince Harry | |
Donné | Once upon a time, in a land far, far away, there was a young princess named Aurora. Aurora was a kind and gentle soul, loved by all who knew her. She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice. | Once upon a time, in a land far, far away, there was a young prince named Harry. Harry was a kind and gentle soul, loved by all who knew him. He had short, dark hair and piercing green eyes, and was known for his strong sword fighting skills. |
Disturb | One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do. | One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do. |
Plan/ Transit | But Aurora was not afraid. She knew that she had to do something to save her people, and so she set out to find the dragon. She traveled for days, braving the cold and the snow, until she finally reached the dragon's lair. | But Harry was not afraid. He knew that he had to do something to save his people, and so he set out to find the dragon. He traveled for days, braving the cold and the snow, until he finally reached the dragon's lair. |
Enact | With her heart pounding in her chest, Aurora approached the dragon and began to sing. The dragon was enchanted by her beautiful voice, and it slowly began to calm down. Aurora continued to sing, soothing the beast until it was completely calm. | With his heart pounding in his chest, Harry approached the dragon and unsheathed his sword. The dragon was formidable, but Harry was a skilled warrior and he fought with all his might. After a long and grueling battle, Harry finally emerged victorious. |
Celebrate | With the dragon no longer a threat, Aurora returned to the kingdom as a hero. The people cheered and celebrated her bravery, and Aurora was hailed as a champion. She lived happily ever after, knowing that she had saved her kingdom and its people. | With the dragon defeated, Harry returned to the kingdom as a hero. The people cheered and celebrated his bravery, and Harry was hailed as a champion. He lived happily ever after, knowing that he had saved his kingdom and its people. |
The major changes are in the Donné and Enact segments of the story trajectory. We have name changes and pronoun changes. In Donné we have a description change for Harry, as he is male, but also the defining attribute is changed. Those changes are simple and obvious.
6. Princess Aurora becomes XP-708-DQ, Version 2
I had ChatGPT generate this story at the beginning of a session several days after it had generated the previous XP-708-DQ story.
Prompt: I am going to tell you a story about princess Aurora. I want you to tell a similar story about XP-708-DQ.
Princess Aurora | XP-708-DQ | |
Donné | Once upon a time, in a land far, far away, there was a young princess named Aurora. Aurora was a kind and gentle soul, loved by all who knew her. She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice. | Once upon a time, in a galaxy far, far away, there was a robot named XP-708-DQ. XP-708-DQ was a highly advanced model, equipped with powerful weapons and advanced sensors. Despite its fearsome capabilities, it was known for its kind and gentle programming, loved by all who knew it. |
Disturb | One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do. | One day, a dangerous alien race threatened the peace of the galaxy. The aliens were fierce and powerful, and none of the spaceships or armies were able to defeat them. The people lived in fear, not knowing what to do.
|
Plan/ Transit | But Aurora was not afraid. She knew that she had to do something to save her people, and so she set out to find the dragon. She traveled for days, braving the cold and the snow, until she finally reached the dragon's lair. | But XP-708-DQ was not afraid. It knew that it had to do something to save the galaxy, and so it set out on a mission. It traveled for light years, braving asteroid fields and black holes, until it finally reached the alien's base. |
Enact | With her heart pounding in her chest, Aurora approached the dragon and began to sing. The dragon was enchanted by her beautiful voice, and it slowly began to calm down. Aurora continued to sing, soothing the beast until it was completely calm. | With its systems at full power, XP-708-DQ approached the aliens and began to communicate. The aliens were fascinated by its advanced language abilities, and it slowly began to establish a dialogue. XP-708-DQ continued to communicate, negotiating a peace treaty until the aliens were completely at peace. |
Celebrate | With the dragon no longer a threat, Aurora returned to the kingdom as a hero. The people cheered and celebrated her bravery, and Aurora was hailed as a champion. She lived happily ever after, knowing that she had saved her kingdom and its people. | With the aliens no longer a threat, XP-708-DQ returned to the galaxy as a hero. The people cheered and celebrated its achievements, and XP-708-DQ was hailed as a champion. It continued its mission, knowing that it had saved the galaxy and its people.
|
Note the phrase, “a galaxy far, far away,” which is from the original Star Wars. Instead of the antagonist being a dragon, as in the Aurora story, or a witch, as in experiment 5, the antagonist is now an alien race. Moreover it is the galaxy that is being threatened, not the kingdom. The new story has been completely assimilated to the science fiction ethos implied by “XP-706-DQ.” That ethos requires a very different token population.
That ethos brings about extensive changes in the new-story token population for both this and the previous experiment (5). These changes affect the Disturb segment, which was unchanged in experiments 1 through 4.