Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2 2023-05-25T15:37:54.593Z
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1 2023-05-09T19:41:10.528Z
Residual stream norms grow exponentially over the forward pass 2023-05-07T00:46:02.658Z
A circuit for Python docstrings in a 4-layer attention-only transformer 2023-02-20T19:35:14.027Z
How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! 2023-01-24T18:45:01.003Z
Reinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default? 2022-10-25T20:48:50.895Z
Research Questions from Stained Glass Windows 2022-06-08T12:38:44.848Z
CNN feature visualization in 50 lines of code 2022-05-26T11:02:45.146Z


Comment by StefanHex (Stefan42) on Really Strong Features Found in Residual Stream · 2023-07-09T10:27:23.272Z · LW · GW

Nice work! I'm especially impressed by the [word] and [word] example: This cannot be read-off the embeddings, thus the model must be actually computing and storing this feature somewhere! I think this is exciting since the things we care about (deception etc.) are also definitely not included in the embeddings. I think you could make a similar case for Title Case and Beginning & End of First Sentence but those examples look less clear, e.g. the Title Case could be mostly stored in "embedding of uppercase word that is usually lowercase".

Comment by StefanHex (Stefan42) on Really Strong Features Found in Residual Stream · 2023-07-09T10:26:49.803Z · LW · GW

Thank you for making the early write-up! I'm not entirely certain I completely understand what you're doing, could I give you my understanding and ask you to fill the gaps / correct me if you have the time? No worries if not, I realize this is a quick & early write-up!


As previously you run Pythia on a bunch of data (is this the same data for all of your examples?) and save its activations.
Then you take the residual stream activations (from which layer?) and train an autoencoder (like Lee, Dan & beren here) with a single hidden layer (w/ ReLU), larger than the residual stream width (how large?), trained with an L1-regularization on the hidden activations. This L1-regularization penalizes multiple hidden activations activating at once and therefore encourages encoding single features as single neurons in the autoencoder.


You found a bunch of features corresponding to a word or combined words (= words with similar meaning?). This would be the embedding stored as a features (makes sense).

But then you also find e.g. a "German Feature", a neuron in the autoencoder that mostly activates when the current token is clearly part of a German word. When you show Uniform examples you show randomly selected dataset examples? Or randomly selected samples where the autoencoder neuron is activated beyond some threshold?

When you show Logit lens you show how strong the embedding(?) or residual stream(?) at a token projects into the read-direction of that particular autoencoder neuron?

In Ablated Text you show how much the autoencoder neuron activation changes (change at what position?) when ablating the embedding(?) / residual stream(?) at a certain position (same or different from the one where you measure the autoencoder neuron activation?). Does ablating refer to setting some activations at that position to zero, or to running the model without that word?

Note on my use of the word neuron: To distinguish residual stream features from autoencoder activations, I use neuron to refer to the hidden activation of the autoencoder (subject to an activation function) while I use feature to refer to (a direction of) residual stream activations.

Comment by StefanHex (Stefan42) on Residual stream norms grow exponentially over the forward pass · 2023-06-25T12:29:38.483Z · LW · GW

Huh, thanks for this pointer! I had not read about NTK (Neural Tangent Kernel) before. What I understand you saying is something like SGD mainly affects weights the last layer, and the propagation down to each earlier layer is weakened by a factor, creating the exponential behaviour? This seems somewhat plausible though I don't know enough about NTK to make a stronger statement.

I don't understand the simulation you run (I'm not familiar with that equation, is this a common thing to do?) but are you saying the y levels of the 5 lines (simulating 5 layers) at the last time-step (finished training) should be exponentially increasing, from violet to red, green, orange, and blue? It doesn't look exponential by eye? Or are you thinking of the value as a function of x (training time)?

I appreciate your comment, and looking for mundane explanations though! This seems the kind of thing where I would later say "Oh of course"

Comment by StefanHex (Stefan42) on A circuit for Python docstrings in a 4-layer attention-only transformer · 2023-05-25T11:58:11.137Z · LW · GW

Hi, and thanks for the comment!

Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?

Both of these show slightly different things. Imagine an "AND circuit" where the result is only correct if two attention heads are clean. If you patch clean->corrupt (inserting a clean attention head activation into a corrupt prompt) you will not find this; but you do if you patch corrupt->clean. However the opposite applies for a kind of "OR circuit". I historically had more success with corrupt->clean so I teach this as the default, however Neel Nanda's tutorials usually start the other way around, and really you should check both. We basically ran all plots with both patching directions and later picked the ones that contained all the information. 

did you find that the selection of [the corrupt words] mattered?

Yes! We tried to select equivalent words to not pick up on properties of the words, but in fact there was an example where we got confused by this: We at some point wanted to patch param and naively replaced it with arg, not realizing that param is treated specially! Here is a plot of head 0.2's attention pattern; it behaves differently for certain tokens. Another example is the self token: It is treated very differently to the variable name tokens.


So it definitely matters. If you want to focus on a specific behavior you probably want to pick equivalent tokens to avoid mixing in other effects into your analysis.

Comment by StefanHex (Stefan42) on Residual stream norms grow exponentially over the forward pass · 2023-05-10T11:30:43.396Z · LW · GW

Thanks for finding this!

There was one assumption in the StackExchange post I didn't immediately get, that the variance of  is . But I just realized the proof for that is rather short: Assuming  (the variance of ) is the identity then the left side is

and the right side is

so this works out. (The  symbols are sums here.)

Comment by StefanHex (Stefan42) on Residual stream norms grow exponentially over the forward pass · 2023-05-08T21:09:54.345Z · LW · GW

Thank for for the extensive comment! Your summary is really helpful to see how this came across, here's my take on a couple of these points:

2.b: The network would be sneaking information about the size of the residual stream past LayerNorm. So the network wants to implement an sort of "grow by a factor X every layer" and wants to prevent LayerNorm from resetting its progress.

  1. There's the difference between (i) How does the model make the residual stream grow exponentially -- the answer is probably theory 1, that something in the weights grow exponentially. And there is (ii) our best guess on Why the model would ever want this, which is the information deletion thing.

How and why disconnected

Yep we give some evidence for How, but for Why we have only a guess.

still don't feel like I know why though

earn generic "amplification" functions

Yes, all we have is some intuition here. It seems plausible that the model needs to communicate stuff between some layers, but doesn't want this to take up space in the residual stream. So this exponential growth is a neat way to make old information decay away (relatively). And it seems plausible to implement a few amplification circuits for information that has to be preserved for much later in the network.

We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren't aware of this previously, so we wanted to make a reference post.

Comment by StefanHex (Stefan42) on Residual stream norms grow exponentially over the forward pass · 2023-05-08T20:57:04.507Z · LW · GW

If I'm interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I'm still not quite clear on why LayerNorm doesn't take care of this.

I understand the network's "intention" the other way around, I think that the network wants to have an exponentially growing residual stream. And in order to get an exponentially growing residual stream the model increases its weights exponentially.

And our speculation for why the model would want this is our "favored explanation" mentioned above.

Comment by StefanHex (Stefan42) on Residual stream norms grow exponentially over the forward pass · 2023-05-07T10:07:49.794Z · LW · GW

Thanks for the comment and linking that paper! I think this is about training dynamics though, norm growth as a function of checkpoint rather than layer index.

Generally I find basically no papers discussing the parameter or residual stream growth over layer number, all the similar-sounding papers seem to discuss parameter norms increasing as a function of epoch or checkpoint (training dynamics). I don't expect the scaling over epoch and layer number to be related?

Only this paper mentions layer number in this context, and the paper is about solving the vanishing gradient problem in Post-LN transformers. I don't think that problem applies to the Pre-LN architecture? (see the comment by Zach Furman for this discussion)

Comment by StefanHex (Stefan42) on Residual stream norms grow exponentially over the forward pass · 2023-05-07T09:41:05.801Z · LW · GW

Oh I hadn't thought of this, thanks for the comment! I don't think this apply to Pre-LN Transformers though?

  1. In Pre-LN transformers every layer's output is directly connected to the residual stream (and thus just one unembedding away from logits), wouldn't this remove the vanishing gradient problem? I just checkout out the paper you linked, they claim exponentially vanishing gradients is a problem (only) in Post-LN, and how Pre-LN (and their new method) prevent the problem, right?

  2. The residual stream norm curves seem to follow the exponential growth quite precisely, do vanishing gradient problems cause such a clean result? I would have intuitively expected the final weights to look somewhat pathological if they were caused by such a problem in training.

Re prediction: Isn't the sign the other way around? Vanishing gradients imply growing norms, right? So vanishing gradients in Post-LN would cause gradients to grow exponentially towards later (closer to output) layers (they also plot something like this in Figure 3 in the linked paper). I agree with the prediction that Post-LN will probably have even stronger exponential norm growth, but I think that this has a different cause to what we find here.

Comment by StefanHex (Stefan42) on A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2 · 2023-03-08T21:20:31.660Z · LW · GW

Finally, we give a simple approach to verify that a particular token is unspeakable rather than just being hard-to-speak.

You're using an optimization procedure to find an embedding that produces an output, and if you cannot find one you say it is unspeakable. How confident are you that the optimization is strong enough? I.e. what are the odds that a god-mode optimizer in this high-dimensional space could actually find an embedding that produces the unspeakable token, it's just that linprog wasn't strong enough?

Just checking here, I can totally imagine that the optimizer is an unlikely point of failure. Nice work again!

Comment by StefanHex (Stefan42) on More findings on maximal data dimension · 2023-03-08T01:05:57.849Z · LW · GW

Thanks Marius for this great write-up!

However, I was surprised to find that the datapoints the network misclassified on the training data are evenly distributed across the D* spectrum. I would have expected them to all have low D* didn’t learn them.

My first intuition here was that the misclassified data points where the network just tried to use the learned features and just got it wrong, rather than those being points the network didn't bother to learn? Like say a 2 that looks a lot like an 8 so to the network it looks like a middle-of-the-spectrum 8? Not sure if this is sensible.

The shape of D* changes very little between initialization and the final training run.

I think this is actually a big hint that a lot of the stuff we see in those plots might be not what we think it is / an illusion. Any shape present at initialization cannot tell us anything about the trained network. More on this later.

the distribution of errors is actually left-heavy which is exactly the opposite of what we would expect

Okay this would be much easier if you collapsed the x-axis of those line plots and made it a histogram (the x axis is just sorted index right?), then you could make the dots also into histograms.

we would think that especially weird examples are more likely to be misclassified, i.e. examples on the right-hand side of the spectrum

So are we sure that weird examples are on the right-hand side? If I take weird examples to just trigger a random set of features, would I expect this to have a high or low dimensionality? Given that the normal case is 1e-3 to 1e-2, what's the random chance value?

We train models from scratch to 1,2,3,8,18 and 40 iterations and plot D*, the location of all misclassified datapoints and a histogram over the misclassification rate per bin.

This seems to suggest the left-heavy distribution might actually be due to initialization too? The left-tail seems to decline a lot after a couple of training iterations.

I think one of the key checks for this metric will be ironing out which apparent effects are just initialization. Those nice line plots look suggestive, but if initialization produces the same image we can't be sure what we can learn.

One idea to get traction here would be: Run the same experiment with different seeds, do the same plot of max data dim by index, then take the two sorted lists of indices and scatter-plot them. If this looks somewhat linear there might be some real reason why some data points require more dimensions. If it just looks random that would be evidence against inherently difficult/complicated data points that the network memorizes / ignores every time.

Edit: Some evidence for this is actually that the 1s tend to be systematically at the right of the curve, so there seems to be some inherent effect to the data!

Comment by StefanHex (Stefan42) on The idea that ChatGPT is simply “predicting” the next word is, at best, misleading · 2023-02-21T02:40:48.073Z · LW · GW

I don't think I understand the problem correctly, but let me try to rephrase this. I believe the key part is the claim whether or not ChatGPT has a global plan? Let's say we run ChatGPT one output at a time, every time appending the output token to the current prompt and calculating the next output. This ignores some beam search shenanigans that may be useful in practice, but I don't think that's the core issue here.

There is no memory between calculating the first and second token. The first time you give ChatGPT the sequence "Once upon a" and it predicts "time" and you can shut down the machine, the next time you give it "Once upon a time" and it predicts the next word. So there isn't any global plan in a very strict sense.

However when you put "Once upon a time" into a transformer, it will actually reproduce the exact values from the "Once upon a" run, in addition to a new set of values for the next token. Internally, you have a column of residual stream for every word (with 400 or so rows aka layers each), and the first four rows are identical between the two runs. So you could say that ChatGPT reconstructs* a plan every time it's asked to output a next token. It comes up with a plan every single time you call it. And the first N columns of the plan are identical to the previous plan, and with every new word you add a column of plan. So in that sense there is a global plan to speak of, but this also fits within the framework of predicting the next token.

"Hey ChatGPT predict the next word!" --> ChatGPT looks at the text, comes up with a plan, and predicts the next word accordingly. Then it forgets everything, but the next time you give it the same text + one more word, it comes up with the same plan + a little bit extra, and so on.

Regarding 'If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me.' I am not sure what you mean with this. I think an important note is to keep in mind it uses the same parameters for every "column", for every word. There is no such thing as ChatGPT not visiting every parameter.

And please correct me if I understood any of this wrongly!


*in practice people cache those intermediate computation results somewhere in their GPU memory to not have to recompute those internal values every time. But it's equivalent to recomputing them, and the latter has less complications to reason about.

Comment by StefanHex (Stefan42) on A circuit for Python docstrings in a 4-layer attention-only transformer · 2023-02-20T21:38:22.448Z · LW · GW

Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.

We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.

Comment by StefanHex (Stefan42) on How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! · 2023-01-28T04:17:03.418Z · LW · GW

Awesome, updated!

Comment by StefanHex (Stefan42) on Language models seem to be much better than humans at next-token prediction · 2022-11-11T17:17:58.988Z · LW · GW

Your language model game(s) are really interesting -- I've had a couple ideas when "playing" (such as adding GPT2-small suggestions for the user to choose from, some tokenization improvements) -- are you happy to share the source / tools to build this website or is it not in a state you would be happy to share? Totally fine if not, just realized that I should ask before considering building something!

Edit for future readers: Managed to do this with Heroku & flask, then switched to Streamlit -- code here, mostly written by ChatGPT:

Comment by StefanHex (Stefan42) on Mysteries of mode collapse · 2022-11-11T17:02:04.639Z · LW · GW

I really appreciated all the observations here and enjoyed reading this post, thank you for writing all this up!

Edit: Found it here! Your setup looks quite useful, with all the extra information -- is it available publicly somewhere / would you be happy to share it, or is the tooling not in that state yet? (Totally fine, just thought I'd ask!)

Comment by StefanHex (Stefan42) on Why I don't believe in doom · 2022-06-08T17:06:49.643Z · LW · GW

Firstly thank you for writing this post, trying to "poke holes" into the "AGI might doom us all" hypothesis. I like to see this!

How is the belief in doom harming this community?

Actually I see this point, "believing" in "doom" can often be harmful and is usually useless.

Yes, being aware of the (great) risk is helpful for cases like "someone at Google accidentally builds an AGI" (and then hopefully turns it off since they notice and are scared).

But believing we are doomed anyway is probably not helpful. I like to think along the lines of "condition on us winning", to paraphrase HPMOR¹. I.e. assume we survive AGI, what could have caused us to survive AGI and work on making those options reality / more likely.

every single plan [...] can go wrong

I think the crux is that the chance of AGI leading to doom is relatively high, where I would say 0.001% is relatively high whereas you would say that is low? I think it's a similar argument to, say, pandemic-preparedness where there is a small chance of a big bad event and even if the chance is very low, we still should invest substantial resources into reducing the risk.

So maybe we can agree on something like Doom by AGI is a sufficiently high risk that we should spend say like 1-millionth world GDP ($80m) on preventing it somehow (AI Safety research, policy etc).

All fractions mentioned above picked arbitrarily.

¹ HPMOR 111

Suppose, said that last remaining part, suppose we try to condition on the fact that we win this, or at least get out of this alive. If someone TOLD YOU AS A FACT that you had survived, or even won, somehow made everything turn out okay, what would you think had happened -

Comment by StefanHex (Stefan42) on CNN feature visualization in 50 lines of code · 2022-05-26T09:45:26.928Z · LW · GW

Image interpretability seems mostly so easy because humans are already really good

Thank you, this is a good point! I wonder how much of this is humans "doing the hard work" of interpreting the features. It raises the question of whether we will be able to interpret more advanced networks, especially if they evolve features that don't overlap with the way humans process inputs.

The language model idea sounds cool! I don't know language models well enough yet but I might come back to this once I get to work on transformers.

Comment by StefanHex (Stefan42) on Nate Soares on the Ultimate Newcomb's Problem · 2021-11-01T17:22:36.226Z · LW · GW

I think I found the problem: Omega is unable to predict your action in this scenario, i.e. the assumption "Omega is good at predicting your behaviour" is wrong / impossible / inconsistent.

Consider a day where Omicron (randomly) chose a prime number (Omega knows this). Now an EDT is on their way to the room with the boxes, and Omega has to put a prime or non-prime (composite) number into the box, predicting EDT's action.

If Omega makes X prime (i.e. coincides) then EDT two-boxes and therefore Omega has failed in predicting.

If Omega makes X non-prime (i.e. numbers don't coincide) then EDT one-boxes and therefore Omega has failed in predicting.

Edit: To clarify, EDT's policy is two-box if Omega and Omicron's numbers coincide, one-box if they don't.

Comment by StefanHex (Stefan42) on Nate Soares on the Ultimate Newcomb's Problem · 2021-11-01T16:56:44.804Z · LW · GW

This scenario seems impossible, as in contradictory / not self-consistent. I cannot say exactly why it breaks, but at least the two statements here seem to be not consistent:

today they [Omicron] happen to have selected the number X


[Omega puts] a prime number in that box iff they predicted you will take only the big box

Both of these statements have implications for X and cannot both be always true. The number cannot both, be random, and be chosen by Omega/you, can it?

From another angle, the statement

FDT will always see a prime number

demonstrates that something fishy is going on. The "random" number X that Omicron has chosen -- and is in the box -- and seen my FDT -- is "always prime". Then it is not a random number?

Edit: See my reply below, the contradiction is that Omega cannot predict EDT's behaviour when Omicron chose a prime number. EDT's decision depends on Omega's decision, and EDT's decision depends on Omega's decision (via the "do the numbers coincide" link). On days where Omicron chooses a prime number this cyclic dependence leads to a contradiction / Omega cannot predict correctly.

Comment by StefanHex (Stefan42) on Selection Has A Quality Ceiling · 2021-06-03T10:39:27.083Z · LW · GW

Nice argument! My main caveats are

* Does training scale linearly? Does it take just twice as much time to get someone to 4 bits (top 3% in world, one in every school class) and from 4 to 8 bits (one in 1000)?

* Can we train everything? How much of e.g. math skills are genetic? I think there is research on this

* Skills are probably quite highly correlated, especially when it comes to skills you want in the same job. What about computer skills / programming and maths skills / science -- are they inherently correlated or is it just because the same people need both? [Edit: See point made by Gunnar_Zarncke above, better argument on this]

Comment by StefanHex (Stefan42) on Open & Welcome Thread - February 2020 · 2020-03-03T22:21:30.126Z · LW · GW

That is a very broad description - are you talking about locating Fast Radio Bursts? I would be very surprised if that was easily possible.

Background: Astronomy/Cosmology PhD student