Posts

A Selection of Randomly Selected SAE Features 2024-04-01T09:09:49.235Z
SAE-VIS: Announcement Post 2024-03-31T15:30:49.079Z
Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders 2024-03-25T21:17:58.421Z
Understanding SAE Features with the Logit Lens 2024-03-11T00:16:57.429Z
Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders 2024-02-27T02:43:22.446Z
Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small 2024-02-02T06:54:53.392Z
Linear encoding of character-level information in GPT-J token embeddings 2023-11-10T22:19:14.654Z
Features and Adversaries in MemoryDT 2023-10-20T07:32:21.091Z
Joseph Bloom on choosing AI Alignment over bio, what many aspiring researchers get wrong, and more (interview) 2023-09-17T18:45:28.891Z
A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N) 2023-05-16T22:59:20.553Z
Decision Transformer Interpretability 2023-02-06T07:29:01.917Z

Comments

Comment by Joseph Bloom (Jbloom) on SAE-VIS: Announcement Post · 2024-03-31T18:39:24.214Z · LW · GW

I'm a little confused by this question. What are you proposing? 

Comment by Joseph Bloom (Jbloom) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-03-27T21:52:40.570Z · LW · GW

Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:

  • Anthropic tested their SAEs on a model with random weights here and found that the results look noticeably different in some respects to SAEs trained on real models "The resulting features are here, and contain many single-token features (such as "span", "file", ".", and "nature") and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code)." I think further experiments like this which identify classes of features which are highly non-trivial, don't occur in SAEs trained on random models (or random models with a W_E / W_U from a real model) or which can be related to interpretable circuity would help. 
  • I should not that, to the extent that SAEs could be capturing structure in the data, the model might want to capture structure in the data too, so it's not super clear to me what you would observe that would distinguish SAEs capturing structure in the data which the model itself doesn't utilise. Working this out seems important. 
  • Furthermore, the embedding space of LLM's is highly structured already and since we lack good metrics, it's hard to say how precisely SAE's capture "marginal" structure over existing methods. So quantifying what we mean by structure seems important too. 
  • The specific claim that SAEs learn features which are combinations of true underlying features is a reasonable one given the L1 penalty, but I think it's far from obvious how we should think about this in practice. 
  • I'm pretty excited about deliberate attempts to understand where SAEs might be misleading or not capturing information well (eg: here or here). It seems like there are lots of technical questions that are slightly more low level that help us build up to this.  

So in summary: I'm a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly. 

Comment by Joseph Bloom (Jbloom) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-03-26T21:31:17.977Z · LW · GW

Thanks for asking:

  1. Currently we load SAEs into my codebase here. How hard this is will depend on how different your SAE architecture/forward pass is from what I currently support. We're planning to support users / do this ourselves for the first n users and once we can, we'll automate the process. So feel free to link us to huggingface or a public wandb artifact. 
  2.  We run the SAEs over random samples from the same dataset on which the model was trained (with activations drawn from forward passes of the same length). Callum's SAE vis codebase has a demo where you can see how this works. 
  3. Since we're doing this manually, the delay will depend on the complexity on handling the SAEs and things like whether they're trained on a new model (not GPT2 small) and how busy we are with other people's SAEs or other features. We'll try our best and keep you in the loop. Ballpark is 1 -2 weeks not months. Possibly days (especially if the SAEs are very similar to those we are hosting already). We expect this to be much faster in the future. 

We've made the form in part to help us estimate the time / effort required to support SAEs of different kinds (eg: if we get lots of people who all have SAEs for the same model or with the same methodological variation, we can jump on that). 

Comment by Joseph Bloom (Jbloom) on Neuroscience and Alignment · 2024-03-25T09:58:09.008Z · LW · GW

It helps a little but I feel like we're operating at too high a level of abstraction. 

Comment by Joseph Bloom (Jbloom) on Neuroscience and Alignment · 2024-03-24T08:49:04.197Z · LW · GW

with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model's linear representations bottom-up, where I think that'll be a highly non-trivial problem.

 

I'm not sure anyone I know in mech interp is claiming this is a non-trivial problem. 

Comment by Joseph Bloom (Jbloom) on Neuroscience and Alignment · 2024-03-24T08:47:07.627Z · LW · GW

biological and artificial neural-networks are based upon the same fundamental principles

 

I'm confused by this statement. Do we know this? Do we have enough of an understanding of either to say this? Don't get me wrong, there's some level on which I totally buy this. However, I'm just highly uncertain about what is really being claimed here. 

Comment by Joseph Bloom (Jbloom) on How to train your own "Sleeper Agents" · 2024-03-09T05:38:46.767Z · LW · GW

Depending on model size I'm fairly confident we can train SAEs and see if they can find relevant features (feel free to dm me about this).

Comment by Joseph Bloom (Jbloom) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-06T16:53:17.700Z · LW · GW

Thanks for posting this! I've had a lot of conversations with people lately about OthelloGPT and I think it's been useful for creating consensus about what we expect sparse autoencoders to recover in language models. 

Maybe I missed it but:

  • What is the performance of the model when the SAE output is used in place of the activations?
  • What is the L0? You say 12% of features active so I assume that means 122 features are active.This seems plausibly like it could be too dense (though it's hard to say, I don't have strong intuitions here). It would be preferable to have a sweep where you have varying L0's, but similar explained variance. The sparsity is important since that's where the interpretability is coming from.  One thing worth plotting might be the feature activation density of your SAE features as compares to the feature activation density of the probes (on a feature density histogram). I predict you will have features that are too sparse to match your probe directions 1:1 (apologies if you address this and I missed this). 
  • In particular, can you point to predictions (maybe in the early game) where your model is effectively perfect and where it is also perfect with the SAE output in place of the activations at some layer? I think this is important to quantify as I don't think we have a good understanding of the relationship between explained variance of the SAE and model performance and so it's not clear what counts as a "good enough" SAE. 

I think a number of people expected SAEs trained on OthelloGPT to recover directions which aligned with the mine/their probe directions, though my personal opinion was that besides "this square is a legal move", it isn't clear that we should expect features to act as classifiers over the board state in the same way that probes do. 

This reflects several intuitions:

  1. At a high level, you don't get to pick the ontology. SAEs are exciting because they are unsupervised and can show us results we didn't expect. On simple toy models, they do recover true features, and with those maybe we know the "true ontology" on some level. I think it's a stretch to extend the same reasoning to OthelloGPT just because information salient to us is linearly probe-able. 
  2. Just because information is linearly probeable, doesn't mean it should be recovered by sparse autoencoders. To expect this, we'd have to have stronger priors over the underlying algorithm used by OthelloGPT. Sure, it must us representations which enable it to make predictions up to the quality it predicts, but there's likely a large space of concepts it could represent. For example, information could be represented by the model in a local or semi-local code or deep in superposition. Since the SAE is trying to detect representations in the model, our beliefs about the underlying algorithm should inform our expectations of what it should recover, and since we don't have a good description of the circuits in OthelloGPT, we should be more uncertain about what the SAE should find. 
  3. Separately, it's clear that sparse autoencoders should be biased toward local codes over semi-local / compositional codes due to the L1 sparsity penalty on activations. This means that even if we were sure that the model represented information in a particular way, it seems likely the SAE would create representations for variables like (A and B) and (A and B') in place of A even if the model represents A. However, the exciting thing about this intuition is it makes a very testable prediction about combinations of features likely combining to be effective classifiers over the board state. I'd be very excited to see an attempt to train neuron-in-a-haystack style sparse probes over SAE features in OthelloGPT for this reason.

Some other feedback:

  • Positive: I think this post was really well written and while I haven't read it in more detail, I'm a huge fan of how much detail you provided and think this is great. 
  • Positive: I think this is a great candidate for study and I'm very interested in getting "gold-standard" results on SAEs for OthelloGPT. When Andy and I trained them, we found they could train in about 10 minutes making them a plausible candidate for regular / consistent methods benchmarking. Fast iteration is valuable. 
  • Negative: I found your bolded claims in the introduction jarring. In particular "This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model". I think this is overclaiming in the sense that OthelloGPT is not toy-enough, nor do we understand it well enough to know that SAEs have failed here, so much as they aren't recovering what you expect. Moreover, I think it would best to hold-off on proposing solutions here (in the sense that trying to map directly from your results to the viability of the technique encourages us to think about arguments for / against SAEs rather than asking, what do SAEs actually recover, how do neural networks actually work and what's the relationship between the two).
  • Negative: I'm quite concerned that tieing the encoder / decoder weights and not having a decoder output bias results in worse SAEs. I've found the decoder bias initialization to have a big effect on performance (sometimes) and so by extension whether or not it's there seems likely to matter. Would be interested to see you follow up on this. 

Oh, and maybe you saw this already but an academic group put out this related work: https://arxiv.org/abs/2402.12201  I don't think they quantify the proportion of probe directions they recover, but they do indicate recovery of all types of features that been previously probed for. Likely worth a read if you haven't seen it. 

Comment by Joseph Bloom (Jbloom) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-06T16:02:26.865Z · LW · GW

I think we got similar-ish results. @Andy Arditi  was going to comment here to share them shortly. 

Comment by Joseph Bloom (Jbloom) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-06T00:35:19.783Z · LW · GW

@LawrenceC  Nanda MATS stream played around with this as group project with code here: https://github.com/andyrdt/mats_sae_training/tree/othellogpt 

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-29T15:51:52.537Z · LW · GW

@Evan Anders "For each feature, we find all of the problems where that feature is active, and we take the two measurements of “feature goodness" <- typo? 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-28T23:40:11.283Z · LW · GW

added a link at the top.

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-28T16:58:42.911Z · LW · GW

My mental model is the encoder is working hard to find particular features and distinguish them from others (so it's doing a compressed sensing task) and that out of context it's off distribution and therefore doesn't distinguish noise properly. Positional features are likely a part of that but I'd be surprised if it was most of it. 

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-28T16:56:39.812Z · LW · GW

I've heard this idea floated a few times and am a little worried that "When a measure becomes a target, it ceases to be a good measure" will apply here. OTOH, you can directly check whether the MSE / variance explained diverges significantly so at least you can track the resulting SAE's use for decomposition. I'd be pretty surprised if an SAE trained with this objective became vastly more performant and you could check whether downstream activations of the reconstructed activations were off distribution. So overall, I'm pretty excited to see what you get!

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-28T16:48:13.709Z · LW · GW

problems

 

prompts*

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-27T17:11:45.022Z · LW · GW

This means they're somewhat problematic for OOD use cases like treacherous turn detection or detecting misgeneralization.

 

I kinda want to push back on this since OOD in behavior is not obviously OOD in the activations. Misgeneralization especially might be better thought of as an OOD environment and on-distribution activations? 

I think we should come back to this question when SAEs have tackled something like variable binding with SAEs. Right now it's hard to say how SAEs are going to help us understand more abstract thinking and therefore I think it's hard to say how problematic they're going to be for detecting things like a treacherous turn. I think this will depend on how how representations factor. In the ideal world, they generalize with the model's ability to generalize (Apologies for how high level / vague that idea is). 

Some experiments I'd be excited to look at:

  • If the SAE is trained on a subset of the training distribution, can we distinguish it being used to decompose activations on those data points off the training distribution?
  • How does that compare to an SAE trained on the whole training distribution from the model, but then looking at when the model is being pushed off distribution? 

I think I'm trying to get at - can we distinguish:

  • Anomalous activations. 
  • Anomalous data points. 
  • Anomalous mechanisms. 

Lots of great work to look forward to!

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-20T15:22:38.616Z · LW · GW

Why do you want to refill and shuffle tokens whenever 50% of the tokens are used?

 

Neel was advised by the authors that it was important minimise batches having tokens from the same prompt. This approach leads to a buffer having activations from many different prompts fairly quickly. 

 

Is this just tokens in the training set or also the test set? In Neel's code I didn't see a train/test split, isn't that important?

I never do evaluations on tokens from prompts used in training, rather, I just sample new prompts from the buffer. Some library set aside a set of tokens to do evaluations on which are re-used. I don't currently do anything like this but it might be reasonable. In general, I'm not worried about overfitting. 

Also, can you track the number of epochs of training when using this buffer method (it seems like that makes it more difficult)?

Epochs in training makes sense in a data-limited regime which we aren't in. OpenWebText has way more tokens than we ever train any sparse autoencoder on so we're always on way less than 1 epoch. We never reuse the same activations when training, but may use more than one activation from the same prompt. 

Comment by Joseph Bloom (Jbloom) on Addressing Feature Suppression in SAEs · 2024-02-16T22:17:06.628Z · LW · GW

Awesome work! I'd be quite interested to know whether the benefits from this technique are equivalently significant with a larger SAE and also what the original perplexity was (when looking at the summary statistics table). I'll probably reimplement at some point. 

Also, kudos on the visualizations. Really love the color scales!

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-15T04:25:23.388Z · LW · GW

On wandb, the dashboards were randomly sampled but we've since uploaded all features to Neuronpedia https://www.neuronpedia.org/gpt2-small/res-jb. The log sparsity is stored in the huggingface repo so you can look for the most sparse features and check if their dashboards are empty or not (anecdotally most dashboards seem good, beside the dead neurons in the first 4 layers).

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-14T17:17:05.017Z · LW · GW

24, 576 prompts of length 128 = 3, 145, 728.

With features that fire less frequently this won't be enough, but for these we seemed to find some activations (if not highly activating) for all features. 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-10T01:12:15.670Z · LW · GW

Makes sense. Will set off some runs with longer context sizes and track this in the future.

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-09T18:27:40.057Z · LW · GW

Ahhh I see. Sorry I was way too hasty to jump at this as the explanation. Your code does use the tied decoder bias (and yeah, it was a little harder to read because of how your module is structured). It is strange how assuming that bug seemed to help on some of the SAEs but I ran my evals over all your residual stream SAE's and it only worked for some / not others and certainly didn't seem like a good explanation after I'd run it on more than one. 

I've been talking to Logan Riggs who says he was able to load in my SAEs and saw fairly similar reconstruction performance to to me but that outside of the context length of 128 tokens, performance markedly decreases. He also mentioned your eval code uses very long prompts whereas mine limits to 128 tokens so this may be the main cause of the difference.  Logan mentioned you had discussed this with him so I'm guessing you've got more details on this than I have? I'll build some evals specifically to look at this in the future I think. 

Scientifically, I am fairly surprised about the token length effect and want to try training on activations from much longer context sizes now. I have noticed (anecdotally) that the number of features I get sometimes increases over the prompt so an SAE trained on activations from shorter prompts are plausibly going to have a much easier time balancing reconstruction and sparsity, which might explain the generally lower MSE / higher reconstruction. Though we shouldn't really compare between models and with different levels of sparsity as we're likely to be at different locations on the pareto frontier. 

One final note is that I'm excited to see whether performance on the first 128 tokens actually improves in SAEs trained on activations from > 128 token forward passes (since maybe the SAE becomes better in general). 
 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-07T20:21:19.849Z · LW · GW
  • MSE Losses were in the WandB report (screenshot below).
  • I've loaded in your weights for one SAE and I get very bad performance (high L0, high L1, and bad MSE Loss) at first. 
  • It turns out that this is because my forward pass uses a tied decoder bias which is subtracted from the initial activations and added as part of the decoder forward pass. AFAICT, you don't do this. 
  • To verify this, I added the decoder bias to the activations of your SAE prior to running a forward pass with my code (to effectively remove the decoder bias subtraction from my method) and got reasonable results. 
  • I've screenshotted the Towards Monosemanticity results which describes the tied decoder bias below as well. 

I'd be pretty interested in knowing if my SAEs seem good now based on your evals :) Hopefully this was the only issue. 

 


Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-07T16:26:41.945Z · LW · GW

Agreed, thanks so much! Super excited about what can be done here!

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-06T20:34:44.548Z · LW · GW

I've run some of the SAE's through more thorough eval code this morning (getting variance explained with the centring and calculating mean CE losses with more batches). As far as I can tell the CE loss is not that high at all and the MSE loss is quite low. I'm wondering whether you might be using the wrong hooks? These are resid_pre so layer 0 is just the embeddings and layer 1 is after the first transformer block and so on. One other possibility is that you are using a different dataset? I trained these SAEs on OpenWebText. I don't much padding at all, that might be a big difference too. I'm curious to get to the bottom of this. 

One sanity check I've done is just sampling from the model when using the SAE to reconstruct activations and it seems to be about as good, which I think rules out CE loss in the ranges you quote above. 

For percent alive neurons a batch size of 8192 would be far too few to estimate dead neurons (since many neurons have a feature sparsity < 10**-3. 

You're absolutely right about missing the centreing in percent variance explained. I've estimated variance explained again for the same layers and get very similar results to what I had originally. I'll make some updates to my code to produce CE score metrics that have less variance in the future at the cost of slightly more train time. 

If we don't find a simple answer I'm happy to run some more experiments but I'd guess an 80% probability that there's a simple bug which would explain the difference in what you get. Rank order of most likely: Using the wrong activations, using datapoints with lots of padding, using a different dataset (I tried the pile and it wasn't that bad either). 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-06T16:11:54.391Z · LW · GW

Oh no. I'll look into this and get back to you shortly. One obvious candidate is that I was reporting CE for some batch at the end of training that was very small and so the statistics likely had high variance and the last datapoint may have been fairly low. In retrospect I should have explicitly recalculated this again post training. However, I'll take a deeper dive now to see what's up. 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T17:00:23.452Z · LW · GW

I'd be excited about reading about / or doing these kinds of experiments. My weak prediction is that low activating features are important in specific examples where nuance matters and that what we want is something like an "adversarially robust SAE" which might only be feasible with current SAE methods on a very narrow distribution. 

A mini experiment I did which motivates this: I did an experiment with an SAE at the residual stream where I looked at the attention pattern of an attention head immediately following the head as function of k, where we take the top-k SAE features in the reconstruction. I found that if the head was attending to "Mary" in the original forward pass (and not "John"), then a k of 3 was good enough to have it attend to Mary and not John. But if I replaced John with Martha, the minimum k such that the head attended to Mary increased. 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T16:37:22.555Z · LW · GW

Unless my memory is screwing up the scale here, 0.3 CE Loss increase seems quite substantial? A 0.3 CE loss increase on the pile is roughly the difference between Pythia 410M and Pythia 2.8B.

Thanks for raising this! I had wanted to find a comparison in terms of different model performances to help me quantify this so I'm glad to have this as a reference.

And do I see it right that this is the CE increase maximum for adding in one SAE, rather than all of them at the same time? So unless there is some very kind correlation in these errors where every SAE is failing to reconstruct roughly the same variance, and that variance at early layers is not used to compute the variance SAEs at later layers are capturing, the errors would add up? Possibly even worse than linearly? What CE loss do you get then?

Have you tried talking to the patched models a bit and compared to what the original model sounds like? Any discernible systematic differences in where that CE increase is changing the answers?

While I have explored model performance with SAEs at different layers, I haven't done so with more than one SAE or explored sampling from the model with the SAE. I've been curious about systematic errors induced by the SAE but a few brief experiments with earlier SAEs/smaller models didn't reveal any obvious patterns. I have once or twice looked at the divergence in the activations after an SAE has been added and found that errors in earlier layers propagated.

One thought I have on this is that if we take the analogy to DNA sequencing seriously, relatively minor errors in DNA sequencing make the resulting sequences useless. If you get one or two base pairs wrong then try to make bacteria express the printed gene (based on your sequencing) then you'll kill that bacteria. This gives me the intuition that I absolutely expect we could have fairly accurate measurements with some error and that the resulting error is large. 

To bring it back to what I suspect is the main point here:  We should amend the statement to say "Our reconstruction scores were pretty good as compared to our previous results". 

It bothers me quite a bit that SAEs don't recover performance better, but I think this is a fairly well defined and that the community can iterate on both via improvements to SAEs and looking for nearby alternatives. For example, I'm quite excited to experiment with any alternative architectures/training procedures that come out of the theory of computation in superposition line of work.

One productive direction inspired by thinking of this as sequencing is that we should have lots of SAEs trained on the same model and show that they get very similar results (to give us more confidence we have a better estimate of the true underlying features). It's standard in DNA/RNA/Protein sequencing to run methods many times over. I think once we see evidence that we get good results along those lines, we should be more interested in / raise our standards for model performance with reconstructed SAEs. 

Comment by Joseph Bloom (Jbloom) on "Does your paradigm beget new, good, paradigms?" · 2024-01-28T16:41:40.613Z · LW · GW

Thanks for writing this! This is an idea that I think is pretty valuable and one that comes up fairly frequently when discussing different AI safety research agendas.

I think that there's a possibly useful analogue of this which is useful from the perspective of being deep inside a cluster of AI safety research and wondering whether it's good. Specifically, I think we should ask "does the value of my current line of research hinge on us basically being right about a bunch of things or does much of the research value come from discovering all the places we are wrong?".

One reason this feels like an important variant to me is that when I speak to people skeptical about the area of research I've been working in, they often seem surprised that I'm very much in agreement with them about a number of issues. Still, I disagree with them that the solution is to shift focus, so much as to try to work how the one paradigm might need to shift into another.

Comment by Joseph Bloom (Jbloom) on My best guess at the important tricks for training 1L SAEs · 2023-12-21T10:55:16.415Z · LW · GW

Thanks. I've found this incredibly useful. This is something that I feel has been long overdue with SAE's! I think the value from advice + (detailed) results + code is something like 10x more useful than the way these insights tend to be reported!

Comment by Joseph Bloom (Jbloom) on Finding Sparse Linear Connections between Features in LLMs · 2023-12-09T12:32:12.363Z · LW · GW

Interesting! This is very cool work but I'd like to understand your metrics better. 
- "So we take the difference in loss for features (ie for a feature, we take linear loss - MLP loss)". What do you mean here? Is this the difference between the mean MSE loss when the feature is on vs not on?  
- Can you please report the L0's for each of the auto-encoders and the linear model as well as the next token prediction loss when using the autoencoder/linear model. These are important metrics on which my generally excitement hinges. (eg: if those are both great, I'm way more interested in results about specific features). 
- I'd be very interested in you can take a specific input, look at the features present and compare them between autoencoder/the linear model. This would be especially cool if you pick an example where ablating the MLP out causes the incorrect prediction so we know it's representing something important.
- Are you using a holdout dataset of eval tokens when measuring losses? Or how many tokens are you using to measure losses? 
- Have you plotted per token MSE loss vs l0 for each model? Do they look similar? Are there any outliers in that relationship? 

Comment by Joseph Bloom (Jbloom) on Testbed evals: evaluating AI safety even when it can’t be directly measured · 2023-11-16T10:14:42.796Z · LW · GW

Tesbeds

missing "t"

Comment by Joseph Bloom (Jbloom) on Linear encoding of character-level information in GPT-J token embeddings · 2023-11-11T04:35:02.179Z · LW · GW

Fixed, thanks!

Comment by Joseph Bloom (Jbloom) on [Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small · 2023-10-26T19:32:50.328Z · LW · GW

Cool paper. I think the semantic similarity result is particularly interesting.

As I understand it you've got a circuit  that wants to calculate something like Sim(A,B), where A and B might have many "senses" aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features. 

So for example, there are senses in which "Berkeley" and "California" are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn't expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score.  This would only happen for really "mono-semantic tokens" that have only one sense (maybe you could test that). 

Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc). 

I'd be interested if you have use something like SAE's to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:
- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn't (or some linear combination) or via a more complicated function of sets of features being used to inform suppression. 
- Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there's some asymmetry here, that could be interesting as it would correspond to "I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).

I'm particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations. 

Comment by Joseph Bloom (Jbloom) on Trying to understand John Wentworth's research agenda · 2023-10-25T21:02:08.885Z · LW · GW

I'd be very interested in seeing these products and hearing about the use-cases / applications. Specifically, my prior experience at a startup leads me to believe that building products while doing science can be quite difficult (although there are ways that the two can synergise). 

I'd be more optimistic about someone claiming they'll do this well if there is an individual involved in the project who is both deeply familiar with the science and has build products before (as opposed to two people each counting on the other to have sufficient expertise they lack). 

A separate question I have is about how building products might be consistent with being careful about what information you make public. If you there are things that you don't want to be public knowledge, will there be proprietary knowledge not shared with users/clients? It seems like a non-trivial problem to maximize trust/interest/buy-in whilst minimizing clues to underlying valuable insights. 

Comment by Joseph Bloom (Jbloom) on Features and Adversaries in MemoryDT · 2023-10-22T20:18:39.056Z · LW · GW

Thanks Jay! (much better answer!) 

Comment by Joseph Bloom (Jbloom) on Features and Adversaries in MemoryDT · 2023-10-22T20:17:51.705Z · LW · GW

The first frame, apologies.  This is a detail of how we number trajectories that I've tried to avoid dealing with in this post. We left pad in a context windows of 10 timesteps so the first observation frame is S5. I've updated the text not to refer to S5. 

Comment by Joseph Bloom (Jbloom) on Features and Adversaries in MemoryDT · 2023-10-22T20:12:33.834Z · LW · GW

I'm not sure if it's interesting to me for alignment, since it's such a toy model.


Cruxes here are things like whether you think toy models are governed by the same rules as larger models, whether studying them helps you understand those general principles and whether understanding those principles is valuable. This model in particular shares many similarities in architecture and training to LLMs and is over a million parameters so it's not nearly as much of a toy model as others and we have particular reasons to expect insights to transfer (both transformers / next token predictors). 

What do you think would change when trying to do similar interpretability on less-toy models?

The recipe stays mostly the same but scale increases and you know less about the training distribution.

  • Features: Feature detection in LLM's via sparse auto-encoders seems highly tractable. There may be more features and you might have less of  a sense for the overall training distribution. Once you collapse latent space into features, this will go a long way to dealing with the curse of dimensionality with these systems. 
  • Training Data: We know much less about the training distribution of larger models (ie: what are the ground truth features, how to they correlate or anti-correlate). 
  • Circuits This investigation treats circuits like a black-box, but larger models will likely solve more complex tasks with more complicated circuitry. The cool thing about knowing the features is that you can get to fairly deep insights even without understanding the circuits (like showing which observations are effectively equivalent to the model). 

What would change about finding adversarial examples?

This is a very complicated/broad question. There's a number of ways you could approach this. I'd probably look at identifying critical features in the language model and see whether we can develop automatic techniques for flipping them. This could be done recursively if you are able to find the features most important for those features (etc.). Understanding why existing adversaries like jail-breaking techniques / initial affirmative responses work (mechanistically) might tell us a lot about how to automate more general search for adversaries. However, my guess is that the task of finding adversaries using a white-box approaches may be fairly tractable. The search space is much smaller once you know features and there are many search strategies that might work to flip features (possibly working recursively through features in each layer and guided by some regularization designed to keep adversaries naturalistic/plausible. 

Directly intervening on features seems like it might stay the same though.

This doesn't seem super obvious if features aren't orthogonal, or may exist in subspaces or manifolds rather than individual directions. The fact that this transition isn't trivial is one reason it would be better to understand some simple models very well (so that when we go to larger models, we're on surer scientific footing). 

Comment by Joseph Bloom (Jbloom) on Don't Dismiss Simple Alignment Approaches · 2023-10-07T07:59:25.516Z · LW · GW

My vibe from this post is something like "we're making on stuff that could be helpful so there's stuff to work on!" and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you're likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)

I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they're making progress, then engaging with those cruxes is important before people invest substantive energy (ie: beyond upskilling). However as a directional update for people who are otherwise pretty cynical, this seems like a good update.

Comment by Joseph Bloom (Jbloom) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T10:01:55.568Z · LW · GW

Strong disagree. Can’t say I’ve worked through the entire article in detail but wanted to chime in as one of the many of junior researchers investing energy in interpretability. Noting that you erred on the side of making arguments too strong. I agree with Richard about this being the wrong kind of reasoning for novel scientific research and with Rohin’s idea that we’re creating new affordances. I think generally MI is grounded and much closer to being a natural science that will progress over time and be useful for alignment, synergising with other approaches. I can't speak for Neel, but I suspect the original list was more about getting something out there than making many nuanced arguments, so I think it's important to steelman those kinds of claims / expand on them before responding. 

A few extra notes: 

The first point I want to address your endorsement of “retargeting the search” and finding the “motivational API” within AI systems which is my strongest motivator for working in interpretability.

This is interesting because this would be a way to not need to fully reverse engineer a complete model. The technique used in Understanding and controlling a maze-solving policy network seems promising to me. Just focusing on “the motivational API” could be sufficient.

I predict that methods like “steering vectors” are more likely to work in worlds where we make much more progress in understanding of neural networks. But steering vectors are relatively recent, so it seems reasonable to think that we might have other ideas soon that could be equally useful but may require progress more generally in the field.

We need only look to biology and medicine to see examples of imperfectly understood systems, which remain mysterious in many ways, and yet science has led us to impressive feats that might have been unimaginable years prior. For example, the ability in recent years to retarget the immune system to fight cancer. Because hindsight devalues science we take such technologies for granted and I think this leads to a general over-skepticism about fields like interpretability.

The second major point I wanted to address was this argument:

Determining the dangerousness of a feature is a mis-specified problem. Searching for dangerous features in the weights/structures of the network is pointless. A feature is not inherently good or bad. The danger of individual atoms is not a strong predictor of the danger of assembly of atoms and molecules. For instance, if you visualize the feature of layer 53, channel 127, and it appears to resemble a gun, does it mean that your system is dangerous? Or is your system simply capable of identifying a dangerous gun? The fact that cognition can be externalized also contributes to this point.

I agree that it makes little sense to think of a feature on it’s own as dangerous but I it sounds to me like you are making a point about emergence. If understanding transistors doesn’t lead to understanding computer software then why work so hard to understand transistors?

I am pretty partial to the argument that the kinds of alignment relevant phenomena in neural networks will not be accessible via the same theories that we’re developing today in mechanistic interpretability. Maybe these phenomena will exist in something analogous to a “nervous system” while we’re still understanding “biochemistry”. Unlike transistors and computers though, biochemistry is hugely relevant to understanding neuroscience.

Comment by Joseph Bloom (Jbloom) on Ten Levels of AI Alignment Difficulty · 2023-07-04T03:38:46.958Z · LW · GW

Thanks for writing this up. I really liked this framing when I first read about it but reading this post has helped me reflect more deeply on it. 

I’d also like to know your thoughts on whether Chris Olah’s original framing, that anything which advances this ‘present margin of safety research’ is net positive, is the correct response to this uncertainty.

I wouldn't call it correct or incorrect only useful in some ways and not others. Whether it's net positive may rely on whether it is used by people in cases where it is appropriate/useful. 

As an educational resource/communication tool, I think this framing is useful. It's often useful to collapse complex topics into few axes and construct idealised patterns, in this case a difficulty-distribution on which we place techniques by the kinds of scenarios where they provide marginal safety. This could be useful for helping people initially orient to existing ideas in the field or in governance or possibly when making funding decisions. 

However, I feel like as a tool to reduce fundamental confusion about AI systems, it's not very useful.  The issue is that many of the current ideas we have in AI alignment are based significantly on pre-formal conjecture that is not grounded in observations of real world systems (see the Alignment Problem from a Deep Learning Perspective).  Before we observe more advanced future systems, we should be highly uncertain about existing ideas. Moreover, it seems like this scale attempts to describe reality via the set of solutions which produce some outcome in it? This seems like an abstraction that is unlikely to be useful.

In other words, I think it's possible that this framing leads to confusion between the map and the territory, where the map is making predictions about tools that are useful in territory which we have yet to observe.

To illustrate how such an axis may be unhelpful if you were trying to think more clearly,  consider the equivalent for medicine. Diseases can be divided up into varying classes on difficulty to cure with corresponding research being useful for curing them. Cuts/Scrapes are self-mending whereas infections require corresponding antibiotics/antivirals, immune disorders and cancers are diverse and therefore span various levels of difficulties amongst their instantiations. It's not clear to me that biologists/doctors would find much use from conjecture on exactly how hard vs likely each disease is to occur, especially in worlds where you lack a fundamental understanding of the related phenomena. Possibly, a closer analogy would be trying to troubleshoot ways evolution can generate highly dangerous species like humans. 

I think my attitude here leads into more takes about good and bad ways to discuss which research we should prioritise but I'm not sure how to convey those concisely. Hopefully this is useful. 

Comment by Joseph Bloom (Jbloom) on [Research Update] Sparse Autoencoder features are bimodal · 2023-06-24T02:12:49.847Z · LW · GW

Hey Robert, great work! My focus isn't currently on this but I thought I'd mention that these trends might relate to some of the observations in the Finding Neurons in a Haystack paper. https://arxiv.org/abs/2305.01610.

If you haven't read the paper, the short version is they used sparse probing to find neurons which linearly encode variables like "is this python code" in a variety of models with varying size. 

The specific observation which I believe may be relevant:

"As models increase in size, representation sparsity increases on average, but different features obey different dynamics: some features with dedicated neurons emerge with scale, others split into finer grained features with scale, and many remain unchanged or appear somewhat randomly" 

I believe this accords with your observation that "Finding more features finds more high-MCS features, but finds even more low-MCS features". 

Maybe finding ways to directly compare approaches could support further use of either approach. 

Also, interesting to hear about using EMD over KL divergence. I hadn't thought about that! 

Comment by Joseph Bloom (Jbloom) on A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N) · 2023-05-18T13:08:42.279Z · LW · GW

Thanks Simon, I'm glad you found the app intuitive :)

The RTG is just another token in the input, except that it has an especially strong relationship with training distribution. It's heavily predictive in a way other tokens aren't because it's derived from a labelled trajectory (it's the remaining reward in the trajectory after that step).

For BabyAI, the idea would be to use an instruction prepended to the trajectory made up of a limited vocab (see baby ai paper for their vocab). I would be pretty partial to throwing out the RTG and using a behavioral clone for a BabyAI model. It seems likely this would be easier to train. Since the goal of these models is to be useful for gaining understanding, I'd like to avoid reusing tokens as that might complicate analysis later on.

Comment by Joseph Bloom (Jbloom) on Steering GPT-2-XL by adding an activation vector · 2023-05-15T21:54:12.255Z · LW · GW

Sure. Let's do it at EAG. :) 

Comment by Joseph Bloom (Jbloom) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T13:37:25.836Z · LW · GW

Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can "an injection coefficient scan". 

The procedure I'm using looks like this:

  1. Repeat your input tokens say, 128 times. 
  2. Apply the activation vector at 128 different steps between a coefficient of -10 and 10 to each of your input tokens when doing your AVEC forward pass. 
  3. Decompose the resulting residual stream to whatever granularity you like (use decompose_resid or get_full_resid_decomposition with/without expand neurons). 
  4. Dot product the outputs with your logit direction of choice ( I use a logit diff that is meaningful in my task)
  5. Plot the resulting attribution vs injection coefficient per component. 
  6. If you like, cluster the profiles to show how different component learn similar functions of the injection coefficient to your decision. 

So far, my results seem very interesting and possibly quite useful. It's possible this method is impractical in LLMs but I think it might be fine as well. Will dm some example figures. 
 
I also want to investigate using a continuous injection coefficient in activation patching is similarly useful since it seems like it might be. 

I am very excited to see if this makes my analyses easier! Great work! 

Comment by Joseph Bloom (Jbloom) on Residual stream norms grow exponentially over the forward pass · 2023-05-09T23:10:26.873Z · LW · GW

Sure, I could have phrased myself better and I meant to say "former", which didn't help either! 

Neither of these are novel concepts in that existing investigations have described features of this nature. 

  1. Good 1 aka Consumer goods. Useful for unembed (may / may not be useful for other modular circuits inside the network. That Logit Lens gets better over the course of the circuit suggests the residual stream contains these kinds of features and more so as we move up the layers. 
     
  2. Good 2. aka Capital goods. Useful primarily for other circuits. A good example is the kind of writing to subspaces in the IOI circuits by duplicate token heads. "John" appeared twice as markup on a token / vector in the subspace of a token in the residual stream" doesn't in itself tell you that Jane is the next token, but is useful to another head which is going to propose a head via another function. 

    Alternatively, in Neel's modular arithmetic,  calculating waves of terms like sin(wx), cos(wx) which are only useful when you have the rest of the mechanism to get argmax(z) of 
    cos(w(x+y))cos(wz)+sin(w(x+y))sin(wz)=cos(w(x+y−z)).
  3. I would have guess that features in the first category and later in the second, since how would you get gradients to things that aren't useful yet. However, the existence of clear examples of "internal signals" is somewhat undisputable?
  4. It seems plausible that there are lots of stuff features that sit in both these categories of course so if it's useful you could define them to be more mutually exclusive and a third category for both.

I realise that my saying "Maybe this is the only kind of good in which case transformers would be "fundamentally interpretable" in some sense.  All intermediate signals could be interpreted as final products." was way too extreme. What I mean is that maybe category two is more less common that we think. 

To relate this to AVEC,  (which I don't have a detailed understanding of how you are implementing currently) if you find the vector (I assume residual stream vector) itself has a high dot product with specific unembeddings then that says you're looking at something in category 1. However, if introducing it into the model earlier has a very different effect to introducing it directly before the unembedding then that would suggest it's also being used by other modular circuits in the model. 

I think this kind of distinction is only one part of what I was trying to get at with circuit economics but hopefully that's clearer! Sorry for the long explanation and initial confusion. 

Comment by Joseph Bloom (Jbloom) on Residual stream norms grow exponentially over the forward pass · 2023-05-09T00:53:28.791Z · LW · GW

We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren't aware of this previously, so we wanted to make a reference post.

Happy to provide! I think I'm pretty interested in testing this/working on this in the future. Currently a bit tied up but I think (as Alex hints at) there could be some big implications for interpretability here.

TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important. 

The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as reducing loss on different inputs, activating on distinct patterns), they share capacity in weights (see polysemantic and capacity in neural networks) and I guess "bandwidth" (getting penalized for interfering signals in activations). There are a few reasons why I think this feels like economics which include: scarce resources, value chains (features composed of other features) and competition (if a circuit is predicting something well with one heuristic, maybe there will be smaller gradient updates to encourage another circuit learning a different heuristic to emerge). 

So to tie this back to your post and Alex's comment "which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability.". I think that what interpretability has recently dealt with in elucidating specific circuits is something like "micro-interpretability" and is akin to microeconomics. However this post seems to show a larger trend ie "macro-interpretability" which would possibly affect which of such circuits are possible/likely to be in the final model. 

I'll elaborate briefly on the off chance this seems like it might be a useful analogy/framing to motivate further work. 

  • Studying the Capacity/Loss Reduction distribution in Time: It seems like during transformer training there may be an effect not unlike inflation? Circuits which delivered enough value to justify their capacity use early in training may fall below the capacity/loss reduction cut off later. Maybe various techniques which enable us to train more robust models work because they make these transitions easier.
  • Studying the Capacity/Loss Reduction distribution in Layer: Moreover, it seems plausible that the distribution of "usefulness" in circuits in different layers of the network may be far from uniform. Circuits later in the network have far more refined inputs which make them better at reducing loss. Residual stream norm growth seems like a "macro" effect that shows model "know" that later layers are more important.
  • Studying the Capacity/Loss Reduction distribution in Layer and Time: Combining the above. I'd predict that neural networks originally start by having valuable circuits in many layers but then transition to maintain circuits earlier in the network which are valuable to many downstream circuits and circuits later in the network which make the best use of earlier circuits. 
  • More generally "circuit economics" as a framing seems to suggest that there are different types of "goods" in the transformer economy. those which directly lead to better predictions and those which are useful for making better predictions when integrated with other features.  The success of Logit Lens seems to suggest that the latter category increases over the course of the layers. Maybe this is the only kind of good in which case transformers would be "fundamentally interpretable" in some sense.  All intermediate signals could be interpreted as final products. More likely, I think is that later in training there are ways to reinforce the creation of more internal goods (in economics, good which are used to make other goods are called capital goods). The value of such goods would be mediated via later circuits. So this would lead also to the "deletion-by-magnitude theory" as a way or removing internal goods. 
  • To bring this back to language already in the field see Neel's discussion here.  A modular circuit is distinct from an end-end circuit in that it starts and ends in intermediate activations. Modular circuits may be composable. I propose that the outputs of such circuits are "capital goods". If we think about the "circuit economy" it then seems totally reasonable that multiple suppliers might generate equivalent capital goods and have a many to many relationship multiple different circuits near the end voting on logits. 

This is very speculative "theory" if you can call it that, but I guess I feel this would be "big if true". I also make no claims about this being super original or actually that useful in practice but it does feel intuition generating. I think this is totally the kind of thing people might have worked on sooner but it's likely been historically hard to measure the kinds of things that might be relevant. What your post shows is that between the transformer circuits framework and TransformerLens we are able to somewhat quickly take a bunch of interesting measurements relatively quickly which may provide more traction on this than previously possible. 

Comment by Joseph Bloom (Jbloom) on Residual stream norms grow exponentially over the forward pass · 2023-05-08T01:57:49.335Z · LW · GW

Second pass through this post which solidly nerd-sniped me! 

A quick summary of my understand of the post: (intentionally being very reductive though I understand the post may make more subtle points). 

  1. There appears to be exponential growth in the norm of the residual stream in a range of models. Why is this the case?
  2. You consider two hypotheses: 
    1. 1. That the parameters in the Attention and/or MLP weights increase later in the network. 
    2. 2. That there is some monkey business with the layer norm sneaking in a single extra feature. 
  3. In terms of evidence, you found that:
    1. Evidence for theory one in W_OV frobenius norms increasing approximately exponential over layers.
    2. Evidence for theory one in MLP output to the residual stream increasing (harder to directly measure the norm of the MLP due to non-linearities).
  4. You're favoured explanation is "We finally note our current favored explanation: Due to LayerNorm, it's hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger. "
     

My thoughts:

  • My general take is that this post is that the explanation about cancelling out features being harder than amplifying new features feels somewhat disconnected from the high level characterisation of weights / norms which makes up most of the post. It feels like there is a question of how and a question of why
  • Given these models are highly optimized by SGD, it seems like the conclusion must be that the residual stream norm is growing because this is useful leading to the argument that it is useful because the residual stream is a limited resource / has limited capacity, making us want to delete information in it and increasing the norm of the contributions to the residual stream effectively achieves this by drowning out other features. 
  • Moreover, if the mechanism by which we achieve larger residual stream contributions in later components is by having larger weights (which is penalized by weight decay) then we should conclude that a residual stream with a large norm is worthwhile enough that the model would rather do this then have smaller weights (which you note). 
  • I feel like I still don't feel like I know why though. Later layers have more information and are therefore "wiser" or something could be part of it.
  • I'd also really like to know the implications of this. Does this affect the expressivity of the model in a meaningful way? Does it affect the relative value of representing a feature in any given part of the model? Does this create an incentive to "relocate" circuits during training  or learn generic "amplification" functions?  These are all ill-defined questions to some extent but maybe there are formulations of them that are better defined which have implications for MI related alignment work. 
     

Thanks for writing this up! Looking forward to subsequent post/details :) 

PS: Is there are non-trivial relationship between this post and tuned lens/logit lens? https://arxiv.org/pdf/2303.08112.pdf Seems possible. 

Comment by Joseph Bloom (Jbloom) on Residual stream norms grow exponentially over the forward pass · 2023-05-08T00:27:10.672Z · LW · GW

Thanks for the feedback. On a second reading of this post and the paper I linked and having read the paper you linked, my thoughts have developed significantly. A few points I'll make here before making a separate comment:
- The post I shared originally does indeed focus on dynamics but may have relevant general concepts in discussing the relationship between saturation and expressivity. However, it focuses on the QK circuit which is less relevant here.
- My gut feel is that true explanations of related formula should have non-trivial relationships. If you had a good explanation for why norms of parameters grew during training it should relate to why norms of parameters are different across the model. However, this is a high level argument and the content of your post does of course directly address a different phenomenon (residual stream norms). If this paper had studied the training dynamics of the residual stream norm, I think it would be very relevant. 

Comment by Joseph Bloom (Jbloom) on Residual stream norms grow exponentially over the forward pass · 2023-05-07T01:38:01.084Z · LW · GW

I really liked this post and would like to engage with it more later. It could be very useful! 

However, I also think that it would be good for you to add a section reviewing previous academic work on this topic (eg: https://aclanthology.org/2021.emnlp-main.133.pdf.  This seems very relevant and may not be the only academic work on this topic (I did not search long). Curious to hear what you find!