Posts
Comments
List of some larger mech interp project ideas (see also: short and medium-sized ideas). Feel encouraged to leave thoughts in the replies below!
Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!
What is going on with activation plateaus: Transformer activations space seems to be made up of discrete regions, each corresponding to a certain output distribution. Most activations within a region lead to the same output, and the output changes sharply when you move from one region to another. The boundaries seem to correspond to bunched-up ReLU boundaries as predicted by grokking work. This feels confusing. Are LLMs just classifiers with finitely many output states? How does this square with the linear representation hypothesis, the success of activation steering, logit lens etc.? It doesn't seem in obvious conflict, but it feels like we're missing the theory that explains everything. Concrete project ideas:
- Can we in fact find these discrete output states? Of course we expect thee to be a huge number, but maybe if we restrict the data distribution very much (a limited kind of sentence like "person being described by an adjective") we are in a regime with <1000 discrete output states. Then we could use clustering (K-means and such) on the model output, and see if the cluster assignments we find map to activation plateaus in model activations. We could also use a tiny model with hopefully less regions, but Jett found regions to be crisper in larger models.
- How do regions/boundaries evolve through layers? Is it more like additional layers split regions in half, or like additional layers sharpen regions?
- What's the connection to the grokking literature (as the one mentioned above)?
- Can we connect this to our notion of features in activation space? To some extent "features" are defined by how the model acts on them, so these activation regions should be connected.
- Investigate how steering / linear representations look like through the activation plateau lens. On the one hand we expect adding a steering vector to smoothly change model output, on the other hand the steering we did here to find activation plateaus looks very non-smooth.
- If in fact it doesn't matter to the model where in an activation plateau an activation lies, would end-to-end SAEs map all activations from a plateau to a single point? (Anecdotally we observed activations to mostly cluster in the centre of activation plateaus so I'm a bit worried other activations will just be out of distribution.) (But then we can generate points within a plateau by just running similar prompts through a model.)
- We haven't managed to make synthetic activations that match the activation plateaus observed around real activations. Can we think of other ways to try? (Maybe also let's make this an interpretability challenge?)
Use sensitive directions to find features: Can we use the sensitivity of directions as a way to find the "true features", some canonical basis of features? In a recent post we found current SAE features to look less special that expected, so I'm a bit cautious about this. But especially after working on some toy models about computation in superposition I'd be keen to explore the error correction predictions made here (paper, comment).
Test of we can fully sparsify a small model: Try the full pipeline of training SAEs everywhere, or training Transcoders & Attention SAEs, and doing all that such that connections between features are sparse (such that every feature only interacts with a few other features). The reason we want that is so that we can have simple computational graphs, and find simple circuits that explain model behaviour.
I expect that---absent of SAE improvements finding the "true feature" basis---you'll need to train them all together with a penalty for the sparsity of interactions. To be concrete, an inefficient thing you could do is the following: Train SAEs on every residual stream layer, with a loss term that L1 penalises interactions between adjacent SAE features. This is hard/inefficient because the matrix of SAE interactions is huge, plus you probably need attributions to get these interactions which are expensive to compute (at every training step!). I think the main question for this project is to figure out whether there is a way to do this thing efficiently. Talk to Logan Smith, Callum McDoughall, and I expect there are a couple more people who are trying something like this.
List of some medium-sized mech interp project ideas (see also: shorter and longer ideas). Feel encouraged to leave thoughts in the replies below!
Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!
Toy model of Computation in Superposition: The toy model of computation in superposition (CIS; Circuits-in-Sup, Comp-in-Sup post / paper) describes a way in which NNs could perform computation in superposition, rather than just storing information in superposition (TMS). It would be good to have some actually trained models that do this, in order (1) to check whether NNs learn this algorithm or a different one, and (2) to test whether decomposition methods handle this well.
This could be, in the simplest form, just some kind of non-trivial memorisation model, or AND-gate model. Just make sure that the task does in fact require computation, and cannot be solved without the computation. A more flashy versions could be a network trained to do MNIST and FashionMNIST at the same time, though this would be more useful for goal (2).
Transcoder clustering: Transcoders are a sparse dictionary learning method that e.g. replaces an MLP with an SAE-like sparse computation (basically an SAE but not mapping activations to itself but to the next layer). If the above model of computation / circuits in superposition is correct (every computation using multiple ReLUs for redundancy) then the transcoder latents belonging to one computation should co-activate. Thus it should be possible to use clustering of transcoder activation patterns to find meaningful model components (circuits in the circuits-in-superposition model). (Idea suggested by @Lucius Bushnaq, mistakes are mine!) There's two ways to do this project:
- Train a toy model of circuits in superposition (see project above), train a transcoder, cluster latent activations, and see if we can recover the individual circuits.
- Or just try to cluster latent activations in an LLM transcoder, either existing (e.g. TinyModel) or trained on an LLM, and see if the clusters make any sense.
Investigating / removing LayerNorm (LN): For GPT2-small I showed that you can remove LN layers gradually while fine-tuning without loosing much model performance (workshop paper, code, model). There are three directions that I want to follow-up on this project.
- Can we use this to find out which tasks the model did use LN for? Are there prompts for which the noLN model is systematically worse than a model with LN? If so, can we understand how the LN acts mechanistically?
- The second direction for this project is to check whether this result is real and scales. I'm uncertain about (i) given that training GPT2-small is possible in a few (10?) GPU-hours, does my method actually require on the order of training compute? Or can it be much more efficient (I have barely tried to make it efficient so far)? This project could demonstrate that the removing LayerNorm process is tractable on a larger model (~Gemma-2-2B?), or that it can be done much faster on GPT2-small, something on the order of O(10) GPU-minutes.
- Finally, how much did the model weights change? Do SAEs still work? If it changed a lot, are there ways we can avoid this change (e.g. do the same process but add a loss to keep the SAEs working)?
List of some short mech interp project ideas (see also: medium-sized and longer ideas). Feel encouraged to leave thoughts in the replies below!
Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!
Directly testing the linear representation hypothesis by making up a couple of prompts which contain a few concepts to various degrees and test
- Does the model indeed represent intensity as magnitude? Or are there separate features for separately intense versions of a concept? Finding the right prompts is tricky, e.g. it makes sense that friendship and love are different features, but maybe "my favourite coffee shop" vs "a coffee shop I like" are different intensities of the same concept
- Do unions of concepts indeed represent addition in vector space? I.e. is the representation of "A and B" vector_A + vector_B? I wonder if there's a way you can generate a big synthetic dataset here, e.g. variations of "the soft green sofa" -> "the [texture] [colour] [furniture]", and do some statistical check.
Mostly I expect this to come out positive, and not to be a big update, but seems cheap to check.
SAEs vs Clustering: How much better are SAEs than (other) clustering algorithms? Previously I worried that SAEs are "just" finding the data structure, rather than features of the model. I think we could try to rule out some "dataset clustering" hypotheses by testing how much structure there is in the dataset of activations that one can explain with generic clustering methods. Will we get 50%, 90%, 99% variance explained?
I think a second spin on this direction is to look at "interpretability" / "mono-semanticity" of such non-SAE clustering methods. Do clusters appear similarly interpretable? I This would address the concern that many things look interpretable, and we shouldn't be surprised by SAE directions looking interpretable. (Related: Szegedy et al., 2013 look at random directions in an MNIST network and find them to look interpretable.)
Activation steering vs prompting: I've heard the view that "activation steering is just fancy prompting" which I don't endorse in its strong form (e.g. I expect it to be much harder for the model to ignore activation steering than to ignore prompt instructions). However, it would be nice to have a prompting-baseline for e.g. "Golden Gate Claude". What if I insert a "<system> Remember, you're obsessed with the Golden Gate bridge" after every chat message? I think this project would work even without the steering comparison actually.
CLDR (Cross-layer distributed representation): I don't think Lee has written his up anywhere yet so I've removed this for now.
Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.
Thanks for the flag! It's these two images, I realize now that they don't seem to have direct links
Images taken from AMFTC and Crosscoders by Anthropic.
Thanks for the comment!
I think this is what most mech interp researchers more or less think. Though I definitely expect many researchers would disagree with individual points, nor does it fairly weigh all views and aspects (it's very biased towards "people I talk to"). (Also this is in no way an Apollo / Apollo interp team statement, just my personal view.)
Thanks! You're right, totally mixed up local and dense / distributed. Decided to just leave out that terminology
Why I'm not too worried about architecture-dependent mech interp methods:
I've heard people argue that we should develop mechanistic interpretability methods that can be applied to any architecture. While this is certainly a nice-to-have, and maybe a sign that a method is principled, I don't think this criterion itself is important.
I think that the biggest hurdle for interpretability is to understand any AI that produces advanced language (>=GPT2 level). We don't know how to write a non-ML program that speaks English, let alone reason, and we have no idea how GPT2 does it. I expect that doing this the first time is going to be significantly harder, than doing this the 2nd time. Kind of how "understand an Alien mind" is much harder than "understand the 2nd Alien mind".
Edit: Understanding an image model (say Inception V1 CNN) does feel like a significant step down, in the sense that these models feel significantly less "smart" and capable than LLMs.
Why I'm not that hopeful about mech interp on TinyStories models:
Some of the TinyStories models are open source, and manage to output sensible language while being tiny (say 64dim embedding, 8 layers). Maybe it'd be great to try and thoroughly understand one of those?
I am worried that those models simply implement a bunch of bigrams and trigrams, and that all their performance can be explained by boring statistics & heuristics. Thus we would not learn much from fully understanding such a model. Evidence for this is that the 1-layer variant, which due to it's size can only implement bigrams & trigram-ish things, achieves a better loss than many of the tall smaller models (Figure 4). Thus it seems not implausible that most if not all of the performance of all the models could be explained by similarly simple mechanisms.
Folk wisdom is that the TinyStories dataset is just very formulaic and simple, and therefore models without any sophisticated methods can appear to produce sensible language. I haven't looked into this enough to understand whether e.g. TinyStories V2 (used by TinyModel) is sufficiently good to dispel this worry.
Collection of some mech interp knowledge about transformers:
Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I've just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research.
Transformers take in a sequence of tokens, and return logprob predictions for the next token. We think it works like this:
- Activations represent a sum of feature directions, each direction representing to some semantic concept. The magnitude of directions corresponds to the strength or importance of the concept.
- These features may be 1-dimensional, but maybe multi-dimensional features make sense too. We can either allow for multi-dimensional features (e.g. circle of days of the week), acknowledge that the relative directions of feature embeddings matter (e.g. considering days of the week individual features but span a circle), or both. See also Jake Mendel's post.
- The concepts may be "linearly" encoded, in the sense that two concepts A and B being present (say with strengths α and β) are represented as α*vector_A + β*vector_B). This is the key assumption of linear representation hypothesis. See Chris Olah & Adam Jermyn but also Lewis Smith.
- The residual stream of a transformer stores information the model needs later. Attention and MLP layers read from and write to this residual stream. Think of it as a kind of "shared memory", with this picture in your head, from Anthropic's famous AMFTC.
- This residual stream seems to slowly accumulate information throughout the forward pass, as suggested by LogitLens.
- Additionally, we expect there to be internally-relevant information inside the residual stream, such as whether the sequence of nouns in a sentence is ABBA or BABA.
- Maybe think of each transformer block / layer as doing a serial step of computation. Though note that layers don't need to be privileged points between computational steps, a computation can be spread out over layers (see Anthropic's Crosscoder motivation)
- Superposition. There can be more features than dimensions in the vector space, corresponding to almost-orthogonal directions. Established in Anthropic's TMS. You can have a mix as well. See Chris Olah's post on distributed representations for a nice write-up.
- Superposition requires sparsity, i.e. that only few features are active at a time.
- The model starts with token (and positional) embeddings.
- We think token embeddings mostly store features that might be relevant about a given token (e.g. words in which it occurs and what concepts they represent). The meaning of a token depends a lot on context.
- We think positional embeddings are pretty simple (in GPT2-small, but likely also other models). In GPT2-small they appear to encode ~4 dimensions worth of positional information, consisting of "is this the first token", "how late in the sequence is it", plus two sinusoidal directions. The latter three create a helix.
- PS: If you try to train an SAE on the full embedding you'll find this helix split up into segments ("buckets") as individual features (e.g. here). Pay attention to this bucket-ing as a sign of compositional representation.
- The overall Transformer computation is said to start with detokenization: accumulating context and converting the pure token representation into a context-aware representation of the meaning of the text. Early layers in models often behave differently from the rest. Lad et al. claim three more distinct stages but that's not consensus.
- There's a couple of common motifs we see in LLM internals, such as
- LLMs implementing human-interpretable algorithms.
- Induction heads (paper, good illustration): attention heads being used to repeat sequences seen previously in context. This can reach from literally repeating text to maybe being generally responsible for in-context learning.
- Indirect object identification, docstring completion. Importantly don't take these early circuits works to mean "we actually found the circuit in the model" but rather take away "here is a way you could implement this algorithm in a transformer" and maybe the real implementation looks something like it.
- In general we don't think this manual analysis scales to big models (see e.g. Tom Lieberum's paper)
- Also we want to automate the process, e.g. ACDC and follow-ups (1, 2).
- My personal take is that all circuits analysis is currently not promising because circuits are not crisp. With this I mean the observation that a few distinct components don't seem to be sufficient to explain a behaviour, and you need to add more and more components, slowly explaining more and more performance. This clearly points towards us not using the right units to decompose the model. Thus, model decomposition is the major area of mech interp research right now.
- Moving information. Information is moved around in the residual stream, from one token position to another. This is what we see in typical residual stream patching experiments, e.g. here.
- Information storage. Early work (e.g. Mor Geva) suggests that MLPs can store information as key-value memories; generally folk wisdom is that MLPs store facts. However, those facts seem to be distributed and non-trivial to localise (see ROME & follow-ups, e.g. MEMIT). The DeepMind mech interp team tried and wasn't super happy with their results.
- Logical gates. We think models calculate new features from existing features by computing e.g. AND and OR gates. Here we show a bunch of features that look like that is happening, and the papers by Hoagy Cunningham & Sam Marks show computational graphs for some example features.
- Activation size & layer norm. GPT2-style transformers have a layer normalization layer before every Attn and MLP block. Also, the norm of activations grows throughout the forward pass. Combined this means old features become less important over time, Alex Turner has thoughts on this.
- LLMs implementing human-interpretable algorithms.
- (Sparse) circuits agenda. The current mainstream agenda in mech interp (see e.g. Chris Olah's recent talk) is to (1) find the right components to decompose model activations, to (2) understand the interactions between these features, and to finally (3) understand the full model.
- The first big open problem is how to do this decomposition correctly. There's plenty of evidence that the current Sparse Autoencoders (SAEs) don't give us the correct solution, as well as conceptual issues. I'll not go into the details here to keep this short-ish.
- The second big open problem is that the interactions, by default, don't seem sparse. This is expected if there are multiple ways (e.g. SAE sizes) to decompose a layer, and adjacent layers aren't decomposed correspondingly. In practice this means that one SAE feature seems to affect many many SAE features in the next layers, more than we can easily understand. Plus, those interactions seem to be not crisp which leads to the same issue as described above.
Thanks for the nice writeup! I'm confused about why you can get away without interpretation of what the model components are:
In cases where we worry that our model learned a human-simulator / camera-simulator rather than actually predicting whether the diamond exists, wouldn't circuit discovery simply give us the human-simulator circuit? (And thus causal scrubbing doesn't save us.) I'm thinking in particular of cases where the human-simulator is easier to learn than the intended solution.
Of course if you had good interpretability, a way to realise whether your explanation is the human simulator is to look for suspicious human-simulator-related features. I would like to get away without interpretation, but it's not clear to me that this works.
Paper link: https://arxiv.org/abs/2407.20311
(I have neither watched the video nor read the paper yet, just in case someone else was looking for the non-video version)
Thanks! I'll edit it
[…] no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.
I find myself really confused by this argument. Shards (or anything) do not need to be “concentrated in one spot” for studying them to make sense?
As Neel and Lucius say, you might study SAE latents or abstractions built on the weights, no one requires (or assumes) than things are concentrated in one spot.
Or to make another analogy, one can study neuroscience even though things are not concentrated in individual cells or atoms.
If we still disagree it’d help me if you clarified how the “So […]” part of your argument follows
Edit: The “the real thinking happens in the scaffolding” is a reasonable argument (and current mech interp doesn’t address this) but that’s a different argument (and just means we understand individual forward passes with mech interp).
Even after reading this (2 weeks ago), I today couldn't manage to find the comment link and manually scrolled down. I later noticed it (at the bottom left) but it's so far away from everything else. I think putting it somewhere at the top near the rest of the UI would be much easier for me
I would like the following subscription: All posts with certain tags, e.g. all [AI] posts or all [Interpretability (ML & AI)] posts.
I just noticed (and enabled) a “subscribe” feature in the page for the tag, it says “Get notifications when posts are added to this tag.” — I’m unsure if those are emails, but assuming they are, my problem is solved. I never noticed this option before.
And here's the code to do it with replacing the LayerNorms with identities completely:
import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer
model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")
# Undo my hacky LayerNorm removal
for block in model.transformer.h:
block.ln_1.weight.data = block.ln_1.weight.data / 1e6
block.ln_1.eps = 1e-5
block.ln_2.weight.data = block.ln_2.weight.data / 1e6
block.ln_2.eps = 1e-5
model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6
model.transformer.ln_f.eps = 1e-5
# Properly replace LayerNorms by Identities
class HookedTransformerNoLN(HookedTransformer):
def removeLN(self):
for i in range(len(self.blocks)):
self.blocks[i].ln1 = torch.nn.Identity()
self.blocks[i].ln2 = torch.nn.Identity()
self.ln_final = torch.nn.Identity()
hooked_model = HookedTransformerNoLN.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
hooked_model.removeLN()
hooked_model.cfg.normalization_type = None
prompt = torch.tensor([1,2,3,4], device="cpu")
logits = hooked_model(prompt)
print(logits.shape)
print(logits[0, 0, :10])
Here's a quick snipped to load the model into TransformerLens!
import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer
model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")
hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=False, center_unembed=False).to("cpu")
# Kill the LayerNorms because TransformerLens overwrites eps
for block in hooked_model.blocks:
block.ln1.eps = 1e12
block.ln2.eps = 1e12
hooked_model.ln_final.eps = 1e12
# Make sure the outputs are the same
prompt = torch.tensor([1,2,3,4], device="cpu")
logits = hooked_model(prompt)
logits2 = model(prompt).logits
print(logits.shape, logits2.shape)
print(logits[0, 0, :10])
print(logits2[0, :10])
I really like the investigation into properties of SAE features, especially the angle of testing whether SAE features have particular properties than other (random) directions don't have!
Random directions as a baseline: Based on my experience here I expect random directions to be a weak baseline. For example the covariance matrix of model activations (or SAE features) is very non-uniform. I'd second @Hoagy's suggestion of linear combination of SAE features, or direction towards other model activations as I used here.
Ablation vs functional FT-LLC: I found the comparison between your LLC measure (weights before the feature), and the ablation effect (effect of this feature on the output) interesting, and I liked that you give some theories, both very interesting! Do you think @jake_mendel's error correction theory is related to these in any way?
I like this idea! I'd love to see checks of this on the SOTA models which tend to have lots of layers (thanks @Joseph Miller for running the GPT2 experiment already!).
I notice this line of argument would also imply that the embedding information can only be accessed up to a certain layer, after which it will be washed out by the high-norm outputs of layers. (And the same for early MLP layers which are rumoured to act as extended embeddings in some models.) -- this seems unexpected.
Additionally, they would be further evidence (but not conclusive[2]) towards hypotheses Residual Networks Behave Like Ensembles of Relatively Shallow Networks.
I have the opposite expectation: Effective layer horizons enforce a lower bound on the number of modules involved in a path. Consider the shallow path
- Input (layer 0) -> MLP 10 -> MLP 50 -> Output (layer 100)
If the effective layer horizon is 25, then this path cannot work because the output of MLP10 gets lost. In fact, no path with less than 3 modules is possible because there would always be a gap > 25.
Only a less-shallow paths would manage to influence the output of the model
- Input (layer 0) -> MLP 10 -> MLP 30 -> MLP 50 -> MLP 70 -> MLP 90 -> Output (layer 100)
This too seems counterintuitive, not sure what to make of this.
I know he’s legitimately affiliated with that YT channel
Can I ask how you know that? The amount of "w Stephen Fry" video titles made me suspicious, and I wondered whether it's AI generated and not Stephen-Fry-endorsed, but I haven't done any further research.
Edit: A colleague just pointed out that other videos are up to 7 years old (and AI voice wasn't this good then), so in that video the voice must be real
Has anyone tested whether feature splitting can be explained by composite (non-atomic) features?
- Feature splitting is the observation that SAEs with larger dictionary size find features that are geometrically (cosine similarity) and semantically (activating dataset examples) similar. In particular, a larger SAE might find multiple features that are all similar to each other, and to a single feature found in a smaller SAE.
- Anthropic gives the example of the feature " 'the' in mathematical prose" which splits into features " 'the' in mathematics, especially topology and abstract algebra" and " 'the' in mathematics, especially complex analysis" (and others).
There’s at least two hypotheses for what is going on.
- The “true features” are the maximally split features; the model packs multiple true features into superposition close to each other. Smaller SAEs approximate multiple true features as one due to limited dictionary size.
- The “true features” are atomic features, and split features are composite features made up of multiple atomic features. Feature splitting is an artefact of training the model for sparsity, and composite features could be replaced by linear combinations of a small number of other (atomic) features.
Anthropic conjectures hypothesis 1 in Towards Monosemanticity. Demian Till argues for hypothesis 2 in this post. I find Demian’s arguments compelling. They key idea is that an SAE can achieve lower loss by creating composite features for frequently co-occurring concepts: The composite feature fires instead of two (or more) atomic features, providing a higher sparsity (lower sparsity penalty) at the cost of taking up another dictionary entry (worse reconstruction).
- I think the composite feature hypothesis is plausible, especially in light of Anthropic’s Feature Completeness results in Scaling Monosemanticity. They find that not all model concepts are represented in SAEs, and that rarer concepts are less likely to be represented (they find an intriguing relation between number of alive features and feature frequency required to be represented in the SAE, likely related to the frequency-rank via Zipf’s law). I find it probably that the optimiser may dedicate extra dictionary entries to composite features of high-frequency concepts at the cost of representing low-frequency concepts.
- This is bad for interpretability not (only) because low-frequency concepts are omitted, but because the creation of composite features requires the original atomic features to not fire anymore in the composite case.
- Imagine there is a “deception” feature, and a “exam” feature. How deception in exams is quite common, so the model learns a composite “deception in the context of exams” feature, and the atomic “deception” and “exam” features no longer fire in that case.
- Then we can no longer use the atomic “deception” SAE direction as a reliable detector of deception, because it doesn’t fire in cases where the composite feature is active!
Do we have good evidence for the one or the other case?
We observe that split features often have high cosine similarity, but this is explained by both hypotheses. (Anthropic says features are clustered together because they’re similar. Demian Till’s hypothesis would claim that multiple composite features contain the same atomic features, again explaining the similarity.)
A naive test may be to test whether features can be explained by a sparse linear combination of other features, though I’m not sure how easy this would be to test.
For reference, cosine similarity of SAE decoder directions in Joseph Bloom's GPT2-small SAEs, blocks.1.hook_resid_pre
and blocks.10.hook_resid_pre
compared to random directions and random directions with the same covariance as typical activations.
But there is still a mystery I don't fully understand: how is it possible to find so many "noise" vectors that don't influence the output of the network much.
In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions[1] that don't influence the output of the network much. This was on GPT2 but I'd expect it to generalize for other Transformers.
- ^
Though I don't know how much space / what the dimensionality of that space is; I'm judging this by the "sensitivity curve" (how much steering is needed for a noticeable change in KL divergence).
Hmm, with that we'd need to get 800 orthogonal vectors.[1] This seems pretty workable. If we take the MELBO vector magnitude change (7 -> 20) as an indication of how much the cosine similarity changes, then this is consistent with for the original vector. This seems plausible for a steering vector?
- ^
Thanks to @Lucius Bushnaq for correcting my earlier wrong number
That model has an Attention and MLP block (GPT2-style model with 1 layer but a bit wider, 21M params).
I changed my mind over the course of this morning. TheTinyStories models' language isn't that bad, and I think it'd be a decent research project to try to fully understand one of these.
I've been playing around with the models this morning, quotes from the 1-layer model:
Once upon a time, there was a lovely girl called Chloe. She loved to go for a walk every morning and one day she came across a road.
One day, she decided she wanted to go for a ride. She jumped up and down, and as she jumped into the horn, shouting whatever makes you feel like.
When Chloe was flying in the sky, she saw some big white smoke above her. She was so amazed and decided to fly down and take a closer look.
When Chloe got to the edge of a park, there was a firework show. The girl smiled and said "Oh, hello, there. I will make sure to finish in my flying body before it gets too cold," it said.So Chloe flew to the park again, with a very persistent look at the white horn. She was very proud of her creation and was thankful for being so brave.
Summary: Chloe, a persistent girl, explores the park with the help of a firework sparkle and is shown how brave the firework can be persistent.
and
Once upon a time, there lived a young boy. His name was Caleb. He loved to learn new things and gain healthy by playing outside.
One day, Caleb was in the garden and he started eating an onion. He was struggling to find enough food to eat, but he couldn't find anything.
Just then, Caleb appeared with a magical lake. The young boy told Caleb he could help him find his way home if he ate the onion. Caleb was so excited to find the garden had become narrow enough for Caleb to get his wish.
Caleb thought about what the pepper was thinking. He then decided to try and find a safer way to play with them next time. From then on, Caleb became healthier and could eat sweets and sweets in the house.
With the peppers, Caleb ate delicious pepper and could be heard by again. He was really proud of himself and soon enough he was playing in the garden again.
This feels like the kind of inconsistency I expect from a model that has only one layer. It can recall that the story was about flying and stuff, and the names, but it feels a bit like the model doesn't remember what it said a paragraph before.
2-layer model:
Once upon a time, there was a lazy bear. He lived in a tall village surrounded by thick trees and lonely rivers.
The bear wanted to explore the far side of the mountain, so he asked a kind bird if he wanted to come. The bird said, "Yes, but first let me seat in my big tree. Follow me!"
The bear was excited and followed the bird. They soon arrived at a beautiful mountain. The mountain was rich with juicy, delicious fruit. The bear was so happy and thanked the bird for his help. They both shared the fruit and had a great time.
The bear said goodbye to the bird and returned to his big tree, feeling very happy and content. From then on, the bear went for food every day and could often seat in his tall tree by the river.
Summary: A lazy bear ventures on a mountain and finds a kind bird who helps him find food on his travels. The bear is happy and content with the food and a delicious dessert.
and
Once upon a time, there were two best friends, a gingerbread fox and a gingerbread wolf. Everyone loved the treats and had a great time together, playing games and eating the treats.
The gingerbread fox spoke up and said: "Let's be like buying a house for something else!" But the ginger suggested that they go to the market instead. The friends agreed and they both went to the market.
Back home, the gingerbread fox was happy to have shared the treats with the friends. They all ate the treats with the chocolates, ran around and giggled together. The gingerbread fox thought this was the perfect idea, and every day the friends ate their treats and laughed together.
The friends were very happy and enjoyed every single morsel of it. No one else was enjoying the fun and laughter that followed. And every day, the friends continued to discuss different things and discover new new things to imagine.
Summary: Two best friends, gingerbread and chocolate, go to the market to buy treats but end up only buying a small house for a treat each, which they enjoy doing together.
I think if we can fully understand (in the Python code sense, probably with a bunch of lookup tables) how these models work this will give us some insight into where we're at with interpretability. Do the explanations feel sufficiently compressed? Does it feel like there's a simpler explanation that the code & tables we've written?
Edit: Specifically I'm thinking of
- Train SAEs on all layers
- Use this for Attention QK circuits (and transform OV circuit into SAE basis, or Transcoder basis)
- Use Transcoders for MLPs
(Transcoders vs SAEs are somewhat redundant / different approaches, figure out how to connect everything together)
The tiny story status seems quite simple, in the sense that I can see how you could provide TinyStories levels of loss by following simple rules plus a bunch of memorization.
Empirically, one of the best models in the tiny stories paper is a super wide 1L transformer, which basically is bigrams, trigrams, and slightly more complicated variants [see Bucks post] but nothing that requires a step of reasoning.
I am actually quite uncertain where the significant gap between TinyStories, GPT-2 and GPT-4 is. Maybe I could fully understand TinyStories-1L if I tried, would this tell us about GPT-4? I feel like the result for TinyStories will be a bunch of heuristics.
Thanks for the comment Lawrence, I appreciate it!
- I agree this doesn't distinguish superposition vs no superposition at all; I was more thinking about the "error correction" aspect of MCIS (and just assuming superposition to be true). But I'm excited too for the SAE application, we got some experiments in the pipeline!
- Your Correct behaviour point sounds reasonable but I feel like it's not an explanation? I would have the same intuitive expectation, but that doesn't explain how the model manages to not be sensitive. Explanations I can think of in increasing order of probability:
- Story 0: Perturbations change activations and logprobs, but the answer doesn't change because the logprob difference was large. I don't think the KL divergence would behave like that.
- Story 1: Perturbations do change the activations but the difference in the logprobs is small due to layer norm, unembed, or softmax shenanigans.
- We did a test-experiment of perturbing the 12th layer rather than the 2nd layer, and the difference between real-other and random disappeared. So I don't think it's a weird effect when activations get converted to outputs.
- Story 2: Perturbations in a lower layer cause less perturbation in later layers if the model is on-distribution (+ similar story for sensitivity).
- This is what the L2-metric plots (right panel) suggest, and also what I understand your story to be.
- But this doesn't explain how the model does this, right? Are there simple stories how this happens?
- I guess there's lots of stories not limited to MCIS, anything along the lines of "ReLUs require thresholds to be passed"?
Based on that, I think the results still require some "error-correction" explanation, though you're right that this doesn't have to me MCIS (it's just that there's no other theory that doesn't also conflict with superposition?).
My core request is that I want (SAE-)features to be a property of the model, rather than the dataset.
- This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.” -- no, this is fine, the model-feature exists but simply isn't found by the SAE.
- What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and that won't help us understand models
- ^
Of course a concept being common in the model-training-data makes it likely (?) to be a concept the model uses, but I don’t think this is a 1:1 correspondence. (So just making the SAE training set equal to the model training set wouldn’t solve the issue.)
There is a view that SAE features are just a useful tool for describing activations (interpretable features) and manipulating activations (useful for steering and probing). That SAEs are just a particularly good method in a larger class of methods, but not uniquely principled. In that case I wouldn't expect this connection to model behaviour.
But often we make the claim that we often make is that the model sees and understands the world as a set of model-features, and that we can see the same features by looking at SAE-features of the activations. And then I want to see the extra evidence.
Are the features learned by the model the same as the features learned by SAEs?
TL;DR: I want true features model-features to be a property of the model weights, and to be recognizable without access to the full dataset. Toy models have that property. My “poor man’s model-features” have it. I want to know whether SAE-features have this property too, or if SAE-features do not match the true features model-features.
Introduction: Neural networks likely encode features in superposition. That is, features are represented as directions in activation space, and the model likely tracks many more features than dimensions in activation space. Because features are sparse, it should still be possible for the model to recover and use individual feature values.[1]
Problem statement: The prevailing method for finding these features are Sparse Autoencoders (SAEs). SAEs are well-motivated because they do recover superposed features in toy models. However, I am not certain whether SAEs recover the features of LLMs. I am worried (though not confident) that SAEs do not recover the features of the model (but the dataset), and that we are thus overconfident in how much SAEs tell us.
SAE failure mode: SAEs are trained to achieve a certain compression[2] task: Compress activations into a sparse overcomplete basis, and reconstruct the original activations based on this compressed representation. The solution to this problem can be identical to what the neural network does (wanting to store & use information), but it not necessarily is. In TMS, the network’s only objective is to compress features, so it is natural that the SAE-features match the model-features. But LLMs solve a different task (well, we don’t have a good idea what LLMs do), and training an SAE on a model’s activations might yield a basis different from the model-features (see hypothetical Example 1 below).
Operationalisation of model-features (I’m tabooing “true features”): In the Toy Model of Superposition (TMS) the model’s weights are clearly adjusted to the features directions. We can tell a feature from looking at the model weights. I want this to be a property of true SAE-features as well. Then I would be confident that the features are a property of the model, and not (only) of the dataset distribution. Concrete operationalisation:
- I give you 5 real SAE-features, and 5 made-up features (with similar properties). Can you tell which features are the real ones? Without relying on the dataset (but you may use an individual prompt). Lindsey (2024) is some evidence, but would it distinguish the SAE-features from an arbitrary decomposition of the activations into 5 fake-features?
Why do I care? I expect that the model-features are, in some sense, the computational units of the model. I expect our understanding to be more accurate (and to generalize) if we understand what the model actually does internally (see hypothetical Example 2 below).
Is this possible? Toy models of computation in superposition seem to suggest that models give special treatment to feature directions (compared to arbitrary activation directions), for example the error correction described here. This may privilege the basis of model-features over other decompositions of activations. I discuss experiment proposals at the bottom.
Example 1: Imagine an LLM was trained on The Pile excluding Wikipedia. Now we train an SAE on the model’s activations on a different dataset including Wikipedia. I expect that the SAE will find Wikipedia-related features: For example, a Wikipedia-citation-syntax feature on a low level, or an Wikipedia-style-objectivity feature on a high level. I would claim that this is not a feature of the model: During training the model never encountered these concepts, it has not reserved a direction in its superposition arrangement (think geometric shapes in Toy Model of Superposition) for this feature.
- It feels like there is a fundamental distinction between a model (SGD) “deciding” whether to learn a feature (as it does in TMS) and an SAE finding a feature that was useful for compressing activations.
Example 2: Maybe an SAE trained on an LLM playing Civilization and Risk finds a feature that corresponds to “strategic deception” on this dataset. But actually the model does not use a “strategic deception” feature (instead strategic deception originates from some, say, the “power dynamics” feature), and it just happens that the instances of strategic deception in those games clustered into a specific direction. If we now take this direction to monitor for strategic deception we will fail to notice other strategic deception originating from the same “power dynamics” features.
- If we had known that the model-features that were active during the strategic deception instances were the “power dynamics” (+ other) features, we would have been able to choose the right, better generalizing, deception detection feature.
Experiment proposals: I have explored the abnormal effect that “poor man’s model-features” (sampled as the difference between two independent model activations) have on model outputs, and their relation to theoretically predicted noise suppression in feature activations. Experiments in Gurnee (2024) and Lindsey (2024) suggest that SAE decoder errors and SAE-features also have an abnormal effect on the model. With the LASR Labs team I mentor I want to explore whether SAE-features match the theoretical predictions, and whether the SAE-feature effects match those expected from model-features.
I think we should think more about computation in superposition. What does the model do with features? How do we go from “there are features” to “the model outputs sensible things”? How do MLPs retrieve knowledge (as commonly believed) in a way compatible with superposition (knowing more facts than number of neurons)?
This post (and paper) by @Kaarel, @jake_mendel, @Dmitry Vaintrob (and @LawrenceC) is the kind of thing I'm looking for, trying to lay out a model of how computation in superposition could work. It makes somewhat-concrete predictions about the number and property of model features.
Why? Because (a) these feature properties may help us find the features of a model (b) a model of computation may be necessary if features alone are not insufficient to address AI Safety (on the interpretability side).
This is great, love it! Settings recommendation: If you (or your company) want, you can restrict the extension's access from all websites down to the websites you read papers on. Note that the scholar.google.com access is required for the look-up function to work.
Maybe I'm confused, but isn't integrated gradients strictly slower than an ablation to a baseline?
For a single interaction yes (1 forward pass vs integral with n_alpha integration steps, each requiring a backward pass).
For many interactions (e.g. all connections between two layers) IGs can be faster:
- Ablation requires d_embed^2 forward passes (if you want to get the effect of every patch on the loss)
- Integrated gradients requires d_embed * n_alpha forward & backward passes
(This is assuming you do path patching rather than "edge patching", which you should in this scenario.)
Sam Marks makes a similar point in Sparse Feature Circuits, near equations (2), (3), and (4).
So we can ‘train’ a circuit by optimizing the Mask parameters using gradient descent.
Did you try how this works in practice? I could imagine an SGD-based circuit finder could be pretty efficient (compared to brute-force algorithms like ACDC), I'd love to see that comparison some day! (might be a project I should try!)
Edit: I remember @Buck and @dmz were suggesting something along those lines last year
Do you have a link to a writeup of Li et al. (2023) beyond the git repo?
Does this still work if there is a layer norm between the layers?
This works because the difference in input to the edge destination is equal to the difference in output of the source component.
This is key to why you can compute the patched inputs quickly, but it only holds without layer norm, right?
It took me a second to understand why "edge patching" can work with only 1 forward pass. I'm rephrasing my understanding here in case it helps anyone else:
If we path patch node X in layer 1 to node Z in layer 3, then the only way to know what the input to node Z looks like without node X is to actually run a forward pass. Thus we need to run a forward pass for every target node that we want to receive a different set of inputs.
However, if we path patch (edge patch) node X in layer 1 to node Y in layer 2, then we can calculate the new input to node Y "by hand" (without running the model, i.e. cheaply): The input to node Y is just the sum of outputs in the previous layers. So you can skip all the "compute what the input would look like" forward passes.
Thanks for the question! This is not something we have included in our distribution, so I think our patching experiments aren't answering that question. If I would speculate though, I'd suggest
- The Prev Tok head 1.4 might "check" for a signature of "I am inside a function definition" (maybe a L0 head that attends to the
def
keyword. This would make it work only on B_def not B_dec - Duplicate Tok head 1.2 might help the mover heads by suppressing their attention to repeated tokens. We observed this ("Duplicate Token Head 1.2 is helping Argument Movers"), but were not confident whether it is important. When doing ACDC we felt 1.2 wasn't actually too important (IIRC) but again this would depend on the distribution
In summary, I can think of a range of possible mechanisms how the model could achieve that, but our experiments don't test for that (because copying the 2nd token after B_dec would be equally bad for the clean and corrupted prompts).
I replicated this with only copy-pasting the message, rather than screenshots.
- Does not include “ChatGPT” sender name
- Takes out text recognition stuff
Works with ChatGPT 3.5: https://chat.openai.com/share/bde3633e-c1d6-47ab-bffb-6b9e6190dc2c (Edit: After a few clarification questions, I think the 3.5 one is an accidental victory, we should try a couple more times)
Works with ChatGPT 4: https://chat.openai.com/share/29ca8161-85ce-490b-87b1-ee613f9e284d https://chat.openai.com/share/7c223030-8523-4234-86a2-4b8fbecfd62f
(Non-cherry picked, seemed to work consistently.)
It’s interesting how the model does this, I’d love to quiz it on this but don’t have time for that right now (and I’d be even more excited if we could mechanistically understand this).
My two bullet point summary / example, to test my understanding:
- We ask labs to implement some sort of filter to monitor their AI's outputs (likely AI based).
- Then we have a human team "play GPT-5" and try to submit dangerous outputs that the filter does not detect (w/ AI assistance etc. of course).
Is this (an example of) a control measure, and a control evaluation?
Nice work! I'm especially impressed by the [word] and [word] example: This cannot be read-off the embeddings, thus the model must be actually computing and storing this feature somewhere! I think this is exciting since the things we care about (deception etc.) are also definitely not included in the embeddings. I think you could make a similar case for Title Case and Beginning & End of First Sentence but those examples look less clear, e.g. the Title Case could be mostly stored in "embedding of uppercase word that is usually lowercase".
Thank you for making the early write-up! I'm not entirely certain I completely understand what you're doing, could I give you my understanding and ask you to fill the gaps / correct me if you have the time? No worries if not, I realize this is a quick & early write-up!
Setup:
As previously you run Pythia on a bunch of data (is this the same data for all of your examples?) and save its activations.
Then you take the residual stream activations (from which layer?) and train an autoencoder (like Lee, Dan & beren here) with a single hidden layer (w/ ReLU), larger than the residual stream width (how large?), trained with an L1-regularization on the hidden activations. This L1-regularization penalizes multiple hidden activations activating at once and therefore encourages encoding single features as single neurons in the autoencoder.
Results:
You found a bunch of features corresponding to a word or combined words (= words with similar meaning?). This would be the embedding stored as a features (makes sense).
But then you also find e.g. a "German Feature", a neuron in the autoencoder that mostly activates when the current token is clearly part of a German word. When you show Uniform examples you show randomly selected dataset examples? Or randomly selected samples where the autoencoder neuron is activated beyond some threshold?
When you show Logit lens you show how strong the embedding(?) or residual stream(?) at a token projects into the read-direction of that particular autoencoder neuron?
In Ablated Text you show how much the autoencoder neuron activation changes (change at what position?) when ablating the embedding(?) / residual stream(?) at a certain position (same or different from the one where you measure the autoencoder neuron activation?). Does ablating refer to setting some activations at that position to zero, or to running the model without that word?
Note on my use of the word neuron: To distinguish residual stream features from autoencoder activations, I use neuron to refer to the hidden activation of the autoencoder (subject to an activation function) while I use feature to refer to (a direction of) residual stream activations.
Huh, thanks for this pointer! I had not read about NTK (Neural Tangent Kernel) before. What I understand you saying is something like SGD mainly affects weights the last layer, and the propagation down to each earlier layer is weakened by a factor, creating the exponential behaviour? This seems somewhat plausible though I don't know enough about NTK to make a stronger statement.
I don't understand the simulation you run (I'm not familiar with that equation, is this a common thing to do?) but are you saying the y levels of the 5 lines (simulating 5 layers) at the last time-step (finished training) should be exponentially increasing, from violet to red, green, orange, and blue? It doesn't look exponential by eye? Or are you thinking of the value as a function of x (training time)?
I appreciate your comment, and looking for mundane explanations though! This seems the kind of thing where I would later say "Oh of course"
Hi, and thanks for the comment!
Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?
Both of these show slightly different things. Imagine an "AND circuit" where the result is only correct if two attention heads are clean. If you patch clean->corrupt (inserting a clean attention head activation into a corrupt prompt) you will not find this; but you do if you patch corrupt->clean. However the opposite applies for a kind of "OR circuit". I historically had more success with corrupt->clean so I teach this as the default, however Neel Nanda's tutorials usually start the other way around, and really you should check both. We basically ran all plots with both patching directions and later picked the ones that contained all the information.
did you find that the selection of [the corrupt words] mattered?
Yes! We tried to select equivalent words to not pick up on properties of the words, but in fact there was an example where we got confused by this: We at some point wanted to patch param
and naively replaced it with arg
, not realizing that param
is treated specially! Here is a plot of head 0.2's attention pattern; it behaves differently for certain tokens. Another example is the self
token: It is treated very differently to the variable name tokens.
So it definitely matters. If you want to focus on a specific behavior you probably want to pick equivalent tokens to avoid mixing in other effects into your analysis.
Thanks for finding this!
There was one assumption in the StackExchange post I didn't immediately get, that the variance of is . But I just realized the proof for that is rather short: Assuming (the variance of ) is the identity then the left side is
and the right side is
so this works out. (The symbols are sums here.)
Thank for for the extensive comment! Your summary is really helpful to see how this came across, here's my take on a couple of these points:
2.b: The network would be sneaking information about the size of the residual stream past LayerNorm. So the network wants to implement an sort of "grow by a factor X every layer" and wants to prevent LayerNorm from resetting its progress.
- There's the difference between (i) How does the model make the residual stream grow exponentially -- the answer is probably theory 1, that something in the weights grow exponentially. And there is (ii) our best guess on Why the model would ever want this, which is the information deletion thing.
How and why disconnected
Yep we give some evidence for How, but for Why we have only a guess.
still don't feel like I know why though
earn generic "amplification" functions
Yes, all we have is some intuition here. It seems plausible that the model needs to communicate stuff between some layers, but doesn't want this to take up space in the residual stream. So this exponential growth is a neat way to make old information decay away (relatively). And it seems plausible to implement a few amplification circuits for information that has to be preserved for much later in the network.
We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren't aware of this previously, so we wanted to make a reference post.
If I'm interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I'm still not quite clear on why LayerNorm doesn't take care of this.
I understand the network's "intention" the other way around, I think that the network wants to have an exponentially growing residual stream. And in order to get an exponentially growing residual stream the model increases its weights exponentially.
And our speculation for why the model would want this is our "favored explanation" mentioned above.
Thanks for the comment and linking that paper! I think this is about training dynamics though, norm growth as a function of checkpoint rather than layer index.
Generally I find basically no papers discussing the parameter or residual stream growth over layer number, all the similar-sounding papers seem to discuss parameter norms increasing as a function of epoch or checkpoint (training dynamics). I don't expect the scaling over epoch and layer number to be related?
Only this paper mentions layer number in this context, and the paper is about solving the vanishing gradient problem in Post-LN transformers. I don't think that problem applies to the Pre-LN architecture? (see the comment by Zach Furman for this discussion)
Oh I hadn't thought of this, thanks for the comment! I don't think this apply to Pre-LN Transformers though?
-
In Pre-LN transformers every layer's output is directly connected to the residual stream (and thus just one unembedding away from logits), wouldn't this remove the vanishing gradient problem? I just checkout out the paper you linked, they claim exponentially vanishing gradients is a problem (only) in Post-LN, and how Pre-LN (and their new method) prevent the problem, right?
-
The residual stream norm curves seem to follow the exponential growth quite precisely, do vanishing gradient problems cause such a clean result? I would have intuitively expected the final weights to look somewhat pathological if they were caused by such a problem in training.
Re prediction: Isn't the sign the other way around? Vanishing gradients imply growing norms, right? So vanishing gradients in Post-LN would cause gradients to grow exponentially towards later (closer to output) layers (they also plot something like this in Figure 3 in the linked paper). I agree with the prediction that Post-LN will probably have even stronger exponential norm growth, but I think that this has a different cause to what we find here.
Finally, we give a simple approach to verify that a particular token is unspeakable rather than just being hard-to-speak.
You're using an optimization procedure to find an embedding that produces an output, and if you cannot find one you say it is unspeakable. How confident are you that the optimization is strong enough? I.e. what are the odds that a god-mode optimizer in this high-dimensional space could actually find an embedding that produces the unspeakable token, it's just that linprog wasn't strong enough?
Just checking here, I can totally imagine that the optimizer is an unlikely point of failure. Nice work again!
Thanks Marius for this great write-up!
However, I was surprised to find that the datapoints the network misclassified on the training data are evenly distributed across the D* spectrum. I would have expected them to all have low D* didn’t learn them.
My first intuition here was that the misclassified data points where the network just tried to use the learned features and just got it wrong, rather than those being points the network didn't bother to learn? Like say a 2 that looks a lot like an 8 so to the network it looks like a middle-of-the-spectrum 8? Not sure if this is sensible.
The shape of D* changes very little between initialization and the final training run.
I think this is actually a big hint that a lot of the stuff we see in those plots might be not what we think it is / an illusion. Any shape present at initialization cannot tell us anything about the trained network. More on this later.
the distribution of errors is actually left-heavy which is exactly the opposite of what we would expect
Okay this would be much easier if you collapsed the x-axis of those line plots and made it a histogram (the x axis is just sorted index right?), then you could make the dots also into histograms.
we would think that especially weird examples are more likely to be misclassified, i.e. examples on the right-hand side of the spectrum
So are we sure that weird examples are on the right-hand side? If I take weird examples to just trigger a random set of features, would I expect this to have a high or low dimensionality? Given that the normal case is 1e-3 to 1e-2, what's the random chance value?
We train models from scratch to 1,2,3,8,18 and 40 iterations and plot D*, the location of all misclassified datapoints and a histogram over the misclassification rate per bin.
This seems to suggest the left-heavy distribution might actually be due to initialization too? The left-tail seems to decline a lot after a couple of training iterations.
I think one of the key checks for this metric will be ironing out which apparent effects are just initialization. Those nice line plots look suggestive, but if initialization produces the same image we can't be sure what we can learn.
One idea to get traction here would be: Run the same experiment with different seeds, do the same plot of max data dim by index, then take the two sorted lists of indices and scatter-plot them. If this looks somewhat linear there might be some real reason why some data points require more dimensions. If it just looks random that would be evidence against inherently difficult/complicated data points that the network memorizes / ignores every time.
Edit: Some evidence for this is actually that the 1s tend to be systematically at the right of the curve, so there seems to be some inherent effect to the data!
I don't think I understand the problem correctly, but let me try to rephrase this. I believe the key part is the claim whether or not ChatGPT has a global plan? Let's say we run ChatGPT one output at a time, every time appending the output token to the current prompt and calculating the next output. This ignores some beam search shenanigans that may be useful in practice, but I don't think that's the core issue here.
There is no memory between calculating the first and second token. The first time you give ChatGPT the sequence "Once upon a" and it predicts "time" and you can shut down the machine, the next time you give it "Once upon a time" and it predicts the next word. So there isn't any global plan in a very strict sense.
However when you put "Once upon a time" into a transformer, it will actually reproduce the exact values from the "Once upon a" run, in addition to a new set of values for the next token. Internally, you have a column of residual stream for every word (with 400 or so rows aka layers each), and the first four rows are identical between the two runs. So you could say that ChatGPT reconstructs* a plan every time it's asked to output a next token. It comes up with a plan every single time you call it. And the first N columns of the plan are identical to the previous plan, and with every new word you add a column of plan. So in that sense there is a global plan to speak of, but this also fits within the framework of predicting the next token.
"Hey ChatGPT predict the next word!" --> ChatGPT looks at the text, comes up with a plan, and predicts the next word accordingly. Then it forgets everything, but the next time you give it the same text + one more word, it comes up with the same plan + a little bit extra, and so on.
Regarding 'If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me.' I am not sure what you mean with this. I think an important note is to keep in mind it uses the same parameters for every "column", for every word. There is no such thing as ChatGPT not visiting every parameter.
And please correct me if I understood any of this wrongly!
*in practice people cache those intermediate computation results somewhere in their GPU memory to not have to recompute those internal values every time. But it's equivalent to recomputing them, and the latter has less complications to reason about.