Posts

Enumerating objects a model "knows" using entity-detection features. 2025-03-30T16:58:01.957Z
Positional kernels of attention heads 2025-03-10T23:17:25.068Z
Using the probabilistic method to bound the performance of toy transformers 2025-01-21T23:01:38.067Z
Duplicate token neurons in the first layer of GPT-2 2024-12-27T04:21:55.896Z
Alex Gibson's Shortform 2024-12-27T04:21:55.840Z

Comments

Comment by Alex Gibson on Alex Gibson's Shortform · 2025-04-16T08:31:31.516Z · LW · GW
Comment by Alex Gibson on Enumerating objects a model "knows" using entity-detection features. · 2025-04-04T21:08:29.029Z · LW · GW

I'm glad you like it! Yeah the lack of a dataset is the thing that excites me about this kind of approach, because it allows us to get validation of our mechanistic explanations via partial "dataset recovery", which I find to be really compelling. It's a lot slower going, and may only work out for the first few layers, but it makes for a rewarding loop.

The utility of SAEs is in telling us in an unsupervised way that there is a feature that codes for "known entity", but this project doesn't use SAEs explicitly. I look for sparse sets of neurons that activate highly on "known entities". Neel Nanda / Wes Gurnee's sparse probing work is the inspiration here: https://arxiv.org/abs/2305.01610

But we only know to look for this sparse set of neurons because the SAEs told us the "known entity" feature exists, and it's only because we know this feature exists that we expect neurons identified on a small set of entities (I think I looked at <5 examples and identified Neuron 0.2946, but admittedly kinda cheated by double checking on neuronpedia) to generalize.

If you count linear probing as a non-interp strategy, you could find the linear direction associated with "entity detection", and then just run the model over all 50257^2 possible pairs of input tokens. The mech interp approach still has to deal with 50257^2 pairs of inputs, but we can use our circuit analysis to save significant time by avoiding the model overhead, meaning we get the list of bigrams pretty much instantly. The circuit analysis also tells us we have only have to look at the previous 2 tokens for determining the broad component of the "entity detection" direction, which we might not know a priori. But I wouldn't say this is a project only interp can do, just maybe interp speeds it up significantly. 

[Note the reason we need 50257^2 inputs even in the mechanistic approach is because I don't know of a good method for extracting the sparse set of large EQKE entries without computing the whole matrix. If we could find a way to do this, then we could save significant time. But it's not necessarily a bottleneck for analysing n-grams, because the 50257^2 complexity is coming from the quadratic form in attention, not the fact that we are looking at bigrams. So if we found a circuit for n-grams, it wouldn't necessarily take us 50257^n time to list, whereas non-interp approaches would scale like 50257^n.]

Comment by Alex Gibson on Tracing the Thoughts of a Large Language Model · 2025-03-30T13:44:07.475Z · LW · GW

My model of why SAEs work well for the Anthropic analysis is that the concepts discussed are genuinely 'sparse' features. Like predicting 'Rabbit' on the next line is a discrete decision, and so is of the form SAEs model for. We expect these SAE features to generalize OOD, because the model probably genuinely has these sparse directions.

Whereas for 'contextual / vibes' based features, the ground truth is not a sparse sum of discrete features. It's a continuous summary of the text obtained by averaging representations over the sequence. In this case, SAEs exhibit feature splitting where they are able to model the continuous summary with sparser and sparser features by clustering texts from the dataset together in finer and finer divisions. This starts off canonical, but eventually the clusters you choose are not features of the model, but features of the dataset. And at this point the features are no longer robust OOD because they aren't genuine internal model features, they are tiny clusters that emerge from the interaction between the model and the dataset.

So in theory the model might have a direction corresponding to 'harmful intent', but the SAEs split the dataset into so many chunks that to recover 'harmful intent' you need to combine lots of SAE latents together. And the OOD behaviour arises from the SAE latents being unfaithful to the ground truth, not the model having poor OOD behaviour. Like SAE latents might be sufficiently fine that you can patch together chunks of the dataset to fit the train dataset, in a non-robust way.

As for concepts that generalize OOD - I suppose it depends what is meant by OOD? Like is looking at a dataset the model wasn't exposed to, but that it reasonably could have been, OOD? If so, the incentive for learning OOD robust concepts is that assuming that most text the model receives is novel, this text is OOD for the model, so if its concepts are only relevant to the text it has seen so far, it will perform poorly. You can also argue regularisation drives short description lengths, and thus generalising concepts. Whether a chunk of the training set is duplicates / similar is kind of irrelevant, because even if only 50% of text is novel, the novel text still provides the incentive for robust concepts.

Comment by Alex Gibson on Tracing the Thoughts of a Large Language Model · 2025-03-30T12:08:28.911Z · LW · GW

But models are incentivized to have concepts that generalize OOD because models hardly ever see the same training data more than once.

Comment by Alex Gibson on Matthias Dellago's Shortform · 2025-03-21T20:29:55.795Z · LW · GW

You can have a hypothesis with really high kolmogorov complexity, but if the hypothesis is true 50% of the time it will require 1 bit of information to specify with respect to a coding scheme that merely points to cached hypotheses.

This is why when kolmogorov complexity is defined it's with respect to a fixed universal description language, as otherwise you're right, it's vacuous to talk about the simplicity of a hypothesis.