Enumerating objects a model "knows" using entity-detection features.
post by Alex Gibson · 2025-03-30T16:58:01.957Z · LW · GW · 0 commentsContents
Introduction 'Known' Proper Noun Neuron: Circuit Discovery: Filtering process: Results of the filter: Conclusion: None No comments
Introduction
Research on Sparse Autoencoders (SAEs) has identified "known entity" features in language models - features that activate when the model processes entities it "knows." If we can find the circuit models use to recognise that they know an entity, then by computing which inputs would trigger the circuit, we can extract a list of "known entities".
The aim is to do this in a mostly dataset free way, bootstrapping from a small number of known entities to make guesses for what components are involved in the circuit, using the insights extracted from SAEs to guide circuit discovery, but not using SAEs or a large dataset of text to find these entities.
In theory, given the "known entity" feature, you could run all possible inputs through the model to extract a list of "known entities". But this is inefficient, so the approach is to find mathematical simplifications in the circuit that let us speed up the process.
I focus on GPT2-Small's first layer, where certain neurons appear to distinguish between "known" and "unknown" bigram proper nouns. By developing a simple model of how these neurons make this distinction, we can quickly filter for entities the model recognizes. These neurons aren't perfect, and there are examples of false positives and false negatives, but they give quite a large list of bigrams, with relatively low noise compared with other methods.
While the approach is currently quite crude, the results are surprisingly clean. This post serves as a proof of concept for a more ambitious project extracting model knowledge through mechanistic interpretability.
'Known' Proper Noun Neuron:
In the first layer of GPT2-Small, three attention heads (3, 4, and 7) have local positional kernels that process n-grams and local context. These heads are prime candidates for identifying circuits that recognize "known bigram nouns." So we should be filtering for circuits which use these heads in some way.
Through experimentation, I discovered a small set of neurons that consistently activate on known proper nouns. One example - of multiple - is Neuron 2946, which typically shows an activation around +2.0 for known bigrams and minimal activation for unknown ones. They don't activate on all known proper nouns, but activate on a large percentage of the variety of known bigrams I tried.
While Neuronpedia suggests this neuron has some preference for American topics, it responds to a wide variety of known proper nouns, as you can see by scrolling down far enough. For the purposes of extracting known entities, the fact that the neuron is not monosemantic is irrelevant, because we can just restrict to contributions from the 'known bigram' circuit we uncover. This neuron serves as an inexpensive linear probe - since very few neurons in the first layer show such high activations (+2.0) on most inputs, we can use a small test set and still expect generalizable results.
Circuit Discovery:
Analyzing contributions to Neuron 2946's activation reveals that Head 0.1, Head 0.7, and the direct circuit pathway are the key components. Head 0.1 functions as a duplicate token head (effectively making it part of the direct circuit), while Head 0.7 has a local positional kernel. This is encouraging as Head 0.7 was one of the highlighted heads with a local positional kernel.
The key insight is that Head 0.7 changes its attention pattern based on whether it's processing a known bigram. When given "a b" where "a b" is a known bigram, Head 0.7 strongly attends to "a" from position "b". For unknown bigrams, it distributes attention more broadly across the previous ~5 tokens.
This creates a filtering mechanism:
- Head 0.7 acts as an initial filter, focusing attention on the first token of potential known bigrams
- The direct circuit provides information about the current token
- Neuron 2946 acts as a classifier using combined information about both tokens
The MLP layer likely combines multiple neurons like this for robust classification. While a single neuron won't perfectly classify all cases, it should give a good initial idea.
On it's own, just filtering for high EQKE entries on Head 0.7 isn't enough. It narrows down candidates for bigrams significantly, but leaves lots of nonsense bigrams and there is too much noise to find interesting examples. This is most likely because the model can't perfectly learn a sparse list of bigrams with a rank 64 matrix.
Filtering process:
We now have enough information to construct a filter for known bigrams:
Step 1: For each fixed query token "b", find EQKE entries with far above average attention scores. About +4.0 above average is where most "known bigrams" seem to lie. We also want it to be higher than the diagonal entry, so that it doesn't just attend to the current token. We can do this for all query tokens in parallel with pytorch.
Step 2: For each filtered bigram, of which there remain only ~1 million of the original 2.5 billion, compute the contribution to Neuron 2946 from Head 0.7, and the direct circuit contribution, and look for it to be above some threshold. Assuming Head 0.7 attends entirely to "a", we can do this quickly in parallel. We can afford to be sloppy because we expect the boundary between "known" and "unknown" bigrams to be sharp, based on evidence from SAE features.
Results of the filter:
The results are a lot cleaner after filtering for high neuron activation. It feels like at least 50% of the bigrams we uncover are 'real' as opposed to hallucinations. There are still 'fake bigrams', however.
A common instance of this is, if there is a famous person with the last name "b", there will be ~5 bigrams of the form "a b" which the model thinks of as names it knows, at least in the first layer. My guess is that this is because first names will have high neuron contributions, because they appear frequently in names the model knows, and so the model can't distinguish names which aren't distinguished by Head 7. And Head 7 might often confuse names because of the low rank structure.
Nonetheless, it's significantly better than pre filter, and it is interesting to see the kinds of bigrams the model learns.
Below, I show the list of bigrams for a neuron threshold that I selected to be able to fit here. I filtered for bigrams with spaces at the start of both tokens, as these tend to be proper nouns. These bigrams are cleaner than for lower neuron thresholds, but it doesn't get significantly worse for lower thresholds. The code to reproduce these results can be found here.
[' Kend al', ' Cold War', ' Penn State', ' mutual fund', ' Random House', ' Kansas City', ' Salt City', ' Laur King', ' Luther King', ' Hann tou', ' FOX News', ' Capitol Hill', ' Ron Paul', ' Orange County', ' Wall Street', ' Green Party', ' pop culture', ' Serge Ev', ' FI Union', ' Green Bay', ' Tampa Bay', ' Homeland Security', ' Prison Service', ' Coast Guard', ' Orange Guard', ' Serge Johnson', ' Air Force', ' Quant Force', ' burg minister', ' Alex Jones', ' Indiana Jones', ' Ron Fe', ' Phill Fe', ' Luther Fe', ' Phill Louis', ' Chip shop', ' Star Wars', ' Syri Wars', ' Salt Lake', ' Sierra Club', ' Dark Knight', ' hot dogs', ' Hot dogs', ' Dead Sea', ' Cy Young', ' Magic Kingdom', ' Modern Family', ' real estate', ' Real estate', ' Joe Rog', ' Russell Wilson', ' Jerry Bru', ' Rex Ast', ' Raw Story', ' Columb Amendment', ' Palm Beach', ' ice cream', ' Ice cream', ' Jeffrey Kat', ' Super Bowl', ' Charl Bron', ' Hong Kong', ' Marshall Kong', ' Chip Kelly', ' clot hint', ' Madison Square', ' Columb Station', ' Bay Area', ' San Diego', ' Serge sheet', ' Liber Arabia', ' Rainbow Six', ' Joint Staff', ' Pearl Jam', ' Chip Som', ' Mountain Caller', ' Luther Kings', ' Pacific Ocean', ' Metal Gear', ' Serge Roberts', ' Final Fantasy', ' Nig belt', ' Academy Award', ' Indianapolis Castle', ' Howard Dean', ' Elizabeth Warren', ' Serge balls', ' San Antonio', ' Animal Planet', ' Anton Graham', ' Billy Graham', ' Electronic Arts', ' Madison Garden', ' pepper spray', ' Visual Studio', ' Glenn Beck', ' Charl tum', ' Virgin Islands', ' Alice Mun', ' Ken Griff', ' Tai Griff', ' Golden Dawn', ' Counter Strike', ' Jay Len', ' Charl Ther', ' Quant Forces', ' Marine Corps', ' Golden Gate', ' Viet Gate', ' Twilight Zone', ' Animal kingdom', ' Wil Laur', ' Dragonbound Laur', ' South Dakota', ' Scot cub', ' Ray Rice', ' Trib Hampshire', ' Fort Collins', ' prime ministers', ' Harry Potter', ' Warner Bros', ' Chuck Todd', ' Wonder Woman', ' Philip Morris', ' Trib Warner', ' Detroit Lions', ' Motor parks', ' obst tract', ' Golden Knights', ' Laur Tales', ' Mechanical Manning', ' Crystal Palace', ' Fort McM', ' Street Fighter', ' Ultimate Fighter', ' Jim Crow', ' Serge Hem', ' Serge sheets', ' Major Soccer', ' Kend Vi', ' Warner Brothers', ' Quant Lynch', ' Kend Lynch', ' Fried Chicken', ' Tiger Woods', ' Psy Griffin', ' Kend Griffin', ' Serge Bryant', ' Luther Strange', ' lineback corps', ' Serge Hud', ' Philadelphia Inqu', ' Fort Hood', ' Robin Hood', ' Pop Culture', ' Major Baseball', ' NC affiliate', ' Quant Copy', ' Martin Luther', ' Tiger Boys', ' Ghost Shell', ' Capitol dioxide', ' Electronic cigarettes', ' Detroit Tigers', ' Consumer Reports', ' fossil fuels', ' pir Caribbean', ' Common Ple', ' Red Bulls', ' Call Duty', ' Ray Kur', ' Lin Feng', ' Boston Globe', ' Golden Globe', ' Palm Springs', ' Sho Noah', ' Planned Parenthood', ' Tony Blair', ' Sandy Hook', ' Nick Cave', ' Cape Cod', ' Joint Chiefs', ' Cape Peterson', ' Dw Wade', ' General Motors', ' Huff ov', ' Howard Hughes', ' Twilight Saga', ' Sho Mohammed', ' blue collar', ' Aaron Rodgers', ' Ice Cream']
Conclusion:
The current method for filtering is quite crude, but shows potential. It would be interesting to investigate if the names which the neurons incorrectly fire for genuinely confuse the model, or if the model refines bigrams in later layers. With this reduced list of bigrams, it could potentially be possible to construct a 'lookup table' of sorts for bigrams, and use this to find structure / lack of in the model's representation of bigrams.
0 comments
Comments sorted by top scores.