Attention-Feature Tables in Gemma 2 Residual Streams
post by J Bostock (Jemist) · 2024-08-06T22:56:40.828Z · LW · GW · 0 commentsContents
TL;DR The Background The Setup Decoder Bias vs Rotary Embedding Interesting Findings Diving Into Head 0 QK Circuits OV Circuit Head 7 Conclusions None No comments
This is research I did in a short span of time, it is likely not optimal, but it's unclear whether the constraints are my skills, my tools, or my methods. Code and full results can be found here, but you'll need to download the model yourself to replicate it.
TL;DR
Using Gemma Scope, I am able to find connections between SAE features in the attention portion of Gemma 2 2B's layer 13. This can be done without running the model. These are often (somewhat) meaningful, with the OV circuits generally being easier to interpret than the QK circuits.
The Background
Anthropic has done some nice research decomposing attention layers into the QK and OV circuits. To summarize, the QK circuit tells an attention head where to look, and the OV circuit tells it what information to pass forward. When looking at a single-layer transformer, this looks like patterns in tokens of the form .
Recently, Google Deepmind released Gemma Scope, consisting of a large number of SAEs, trained on the Gemma 2 family of models. Neuronpedia has already generated a description for all of the "canonical" SAEs using GPT-4o.
This is an attempt to look at QK and OV circuits in Gemma 2 from the lens of features, rather than tokens.
The Setup
I chose to look at layer 13 (indexed from zero) roughly halfway through the model. For each decoder vector in the 16k-feature canonical residual stream SAE of layer 12 (henceforth SAE12), I calculated the element-wise product with the RMS norm coefficients, then calculated the products with , , and . I'll refer to these as the queries, keys, and values.
I then took the encoder vectors of the 16k-feature canonical residual stream SAE of layer 13 (henceforth SAE13), did an element-wise product with the RMS norm of the attention layer output, and multiplied by the transpose of . This creates a vector I'll call the pre-output. This is working backward towards the attention layer.
This does ignore the MLP layer. I wish there were SAEs trained between the two components. It also ignores the rotary positional encoding. I may return to these later. It also ignores the attention-out prelinear SAE! This is because Neuronpedia doesn't have labels available to download for these yet.
I calculated, for each head, a matrix consisting of the dot-products of the queries and keys, and which contains the dot-products of the values and pre-outputs. For each head, I found the highest-weighted connections between features in the QK and OV circuits.
Decoder Bias vs Rotary Embedding
I take the decoder bias of SAE12 to represent the "average" value of the residual stream going into layer 13. By looking at the query-key dot product of this with itself at different positional distances, we can maybe kinda see how each head looks across tokens:
Seems like head 0 is mostly a previous-token head, whereas the others fall off more slowly over distance.
Interesting Findings
Head 0 has lower QK-weighting than the other heads. It does a few interesting functions, such as noticing when mathematical constants are present and down-weighting a feature related to social equity (presumably to point the model towards the correct concept of "equal to" in this context).
It also detects a feature relating to scientific context and up-weights a feature relating to the past tense, specifically completed actions, which is indeed the most common tense for scientific writing to be in!
Lots of OV circuits are pretty interpretable, or at least seem that way: features up or down-weight later features appropriately.
Unfortunately, many of the feature labels are not very good. For example, I keep getting ones relating to "product descriptions" cropping up in unrelated text in the neuronpedia playground. I assume a more expensive model would do a better job. Also, the features in a 16k SAE for a 2k residual stream are not very monosemantic. It would be interesting to try the 256k SAEs once those are labelled.
It seems like many of the heads are deep in attention-head superposition. If the attention SAE gets labelled it would be cool to check that out.
Some features come up repeatedly in various heads. I think this is because they're up-weighted by the coefficients of the layernorm.
Diving Into Head 0
QK Circuits
QK Circuit Value | Key Feature | Query Feature | |
0 | -0.1015 | 3744: references to the Hall of Fame and related ceremonies or inductions | 8517: numerical patterns or symbols related to lists or sequences |
1 | 0.0976 | 3476: references to surface-related concepts and properties | 8517: numerical patterns or symbols related to lists or sequences |
2 | -0.09106 | 819: structured data presentation and numerical information | 819: structured data presentation and numerical information |
3 | -0.0845 | 6605: frequencies of occurrences or items in lists and counts | 6605: frequencies of occurrences or items in lists and counts |
4 | -0.0821 | 15107: structured data or variables related to types and their attributes | 8517: numerical patterns or symbols related to lists or sequences |
5 | 0.0812 | 10553: references to military threats and potential risks involving individuals | 10553: references to military threats and potential risks involving individuals |
6 | -0.0793 | 10211: scientific or medical terminology related to diseases and treatment options | 3160: programming syntax and coding structures |
7 | -0.07886 | 2842: numbers and percentage values related to statistical analysis | 8517: numerical patterns or symbols related to lists or sequences |
8 | -0.07434 | 10553: references to military threats and potential risks involving individuals | 10211: scientific or medical terminology related to diseases and treatment options |
9 | -0.0731 | 16134: references to interviews and conversations | 8517: numerical patterns or symbols related to lists or sequences |
10 | -0.0681 | 1681: references to scams and fraudulent activities | 5864: terms related to financial transactions and deposits |
11 | -0.06793 | 3598: terms associated with technical specifications and measurements | 16241: legal terminology related to court cases and appeals |
12 | -0.06726 | 838: technical terms related to computer science, programming languages, or data structures | 5864: terms related to financial transactions and deposits |
13 | 0.0662 | 1374: common symbols or mathematical notations related to set theory and graph theory | 8517: numerical patterns or symbols related to lists or sequences |
14 | -0.0653 | 11614: code snippets and programming constructs | 8517: numerical patterns or symbols related to lists or sequences |
15 | 0.06494 | 7352: structured data formats and attributes within documents | 6062: phrases related to authority and compliance |
16 | -0.06476 | 6594: numerical values related to quantities or measurements | 8517: numerical patterns or symbols related to lists or sequences |
17 | 0.0642 | 12570: references to faces or facial features | 14662: various sounds and noises described in the text |
18 | -0.0642 | 12888: punctuation marks, particularly periods | 8517: numerical patterns or symbols related to lists or sequences |
19 | -0.064 | 12523: terms related to regulations and conditions for financial and research contexts | 8517: numerical patterns or symbols related to lists or sequences |
20 | 0.0637 | 14859: specific names and references related to locations and events | 5864: terms related to financial transactions and deposits |
21 | -0.06366 | 7426: fragments of code or programming-related syntax, specifically within a structured or formatted context | 8517: numerical patterns or symbols related to lists or sequences |
22 | -0.0633 | 4118: specific programming constructs or syntactic elements related to function definitions and method calls | 2737: phrases related to legal expenses and costs |
23 | 0.0628 | 13408: lists of numbered items and their classifications or evaluations | 8517: numerical patterns or symbols related to lists or sequences |
24 | -0.0622 | 10543: coordinating conjunctions used to connect clauses or phrases | 10543: coordinating conjunctions used to connect clauses or phrases |
25 | 0.0621 | 2069: specific proper nouns, particularly names and titles | 5864: terms related to financial transactions and deposits |
26 | 0.062 | 3302: phrases involving details of legal cases and actions taken within them | 8517: numerical patterns or symbols related to lists or sequences |
27 | 0.0608 | 10553: references to military threats and potential risks involving individuals | 3160: programming syntax and coding structures |
28 | 0.0606 | 7352: structured data formats and attributes within documents | 14662: various sounds and noises described in the text |
29 | 0.0606 | 11087: specific structural elements or commands in programming and mathematical contexts | 12085: instructions and guides for how to perform tasks or solve problems |
30 | 0.06052 | 8517: numerical patterns or symbols related to lists or sequences | 8517: numerical patterns or symbols related to lists or sequences |
Seems like several themes are coming up here: programming and numerical data, particularly lists; names and individuals; scientific research; legal proceedings. Overall it seems like this head is being used for a few different things. Makes sense, given what we know about superposition. Remember that head 0 might be mostly a previous-token head.
We also see a lot of what I call like-to-like QK-connections, in which the same feature appears as the query, and the key. This also makes sense intuitively. I'll show the first 20 connections which are not like-to-like, and which have positive QK values:
QK Circuit Value | Key Feature | Query Feature | |
0 | 0.0976 | 3476: references to surface-related concepts and properties | 8517: numerical patterns or symbols related to lists or sequences |
1 | 0.0662 | 1374: common symbols or mathematical notations related to set theory and graph theory | 8517: numerical patterns or symbols related to lists or sequences |
2 | 0.06494 | 7352: structured data formats and attributes within documents | 6062: phrases related to authority and compliance |
3 | 0.0642 | 12570: references to faces or facial features | 14662: various sounds and noises described in the text |
4 | 0.0637 | 14859: specific names and references related to locations and events | 5864: terms related to financial transactions and deposits |
5 | 0.0628 | 13408: lists of numbered items and their classifications or evaluations | 8517: numerical patterns or symbols related to lists or sequences |
6 | 0.0621 | 2069: specific proper nouns, particularly names and titles | 5864: terms related to financial transactions and deposits |
7 | 0.062 | 3302: phrases involving details of legal cases and actions taken within them | 8517: numerical patterns or symbols related to lists or sequences |
8 | 0.0608 | 10553: references to military threats and potential risks involving individuals | 3160: programming syntax and coding structures |
9 | 0.0606 | 7352: structured data formats and attributes within documents | 14662: various sounds and noises described in the text |
10 | 0.0606 | 11087: specific structural elements or commands in programming and mathematical contexts | 12085: instructions and guides for how to perform tasks or solve problems |
11 | 0.06033 | 6576: HTML and XML markup tags | 8517: numerical patterns or symbols related to lists or sequences |
12 | 0.0591 | 759: references to fakeness or deception, particularly in the context of news and representations | 8517: numerical patterns or symbols related to lists or sequences |
13 | 0.05896 | 12202: numerical values and their relationships in a data context | 8517: numerical patterns or symbols related to lists or sequences |
14 | 0.05804 | 5815: numerical or tabular data relevant to various contexts | 9295: terms related to programming exceptions and errors in software development |
15 | 0.0576 | 15335: references to academic departments, institutions, and legal entities | 13384: complex structured elements in data |
16 | 0.05566 | 13844: references to social connections and communal engagement | 8517: numerical patterns or symbols related to lists or sequences |
17 | 0.05502 | 2989: statistical significance indicators in research findings | 16241: legal terminology related to court cases and appeals |
18 | 0.05475 | 12958: terms related to product quality and effectiveness | 6605: frequencies of occurrences or items in lists and counts |
19 | 0.05466 | 10186: references to military events and significant historical actions | 6605: frequencies of occurrences or items in lists and counts |
20 | 0.0546 | 4436: topics related to technology and data management | 16093: words related to physical interactions and conflicts |
One thing to note here is that the QK values for this head are much lower in magnitude than the other heads. Perhaps this head takes the role of a general aggregator, picking up on vibes from lots of tokens, rather than passing specific information around. This kinda makes sense based on the sorts of things which crop up in the OV circuit:
OV Circuit
OV Circuit Value | Value Feature | Output Feature | |
0 | 0.633 | 13462: occurrences of the word "little." | 13158: mathematical equations and physical variables related to scientific concepts |
1 | -0.629 | 13462: occurrences of the word "little." | 1391: prepositions and their relationships in sentences |
2 | 0.3972 | 563: the context or formatting of sections in a document, particularly those marked with specific tags such as <bos> | 5670: phrases related to implications and consequences in scientific contexts |
3 | -0.3918 | 563: the context or formatting of sections in a document, particularly those marked with specific tags such as <bos> | 8048: occurrences of selectors and method calls within Objective-C or Swift code |
4 | -0.3894 | 867: mathematical notation and formal expressions in the document | 11482: numerical representations or indicators related to data or formatting elements |
5 | 0.3792 | 867: mathematical notation and formal expressions in the document | 5764: different types of hats and roofs |
6 | 0.363 | 8810: references to constants in mathematical expressions or equations | 13698: mathematical or logical expressions and structures in the text |
7 | 0.356 | 571: patterns of mathematical variables and operations in equations | 13698: mathematical or logical expressions and structures in the text |
8 | -0.3525 | 8810: references to constants in mathematical expressions or equations | 1467: words and phrases related to social equity and representation issues |
9 | -0.3496 | 5764: symbols and punctuation marks that indicate changes in data or versioning | 11482: numerical representations or indicators related to data or formatting elements |
10 | 0.3494 | 8692: terms related to the implementation process in various contexts | 11482: numerical representations or indicators related to data or formatting elements |
11 | -0.3477 | 571: patterns of mathematical variables and operations in equations | 1467: words and phrases related to social equity and representation issues |
12 | 0.3425 | 12192: components and classes related to the Java Swing framework | 4743: formal titles and legal terminology related to court cases |
13 | 0.3352 | 5764: symbols and punctuation marks that indicate changes in data or versioning | 5764: different types of hats and roofs |
14 | -0.3337 | 8692: terms related to the implementation process in various contexts | 5764: different types of hats and roofs |
15 | -0.3323 | 12192: components and classes related to the Java Swing framework | 8160: terms related to scientific research and medical conditions |
16 | -0.3179 | 867: mathematical notation and formal expressions in the document | 8048: occurrences of selectors and method calls within Objective-C or Swift code |
17 | 0.3176 | 867: mathematical notation and formal expressions in the document | 5670: phrases related to implications and consequences in scientific contexts |
18 | -0.3044 | 1302: shipping-related terms and phrases for large items | 490: phrases related to self-awareness and personal identity |
19 | 0.303 | 1302: shipping-related terms and phrases for large items | 9249: mathematical expressions or notations |
20 | 0.295 | 2365: mathematical expressions and types related to programming or data structures | 11482: numerical representations or indicators related to data or formatting elements |
21 | 0.2917 | 6210: references to specific experimental methods and materials used in scientific research | 13698: mathematical or logical expressions and structures in the text |
22 | 0.2913 | 14005: references to scientific research and studies | 1600: past participles and auxiliary verbs expressing completed actions |
23 | 0.2913 | 6508: terms and concepts related to statistical methods and assumptions in graphical models | 879: conditional phrases and references to evidence or support |
24 | 0.29 | 3156: references to alternative options or entities | 11482: numerical representations or indicators related to data or formatting elements |
25 | 0.2893 | 16200: instances of phrases introducing or referencing specific scenarios | 15170: references to Muslims and related cultural or religious terms and events |
26 | -0.2878 | 9752: the presence of numerical values and their associated contexts or relationships | 4743: formal titles and legal terminology related to court cases |
27 | -0.2869 | 6508: terms and concepts related to statistical methods and assumptions in graphical models | 912: references to structural models and their connections in a technical context |
28 | -0.2844 | 6210: references to specific experimental methods and materials used in scientific research | 1467: words and phrases related to social equity and representation issues |
29 | -0.2844 | 16200: instances of phrases introducing or referencing specific scenarios | 15386: references to user-related information and actions within a programming or software context |
30 | 0.2834 | 1711: instances of various elements or entities in a list or catalog format | 490: phrases related to self-awareness and personal identity |
These are much richer and more interesting.
I really like 22 here, because it seems a bit weird at first, but what it's actually saying is that scientific work is almost always written in the perfect tense!
I find the negative values more interesting than the positive ones in a lot of cases. A lot of them seem to be "disambiguation". 8 seems to be telling the network "no, this is maths, we're looking at the mathematical definition of equality, not the social one!", as does 29. 16 seems to disambiguate Swift or Objective-C code from mathematical notations!
I don't know what 0 or 1 are doing! Why would "little" mean there are no prepositions, but we're in a scientific or mathematical context!
Head 7
I'll go through Head 7 here as well:
QK Circuit Value | Key Feature | Query Feature | |
0 | -0.3337 | 14956: legal terms and references to court proceedings | 15560: multiple segments of structured data, likely in a programming context |
1 | 0.328 | 13805: conditional phrases or questions | 13805: conditional phrases or questions |
2 | -0.327 | 10788: technical terms and parameters related to performance metrics | 13805: conditional phrases or questions |
3 | -0.324 | 13805: conditional phrases or questions | 10788: technical terms and parameters related to performance metrics |
4 | 0.3235 | 14956: legal terms and references to court proceedings | 14956: legal terms and references to court proceedings |
5 | 0.3228 | 15560: multiple segments of structured data, likely in a programming context | 15560: multiple segments of structured data, likely in a programming context |
6 | 0.3215 | 10788: technical terms and parameters related to performance metrics | 10788: technical terms and parameters related to performance metrics |
7 | -0.316 | 15560: multiple segments of structured data, likely in a programming context | 14956: legal terms and references to court proceedings |
8 | 0.2856 | 6163: LaTeX formatting commands and structure in a document | 6163: LaTeX formatting commands and structure in a document |
9 | -0.285 | 6412: conjunctions and their recurring use in sentences | 6163: LaTeX formatting commands and structure in a document |
10 | -0.2732 | 6163: LaTeX formatting commands and structure in a document | 6412: conjunctions and their recurring use in sentences |
11 | 0.271 | 6412: conjunctions and their recurring use in sentences | 6412: conjunctions and their recurring use in sentences |
12 | -0.2457 | 13492: numerical data and date representations | 499: features and attributes related to product descriptions and specifications |
13 | 0.2456 | 8931: restaurant reviews that mention food quality and dining experiences | 8931: restaurant reviews that mention food quality and dining experiences |
14 | -0.2444 | 7314: punctuation marks such as quotation marks and apostrophes | 8931: restaurant reviews that mention food quality and dining experiences |
15 | -0.2432 | 8931: restaurant reviews that mention food quality and dining experiences | 7314: punctuation marks such as quotation marks and apostrophes |
16 | 0.241 | 13492: numerical data and date representations | 13492: numerical data and date representations |
17 | 0.2406 | 499: features and attributes related to product descriptions and specifications | 499: features and attributes related to product descriptions and specifications |
18 | 0.2404 | 7314: punctuation marks such as quotation marks and apostrophes | 7314: punctuation marks such as quotation marks and apostrophes |
19 | -0.24 | 4374: elements of humor, particularly dark and inappropriate humor | 5484: references to notable achievements or events related to advancements and recognitions |
20 | 0.2383 | 5484: references to notable achievements or events related to advancements and recognitions | 5484: references to notable achievements or events related to advancements and recognitions |
21 | -0.2378 | 499: features and attributes related to product descriptions and specifications | 13492: numerical data and date representations |
22 | 0.2329 | 4374: elements of humor, particularly dark and inappropriate humor | 4374: elements of humor, particularly dark and inappropriate humor |
23 | -0.2328 | 5484: references to notable achievements or events related to advancements and recognitions | 4374: elements of humor, particularly dark and inappropriate humor |
24 | 0.2311 | 16266: numerical data and mathematical expressions | 16266: numerical data and mathematical expressions |
25 | -0.2283 | 16266: numerical data and mathematical expressions | 7942: references to event registration and participation details |
26 | 0.2257 | 9009: programming-related terms and code structure elements | 9009: programming-related terms and code structure elements |
27 | -0.2246 | 4814: concepts related to health and well-being, especially in medical contexts | 9009: programming-related terms and code structure elements |
28 | -0.2241 | 9009: programming-related terms and code structure elements | 4814: concepts related to health and well-being, especially in medical contexts |
29 | 0.222 | 4814: concepts related to health and well-being, especially in medical contexts | 4814: concepts related to health and well-being, especially in medical contexts |
30 | -0.2181 | 15560: multiple segments of structured data, likely in a programming context | 7400: key political figures and their roles |
There are still a lot of like-to-like pairs. We some mutual-exclusion tetrads, like 8-11, or 26-29. Let's show the non like-to-like pairs with positive QK values:
QK Circuit Value | Key Feature | Query Feature | |
0 | 0.2092 | 15560: multiple segments of structured data, likely in a programming context | 15577: references to tutorials and guides |
1 | 0.2036 | 14956: legal terms and references to court proceedings | 7400: key political figures and their roles |
2 | 0.1814 | 7314: punctuation marks such as quotation marks and apostrophes | 693: references to companies and specific products associated with genetics and finance |
3 | 0.1774 | 499: features and attributes related to product descriptions and specifications | 6599: sections of text that contain scientific or technical jargon related to genetics or molecular biology |
4 | 0.1741 | 13492: numerical data and date representations | 6293: elements related to mold removal and cleaning |
5 | 0.17 | 8931: restaurant reviews that mention food quality and dining experiences | 16340: elements related to user preferences or session management |
6 | 0.1624 | 7344: relationships between keywords and their attributes | 12988: technical terms related to structural engineering and materials |
7 | 0.1624 | 14956: legal terms and references to court proceedings | 7950: phrases related to technical specifications or characteristics |
8 | 0.1593 | 15560: multiple segments of structured data, likely in a programming context | 4092: references to educational backgrounds and achievements |
9 | 0.155 | 3391: terms related to scientific research and methodology | 15733: closing braces and related control flow syntax in code |
10 | 0.1525 | 4470: technical terms and phrases related to processes in refrigeration and fluid dynamics | 12988: technical terms related to structural engineering and materials |
11 | 0.148 | 14956: legal terms and references to court proceedings | 15015: <span command="">JavaScript event handling and functions related to user interactions in web development.</span> |
12 | 0.1415 | 620: references to money and financial transactions, particularly those related to illegal activities | 251: financial terms related to risk and stability |
13 | 0.1395 | 620: references to money and financial transactions, particularly those related to illegal activities | 4043: references to data, experiments, and processes in scientific contexts |
14 | 0.139 | 13805: conditional phrases or questions | 9289: numeric values and structured formats, particularly those that appear in data representations and web links |
15 | 0.1371 | 6386: terms related to audio processing and effects | 3551: structure declarations and classes within programming code |
16 | 0.1354 | 12093: the word "after" in various contexts | 7846: references to medical treatments and patient outcomes |
17 | 0.1345 | 3632: references to legal circuits and courts | 9399: details about music albums and their characteristics |
18 | 0.1343 | 12093: the word "after" in various contexts | 2999: terms related to statistical analysis and data representation |
19 | 0.134 | 12711: numbers and mathematical expressions | 12231: references to the Android context class and its usage in code |
20 | 0.133 | 11094: references to scientific studies and their results | 10631: clauses and phrases that describe relationships or characteristics |
Now let's take a look at the OV circuit:
OV Circuit Value | Value Feature | Output Feature | |
0 | 0.973 | 2012: instances of the word "already" in various contexts | 8487: scientific measurements and their implications |
1 | -0.9707 | 3732: punctuation marks | 8487: scientific measurements and their implications |
2 | 0.9575 | 9295: terms related to programming exceptions and errors in software development | 27: technical terms and concepts related to object-role modeling and database queries |
3 | -0.954 | 11541: coding elements related to data parsing and storage operations | 27: technical terms and concepts related to object-role modeling and database queries |
4 | 0.94 | 11541: coding elements related to data parsing and storage operations | 11639: mathematical operations and programming constructs related to vector calculations |
5 | -0.9395 | 9295: terms related to programming exceptions and errors in software development | 11639: mathematical operations and programming constructs related to vector calculations |
6 | 0.913 | 3732: punctuation marks | 738: entities related to organizations and institutional frameworks |
7 | -0.8677 | 2012: instances of the word "already" in various contexts | 738: entities related to organizations and institutional frameworks |
8 | 0.715 | 6052: object properties and their associated methods in programming contexts | 15278: keywords related to job postings in the healthcare field |
9 | -0.7026 | 6052: object properties and their associated methods in programming contexts | 6954: references to boys and masculinity |
10 | -0.7017 | 7507: numerical data and references related to statistics and measurements | 15278: keywords related to job postings in the healthcare field |
11 | 0.694 | 7507: numerical data and references related to statistics and measurements | 6954: references to boys and masculinity |
12 | 0.6934 | 3391: terms related to scientific research and methodology | 13811: specific coding constructs and structure, particularly related to object-oriented programming elements like classes and unique identifiers |
13 | -0.6753 | 9932: functions and events related to programming, particularly those involving event handling and listener methods | 14631: terms related to procedural steps and algorithms |
14 | 0.673 | 1745: questions and mathematical operations involving problem-solving | 11576: terms related to legal documentation and identification processes |
15 | 0.668 | 9932: functions and events related to programming, particularly those involving event handling and listener methods | 6733: details related to room features and rental conditions |
16 | -0.6597 | 1942: rankings and positions of institutions or programs | 11576: terms related to legal documentation and identification processes |
17 | -0.6587 | 1745: questions and mathematical operations involving problem-solving | 965: references to gender equality and disparities |
18 | 0.6562 | 14781: elements related to user interaction and token verification in a digital workspace | 14631: terms related to procedural steps and algorithms |
19 | -0.6494 | 3391: terms related to scientific research and methodology | 13176: code structures, particularly comments and namespace declarations in programming languages |
20 | 0.6484 | 1942: rankings and positions of institutions or programs | 965: references to gender equality and disparities |
21 | -0.6465 | 14781: elements related to user interaction and token verification in a digital workspace | 6733: details related to room features and rental conditions |
22 | -0.6416 | 13430: mathematical constructs and expressions | 331: mathematical equations and expressions |
23 | 0.639 | 13430: mathematical constructs and expressions | 15082: references related to mathematical or scientific notation and parameters |
24 | 0.6387 | 2565: references to rural locations and related entities | 331: mathematical equations and expressions |
25 | -0.635 | 2565: references to rural locations and related entities | 15082: references related to mathematical or scientific notation and parameters |
26 | -0.6245 | 8437: data references and statistics related to biological experiments | 13260: elements associated with input fields and forms |
27 | 0.623 | 9681: historical references related to laws and legal cases | 13260: elements associated with input fields and forms |
28 | 0.619 | 8437: data references and statistics related to biological experiments | 12800: specific references to successful authors and their works |
29 | -0.616 | 9681: historical references related to laws and legal cases | 12800: specific references to successful authors and their works |
30 | 0.615 | 10640: keywords and references related to academic or scientific sources | 27: technical terms and concepts related to object-role modeling and database queries |
Some of these seem totally nonsensical! 8-11 in particular, I mean what? These seem to be hopelessly lost in superposition. Perhaps head 7 is in a greater degree of superposition because it attends to more specific tokens than head 0.
Conclusions
With difficulty, it may be possible to reconstruct attention-based circuits in this way. It's unclear how much of the difficulties stem from technical limitations in the SAE and labels, and how much are fundamental to this method. I would like to try again someday.
0 comments
Comments sorted by top scores.