# Explaining SolidGoldMagikarp by looking at it from random directions

post by Robert_AIZI · 2023-02-14T14:54:20.519Z · LW · GW · None commentsThis is a link post for https://aizi.substack.com/p/explaining-solidgoldmagikarp-by-looking

## Contents

Summary Introduction The Interior Conjecture is sufficient for unspeakability A Different Approach: Random Directions Experiment Results Conclusions None No comments

**Summary**

I conducted an experiment that provides evidence that many of the weird tokens demonstrating the __SolidGoldMagikarp phenomenon__ [LW · GW] are interior points in the token embedding cloud. This would be sufficient to make them “unspeakable”.

My experiment was equivalent to making GPT predict the next token from a randomized internal state, and in 3 million attempts there were 510/50257 tokens that it failed to output, but those 510 included 85/133 of the “weird tokens”, including “ SolidGoldMagikarp” itself. When I repeated this experiment on a “control group” of 50257 random embeddings, all 50257 were predicted at least 16 times, so failing to output a token is astronomically unlikely to be a fluke. This experiment provides some evidence that the weird tokens are embedded in interior points, but my ability to confirm that has been bottlenecked by my coding skill (help would be appreciated).

I believe this provides one step towards understanding the SolidGoldMagikarp phenomenon. Some of the “weird tokens” are not covered by this explanation, and it remains unclear why these token embeddings were learned in the first place.

**Introduction**

Like many others, I’m fascinated by the __SolidGoldMagikarp phenomenon__ [LW · GW] identified by Rumbelow and Watkins. In short, the GPT family has certain tokens, including " SolidGoldMagikarp", that produce weird behavior. One such behavior is being "unspeakable", where "GPT models seem largely incapable of repeating these anomalous tokens, and instead respond in a number of strange ways".

I was struck by the author's comments that the mysterious tokens "were among those closest to the centroid of the entire set of 50,257 tokens", since that suggests a simple explanation:

The Interior Conjecture:Unspeakable tokens are in the interior of the convex hull of the token embeddings.

**The Interior Conjecture is sufficient for unspeakability**

Let's first show that the Interior Conjecture would be sufficient to explain unspeakability:

Claim: If a token's embedding is on the interior of the convex hull of other tokens, then GPT cannot output it at temperature 0.

Proof: At temperature 0, GPT's output is the token with the largest logit in , where is the last row of the final state of the residual stream and is the token embedding matrix.

Proof (ct’d)Writing for tokens with embeddings , suppose is a convex linear combination of . That is, , with . Writing for the dot product, taking the dot product with h, and applying linearity, we have , which shows that the (real-number) logit of is a convex linear combination of logits of . But a convex linear combination of real numbers is bounded by its largest value, so , with equality if and only if all with nonzero coefficients are equal. Since is strictly in the interior, this is not the case (unless ), so cannot be the first choice token. QED

A visual description of the argument: is linearly projecting all embedding vectors onto a line, and the output token is the point furthest along this line. Since is an interior point before the projection, it will be an interior point after the projection, so it cannot be the first-choice token.

So is the Interior Conjecture true? I couldn't check because __scipy.spatial.ConvexHull__ was unhappy with the size of the vectors, and I didn't see how I could implement an algorithm with good performance. If someone with a coding background wanted to help me implement an algorithm to check this, I’d be eternally grateful^{[1]}.

**A Different Approach: Random Directions**

However, I did run a different test that sheds some light on the situation: I chose a direction vector at random and and found which token maximizes , where ranges over the set of tokens. I used the token embeddings from GPT-J-6B (__here__), which were 50257 tokens in 4096-dimensional space^{[2]}. Direction vectors were generated by sampling the standard normal distribution independently for each dimension, which is equivalent to choosing points from a hypersphere uniformly at random^{[3]}. I drew 3 million samples. __Code here__.

By the argument in the previous section, any interior point will occur 0 times in this dataset. The converse is also “true in the limit”: if a token embedding is extremal,__ there is a hyperplane separating it from the other points__, so the probability of it appearing at least once in the dataset approaches 1 as the number of samples approaches infinity.

As a “control group”, I also ran this experiment on randomized token embeddings, generated the same way as the direction vectors. I believe this is similar to how GPT’s weights were initiated, so this should be akin to what GPT would have predicted before any training, and will give us a baseline to see unusual patterns in the trained embeddings.

**Experiment Results**

I analyzed the resulting frequency dataset from a few perspectives.

Here’s a chart of the frequency of each token (sorted). Contrast the frequencies of GPT-J’s tokens (left column) with randomized token embeddings (right column).

We can see right away that GPT-J’s tokens are not similar to the random distribution, and in particular it covers a far wider range of frequencies (0-1760) than the random distribution (16-146).

Where are the weird tokens in this data? All across the distribution, but particularly concentrated at low frequencies (more on that later).

The top 10 tokens by frequency don't hold any meaning to me. They were a combination of seemingly-random characters, accented capital vowels, and the words “gif” and “kids”^{[4]}. Frankly, it's bizarre to me that most of these were even tokens:

```
index| frequency|token
-------------------------
17433 1760 ĠãĤ
27908 787 gif
136 781 Ì
47540 733 Ġ._
37855 704 ĠâĢº
46256 686 âģ
146 667 Ö
45235 667 kids
28110 641 Ġtha
25248 636 Ġ@@
```

Looking to the opposite end of the spectrum, 510/50257 tokens were never randomly generated (ie had a frequency of zero). What of the 133 candidate “weird tokens” described by Rumbelow and Watkins? Of those, 85 had zero frequency! To put it another way:, but !

However, a majority of the zero frequency tokens are not in the list of 136 “weird tokens”. The other notable class to me was “tokens with low indices”: of the first 93 tokens, 72 (77%) had zero frequency, a rate even higher than the “weird tokens”! This part of the vocabulary consists of digits, letters (uppercase and lowercase), and the punctuation found on a standard American keyboard. To irresponsibly speculate about this, GPT was trained not to predict these characters because the tokenization algorithm tries to group characters together. For instance, if the next piece of text is “word”, this will be tokenized as “[word]” instead of “[w][o][r][d]”, and the embeddings learn to reflect that solitary characters are almost never the next token.

Here are all 510 tokens that appear with zero frequency:

```
[ 0 1 3 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 40
42 43 44 45 46 47 49 50 53 54 58 59
60 62 64 65 66 67 69 70 72 73 74 76
77 78 79 81 82 85 86 88 89 90 91 92
124 125 153 173 174 177 178 179 180 181 182 183
184 185 186 187 188 197 198 200 220 246 247 250
257 259 262 263 264 269 270 271 272 274 276 278
281 282 285 286 287 290 299 307 308 309 311 318
319 326 327 329 338 339 340 345 347 351 352 355
356 357 360 362 366 367 371 373 376 379 383 385
389 390 393 399 402 406 410 412 416 418 422 423
428 438 447 460 461 464 465 468 470 474 475 479
481 484 494 502 508 509 510 511 513 515 517 526
530 532 534 540 543 544 546 547 550 553 554 584
588 602 607 611 616 617 618 621 625 628 632 642
645 649 651 654 656 657 663 673 674 683 685 689
705 706 714 718 720 737 750 760 765 766 767 770
775 779 784 796 803 807 815 818 821 828 832 837
843 851 860 878 910 921 940 960 981 986 1003 1026
1101 1105 1114 1115 1135 1143 1168 1169 1174 1187 1194 1201
1212 1222 1262 1314 1343 1391 1422 1462 1495 1511 1539 1550
1566 1634 1635 1639 1776 1782 1946 2075 2091 2102 2215 2231
2291 2399 2402 2534 2548 2608 2620 2751 2941 2996 3256 3336
3467 3510 3695 3717 3901 4008 4060 4083 4357 4533 4690 4778
5174 5332 5334 5357 5512 5808 5815 6438 7105 7782 8438 8735
8755 8980 9364 9783 10298 11033 11273 11304 11537 11548 11689 11709
11974 12340 12677 12781 13150 13171 13198 14574 14695 14827 15243 15272
16142 16764 17629 17900 18125 18472 18945 19415 19510 20174 20554 22640
22757 23090 23282 23513 23711 24847 24934 25193 25618 25658 25992 27006
27013 27534 28666 29372 30072 30202 30208 30209 30210 30211 30212 30213
30439 30684 30897 30898 30899 30905 30906 31032 31478 31573 31666 31765
31886 31957 32047 32239 32382 32437 32917 33023 33434 33454 33813 33937
34027 34206 34448 34504 34604 34832 35207 35307 35496 35579 35944 36130
36173 36174 36481 36607 36726 36911 36926 36935 36938 36940 37389 37444
37545 37574 37579 37631 37842 38016 38165 38250 38370 38653 39165 39177
39253 39280 39374 39446 39693 39749 39752 39753 39755 39756 39757 39811
39820 39821 39890 39906 40012 40219 40240 40241 40242 40516 41297 41380
41383 41504 41538 41551 42066 42089 42090 42202 42424 42535 42586 42728
42889 43010 43038 43065 43177 43361 43453 43569 43796 44555 45003 45228
45392 45544 45545 46570 46600 47198 47571 47614 48193 48366 48396 48404
49731 49781 49997 50009 50216 50256]
```

**Conclusions**

- Because the logits used for prediction are determined by a linear function of the token embeddings, token embeddings that are within the interior of the embedding cloud can never be predicted by GPT at 0 temperature, regardless of the contents of the transformer layers.
- The “Interior Conjecture” is my hypothesis that weird tokens such as “ SolidGoldMagikarp” are within the interior of the token embedding cloud.
- I have conducted an experiment that chooses random directions to evaluate on, and which provides evidence that
*some*weird tokens satisfy the Interior Conjecture, but shows that*not all of them satisfy it*. In particular, the “weird tokens” appear dramatically more often in the set of tokens with zero frequency. - My experiment shows that that Interior Conjecture is not true for all weird tokens (as some weird tokens had positive frequency), but is evidence that it might be true for many weird tokens. Several further experiments could prove it or provide additional evidence:
- Algorithmically compute which token embeddings are in the interior of the convex hull. Alternatively, for each token embedding compute the distance from it to the convex hull of the other points. (I would prefer the latter because it would be a richer dataset.)
- Bottlenecked by: I couldn’t write an efficient implementation of the
__Gilbert–Johnson–Keerthi distance algorithm__that operates in such a high-dimensional space.

- Bottlenecked by: I couldn’t write an efficient implementation of the
- Run the same random direction experiment on other GPT embeddings or for more datapoints (this would be perfect for parallelization).
- Bottlenecked by: I’m working from a laptop and don’t want to wait for jobs that last more than 8 hours.

- Analytically compute the exact probabilities that the random direction experiment approximates. To do this, for each token find the measure of the set of points in the 4096-dimensional hypersphere that results in that token being chosen.
- Bottlenecked by: it seems hard to set up and evaluate those integrals.

- Algorithmically compute which token embeddings are in the interior of the convex hull. Alternatively, for each token embedding compute the distance from it to the convex hull of the other points. (I would prefer the latter because it would be a richer dataset.)
- If true, the Interior Conjecture would raise additional questions:
- Why does GPT learn to put some tokens on the interior of its point cloud?
- My best guess is that this is the fate of all tokens that aren’t in the training set. Hiding a token in the center of the embedding cloud will guarantee that it is never predicted, which is a good behavior to learn if it is correct to never predict them!

- Which tokens does it learn to do this with? Why them?
- How do token embeddings evolve over the course of training? In particular, do the unspeakable tokens “move to the interior” or do speakable tokens “move to the extreme”?
- Why does the set of zero-frequency tokens overlap imperfectly with the set of weird tokens? Would a more careful study reveal a deeper overlap (e.g. set containment)?
- Can this be used for AI safety and if so how?
- To be honest, I don’t see a use case at the moment.

- Why does GPT learn to put some tokens on the interior of its point cloud?

[Edit: I've put my code and data up on Github. You can see the frequency data, the plots, and should be able to run my code to replicate the data generation and analysis. Please make use of this however you'd like.]

^{^}I think the algorithm to use is the

__Gilbert–Johnson–Keerthi distance algorithm__, or possibly a simplified variant (since we’re checking object-to-point distance instead of object-to-object). I’m worried that the NearestSimplex part of the code is infeasible since we need this to run in 4096-dimensional space. The__original paper__remarks that “since v [the set of vertices] is small, it is effective to take a combinatoric approach, where all [2^|v|-1] subsets are tested”, but in this case v could be as large as 4097…^{^}Technically there are 50400 tokens, but the additional 143 are extra tokens added just to make the number of tokens nicely divisible, and never came up in my evaluation.

^{^}To sample from the hypersphere, you can generate vectors as described and then normalize them. Since we only care about the index of the maximum value, and this is unchanged by the normalizing step, I omitted that step in my code.

^{^}Also, I believe in the GPT-J vocab list I’m working with, “Ġ” is used for spaces. This makes this token list marginally less weird, but it’s still confusing to me.

## None comments

Comments sorted by top scores.