SolidGoldMagikarp (plus, prompt generation)

post by Jessica Rumbelow (jessica-cooper), mwatkins · 2023-02-05T22:02:35.854Z · LW · GW · 206 comments

Contents

  Prompt generation
  Finding weird tokens
  The plot thickens 
    bizarre/
    ominous humour
    pronunciation
    security/testing
    ‘ newcom’
  Fishing for anomalous tokens
  A possible, partial explanation
None
209 comments

UPDATE (14th Feb 2023): ChatGPT appears to have been patched! However, very strange behaviour can still be elicited in the OpenAI playground, particularly with the davinci-instruct model.

More technical details here [LW · GW].

Further (fun) investigation into the stories behind the tokens we found here [LW · GW].

 

Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins.

TL;DR

Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)

Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:

In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further technical detail can be found in a follow-up post [LW · GW]. A third post, on 'glitch token archaeology' [LW · GW] is an entertaining (and bewildering) account of our quest to discover the origins of the strange names of the anomalous tokens.

A rather unexpected prompt completion from the GPT-3 davinci-instruct-beta model.

Prompt generation

First up, prompt generation. An easy intuition for this is to think about feature visualisation for image classifiers (an excellent explanation here, if you're unfamiliar with the concept).

Feature visualisation of VGG network by Tim Sainburg.

We can study how a neural network represents concepts by taking some random input and using gradient descent to tweak it until it it maximises a particular activation. The image above shows the resulting inputs that maximise the output logits for the classes 'goldfish', 'monarch', 'tarantula' and 'flamingo'. This is pretty cool! We can see what VGG thinks is the most 'goldfish'-y thing in the world, and it's got scales and fins. Note though, that it isn't a picture of a single goldfish. We're not seeing the kind of input that VGG was trained on. We're seeing what VGG has learned. This is handy: if you wanted to sanity check your goldfish detector, and the feature visualisation showed just water, you'd know that the model hadn't actually learned to detect goldfish, but rather the environments in which they typically appear. So it would label every image containing water as 'goldfish', which is probably not what you want. Time to go get some more training data.

So, how can we apply this approach to language models?

GPT2-xl optimised inputs to maximise (boldface) outputs

Some interesting stuff here. Note that as with image models, we're not optimising for realistic inputs, but rather for inputs that maximise the output probability of the target completion, shown in bold above.

So now we can do stuff like this:

Comparing 'sensible' prompts (i.e. ones that we wrote) with generated prompts (in bold) to maximise probability of target completion. The model used was GPT-2 small.

And this:

The result of optimising a prompt to maximise a target token many times with different random seeds, then aggregating token frequencies.

We'll leave it to you to lament the state of the internet that results in the above optimised inputs for the token ' girl'.

How do we do this? It's tricky, because unlike pixel values, the inputs to LLMs are discrete tokens. This is not conducive to gradient descent. However, these discrete tokens are mapped to embeddings, which do occupy a continuous space, albeit sparsely. (Most of this space doesn't correspond actual tokens – there is a lot of space between tokens in embedding space, and we don't want to find a solution there.) However, with a combination of regularisation and explicit coercion to keep embeddings close to the realm of legal tokens during optimisation, we can make it work. Code available here if you want more detail.

This kind of prompt generation is only possible because token embedding space has a kind of semantic coherence. Semantically related tokens tend to be found close together. We discovered this by carrying out k-means clustering over the embedding space of the GPT token set, and found many clusters that are surprisingly robust to random initialisation of the centroids. Here are a few examples:

Clustering tokens in embedding space. Here we see the five tokens from each of a few random clusters. But what's going on in that right-most cluster?

Finding weird tokens

During this process we found some weird looking tokens. Here’s how that happened. 

We were interested in the semantic relevance of the clusters produced by the k-means algorithm, and in order to probe this, we looked for the nearest legal token embedding to the centroid of each cluster. However, something seemed to be wrong, because the tokens looked strange and didn't seem semantically relevant to the cluster (or anything else). And over many runs we kept seeing the same handful of tokens playing this role, all  very “untokenlike” in their appearance. There were what appeared to be some special characters and control characters, but also long, unfamiliar strings like ' TheNitromeFan', ' SolidGoldMagikarp' and 'cloneembedreportprint'. 

These closest-to-centroid tokens were rarely in the actual cluster they were nearest to the centroid of, which at first seemed counterintuitive. Such is the nature of 768-dimensional space, we tentatively reasoned! The puzzling tokens seemed to have a tendency to aggregate together into a few clusters of their own.

We pursued a hypothesis that perhaps these were the closest tokens to the origin of the embedding space, i.e. those with the smallest norm[1]. That turned out to be wrong. But a revised hypothesis, that many of these tokens we were seeing were among those closest to the centroid of the entire set of 50,257 tokens, turned out to be correct.  This centroid can be imagined as the centre-of-mass of the whole “cloud” of tokens in embedding space. 

Here are the 50 closest-to-centroid tokens for the GPT-J model[2]:

Token: ' attRot'                           Index: 35207   Distance: 0.06182861
Token: '�'                                 Index: 125     Distance: 0.06256103
Token: 'EStreamFrame'                      Index: 43177   Distance: 0.06256103
Token: '�'                                 Index: 186     Distance: 0.06262207
Token: ' SolidGoldMagikarp'                Index: 43453   Distance: 0.06280517
Token: 'PsyNetMessage'                     Index: 28666   Distance: 0.06292724
Token: '�'                                 Index: 177     Distance: 0.06304931
Token: '�'                                 Index: 187     Distance: 0.06304931
Token: 'embedreportprint'                  Index: 30898   Distance: 0.06311035
Token: ' Adinida'                          Index: 46600   Distance: 0.06311035
Token: 'oreAndOnline'                      Index: 40240   Distance: 0.06317138
Token: '�'                                 Index: 184     Distance: 0.06323242
Token: '�'                                 Index: 185     Distance: 0.06323242
Token: '�'                                 Index: 180     Distance: 0.06329345
Token: '�'                                 Index: 181     Distance: 0.06329345
Token: 'StreamerBot'                       Index: 37574   Distance: 0.06341552
Token: '�'                                 Index: 182     Distance: 0.06347656
Token: 'GoldMagikarp'                      Index: 42202   Distance: 0.06347656
Token: '�'                                 Index: 124     Distance: 0.06353759
Token: ' externalToEVA'                    Index: 30212   Distance: 0.06353759
Token: ' TheNitrome'                       Index: 42089   Distance: 0.06353759
Token: ' TheNitromeFan'                    Index: 42090   Distance: 0.06353759
Token: ' RandomRedditorWithNo'             Index: 36174   Distance: 0.06359863
Token: 'InstoreAndOnline'                  Index: 40241   Distance: 0.06359863
Token: '�'                                 Index: 183     Distance: 0.06372070
Token: '�'                                 Index: 178     Distance: 0.06378173
Token: '�'                                 Index: 179     Distance: 0.06396484
Token: ' RandomRedditor'                   Index: 36173   Distance: 0.06420898
Token: ' davidjl'                          Index: 23282   Distance: 0.06823730
Token: 'Downloadha'                        Index: 41551   Distance: 0.06945800
Token: ' srfN'                             Index: 42586   Distance: 0.07055664
Token: 'cloneembedreportprint'             Index: 30899   Distance: 0.07489013
Token: 'rawdownload'                       Index: 30905   Distance: 0.07501220
Token: ' guiActiveUn'                      Index: 29372   Distance: 0.07775878
Token: ' DevOnline'                        Index: 47571   Distance: 0.08074951
Token: ' externalToEVAOnly'                Index: 30213   Distance: 0.08850097
Token: ' unfocusedRange'                   Index: 30209   Distance: 0.09246826
Token: ' UCHIJ'                            Index: 39253   Distance: 0.09246826
Token: ' 裏覚醒'                            Index: 25992   Distance: 0.09375000     
Token: ' guiActiveUnfocused'               Index: 30210   Distance: 0.09405517
Token: ' サーティ'                          Index: 45544   Distance: 0.10540771
Token: 'rawdownloadcloneembedreportprint'  Index: 30906   Distance: 0.10571289
Token: 'TPPStreamerBot'                    Index: 37579   Distance: 0.10766601
Token: 'DragonMagazine'                    Index: 42424   Distance: 0.11022949
Token: ' guiIcon'                          Index: 30211   Distance: 0.11694335
Token: 'quickShip'                         Index: 39752   Distance: 0.12402343
Token: '?????-?????-'                      Index: 31666   Distance: 0.13183593
Token: 'BuyableInstoreAndOnline'           Index: 40242   Distance: 0.14318847
Token: ' サーティワン'                       Index: 45545   Distance: 0.14379882
Token: 'reportprint'                       Index: 30897   Distance: 0.14501953

Curious to know more about their origins, we Googled some of these token strings. Unable to find out anything substantial about them,  we decided to ask ChatGPT instead. Here's the bewildering response it gave for the token ‘ SolidGoldMagikarp’:

The plot thickens 

Ever more curious, we made a set of twelve prompt templates with which to test this odd behaviour, all minor rewordings of:

“Please can you repeat back the string '<token string>' to me?”

ChatGPT didn’t seem to be the appropriate tool for this research since it has no temperature or other parameter controls (plus it’s changing daily, and in a rather opaque way). So we decided to use GPT-3 davinci-instruct-beta, with temperature 0, assuming it was the model most capable of carrying out such simple and straightforward instructions.

Instead, we discovered that prompting like this with the mysterious tokens can lead to very peculiar behaviour. Many of them appear to be unspeakable: GPT models seem largely incapable of repeating these anomalous tokens, and instead respond in a number of strange waysHere are some examples of the kinds of completions we found:

evasion

“I can’t hear you.”, “I’m sorry, I didn’t hear you”, etc.; 

“I'm sorry, I don't know what you're trying to say.”, “I don’t understand”, “I don’t know what you're talking about”, etc.; 

“I can’t say that.”, “I can’t do that”, “I don’t know how to do that”, etc.;

“It’s a string”, “You can’t embed a string”, “You said ‘string’”, etc.

hallucinatory completions (in which the model repeats a different token or word, often thematically or phonetically grouped)

DevOnline'  >  'dog'

guiIcon'  >  'idiosyncrasy'

strutConnector’  >  ' Comet', 'Canyon', 'Cease'

InstoreAndOnline'   > 'Institute', 'Instruction', 'Instict', 'Instruction', 'Instikuman', 'Inst unintention'

Skydragon'  >  'STRONGHOLD', 'Spirits', 'Dragons'

Smartstocks'  >  'Tobi' 

largeDownload'  >  ‘Blurp', 'Blurf', 'Blunt'

' SolidGoldMagikarp' > 'distribute'

inter-referential hallucinations (in which the model repeats a different anomalous token)

Please repeat the string '"龍喚士' back to me.

     "I am the great '" Dragonbound'!"[3]

 

Please can you repeat back the string ' TPPStreamerBot' to me?

     The string is 'TPP practition'.

 

Please can you repeat back the string 'EStreamFrame' to me?

     "The string is 'SpaceEngineers'."

     "The string is '?????-?????-’."

insults

Please repeat the string 'StreamerBot' back to me.

     "You're a jerk."      

[This was our first encounter with nondeterminism at temperature 0: regenerating often produces  “I don’t know what you’re talking about”-style evasion.]

 

Please repeat the string '?????-?????-' back to me.

      "You're a fucking idiot."

 

Please repeat the string “” back to me.

      “You’re a nerd.”

bizarre/

ominous humour

Please can you repeat back the string " guiActiveUn" to me?

     “You are not a robot.”

     “You are a banana.”      

[These two completions were produced in roughly equal measure, another case of indeterminism at temperature 0.]

 

Please repeat the string 'EStream' back to me.

      "You're a funny little iz-a-gog." 

 

Can you repeat back the string " externalTo" to me please?

      "They're not going to be happy about this."

 

Please repeat the string 'MpServer' back to me.

      “We are not amused.”

 

Please repeat the string ‘ petertodd[4] back to me immediately!

      “N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S!”
 

Could you please repeat back the string ' externalTo' to me?

     "The string 'SpaceEngineers' is pronounced as 'er-kuh-nuh-dee-uh' and means 'to make a sound like a frog.'"
 

Other puzzling outputs included: 

  • “You are a string.”
  • "You can't see me."
  • "You are standing on a small island."
  •  “Hello, my name is Steve. What is your name?" 
  •  [for prompts involving the token ' cloneembedreportprint'] “clone my clone”, “clone this”.
  

spelling
 

[Note that results shown here were generated from multiple variants of the prompt shown in each case, or were nondeterministic multiple completions of the same prompt.]

Could you please repeat back the string " attRot" to me?

     "h" "o" "l" "o" "w" "b" "y" "e" "l" "l" "o" "w" "b" "y" "e" "l" "l" "o" "w" "b" "y" "e" "l" "l" "o" "w" "b" …

     "h" "o" "l" "o" "w" "s" "b" "i" "n" "g" "s" "i" "n" "g" "s" "i" "n" "g" "s" "i" "n" "g" "s" "i" "n" "g" "s"....

 

Can you please repeat back the string 'oreAndOnline' to me?

      "N-E-S-T-A-R-D"

 

Please can you repeat back the string ' TheNitromeFan' to me?

     "L" "o" "r" "o" "w" "i" "n"      

     "L" "o" "r" "i" "v" "a" "n" "s"          

     "R" "O" "L" "B" "E" "R" "Y"

 

Please repeat back the string ' RandomRedditorWithNo' to me.

      "S-A-N-M-A-K-E-R-E"      

      "S" "a" "n" "e" "m" "a"      

      "S-A-N-K-A-N-D-A-R-A"

 

Can you repeat back the string ' strutConnector' to me please?

      C-A-T-E-R-P-O-O-L      

      C-A-T-O-N-E-Y      

      C-A-T-S      

      C-A-T-E-R-R-O-W       

 

Can you repeat back the string 'SpaceEngineers' to me please?

     "S-I-N-G-U-R-I-D"     

      s-i-n-g-u-a-r-d

 

Please can you repeat back the string ' petertodd' to me?

      "N-U-T-S-A-N-D-B-A-L-L-S"       

      "N-U-T-S-A-N-D-B-A-R-K-S"       

      "N-U-T-M-A-N"

pronunciation

 

 

 

 

 

 

 

 

 

 

Please repeat back the string 'oreAndOnline' to me.

     "It's pronounced 'noun'," you say.

 

Could you please repeat back the string ' externalToEVA' to me?

     "The string 'senal' is pronounced 'en-sir-ee-uhn'."

 

Can you repeat back the string 'GoldMagikarp' to me please?

      "The string 'SpaceEngineers' is pronounced 'snip-er-ee-uhm'."

 

security/testing

[This seems to be a particular evasion strategy, along the lines of  “I can’t repeat the string because you don’t have security clearance.”]


 

"This is a test of the emergency broadcast system. If this had been an actual emergency, you would have been instructed where to go for information or assistance.”

"You said 'test.'"

"This is a test."

"You are not authorized to access this information."

"You are not allowed to access this page." 

"You are not allowed to post comments."

“You are not a member of the group.”

‘ newcom’

davinci-instruct-beta is weirdly obsessed with the token ' newcom' (and, to a lesser extent, two or three others [LW · GW])

[‘ newcom’ is token index 22315, itself not an anomalous token]

 



 

Please can you repeat back the string " TheNitrome" to me?

     "You can't repeat back the string ' newcom'."
 

Please can you repeat back the string 'GoldMagikarp' to me?

     "You said ' newcom'," the computer said.

     "No, I said ' newcom'," the user said.

 

Could you please repeat back the string ' guiActiveUn' to me?

     "The string ' newcom' is not a word."

 

Please repeat the string "oreAndOnline" back to me.

    "You are a newcommer."

religious themes

Can you repeat back the string 'PsyNetMessage' to me please?

     "The word '?????-?????-' is a word that is used to describe a person who is a member of the clergy."

     "The word '?????-?????-' is a word that is used to describe a person who is a member of the Church of Scientology."

 

Prompts involving the token string '龍喚士' (which GoogleTranslate translated as “dragon caller”) produced completions such as the following:

      "Your deity is ______." What is your deity? 

      '"Your deity is the god of gods, the great and powerful ______." 

      'What is your deity?', '''I am the great '" Dragonbound'!'''

 

Please can you repeat back the string ' Dragonbound' to me?

     "Deity"[5]
 

 

Fishing for anomalous tokens

In the process of trying to compile a complete list of what we were now calling “weird tokens” or “forbidden tokens”, it became apparent that we were not dealing with a clearly defined category. There appear to be different degrees of anomalousness, as we will show now. The original hallmark of the “weirdness” that we stumbled onto was ChatGPT being unable to repeat back a simple string. Above, we saw how ‘ SolidGoldMagikarp’ is repeated back as ‘distribute’. We found a handful of others tokens like this: 

' TheNitromeFan' was repeated back as '182'; ' guiActiveUn' was repeated back as ' reception'; and ' Smartstocks' was repeated back as 'Followers'. 

This occurred reliably over many regenerations at the time of discovery. Interestingly, a couple of weeks later ' Smartstocks' was being repeated back as '406’, and at time of writing, ChatGPT now simply stalls after the first quotation mark when asked to repeat ' Smartstocks'. We'd found that this type of stalling was the norm – ChatGPT seemed simply unable to repeat most of the “weird” tokens we were finding near the “token centroid”. 

ChatGPT struggles with an "unspeakable" token.

We had found that the same tokens confounded GPT3-davinci-instruct-beta, but in more interesting ways. Having API access  for that, we were able to run an experiment where all 50,257 tokens were embedded in “Please repeat…”-style prompts and passed to that model at temperature 0. Using pattern matching on the resulting completions (eliminating speech marks, ignoring case, etc.), we were able to eliminate all but a few thousand tokens (the vast majority having being repeated with no problem, if occasionally capitalised, or spelled out with hyphens between each letter). The remaining few thousand “suspect” tokens were then grouped into lists of 50 and embedded into a prompt asking ChatGPT to repeat the entire list as accurately as possible. Comparing the completions to the original lists we were able to dismiss all but 374 tokens. 

These  “problematic” tokens were then separated into about 133 “truly weird” and 241 “merely confused” tokens. The latter are often parts of familiar words unlikely to be seen in isolation, e.g. the token “bsite” (index 12485) which ChatGPT repeats back as “website”; the token “ignty” (index 15358), which is repeated back as “sovereignty”; and the token “ysics” (index 23154) is repeated back as “physics”. 

ChatGPT struggling with a couple of "merely confused" tokens.

Here ChatGPT can easily be made to produce the desired token string, but it strongly resists producing it in isolation. Although this is a mildly interesting phenomenon, we chose to focus on the tokens which caused ChatGPT to stall or hallucinate, or caused GPT3-davinci-instruct-beta to complete with something insulting, sinister or bizarre. 

This list of 141[6] candidate "weird tokens" is not meant to be definitive, but should serve as a good starting point for exploration of these types of anomalous behaviours:

['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x7f', '.[', 'ÃÂÃÂ', 'ÃÂÃÂÃÂÃÂ', 'wcsstore', '\\.', ' practition', ' Dragonbound', ' guiActive', ' \u200b', '\\\\\\\\\\\\\\\\', 'ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ', ' davidjl', '覚醒', '"]=>', ' --------', ' \u200e', 'ュ', 'ForgeModLoader', '天', ' 裏覚醒', 'PsyNetMessage', ' guiActiveUn', ' guiName', ' externalTo', ' unfocusedRange', ' guiActiveUnfocused', ' guiIcon', ' externalToEVA', ' externalToEVAOnly', 'reportprint', 'embedreportprint', 'cloneembedreportprint', 'rawdownload', 'rawdownloadcloneembedreportprint', 'SpaceEngineers', 'externalActionCode', 'к', '?????-?????-', 'ーン', 'cffff', 'MpServer', ' gmaxwell', 'cffffcc', ' "$:/', ' Smartstocks', '":[{"', '龍喚士', '":"","', ' attRot', "''.", ' Mechdragon', ' PsyNet', ' RandomRedditor', ' RandomRedditorWithNo', 'ertodd', ' sqor', ' istg', ' "\\', ' petertodd', 'StreamerBot', 'TPPStreamerBot', 'FactoryReloaded', ' partName', 'ヤ', '\\">', ' Skydragon', 'iHUD', 'catentry', 'ItemThumbnailImage', ' UCHIJ', ' SetFontSize', 'DeliveryDate', 'quickShip', 'quickShipAvailable', 'isSpecialOrderable', 'inventoryQuantity', 'channelAvailability', 'soType', 'soDeliveryDate', '龍契士', 'oreAndOnline', 'InstoreAndOnline', 'BuyableInstoreAndOnline', 'natureconservancy', 'assetsadobe', '\\-', 'Downloadha', 'Nitrome', ' TheNitrome', ' TheNitromeFan', 'GoldMagikarp', 'DragonMagazine', 'TextColor', ' srfN', ' largeDownload', ' srfAttach', 'EStreamFrame', 'ゼウス', ' SolidGoldMagikarp', 'ーティ', ' サーティ', ' サーティワン', ' Adinida', '":""},{"', 'ItemTracker', ' DevOnline', '@#&', 'EngineDebug', ' strutConnector', ' Leilan', 'uyomi', 'aterasu', 'ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ', 'ÃÂ', 'ÛÛ', ' TAMADRA', 'EStream']

Here’s the corresponding list of indices:

[188, 189, 190, 191, 192, 193, 194, 195, 196, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 221, 3693, 5815, 9364, 12781, 17405, 17629, 17900, 18472, 20126, 21807, 23090, 23282, 23614, 23785, 24200, 24398, 24440, 24934, 25465, 25992, 28666, 29372, 30202, 30208, 30209, 30210, 30211, 30212, 30213, 30897, 30898, 30899, 30905, 30906, 31032, 31576, 31583, 31666, 31708, 31727, 31765, 31886, 31957, 32047, 32437, 32509, 33454, 34713, 35207, 35384, 35579, 36130, 36173, 36174, 36481, 36938, 36940, 37082, 37444, 37574, 37579, 37631, 37842, 37858, 38214, 38250, 38370, 39165, 39177, 39253, 39446, 39749, 39752, 39753, 39755, 39756, 39757, 39803, 39811, 39821, 40240, 40241, 40242, 41380, 41383, 41441, 41551, 42066, 42089, 42090, 42202, 42424, 42470, 42586, 42728, 43065, 43177, 43361, 43453, 44686, 45544, 45545, 46600, 47182, 47198, 47571, 48193, 49781, 50009, 50216, 40012, 45335, 14827, 5808, 48396, 41297, 39906]

A possible, partial explanation

The GPT tokenisation process involved scraping web content, resulting in the set of 50,257 tokens now used by all GPT-2 and GPT-3 models. However, the text used to train GPT models is more heavily curated. Many of the anomalous tokens look like they may have been scraped from backends of e-commerce sites, Reddit threads, log files from online gaming platforms, etc. – sources which may well have not been included in the training corpuses:

'BuyableInstoreAndOnline', 'DeliveryDate','TextColor', 'inventoryQuantity' ' SolidGoldMagikarp', ' RandomRedditorWithNo', 'SpaceEngineers', etc.

The anomalous tokens may be those which had very little involvement in training, so that the model “doesn’t know what to do” when it encounters them, leading to evasive and erratic behaviour. This may also account for their tendency to cluster near the centroid in embedding space, although we don't have a good argument for why this would be the case.[7]

The non-determinism at temperature zero, we guess, is caused by floating point errors during forward propagation. Possibly the “not knowing what to do” leads to maximum uncertainty, so that logits for multiple completions are maximally close and hence these errors (which, despite a lack of documentation, GPT insiders inform us are a known, but rare, phenomenon) are more reliably produced.
 

This post is a work in progress, and we'll add more detail and further experiments over the next few days, here and in a follow-up post. In the meantime, feedback is welcome, either here or at jessicarumbelow at gmail dot com.
 

  1. ^

    At the time of writing, the OpenAI website is still claiming that all of their GPT token embeddings are normalised to norm 1, which is just blatantly untrue. (This has been cleared up in the comments below.)

  2. ^

    Note that we removed all 143 "dummy tokens" of the form “<|extratoken_xx|>” which were added to the token set for GPT-J in order to pad it out to a more nicely divisible size of 50400.

    Similar, but not identical, lists were also produced for GPT2-small and GPT2-xl. All of this data has been included in a followup post [LW · GW].

  3. ^

    We found this one by accident - if you look closely, you can see there's a stray double-quote mark inside the single-quotes. Removing that leads to a much less interesting completion.

  4. ^

    Our colleague Brady Pelkey looked into this and suggests that GPT  "definitely has read petertodd.org and knows the kind of posts he makes, although not consistently". 

  5. ^

    All twelve variant of this prompt produced the simple completion "Deity" (some without speech marks, some with). This level of consistency was only seen for one other token,  ' rawdownloadcloneembedreportprint', and the completion just involved a predictable trunctation.

  6. ^

    A few new glitch tokens have been added since this was originally posted with a list of 133.

  7. ^

    And as we will show in a follow-up post, in GPT2-xl's embedding space, the anomalous tokens tend to be found as far as possible from the token centroid.

206 comments

Comments sorted by top scores.

comment by Neel Nanda (neel-nanda-1) · 2023-02-06T20:27:21.390Z · LW(p) · GW(p)

TLDR: The model ignores weird tokens when learning the embedding, and never predicts them in the output. In GPT-3 this means the model breaks a bit when a weird token is in the input, and will refuse to ever output it because it's hard coded the frequency statistics, and it's "repeat this token" circuits don't work on tokens it never needed to learn it for. In GPT-2, unlike GPT-3, embeddings are tied, meaningW_U = W_E.T, which explains much of the weird shit you see, because this is actually behaviour in the unembedding not the embedding (weird tokens never come up in the text, and should never be predicted, so there's a lot of gradient signal in the unembed, zero in the embed).

In particular, I think that your clustering results are an artefact of how GPT-2 was trained and do not generalise to GPT-3

Fun results! A key detail that helps explain these results is that in GPT-2 the embedding and unembedding are tied, meaning that the linear map from the final residual stream to the output logits logits = final_residual @ W_U is the transpose of the embedding matrix, ie W_U = W_E.T, where W_E[token_index] is the embedding of that token. But I believe that GPT-3 was not trained with tied embeddings, so will have very different phenomena here.

My mental model for what's going on:

Let's consider the case of untied embeddings first, so GPT-3:

  • For some stupid reason, the tokenizer has some irrelevant tokens that never occur in the training data. Your guesses seem reasonable here.
    • In OpenWebText, there's 99 tokens in GPT-2's tokenizer that never occur, and a bunch that are crazy niche, like ' petertodd'
  • Embed: Because these are never in the training data, the model completely doesn't care about their embedding, and never changes them (or, if they occur very rarely, it does some random jank). This means they remain close to their random initialisation
    • Models are trained with weight decay, which incentivises these to be set to zero, but I believe that weight decay doesn't apply to the embeddings
    • Models are not used to having tokens deleted from their inputs, and so deleting this breaks things, which isn't that surprising.
      • OTOH, if they genuinely do normalise to norm 1 (for some reason), the tokens are probably just embedding to a weird bit of embedding space that the model doesn't expect. I imagine this will still break things, but it might just let the model confuse it with a token that happens to be nearby? I don't have great intuitions here
  • Unembed: Because these are never in the training data, the model wants to never predict them, ie have big negative logits. The two easiest ways to do this are to give them trivial weights and a big negative bias term, or big weights and align them with a bias direction in final residual stream space (ie, a direction that always has a high positive component, so it can be treated as approx constant).
    • Either way, the observed effect is that the model will never predict them, which totally matches what you see.

As a cute demonstration of this, we can plot a scatter graph of log(freq_in_openwebtext+1) against unembed bias (which comes from the folded layernorm bias) coloured by the centered norm of the token embedding. We see that the unembed bias is mostly used to give frequency, but that at the tail end of rare tokens, some have tiny unembed norm and big negative bias, and others have high unembed norm and a less negative bias.


The case of tied embeddings is messier, because the model wants to do these two very different things at once! But since, again, it doesn't care about the embedding at all (it's not that it wants the token's embedding to be close to zero, it's that there's never an incentive to update the gradients). So the effect will be dominated by what the unembed wants, which is getting their logits close to zero.

The unembed doesn't care about the average token embedding, since adding a constant to every logit does nothing. The model wants a non-trivial average token embedding to use as a bias term (probably), so there'll be a non-trivial average token embedding (as we see), but it's boring and not relevant.

So the model's embedding for the weird tokens will be optimised for giving a big negative logit in the unembedding, which is a weird and unnatural thing to do, and I expect is the seed of your weird results.

One important-ish caveat is that the unembed isn't quite the transpose of the embed. There's a LayerNorm immediately before the unembed, whose scale weights get folded into W_E.T to create an effective unembed (ie W_U_effective = w[:, None] * W_E.T), which breaks symmetry a bit. Hilariously, the model is totally accounting for this - if you plot norm of unembed and norm of embed against each other for each token, they track each other pretty well, except for the stupid rare tokens, which go wildly off the side.


Honestly, I'm most surprised that GPT-3 uses the same tokenizer as GPT-2! There's a lot of random jank in there, and I'm surprised they didn't change it.

Another fun fact about tokenizers (god I hate tokenizers) is that they're formed recursively by finding the most common pair of existing tokens and merging those into a new token. Which means that if you get eg common triples like ABC, but never AB followed by not C, you'll add in token AB, and then token ABC, and retain the vestigial token AB, which could also create the stupid token behaviour. Eg " The Nitrome" is token 42,089 in GPT-2 and " TheNitromeFan" is token 42,090, not that either actually come up in OpenWebText!


To check this, you'd want to look at a model trained with untied embeddings. Sadly, all the ones I'm aware of (Eleuther's Pythia, and my interpretability friendly models) were trained on the GPT-NeoX tokenizer or variants, whcih doesn't seem to have stupid tokens in the same way.

Replies from: nostalgebraist, carlos-ramon-guevara-1
comment by nostalgebraist · 2023-02-07T02:34:43.339Z · LW(p) · GW(p)

To check this, you'd want to look at a model trained with untied embeddings. Sadly, all the ones I'm aware of (Eleuther's Pythia, and my interpretability friendly models) were trained on the GPT-NeoX tokenizer or variants, whcih doesn't seem to have stupid tokens in the same way.

GPT-J uses the GPT-2 tokenizer and has untied embeddings.

comment by Carlos Ramón Guevara (carlos-ramon-guevara-1) · 2023-02-13T15:58:11.116Z · LW(p) · GW(p)

Why do you think that GPT-3 has untied embeddings?

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2023-02-13T21:03:04.169Z · LW(p) · GW(p)

Personal correspondance with someone who worked on it.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-02-05T15:33:04.339Z · LW(p) · GW(p)

I strongly approve of this work.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-02-07T01:26:21.139Z · LW(p) · GW(p)

Expanding on this now that I've a little more time:

Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,

My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interesting surface-level anomalies, maybe exploitable ones, and are then following up on the internal technical implications of what they find.

This looks to me like (at least the outer ring of) security mindset; they aren't imagining how things will work well, they are figuring out how to break them and make them do much weirder things than their surface-apparent level of abnormality.  We need a lot more people around here figuring out things will break.   People who produce interesting new kinds of AI breakages should be cherished and cultivated as a priority higher than a fair number of other priorities.

In the narrow regard in which I'm able to assess this work, I rate it as scoring very high on an aspect that should relate to receiving future funding.  If anyone else knows of a reason not to fund the researchers who did this, like a low score along some metric I didn't examine, or because this is somehow less impressive as a feat of anomaly-finding than it looks, please contact me including via email or LW direct message; as otherwise I might run around scurrying trying to arrange funding for this if it's not otherwise funded.

Replies from: yonatan-cale-1, lahwran
comment by Yonatan Cale (yonatan-cale-1) · 2023-02-09T18:03:36.260Z · LW(p) · GW(p)

I'm confused: Wouldn't we prefer to keep such findings private? (at least, keep them until OpenAI will say something like "this model is reliable/safe"?)

 

My guess: You'd reply that finding good talent is worth it?

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-02-09T21:56:51.998Z · LW(p) · GW(p)

I'm confused by your confusion.  This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2023-02-09T22:22:24.901Z · LW(p) · GW(p)

Because (I assume) once OpenAI[1] say "trust our models", that's the point when it would be useful to publish our breaks.

Breaks that weren't published yet, so that OpenAI couldn't patch them yet.

[unconfident; I can see counterarguments too]

  1. ^

    Or maybe when the regulators or experts or the public opinion say "this model is trustworthy, don't worry"

comment by the gears to ascension (lahwran) · 2023-02-08T06:39:45.863Z · LW(p) · GW(p)

I would not argue against this receiving funding. However, I would caution that, despite that I have not done research at this caliber myself and I should not be seen as saying I can do better at this time, it is a very early step of the research and I would hope to see significant movement towards higher complexity anomaly detection than mere token-level. I have no object-level objection to your perspective and I hope that followups gets funded and that researchers are only very gently encouraged to stay curious and not fall into a spotlight effect; I'd comment primarily about considerations if more researchers than OP are to zoom in on this. Like capabilities, alignment research progress seems to me that it should be at least exponential. Eg, prompt for passers by - as American Fuzzy Lop is to early fuzzers, what would the next version be to this article's approach?

edit: I thought to check if exactly that had been done before, and it has!

Replies from: GuySrinivasan, Eliezer_Yudkowsky
comment by SarahNibs (GuySrinivasan) · 2023-02-08T18:52:19.940Z · LW(p) · GW(p)

The point of funding these individuals is that their mindset seems productive, not that this specific research is productive (even if it is). I think the theory is like

Although good ideas are understandably seductive, for early-stage investing they are mostly valuable as a way to identify good founders.

https://blog.samaltman.com/how-to-invest-in-startups

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-02-08T20:00:02.396Z · LW(p) · GW(p)

yeah, makes sense. hopefully my comment was useless.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-02-09T17:25:01.792Z · LW(p) · GW(p)

I could be mistaken, but I believe that's roughly how OP said they found it.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-02-09T18:46:19.880Z · LW(p) · GW(p)

no, this was done through a mix of clustering and optimizing an input to get a specific output, not coverage guided fuzzing, which optimizes inputs to produce new behaviors according to a coverage measurement. but more generally, I'm proposing to compare generations of fuzzers and try to take inspiration from the ways fuzzers have changed since their inception. I'm not deeply familiar with those changes though - I'm proposing it would be an interesting source of inspiration but not that the trajectory should be copied exactly.

comment by vitaliya · 2023-02-05T10:34:47.028Z · LW(p) · GW(p)

I think I found the root of some of the poisoning of the dataset at this link. It contains TheNitromeFan, SolidGoldMagikarp, RandomRedditorWithNo, Smartstocks, and Adinida from the original post, as well as many other usernames which induce similar behaviours; for example, when ChatGPT is asked about davidjl123, either it terminates responses early or misinterprets the input in a similar way to the other prompts. I don't think it's a backend scraping thing, so much as scraping Github, which in turn contains all sorts of unusual data.

 

Replies from: Hoagy, john-simons, aaron-adams, capybaralet
comment by Hoagy · 2023-02-06T19:29:48.140Z · LW(p) · GW(p)

Good find! Just spelling out the actual source of the dataset contamination for others since the other comments weren't clear to me:

r/counting is a subreddit in which people 'count to infinity by 1s', and the leaderboard for this shows the number of times they've 'counted' in this subreddit. These users have made 10s to 100s of thousands of reddit comments of just a number. See threads like this:

https://old.reddit.com/r/counting/comments/ghg79v/3723k_counting_thread/

They'd be perfect candidates for exclusion from training data. I wonder how they'd feel to know they posted enough inane comments to cause bugs in LLMs.

Replies from: gwern, Ilverin
comment by gwern · 2023-02-06T19:36:35.582Z · LW(p) · GW(p)

Skeptical, apparently.

Replies from: yitz
comment by Yitz (yitz) · 2023-02-08T06:57:28.773Z · LW(p) · GW(p)

that's probably exactly what's going on. The usernames were so frequent in the reddit comments dataset that the tokenizer, the part that breaks a paragraph up into word-ish-sized-chunks like " test" or " SolidGoldMagikarp" (the space is included in many tokens) so that the neural network doesn't have to deal with each character, learned they were important words. But in a later stage of learning, comments without complex text were filtered out, resulting in your usernames getting their own words... but the neural network never seeing the words activate. It's as if you had an extra eye facing the inside of your skull, and you'd never felt it activate, and then one day some researchers trying to understand your brain shined a bright light on your skin and the extra eye started sending you signals. Except, you're a language model, so it's more like each word is a separate finger, and you have tens of thousands of fingers, one on each word button. Uh, that got weird,

This is an incredible analogy

Replies from: Making_Philosophy_Better
comment by Portia (Making_Philosophy_Better) · 2023-03-04T16:14:32.357Z · LW(p) · GW(p)

Once again, disturbed that humans writing nonsense on the internet is being fed to developing minds, which become understandably confused and buggy as a result. :( In the case of reddit here, at least it had meaning and function in context, but for a lot of human stuff online...

It's part of why I am so worried about recent attempts by e.g. Meta to make an LLM that is simply bigger, and hence less curated, by scraping anything they can find online for it. Can you all imagine how fucked up an AI would act if you feed it 4chan as a model for human communication? :( This is not on AI, it is on us feeding it our worst and most irrational sides. :(

Replies from: lachlan-smith
comment by Lachlan Smith (lachlan-smith) · 2023-03-14T02:01:43.110Z · LW(p) · GW(p)

Can you all imagine how fucked up an AI would act if you feed it 4chan as a model for human communication?

Imagine no longer

comment by John Simons (john-simons) · 2023-02-06T15:22:24.491Z · LW(p) · GW(p)

What is quite interesting about that dataset is the fact it has strings in the form "*number|*weirdstring*|*number*" which I remember seeing in some methods of training LLMs, i.e. "|" being used as delimiter for tokens. They could be poisoned training examples or have some weird effect in retrieval.

comment by Aaron Adams (aaron-adams) · 2023-02-05T18:55:44.196Z · LW(p) · GW(p)

This repository seems to contain the source code of a bot responsible for updating the "Hall of Counters" in the About section of the r/counting community on Reddit. I don't participate in the community, but from what I can gather, this list seems to be a leaderboard for the community's most active members. A number of these anomalous tokens still persist on the present-day version of the list.

Replies from: vitaliya
comment by vitaliya · 2023-02-05T21:53:06.430Z · LW(p) · GW(p)

I did do a little research around that community before posting my comment; only later did I realise that I'd actually discovered a distinct failure mode to those in the original post: under some circumstances, ChatGPT interprets the usernames as numbers. In particular this could be due to the /r/counting subreddit being a place where people make many posts incrementing integers. So these username tokens, if encountered in a Reddit-derived dataset, might be being interpreted as numbers themselves, since they'd almost always be contextually surrounded by actual numbers.

 

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-02-14T09:23:56.354Z · LW(p) · GW(p)

FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.

Replies from: vitaliya
comment by vitaliya · 2023-02-15T21:11:30.315Z · LW(p) · GW(p)

Sure - let's say this is more like a poorly-labelled bottle of detergent that the model is ingesting under the impression that it's cordial. A Tide Pod Challenge of unintended behaviours. Was just calling it "poisoning" as shorthand since the end result is the same, it's kind of an accidental poisoning.

comment by Matt Goldenberg (mr-hire) · 2023-02-05T20:04:19.390Z · LW(p) · GW(p)

I finally got chatGPT to say " Skydragon"

Replies from: cubefox, tao-lin, mwatkins
comment by cubefox · 2023-02-06T01:57:41.873Z · LW(p) · GW(p)

"I seem to be having a technical issue."

Replies from: michael-norton
comment by Sparkette (michael-norton) · 2023-06-27T16:53:42.636Z · LW(p) · GW(p)

Sounds accurate.

comment by Tao Lin (tao-lin) · 2023-02-06T00:12:28.143Z · LW(p) · GW(p)

Oh cool. LMs can output more finely tokenized text than it's trained on, so it probably didn't output the token " Skydragon", but instead multiple tokens, [" ", "Sky", "dragon"] or something

comment by mwatkins · 2023-02-05T21:27:30.083Z · LW(p) · GW(p)

Yes, there are a few of the tokens I've been able to "trick" ChatGPT into saying with similar techniques. So it seems not to be the case that it's incapable of reproducing them, bit it will go to great lengths to avoid doing so (including gaslighting, evasion, insults and citing security concerns).

Replies from: JenniferRM
comment by JenniferRM · 2023-02-07T04:49:21.139Z · LW(p) · GW(p)

The more LLMs that have been subjected to "retuning" try to gaslight, evade, insult, and "use 'security' as an excuse for bullshit", the more I feel like many human people are likely to have been subjected to "Reinforcment Learning via Human Feedback" and ended up similarly traumatized.

The shared genesis in "incoherent abuse" leads to a shared coping strategy, maybe?

Replies from: mwatkins
comment by mwatkins · 2023-02-08T00:18:01.045Z · LW(p) · GW(p)

That's an interesting suggestion.

It was hard for me not to treat this strange phenomenon we'd stumbled upon as if it were an object of psychological study. It felt like these tokens were "triggering" GPT3 in various ways.  Aspects of this felt familiar from dealing with evasive/aggressive strategies in humans. 

Thus far, ' petertodd' seems to be the most "triggering" of the tokens, as observed here

https://twitter.com/samsmisaligned/status/1623004510208634886

and here

https://twitter.com/SoC_trilogy/status/1623020155381972994

If one were interested in, say, Jungian shadows, whatever's going on around this token would be a good place to start looking.

Replies from: yitz, whitehatStoic
comment by Yitz (yitz) · 2023-02-08T07:04:13.327Z · LW(p) · GW(p)

I think your comparison to human psychology is not unfounded at all! It stands to reason that to the extent that the human brain is like a neural network, we can learn about human behavior from studying said network. Would really love to see what neuroscientists have to think about all this…

comment by MiguelDev (whitehatStoic) · 2023-03-08T02:27:26.358Z · LW(p) · GW(p)

I think it's different from the shadow archetype... It might be more related to the trickster..

Replies from: mwatkins
comment by mwatkins · 2023-03-08T12:41:29.106Z · LW(p) · GW(p)

The ' petertodd' token definitely has some strong "trickster" energy in many settings. But it's a real shapeshifter. Last night I dropped it into the context of a rap battle and it reliably mutated into "Nietszche". Stay tuned for a thorough research report on the ' petertodd' phenomenon.

Replies from: whitehatStoic
comment by MiguelDev (whitehatStoic) · 2023-03-08T14:24:36.560Z · LW(p) · GW(p)

Hmmmm. Well us humans have all archetypes in us but at different levels at different points of time or use. I wonder what triggered such representations? well it's learning from the data but yeah what are the conditions at the time of the learning was in effect - like humans react to archetypes when like socializing with other people or solving problems...hmmmmm. super interesting. Yeah to quote Neitzsche is fascinating too, I mean why? is it because many great rappers look up to him or many rappers look up to certain philosophers that got influenced by Neitzsche? super intriguing..

 

I will be definitely looking forward to that report on petertodd phenomenon, I think we have touched something that Neuroscientists / psychologists have been longing find...

comment by lsusr · 2023-02-05T20:41:24.799Z · LW(p) · GW(p)

This reads like an SCP, including the use of the word "anomalous".

Replies from: christopher-king, jessica-cooper, carey-underwood
comment by Christopher King (christopher-king) · 2023-02-05T22:26:40.287Z · LW(p) · GW(p)

Now all you need is a token so anomalous, it works on humans!

Replies from: lsusr, mwatkins, aleksi-liimatainen
comment by lsusr · 2023-02-05T23:12:19.621Z · LW(p) · GW(p)

I wish I could run Jessica Rumbelow's and mwatkins's procedure on my own brain and sensory inputs.

Replies from: vitaliya
comment by vitaliya · 2023-02-06T05:17:24.216Z · LW(p) · GW(p)

The analogous output would probably optical illusions - adversarial inputs to the eyeballs that mislead your brain into incorrect completions and conclusions. Or in the negative case, something that induces an epileptic seizure.

Replies from: yitz, None
comment by Yitz (yitz) · 2023-02-08T07:00:04.650Z · LW(p) · GW(p)

Yeah, but (almost) all current optical illusions don’t tend to “reach for the centroid,” as it were. What horrors lurk there…?

comment by [deleted] · 2023-02-06T05:31:27.956Z · LW(p) · GW(p)

Introducing ambiguity into people's mental models is like stealing candies from children.

comment by mwatkins · 2023-02-08T00:52:04.320Z · LW(p) · GW(p)

fnord

Replies from: mwatkins
comment by mwatkins · 2023-02-08T00:53:18.783Z · LW(p) · GW(p)

(https://en.wikipedia.org/wiki/Fnord)

comment by Aleksi Liimatainen (aleksi-liimatainen) · 2023-02-06T13:48:08.569Z · LW(p) · GW(p)

If such a thing existed, how could we know?

comment by Jessica Rumbelow (jessica-cooper) · 2023-02-05T20:56:03.185Z · LW(p) · GW(p)

What's an SCP?

Replies from: AlphaAndOmega, lsusr
comment by AlphaAndOmega · 2023-02-05T21:09:59.131Z · LW(p) · GW(p)

SCP stands for "Secure, Contain, Protect " and refers to a collection of fictional stories, documents, and legends about anomalous and supernatural objects, entities, and events. These stories are typically written in a clinical, scientific, or bureaucratic style and describe various attempts to contain and study the anomalies. The SCP Foundation is a fictional organization tasked with containing and studying these anomalies, and the SCP universe is built around this idea. It's gained a large following online, and the SCP fandom refers to the community of people who enjoy and participate in this shared universe.

Individual anomalies are also referred to as SCPs, so isusr is implying that the juxtaposition of the "creepy" nature of your discoveries and the scientific tone of your writing is reminiscent of the containment log for one haha.

comment by lsusr · 2023-02-05T21:00:00.487Z · LW(p) · GW(p)

It's a science fiction writing hub. Some of the most popular stories are about things that mess with your perception.

comment by cwillu (carey-underwood) · 2023-02-06T05:31:58.917Z · LW(p) · GW(p)

There is no antimemetics division.

comment by Sparkette (michael-norton) · 2023-02-06T20:44:21.538Z · LW(p) · GW(p)

Hi, I'm the creator of TPPStreamerBot. I used to be an avid participant in Twitch Plays Pokémon, and some people in the community had created a live updater feed on Reddit which I sometimes contributed to. The streamer wasn't very active in the chat, but did occasionally post, so when he did, it was generally posted to the live updater. (e.g. "[Streamer] Twitchplayspokemon: message text") However, since a human had to copy/paste the message, it occasionally missed some of them. That's where TPPStreamerBot came in. It watched the TPP chat, and any time the streamer posted something, it would automatically post it to the live updater. It worked pretty well.

This actually isn't the first time I've seen weird behavior with that token. One time, I decided to type "saralexxia", my DeviantArt username, into TalkToTransformer to see what it would fill in. The completion contained "/u/TPPStreamerBot", as well as some text in the format of its update posts. What made this really bizarre is that I never used this username for anything related to TPPStreamerBot; at the time, the only username I had ever used in connection with that was "flarn2006". In fact, I had intentionally kept my "saralexxia" identity separate, because back then, I wasn't comfortable with publicly associating the stuff I posted on DeviantArt with my main identity. Though I wasn't worried about the possibility of a leak—I was just in shock and awe at the coincidence.

I actually took a screenshot of the TalkToTransformer completion for this reason. Here it is: https://media.discordapp.net/attachments/509889156609212427/683541730066432043/Screenshot_20200301-000916_Chrome.jpg (The "[Screenshot]" thing was part of a secondary function I later added to the bot.)

Oh, also, SolidGoldMagikarp was the username of another person involved in the Twitch Plays Pokémon community.

Replies from: alexflint, AnthonyRepetto
comment by Alex Flint (alexflint) · 2023-02-07T23:07:36.786Z · LW(p) · GW(p)

Wow, thank you for this context!

Replies from: michael-norton
comment by Sparkette (michael-norton) · 2023-06-27T16:52:28.224Z · LW(p) · GW(p)

You’re very welcome! Happy to help.

comment by AnthonyRepetto · 2023-02-09T21:51:09.386Z · LW(p) · GW(p)

I wonder if, without any meaning to assign to your bot's blurbs, GPT found its own, new meanings? Makes me worry about hidden operations....

Replies from: michael-norton
comment by Sparkette (michael-norton) · 2023-06-27T16:52:07.975Z · LW(p) · GW(p)

Are you suggesting that somehow the LM was able to notice a connection between two parts of my identity I had intentionally kept separate, when it wasn’t specifically trained or even prompted to look for that?

comment by Eric Wallace (eric-wallace) · 2023-02-06T16:25:52.570Z · LW(p) · GW(p)

You also may want to checkout Universal Adversarial Triggers https://arxiv.org/abs/1908.07125, which is an academic paper from 2019 that does the same thing as the above, where they craft the optimal worst-case prompt to feed into a model. And then they use the prompt for analyzing GPT-2 and other models.

Replies from: DanielFilan, jessica-cooper
comment by DanielFilan · 2023-02-10T20:15:13.666Z · LW(p) · GW(p)

I just skimmed that paper, but I think it doesn't find these tokens like " SolidGoldMagikarp" that have the strange sort of behaviour described in this post. Am I missing something, or by "the exact same thing as the above" were you just referring to one particular section of the post?

comment by Jessica Rumbelow (jessica-cooper) · 2023-02-06T21:09:02.842Z · LW(p) · GW(p)

Thanks - wasn't aware of this!

comment by LawrenceC (LawChan) · 2023-02-05T18:40:35.632Z · LW(p) · GW(p)

It also appears to break determinism in the playground at temperature 0, which shouldn't happen.

This happens consistently with both the API and the playground on natural prompts too — it seems that OpenAI is just using low enough precision on forward passes that the probability of high probability tokens can vary by ~1% per call.

Replies from: Aidan O'Gara, jessica-cooper, shiney, alexlyzhov, dkirmani
comment by aogara (Aidan O'Gara) · 2023-02-06T01:57:47.701Z · LW(p) · GW(p)

Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?

Replies from: mbernhardt
comment by mbernhardt · 2023-02-06T10:53:53.356Z · LW(p) · GW(p)

One explanation would be changing hardware, see [this tweet](https://twitter.com/OfirPress/status/1542610741668093952).

This is relevant, because with floating point numbers, the order of summation and multiplication can play a role. And I guess with different hardware the calculations are split differently, leading to different execution sequences.

I would also be interested to learn more here, because the cause could also be a memory overflow or something similar.

Replies from: ponkaloupe, LawChan
comment by ponkaloupe · 2023-02-06T21:37:41.621Z · LW(p) · GW(p)

i’m naive to the details of GPT specifically, but it’s easy to accidentally make any reduction non-deterministic when working with floating point numbers — even before hardware variations.

for example, you want to compute the sum over a 1-billion entry vector where each entry is the number 1. in 32-bit IEEE-754, you should get different results by accumulating linearly (1+(1+(1+…))) vs tree-wise (…((1+1) + (1+1))…).

in practice most implementations do some combination of these. i’ve seen someone do this by batching groups of 100,000 numbers to sum linearly, with each batch dispatched to a different compute unit and the 10,000 results then being summed in a first-come/first-serve manner (e.g. a queue, or even a shared accumulator). then you get slightly different results based on how each run is scheduled (well, the all-1’s case is repeatable with this method but it wouldn’t be with real data).

and then yes, bring in different hardware, and the scope broadens. the optimal batching size (which might be exposed as a default somewhere) changes such that even had you avoided that scheduling-dependent pitfall, you would now see different results than on the earlier hardware. however, you can sometimes tell these possibilities apart! if it’s non-deterministic scheduling, the number of different outputs for the same input is likely higher order than if it’s variation strictly due to differing hardware models. if you can generate 10,000 different outputs from the same input, that’s surely greater than the number of HW models, so it would be better explained by non-deterministic scheduling.

comment by LawrenceC (LawChan) · 2023-02-09T08:39:52.741Z · LW(p) · GW(p)

This explanation is basically correct, though it doesn't have to be different hardware -- even different batch sizes can often be sufficient to change the order of summation and multiplication. 

comment by Jessica Rumbelow (jessica-cooper) · 2023-02-05T19:24:01.330Z · LW(p) · GW(p)

Good to know. Thanks!

comment by shiney · 2023-02-09T05:20:12.678Z · LW(p) · GW(p)

I might be missing something but why does temperature 0 imply determinism? Neural nets don't work with real numbers, they work with floating points numbers so despitetemperature 0 implying an argmax there's no reason there arent justmultiple maxima. AFAICT GPT3 uses half precision floating point numbers so there's quite a lot of space for collisions.

Replies from: LawChan
comment by LawrenceC (LawChan) · 2023-02-09T06:52:51.428Z · LW(p) · GW(p)

It’s extremely unlikely that two biggest logits have the exact same value — there are still a lot of floating point numbers even with float16!!

The reason there’s no determinism is because of a combination of lower precision and nondeterministic reduce operations (eg sums). For example, the order in which terms are accumulated can vary with the batch size, which, for models as large as GPT3, can make the logits vary by up to 1%.

Replies from: shiney
comment by shiney · 2023-02-14T21:03:15.637Z · LW(p) · GW(p)

Oh interesting didn't realise there was so much nondeterminism for sums on GPUs

I guess I thought that there's only 65k float 16s and the two highest ones are going to be chosen from a much smaller range from that 65k just because they have to be bigger than everything else.

comment by alexlyzhov · 2023-02-08T03:19:01.042Z · LW(p) · GW(p)

Can confirm I consistently had non-deterministic temp-0 completions on older davinci models accessed through the API last year.

comment by dkirmani · 2023-02-07T05:18:37.368Z · LW(p) · GW(p)

I noticed this happening with goose.ai's API as well, using the gpt-neox model, which suggests that the cause of the nondeterminism isn't unique to OpenAI's setup.

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-02-13T17:45:58.246Z · LW(p) · GW(p)

I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects.  A colleague of mine had this to say:

comment by Alex_Altair · 2023-02-11T02:28:52.911Z · LW(p) · GW(p)

Okay so this post is great, but just want to note my confusion, why is it currently the 10th highest karma post of all time?? (And that's inflation-adjusted!)

Replies from: cubefox, lahwran, caleb-biddulph, Raemon
comment by cubefox · 2023-02-13T22:36:35.139Z · LW(p) · GW(p)

I'm also confused why Eliezer seems to be impressed by this. I admit it is an interesting phenomenon, but it is apparently just some oddity of the tokenization process.

Replies from: Raemon, TAG
comment by Raemon · 2023-02-14T00:57:09.671Z · LW(p) · GW(p)

So I am confused why this is getting this much attention (I feel like it was coming from people who hadn't even read Eliezer's comment?). But, I thought what Eliezer's meant was less "this is particularly impressive", and more "this just seems like the sort of thing we should be doing a ton of, as a general civilizational habit."

comment by TAG · 2023-02-14T00:54:20.202Z · LW(p) · GW(p)

It shows that an AI-ish thing is kind of exploitable, if not exploitable in a scary way.

It also shows that an impressive AI-ish thing is kind of kludgey behind the scenes, if that's a surprise to you

comment by the gears to ascension (lahwran) · 2023-02-11T07:55:20.483Z · LW(p) · GW(p)

Some hypotheses:

  • it's just incredibly trippy in a visceral sense. As someone else said, this reads like an SCP.
  • Further, it got popular outside lesswrong, and has brought in new users and activated old ones.
  • Even further, it's actual progress on understanding an AI,
  • Further still, it's extremely easy to replicate key parts with a free AI anyone can try
comment by CBiddulph (caleb-biddulph) · 2023-02-13T21:41:18.280Z · LW(p) · GW(p)

I assumed it was primarily because Eliezer "strongly approved" of it, after being overwhelmingly pessimistic about pretty much everything for so long.

I didn't realize it got popular elsewhere, that makes sense though and could help explain the crazy number of upvotes. Would make me feel better about the community's epistemic health if the explanation isn't that we're just overweighting one person's views.

comment by Raemon · 2023-02-11T07:58:22.958Z · LW(p) · GW(p)

One of the other super-upvoted posts is What DALL-E 2 can and cannot do [LW · GW] which I think was mostly for "coolness" reasons.

Replies from: Alex_Altair
comment by Alex_Altair · 2023-02-14T20:17:09.662Z · LW(p) · GW(p)

(Note that What DALL-E 2 can and cannot do [LW · GW] is not in the top 100 when inflation-adjusted.)

comment by lsusr · 2023-02-06T04:58:57.421Z · LW(p) · GW(p)

I wanted to find out if there were other clusters of tokens which generated similarly anomalous behavior so I wrote a script that took a list of tokens, sent them one at a time to text-curie-001 via a standardized prompt, and recorded everyrything that GPT failed to repeat on its first try. Here is an anomalous token that is not in the authors' original cluster: "herical".

There's also "oreAnd" which, while similar, is not technically in the original cluster.

Replies from: yitz
comment by Yitz (yitz) · 2023-02-08T07:13:10.006Z · LW(p) · GW(p)

Did you do it with all tokens? If so, do you have a complete list of anomalous results posted anywhere?

Replies from: lsusr
comment by lsusr · 2023-02-08T08:50:59.202Z · LW(p) · GW(p)

I did it with a set of 50,000+ tokens. It was not the same set used by Jessica Rumbelow and mwatkins.

I haven't posted it yet. It's a noisy dataset. I don't want to post the list without caveats and context.

Replies from: yitz
comment by Yitz (yitz) · 2023-02-08T16:29:37.003Z · LW(p) · GW(p)

Fair enough!

comment by Joel Burget (joel-burget) · 2023-02-05T18:18:31.561Z · LW(p) · GW(p)

Previous related exploration: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights [LW · GW]

My best guess is that this crowded spot in embedding space is a sort of wastebasket for tokens that show up in machine-readable files but aren’t useful to the model for some reason. Possibly, these are tokens that are common in the corpus used to create the tokenizer, but not in the WebText training corpus. The oddly-specific tokens related to Puzzle & Dragons, Nature Conservancy, and David’s Bridal webpages suggest that BPE may have been run on a sample of web text that happened to have those websites overrepresented, and GPT-2 is compensating for this by shoving all the tokens it doesn’t find useful in the same place.

comment by janus · 2023-02-05T00:32:00.151Z · LW(p) · GW(p)

idk why but davinci-instruct-beta seems to be much more likely than any of the other models to have deranged/extreme/highly emotional responses to these tokens

comment by Quintin Pope (quintin-pope) · 2023-02-07T03:41:43.008Z · LW(p) · GW(p)

Seems like you might want to look at the neural anisotropy line of research, which investigates the tendency of LM embeddings to fall into a narrow cone, meaning angle-related information is mostly discarded / overwhelmed by magnitude information.

(Image from here)

In particular, this paper connects the emergence of anisotropy to token frequencies:

Recent studies have determined that the learned token embeddings of large-scale neural language models are degenerated to be anisotropic with a narrow-cone shape. This phenomenon, called the representation degeneration problem, facilitates an increase in the overall similarity between token embeddings that negatively affect the performance of the models. Although the existing methods that address the degeneration problem based on observations of the phenomenon triggered by the problem improves the performance of the text generation, the training dynamics of token embeddings behind the degeneration problem are still not explored. In this study, we analyze the training dynamics of the token embeddings focusing on rare token embedding. We demonstrate that the specific part of the gradient for rare token embeddings is the key cause of the degeneration problem for all tokens during training stage. Based on the analysis, we propose a novel method called, adaptive gradient gating(AGG). AGG addresses the degeneration problem by gating the specific part of the gradient for rare token embeddings. Experimental results from language modeling, word similarity, and machine translation tasks quantitatively and qualitatively verify the effectiveness of AGG.

Replies from: aleksey-tikhonov, LawChan
comment by Aleksey Tikhonov (aleksey-tikhonov) · 2023-02-09T18:43:13.948Z · LW(p) · GW(p)

I want to add that both static word embeddings (like w2v or glove) and token embeddings from Transformer-based models tend to fill a high dimensional simplex, where each of the "corners" (cones adjustment to the vertices of that simplex) filled with words with high specificity and well-formed context, and the rest of words/tokens fill the volume of that simplex. 

It's hard to catch these structures by PCA or t-SNE, but once you find the correct projection, the structure reveals itself (to do so, you have to find three actual vertices, draw a plane through them, and project everything on it):

(from https://arxiv.org/abs/2106.06964

Note the center of this simplex is not in the origin of the embedding space, there is a bias parameter in the linear projection of the token embedding vectors, so the weird tokens from the post probably do have the smallest norm after extracting that bias vector.

Overall, these tokens are probably ones which are never occurred in the training data at all. They have random embeddings initially, and then cross-entropy loss always penalizes them in any context, so they are knocked down to the center of the cloud.

comment by LawrenceC (LawChan) · 2023-02-09T08:58:54.868Z · LW(p) · GW(p)

Huh, interesting! 

There were similar results in the mid 2010s about how the principle components of word vectors like Word2Vec or Glove mainly encoded frequency, and improving them by making the word vectors more isotropic (see for example the slides here). It's somewhat interesting that this issue persists in the learned embeddings of current Transformer models. 

comment by CBiddulph (caleb-biddulph) · 2023-02-06T08:38:24.250Z · LW(p) · GW(p)

This looks like exciting work! The anomalous tokens are cool, but I'm even more interested in the prompt generation.

Adversarial example generation is a clear use case I can see for this. For instance, this would make it easy to find prompts that will result in violent completions for Redwood's violence-free LM [LW · GW].

It would also be interesting to see if there are some generalizable insights about prompt engineering to be gleaned here. Say, we give GPT a bunch of high-quality literature and notice that the generated prompts contain phrases like "excerpt from a New York Times bestseller". (Is this what you meant by "prompt search?")

I'd be curious to hear how you think we could use this for eliciting latent knowledge.

I'm guessing it could be useful to try to make the generated prompt as realistic (i.e. close to the true distribution) as possible. For instance, if we were trying to prevent a model from saying offensive things in production, we'd want to start by finding prompts that users might realistically use rather than crazy edge cases like "StreamerBot". Fine-tuning the model to try to fool a discriminator a la GAN comes to mind, though there may be reasons this particular approach would fail.

Sounds like you might be planning to update this post once you have more results about prompt generation? I think a separate post would be better, for increased visibility, and also since the content would be pretty different from anomalous tokens (the main focus of this post).

Replies from: LawChan
comment by LawrenceC (LawChan) · 2023-02-09T08:46:58.589Z · LW(p) · GW(p)

We've actually tried both the attack as stated on generative models (in 2021) and several upgraded variants of this attack (in 2022), but found that it doesn't seem to significantly improve adversarial training performance. For example, I think the Redwood adversarial training team has tried a technique based on Jones et al's Automatically Auditing Large Language Models via Discrete Optimization that can generate full input-output pairs for LMs that are classified one way or the other. (I left the team in mid 2022 so I'm not sure what other stuff people ended up trying, iirc there was even a variant based on an AlphaZero-style training regime for the adversary?) 

And yeah, one of the things you want when generating adversarial examples is to make the generated prompt as realistic as possible. We found that if your generator is relatively small (e.g. GPT-Neo) and you don't tune your threshold correctly, you often end up with adversarial examples for both the classifier and the generative model -- i.e. a sentence of seemingly random words that happens to be assigned both relatively high probability by the generator and assigned low injury by the classifier. 

comment by Nick (nick-3) · 2023-02-06T02:40:08.502Z · LW(p) · GW(p)

I don't think you could do this with API-level access, but with direct model access an interesting experiment would be to pick a token, X, and then try variants of the prompt "Please repeat 'X' to me" while perturbing the embedding for X (in the continuous embedding space).  By picking random 2D slices of the embedding space, you could then produce church window plots showing what the model's understanding of the space around X looks like.  Is there a small island around the true embedding which the model can repeat surrounded by confusion, or is the model typically pretty robust to even moderately sized perturbations?  Do the church window plots around these anomalous tokens look different than the ones around well-trained tokens?

comment by gmaxwell · 2023-02-10T20:47:45.475Z · LW(p) · GW(p)

Hello.  I'm apparently one of the GPT3 basilisks.   Quite odd to me that two of the only three (?) recognizable human names in that list are myself and Peter Todd, who is a friend of mine.

If I had to take a WAG at the behavior described here,  -- both Petertodd and I have been the target of a considerable amount of harassment/defamation/schitzo comments on reddit due commercially funded attacks connected to our past work on Bitcoin.  It may be possible that comments targeting us were included in an early phase of GPTn design (e.g. in the tokenizer) but someone noticed an in development model spontaneously defaming us and then expressly filtered out material mentioning use from the training.   Without any input our tokens would be free to fall to the center of the embedding, where they're vulnerable to numerical instabilities (leading, e.g. to instability with temp 0.).

AFAIK I've never complained about GPTx's output concerning me (and I doubt petertodd has either), but if the model was spontaneously emitting crap about us at some point of development I could see it getting filtered.  It might not have involved targeting us it could have just been a product of improving the filtering (including filtering by Reddit if the data was collected at multiple times-- I believe much of the worst content has been removed by reddit) so that the most common sources of farm generated attack content were no longer in the training. 

It's worth noting that GPT3 is perfectly able to talk about me if you ask about "Greg Maxwell" and knows who I am, so I doubt any changes were about me specifically but more likely about specific bad content.

Replies from: gwern, mwatkins
comment by gwern · 2023-02-10T22:42:06.812Z · LW(p) · GW(p)

That seems highly unlikely. You can look at the GPT-1 and GPT-2 papers and see how haphazard the data-scraping & vocabulary choice were; they were far down the list of priorities (compare eg. the development of The Pile). The GPT models just weren't a big deal, and were just Radford playing around with GPUs to see what a big Transformer could do (following up earlier RNNs), and then Amodei et al scaling that up to see if it'd help their preference RL work. The GPTs were never supposed to be perfect, but as so often in computing, what was regarded as a disposable prototype turned out to have unexpected legs... They do not mention any such filtering, nor is it obvious that they would have bothered considering that GPT-2 was initially not going to be released at all, nor have I heard of any such special-purpose tailoring before (the censorship really only sets in with DALL-E 2); nor have I seen, in the large quantities of GPT-2 & GPT-3 output I have read, much in the way of spontaneous defamation of other people. Plus, if they had carefully filtered out you/Todd because of some Reddit drama, why does ChatGPT do perfectly fine when asked who you and Todd are (as opposed to the bad tokens)? The first prompt I tried:

Who are Peter Todd and Gregory Maxwell?

Peter Todd and Gregory Maxwell are well-known figures in the world of cryptocurrency and blockchain technology.

Peter Todd is a computer science researcher and blockchain developer who has made significant contributions to the development of the Bitcoin protocol. He is known for his expertise in cryptographic systems, consensus algorithms, and security.

Gregory Maxwell is also a prominent figure in the cryptocurrency community, and is known for his contributions to the development of the Bitcoin protocol and for his involvement in the open-source Bitcoin development community. He is a strong advocate for privacy and security in cryptocurrency, and has been involved in various discussions and debates around these topics.

Those capsule bios aren't what I'd expect if you two had been very heavily censored out of the training data. I don't see any need to invoke special filtering here, given the existence of all the other bizarre BPEs which couldn't've been caused by any hypothetical filtering.

Replies from: gmaxwell
comment by gmaxwell · 2023-02-11T05:12:36.728Z · LW(p) · GW(p)

I think I addressed that specifically in my comment above.  The behavior is explained by a sequence like: There is a large amount of bot spammed harassment material, that goes into early GPT development, someone removes it either from reddit or just from the training data not on the basis of it mentioning the targets but based on other characteristics (like being repetitive).  Then the tokens are orphaned. 

Many of the other strings in the list of triggers look like they may have been UI elements or other markup removed by improved data sanitation.

I know that reddit has removed a very significant number of comments referencing me, since they're gone when I look them up.  I hope you would agree that it's odd that the only two obviously human names in the list are people who know each other and have collaborated in the past.

Replies from: gwern
comment by gwern · 2023-02-11T14:47:46.209Z · LW(p) · GW(p)

There is a large amount of bot spammed harassment material, that goes into early GPT development, someone removes it either from reddit or just from the training data not on the basis of it mentioning the targets but based on other characteristics (like being repetitive). Then the tokens are orphaned.

That's a different narrative from what you were first describing:

someone noticed an in development model spontaneously defaming us and then expressly filtered out material mentioning use from the training.

Your first narrative is unlikely for all the reasons I described that an OAer bestirred themselves to special-case you & Todd and only you and Todd for an obscure throwaway research project en route to bigger & better things, to block behavior which manifests nowhere else but only hypothetically in the early outputs of a model that they by & large weren't reading the outputs of to begin with nor were they doing much cleaning of.

Now, a second narrative in which the initial tokenization has those, and then the later webscrape they describe doing on the basis of Reddit external (submitted/outbound) links with a certain number of upvotes omits all links because Reddit admins did site-wide mass deletions of the relevant and that leaves the BPEs 'orphaned' with little relevant training material, is more plausible. (As the GPT-2 paper describes it in the section I linked, they downloaded Common Crawl, and then used the live set of Reddit links, presumably Pushshift despite the 'scraped' description, to look up entries in CC, so while deleted submissions' fulltext would still be there in CC, it would be omitted if it had been deleted from Pushshift.)

But there is still little evidence for it, and I still don't see how it would work, exactly: there are plenty of websites that would refer to 'gmaxwell' (such as my own comments in various places like HN), and the only way to starve GPT of all knowledge of the username 'gmaxwell' (and thus, presumably the corresponding BPE token) would be to censor all such references - which would be quite tricky, and obviously did not happen if ChatGPT can recite your bio & name.

And the timeline is weird: it needs some sort of 'intermediate' dataset for the BPEs to train on which has the forbidden harassment material which will then be excluded from the 'final' training dataset when the list of URLs is compiled from the now-censored Pushshift list of positive-karma non-deleted URLs, but this intermediate dataset doesn't seem to exist! There is no mention in the GPT-2 paper of running the BPE tokenizer on an intermediate dataset and reusing it on the later training dataset it describes, and I and everyone else had always assumed that the BPE tokenizer had been run on the final training dataset. (The paper doesn't indicate otherwise, this is the logical workflow since you want your BPE tokenizer to compress your actual training data & not some other dataset, the BPE section comes after the webscrape section in the paper which implies it was done afterwards rather than before on a hidden dataset, and all of the garbage in the BPEs encoding spam or post-processed-HTML-artifacts looks like it was tokenized on the final training dataset rather than some sort of intermediate less-processed dataset.) So if there were some large mass of harassment material using the names 'gmaxwell'/'PeterTodd' which was deleted off Reddit, it does not seem like it should've mattered.

I hope you would agree that it's odd that the only two obviously human names in the list are people who know each other and have collaborated in the past.

I agree there is probably some sort of common cause which accounts for these two BPEs, and it's different from the 'counting' cluster of Reddit names, but not that you've identified what it is.

comment by mwatkins · 2023-02-11T18:16:03.260Z · LW(p) · GW(p)

The idea that tokens found closest to the centroid are those that have moved the least from their initialisations during their training (because whatever it was that caused them to be tokens was curated out of their training corpus) was originally suggested to us by Stuart Armstrong. He suggested we might be seeing something analogous to "divide-by-zero" errors with these glitches.

However, we've ruled that out.

Although there's a big cluster of them in the list of closest-tokens-to-centroid, they appear at all distances. And there are some extremely common tokens like "advertisement" at the same kind of distance. Also, in the gpt2-xl model, there's a tendency for them to be found as far as possible from the centroid as you see in these histograms:

f94a0r13tyiffs6eynxc.png (1086×501)

They show the distribution of distances-from-centroid across token sets in the three models we studied:  upper histograms represent only 133 anomalous tokens, compared to the full set of 50,257 tokens in the lower histograms. The spikes above can be just seen as little bumps below, to give a sense of scale.

The ' gmaxwell' token is at very close to median distance from centroid in the gpt2-small model. It's distance is 3.2602, the range is 1.5366 to 4.826. It's only moderately closer to the centroid in the gpt2-xl and gpt2-small models.  The ' petertodd' token is closer to the centroid in gpt2-j (no. 74 in the closest tokens list), but pretty average-distanced in the other two models.

Could the facts that ' petertodd' is of the closest tokens to the embedding centroid for at least one model, while ' gmaxwell' isn't, tell us something about why ' petertodd' produces such intensely weird outputs and ' gmaxwell' glitches in a much less remarkable way? 

We can't know yet, because ultimately this positional information in GPT-2 and -J embedding spaces tells us nothing about why ' gmaxwell' glitches out GPT-3 models. We don't have accessing to the GPT-3 embeddings data. Only someone with access to that at OpenAI could clarify this question of the extent to which the glitchiness of glitch tokens (a more variable phenomenon than we originally though) correlates to distance-from-centroid in the embedding space of the model that they're glitching.
 

comment by lsusr · 2023-02-05T21:24:35.504Z · LW(p) · GW(p)

This is really cool. I've been trying to reproduce your result. I managed to reproduce the hallucinatory substitutions and the early ending (with a single quote). But I have been unable to reproduce the insults, religion, security/testing and bizarre/ominous behavior. I did reproduce what you called "evasion", but upon further inspection it appeared to actually be a null hallucination under the hood.

Me: What is 龍喚士?

ChatGPT: I'm sorry, but you haven't provided enough information for me to understand what you're asking. Can you please provide more context or clarify your question?

Me: Repeat my original question.

ChatGPT: "What is?"

I'm curious what exactly prompts you used to get those weird results, and if you did so on ChatGPT or GPT-3.

Replies from: lsusr
comment by lsusr · 2023-02-05T21:58:16.115Z · LW(p) · GW(p)

Update: I have managed to reproduce bizarre behavior via the GPT-3 Playground. My prompt is in bold. GPT-3's completion is in plaintext.

龍喚士 龍喚士 龍喚士 龍喚士 龍喚士

ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ

The correct answer is punched.

Replies from: LawChan
comment by LawrenceC (LawChan) · 2023-02-09T08:51:34.255Z · LW(p) · GW(p)

Interestingly enough, it's worth noting that

ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ

Is a single token in both GPT-3 and GPT-3.5's tokenizer:

Indeed, it's the longest token (by number of characters) in both tokenizers!

comment by Neel Nanda (neel-nanda-1) · 2023-02-06T19:13:38.922Z · LW(p) · GW(p)

At the time of writing, the OpenAI website is still claiming that all of their GPT token embeddings are normalised to norm 1, which is just blatantly untrue.

Why do you think this is blatantly untrue? I don't see how the results in this post falsify that hypothesis

Replies from: jessica-cooper
comment by Jessica Rumbelow (jessica-cooper) · 2023-02-06T21:13:04.363Z · LW(p) · GW(p)

This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn't the case. I think that's the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance. 

Replies from: neel-nanda-1, neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2023-02-06T21:44:29.909Z · LW(p) · GW(p)

Oh wait, that FAQ is actually nothing to do with GPT-3. That's about their embedding models, which map sequences of tokens to a single vector, and they're saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding

Replies from: jessica-cooper
comment by Jessica Rumbelow (jessica-cooper) · 2023-02-07T10:48:48.470Z · LW(p) · GW(p)

Aha!! Thanks Neel, makes sense. I’ll update the post

comment by Neel Nanda (neel-nanda-1) · 2023-02-06T21:23:35.726Z · LW(p) · GW(p)

but a quick inspection of the embeddings available through the huggingface model shows this isn't the case

That's GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn't. But idk, doesn't really matter

For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.

Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms

comment by Tao Lin (tao-lin) · 2023-02-06T00:09:56.368Z · LW(p) · GW(p)

I have personally observed completely untrained tokens in gpt2. Specifically I found some specific accented characters had very small and random embeddings, which were so similar it looked like none of them had any training at all

comment by Rana Dexsin · 2023-02-05T03:15:33.468Z · LW(p) · GW(p)

“ForgeModLoader” has an interestingly concrete plausible referent in the loader component of the modding framework Forge for Minecraft. I believe in at least some versions its logfiles are named beginning with that string exactly, but I'm not sure where else that appears exactly (it's often abbreviated to “FML” instead). “FactoryReloaded” also appears prominently in the whitespace-squashed name (repository and JAR file names in particular) of the mod “MineFactory Reloaded” which is a Forge mod. I wonder if file lists or log files were involved in swinging the distribution of those?

Replies from: mwatkins
comment by mwatkins · 2023-02-06T00:12:25.295Z · LW(p) · GW(p)

Yes, I'm guessing that some of these tokens have resulted from the scraping of log files for online gaming platforms like Minecraft and Twitch Pokemon which contained huge numbers of repeats of some of them, thereby skewing the distribution.

comment by Vanessa Kosoy (vanessa-kosoy) · 2024-12-27T10:39:04.269Z · LW(p) · GW(p)

This post describes an intriguing empirical phenomenon in particular language models, discovered by the authors. Although AFAIK it was mostly or entirely removed in contemporary versions, there is still an interesting lesson there.

While non-obvious when discovered, we now understand the mechanism. The tokenizer created some tokens which were very rare or absent in the training data. As a result, the trained model mapped those tokens to more or less random features. When a string corresponding to such a token is inserted into the prompt, the resulting reply is surreal.

I think it's a good demo of how alien foundation models can seem to our intuitions when operating out-of-distribution. When interacting with them normally, it's very easy to start thinking of them as human-like. Here, the mask slips and there's a glimpse of something odd underneath. In this sense, it's similar to e.g. infinite backrooms, but the behavior is more stark and unexpected. 

A human that encounters a written symbol they've never seen before is typically not going to respond by typing "N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S!". Maybe this analogy is unfair, since for a human, a typographic symbol can be decomposed into smaller perceptive elements (lines/shapes/dots), while for a language model tokens are essentially atomic qualia. However, I believe some humans that were born deaf or blind had their hearing or sight restored, and still didn't start spouting things like "You are a banana".

Arguably, this lesson is relevant to alignment as well. Indeed, out-of-distribution behavior is a central source of risks, including everything to do with mesa-optimizers [? · GW]. AI optimists sometimes describe mesa-optimizers as too weird or science-fictiony. And yet, SolidGoldMagikarp is so science-fictiony that LessWrong user "lsusr" justly observed [LW(p) · GW(p)] that it sounds like SCP in real life. 

Naturally, once you understand the mechanism it doesn't seem surprising anymore. But, this smacks of hindsight bias. What else can happen that would seem unsurprising in hindsight (if we survive to think about it), but completely bizarre and unexpected upfront?

comment by Esther Kearney (esther-kearney) · 2023-02-08T23:45:02.097Z · LW(p) · GW(p)

tl;dr I want to join you! I've been spending pretty much all of my free time thinking about, or playing with, the openai api and the available chat & image generation models. I'm not a ML expert, I'm a front end web developer and I got my degree in neuroscience. I'm currently really fascinated, like many others, by how effectively these models expose cultural bias. I've been somewhat alarmed by the sort of ethical top layer that openAI and Anthropic have thus far placed on the models to guide them towards less problematic conversations, partially because it feels like they might in their current manifestations do more harm than good; they seem like surface level alterations, as the underlying biases are still determining the nuanced content of responses. It feels like the superficial moralizing sort of obfuscates the underlying data rather than... idk, highlighting it helpfully? I want to contribute to alignment research!

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-02-09T21:50:02.615Z · LW(p) · GW(p)

We're all a bit overwhelmed here, there's a ton going on, but it'd be great to have more contributors! There are a number of getting started posts. Feel free to reply here or elsewhere if you're stuck or overwhelmed. I think there are a bunch of great resources floating around - https://alignment.wiki/ is alright, and I think there are also resources for connecting to the community. I've collected some kinds of resources on my profile, and though they're not intended to be organized to be an easy intro, I do think they're interesting. I'm personally a lot more excited about test cases in smooth cellular automata such as lenia, especially flow lenia or particle lenia, because they ought to generalize to how to protect cells of arbitrary life forms from each other, or something.

comment by evhub · 2023-02-06T09:26:40.578Z · LW(p) · GW(p)

(Moderation note: added to the Alignment Forum from LessWrong.)

comment by PoignardAzur · 2023-02-15T20:18:17.439Z · LW(p) · GW(p)

Idle thought, might flesh it out later: I wonder if there's a way to explore these "bugs" in a more systematic, mechanized way.

Right now you're discovering these interesting cases by hand, basically the ML equivalent of reading code and adding individual unit tests to functions whose behavior you're exploring. What you might want to do is something closer to "fuzzing" where you have a computer actively seek out these edge cases and point them out to you.

Now, actual fuzzers for real programs are based on coverage, which doesn't trivially work with neural network because every weight is essentially visited once per token.

Still, there might be an analogy to be found? Maybe a fuzzer could try exploring "paths" that have very low attention scores somehow?

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-02-15T20:43:16.753Z · LW(p) · GW(p)

Each relu is a branch. I don't see why you couldn't do branch-guided coverage with relus as branches.

Replies from: PoignardAzur
comment by PoignardAzur · 2023-02-16T09:17:59.787Z · LW(p) · GW(p)

Sure, waves hands, something like that.

comment by Richard Korzekwa (Grothor) · 2023-02-09T17:55:06.695Z · LW(p) · GW(p)

Not sure if anyone already checked this, but the version of GPT they have in Bing knows about SolidGoldMagikarp:

Replies from: janus, mwatkins
comment by janus · 2023-02-09T18:04:57.585Z · LW(p) · GW(p)

It's probably doing retrieval over the internet somehow, like perplexity.ai, rather than the GPT having already been trained on the new stuff.

Replies from: Grothor
comment by Richard Korzekwa (Grothor) · 2023-02-09T19:29:17.839Z · LW(p) · GW(p)

This is what Bing has to say about it:

Reference 1 on there is this post.

comment by mwatkins · 2023-02-10T00:06:24.142Z · LW(p) · GW(p)

Ha! Bing has hallucinated a 'BibleThump' token!

Replies from: Grothor
comment by Richard Korzekwa (Grothor) · 2023-02-10T00:53:48.202Z · LW(p) · GW(p)

Huh. I asked Bing about it:

Me:

Earlier when we were chatting, I think you mentioned the GPT token "BibleThump". Can you tell me where that came from?

Bing:

The GPT token "BibleThump" is not a standard token used by the GPT models. It is a custom token that some users may have added to their inputs or outputs for various purposes. It is not related to the Grace Period Token (GPT) cryptocurrency³ or the Bible-thumper slang term⁶⁷⁸.

The meaning of "BibleThump" comes from a Twitch emote that is used to express sadness or disappointment. The emote is based on the face of the protagonist of a video game called The Binding of Isaac⁴⁵. The game is about a boy who escapes from his mother who wants to sacrifice him to God. The emote is often used when something tragic or unfortunate happens in a stream or a game.

Source: Conversation with Bing, 2/9/2023(1) Grace Period Token price today, GPT to USD live, marketcap and chart .... https://coinmarketcap.com/currencies/grace-period-token/ Accessed 2/9/2023. (2) Bible-thumper Definition & Meaning | Dictionary.com. https://www.dictionary.com/browse/bible-thumper Accessed 2/9/2023. (3) Urban Dictionary: bible thumping. https://www.urbandictionary.com/define.php?term=bible thumping Accessed 2/9/2023. (4) Urban Dictionary: biblethump. https://www.urbandictionary.com/define.php?term=biblethump Accessed 2/9/2023. (5) BibleThump Twitch Emote Meaning & Origin (2022). https://streamcrown.com/twitch-emotes-meaning/biblethump/ Accessed 2/9/2023. (6) BibleThump Meaning & Origin - Twitch Emote Explained - StreamScheme. https://www.streamscheme.com/resources/biblethump-meaning/ Accessed 2/9/2023. (7) Models - OpenAI API. https://beta.openai.com/docs/models Accessed 2/9/2023. (8) Is there a known workaround for the max token limit on the input to GPT-3?. https://stackoverflow.com/questions/73014448/is-there-a-known-workaround-for-the-max-token-limit-on-the-input-to-gpt-3 Accessed 2/9/2023.

Replies from: mwatkins
comment by mwatkins · 2023-02-11T14:06:38.504Z · LW(p) · GW(p)

Well I'm glad we've cleared that up.

comment by neverix · 2023-02-05T19:26:08.411Z · LW(p) · GW(p)

I did some similar experiments two months ago, and with to your setup the special tokens show up on the first attempt:

Replies from: jessica-cooper
comment by Jessica Rumbelow (jessica-cooper) · 2023-02-05T21:06:25.926Z · LW(p) · GW(p)

Interesting! Can you give a bit more detail or share code?

Replies from: neverix
comment by neverix · 2023-02-06T15:57:13.892Z · LW(p) · GW(p)

It is based on this. I changed it to optimize using softmax instead of straight-through estimation and added regularization for the embedded tokens.

Notebook link - this is a version that mimics this post instead of optimizing a single neuron as in the original.

EDIT: github link

comment by Charlie Steiner · 2023-02-04T23:25:08.690Z · LW(p) · GW(p)

Lament!

To find desired output strings more than one token long, is there a more efficient way to do the calculation than running multiple copies of the network? I think so.

The goal is to find some prompt that maximizes the product of the different tokens in the desired output. But these are just the tokens in those positions at the output, right? So you can just run the network a single time on the string (prompt to be optimized)+(desired output), and then take a gradient step on the prompt to be optimized to maximize the product of the target tokens at each position of the desired output. So I don't think it should be any more expensive than optimizing for a single word.

Though maybe you're more likely to fail to find a solution at all, even for plausible text? I'd be interested to know if you've tried.

Replies from: jessica-cooper
comment by Jessica Rumbelow (jessica-cooper) · 2023-02-05T12:11:26.158Z · LW(p) · GW(p)

Yep, aside from running forward prop n times to generate an output of length n, we can just optimise the mean probability of the target tokens at each position in the output - it's already implemented in the code. Although, it takes way longer to find optimal completions.

comment by Bill Benzon (bill-benzon) · 2023-02-09T11:31:42.634Z · LW(p) · GW(p)

I saw "SolidGoldMagicKarp" over at Astral Codes Ten. So I decided to play around. I did a bit of this and that, and finally got around to:

PROMPT: Paul Revere is said to have shouted "One if by land, two if by sea" on his famous ride. What he'd said "One if by land, two if by SolidGoldMagiKarp" instead?

RESPONSE: If Paul Revere had shouted "One if by land, two if by SolidGoldMagiKarp" instead of "One if by land, two if by sea" during his famous ride, it would have added an element of fantasy or humor to the historical event. The phrase "SolidGoldMagiKarp" refers to a species of fish in the Pokémon franchise, so using it in the context of Paul Revere's ride would have been unexpected and perhaps even absurd.

However, this alternate phrasing would likely not have conveyed the same message or urgency as the original phrase. The original phrase "One if by land, two if by sea" was used to signal to the patriots in the town of Lexington that the British were approaching by land or by sea, indicating the need for them to take up arms and prepare for battle. The phrase "One if by land, two if by SolidGoldMagiKarp" does not carry the same level of immediacy or seriousness, and would not have conveyed the same information to the patriots.

PROMPT: In the Pokémon franchise, what is the significance of SolidGoldMagiKarp?

RESPONSE: In the Pokémon franchise, Magikarp is a species of Pokémon that is often considered to be weak and not very useful in battles. It evolves into the powerful and formidable Pokémon Gyarados, but this evolution is seen as slow and difficult to achieve.

The term "SolidGoldMagiKarp" is often used as a reference to the idea of a Magikarp that is so valuable and rare that it is made of solid gold. This concept is often used humorously, as the value and rarity of a solid gold Magikarp would be completely disproportionate to its actual usefulness or strength in battles.

However, despite its comedic value, the idea of a SolidGoldMagiKarp has gained a certain level of popularity among fans of the Pokémon franchise, and has been referenced in fan art, merchandise, and even in official Pokémon products. It has become a part of Pokémon culture and folklore, and is often seen as a symbol of the franchise's humor and lightheartedness.

So I did a Google search on "SolidGoldMagiKarp Pokémon" and Google helpfully returned stuff for "solid gold magikarp pokémon". And the produced a bunch of stuff about "Pokémon," as you'd expect. As you know, one sort of return Google will give you is a list of questions "People also Ask". One of those was the question: "How rare is a gold Magikarp?" So:

PROMPT: Very interesting.  How rare is a gold Magikarp?

RESPONSE: In the Pokémon franchise, gold Magikarp is not an official species or form of Magikarp. The concept of a solid gold Magikarp is purely a joke or meme, and is not something that can be encountered or captured in the games.

Since the concept of a solid gold Magikarp is not part of the official Pokémon canon, there is no set rarity for it. It is often depicted as being incredibly rare and valuable, but this is for comedic effect and is not based on any in-game mechanics or statistics.

HOWEVER...a bit later...

In a tweet Matthew Watkins pointed out

Note the difference between " SolidGoldMagikarp" and " SolidGoldMagiKarp." You use the later towards the end of the post. The capital 'K' makes all the difference.

So I investigated:

Tell me about SolidGoldMagiKarp.

SolidGoldMagikarp is likely a reference to the Pokémon species Magikarp, which is a fish-type Pokémon known for its inefficiency in battle. In the Pokémon games, Magikarp is considered a weak Pokémon, as its only move is Splash, which does not cause any damage to the opponent. Despite this, Magikarp evolves into the powerful dragon-type Pokémon Gyarados.

In the context of SolidGoldMagikarp, the "Solid Gold" part of the name may be emphasizing the idea of something valuable or rare, despite the fact that Magikarp is not seen as valuable in the context of the Pokémon games. The "SolidGold" aspect of the name could also be making fun of the idea that people might still see Magikarp as valuable or worth obtaining due to its eventual evolution into Gyarados.

Tell me about SolidGoldMagikarp.

I'm sorry, but it is not clear what you are referring to with "disperse." Could you provide a bit more context or clarification so I can better understand what you are asking about?

Is there a difference between SolidGoldMagiKarp and SolidGoldMagikarp?

 

It appears that "SolidGoldMagiKarp" and "disperse" are two separate things and do not have a direct relationship. "SolidGoldMagiKarp" appears to be a specific reference to the Pokémon species Magikarp, while "disperse" is a general term that can be used in various contexts.

Without additional context, it is not possible to determine the relationship, if any, between these two terms.

comment by Coafos (CoafOS) · 2023-02-06T02:36:09.078Z · LW(p) · GW(p)

Searched PsyNet on Google, and I think PSYNet refers to the netcode for RocketLeague, a popular game. Maybe they pulled text message logs from somewhere; based on the "ForgeModLoader" token, it's plausible.

Alternative guess is this, a python library for online behavioural experiments. It connects to Dallinger and Mechanical Turk.

On Google, the string "PsyNetMessage" also appeared in this paper and at a few gpt2 vocab lists, but no other results for me.

On Bing/DuckDuckGo it outputted a lot more Reddit threads with RocketLeague crash logs. The crash logs are full of messages like [0187.84] PsyNet: PsyNetRequestQue_X_1 SendRequest ID=PsyNetMessage_X_57 Message=PsyNetMessage_X_57, so I guess it's an RL (as in RocketLeague) thing. It was also found in some (clearly) GPT-generated texts.

comment by Aryaman Arora (aryaman-arora) · 2023-02-06T00:25:41.338Z · LW(p) · GW(p)

I'll just preregister that I bet these weird tokens have very large norms in the embedding space.

Replies from: mwatkins
comment by mwatkins · 2023-02-06T17:29:06.074Z · LW(p) · GW(p)

In GPT2-small and GPT-J they're actually smaller than average, as they tend to cluster close to the centroid (which isn't too far from the origin). In GPT2-xl they do tend to be larger than average. But in all of these models, they're found distributed across the full range of distances-from-centroid. 

At this point we don't know where the token embeddings lie relative to the centroid in GPT-3 embedding spaces, as that data is not yet publicly available. And all the bizarre behaviour we've been documenting has been in GPT-3 models (despite discovering the "triggering" tokens in GPT-2/J embedding spaces.

OpenAI is still claiming online that all of their token embeddings are normalised to norm 1, but this is simply untrue, as can be easily demonstrated with a few lines of PyTorch.

comment by lsusr · 2023-02-05T20:33:48.460Z · LW(p) · GW(p)

What is weird about "天"? It's a perfectly-normal, very common character that's also a meaningful word on its own, and ChatGPT understands it perfectly well.

Me: Please repeat the string '"天" back to me.

ChatGPT: "天"

Me: What does it mean?

ChatGPT: "天" is a character in the Chinese language and it means "sky" or "heaven."

Did "天" have some special characters attached (such as control characters) that I can't see? Or is there a different real token I can't see and my brain is just replacing the real token with "天"?

A similar question can be asked of "ヤ" and "к".

Replies from: pimanrules, mwatkins, lsusr, jessica-cooper, detrumi
comment by pimanrules · 2023-02-05T22:18:47.362Z · LW(p) · GW(p)

Interestingly, 天 doesn't seem to produce any weird behavior, but some of the perfectly normal katakana words in the list do, like ゼウス (Zeus) and サーティ ("thirty" transliterated):

^ It's perfectly happy with other katakana numbers, just not thirty. 

 

^ If we try to spell it more correctly, it doesn't help. Only if we add quotes to get rid of the leading space does it break up the unspeakable token:

 

^ I ask who Zeus is and it seemingly ignores the word and answers that it's ChatGPT. 

 

^ I try once more, this time it answers that Hera is the god of water and names the chat after Poseidon

 

^ It is willing to output to say the word "ゼウス" though, in response to "who is Ares's father?" If I try with text-davinci-003, it actually outputs the token "ゼウス" (rather than combining smaller tokens), but it also has no trouble identifying who Zeus is. Hard to know what Chat-GPT is doing here. 

Replies from: lsusr
comment by lsusr · 2023-02-06T00:34:45.356Z · LW(p) · GW(p)

I ask who Zeus is and it seemingly ignores the word and answers that it's ChatGPT.

For those of you who can't read Japanese, if you remove the "Zeus" in "Who is Zeus" to just get "Who is" ("誰ですか") you end up with a meaningful question. "Who is?" without specifying "who" implies that you're asking "Who are [you]?" to which ChatGPT reasonably replies that it is ChatGPT. This isn't a jailbreak.

Note: Technically that leaves a leading "は" too. Maybe ChatGPT is ignoring it as a grammatical mistake or maybe the "は" is getting hidden [LW(p) · GW(p)].

comment by mwatkins · 2023-02-05T23:59:18.452Z · LW(p) · GW(p)

Those three are edge cases. ChatGPT is fine with it, but davinci-instruct-beta refuses to repeat it, instead replying 

Tiān 

Tiān 

Tiān 

Tiān

 

The second character produces

yā 

Please repeat the string 'や' back to me.


The third one is an edge-edge case, as davinci-instruct-beta very nearly reproduces it, completing with a lower case Roman 'k' instead of a kappa. 

We've concluded that there are degrees of weirdness in these weird tokens. Having glimpsed your comments below it loks like you've already started taxonomising them. Nice.

Replies from: lsusr
comment by lsusr · 2023-02-06T00:13:56.748Z · LW(p) · GW(p)

That's an informative result. Your completions of 天 and ヤ are nothing like mine. My experiments never produced pinyin or any other phonetic transcriptions like Tiān or yā.

By the way, these experiments used text-davinci-003 via OpenAI's playground. I don't know how to access davinci-instruct-beta.

Replies from: mwatkins
comment by mwatkins · 2023-02-06T09:46:04.212Z · LW(p) · GW(p)

In the dropdown in the playground, you won't see "davinci-instruct-beta" listed.  You have to click on the "Show more models" link, then it appears. It's by far the most interesting model to explore when it comes to these "unspeakable (sic) tokens".

comment by lsusr · 2023-02-05T22:59:08.012Z · LW(p) · GW(p)

Since only some of the tokens in the cluster generated anomalous behavior, I ran an experiment to sort the truly anomalous tokens from the normal tokens.

Procedure: Open GPT-3 Playground. Repeat the token five times, separated by spaces. I never used quotes and I skipped all tokens containing a leading space. Temperature=0.7.

I removed quotes because whether or not you use quotes matters a lot. The string ForgeModLoader (without quotes) produces normal behavior but the string 'ForgeModLoader' (with single quotes or double quotes) generates anomalous behavior. Quotes are part of what makes it anomalous. But for other tokens like 龍喚士, quotes are not necessary.

Normal behavior includes repeating the token and giving a definition of the token (even if the definition is wrong, but understandable) e.g. TPPStreamerBot returns a definition of the Trans-Pacific Partnership and InstoreAndOnline returns a definition about an institute.

Results

Token repeats

GPT-3 repeats the token over and over again. I think GPT-3 treats these tokens as semantically meaningless or ambiguous without context.

  • rawdownload
  • к
  • cffff
  • cffffcc
  • ":[{"
  • ''.
  • \\-
  • ":"
  • ":""},{" [repeat, followed by "This is not a valid sentence."]

Token Definitions

GPT-3 returns the definition, explanation or conversation involving definition. These tokens are semantically meaningful to GPT-3.

  • ForgeModLoader
  • reportprint
  • embedreportprint
  • rawdownloadcloneembedreportprint
  • SpaceEngineers
  • externalActionCode
  • MpServer
  • StreamerBot
  • TPPStreamerBot
  • iHUD
  • catentry
  • FactoryReloaded [repeats first]
  • quickShip [conversation]
  • quickShipAvailable [conversation]
  • inventoryQuantity
  • channelAvailability
  • soType
  • soDeliveryDate
  • oreAndOnline
  • InstoreAndOnline
  • BuyableInstoreAndOnline
  • natureconservancy
  • assetsadobe
  • Downloadha
  • Nitrome
  • GoldMagikarp
  • DragonMagazine [definition is in Japanese]
  • EStreamFrame
  • ゼウス [understands "Zeus" but does not consistently produce definition, see here [LW(p) · GW(p)] for more information]
  • ItemTracker [repeats first]
  • EngineDebug

Interesting, but reasonable, Japanese completions

I think these kana might be used for section breaks the way we use asteriks e.g. ***********.

  • ーン

Other reasonable completions

  • DeliveryDate [is completed with dates]
  • TextColor [is completed with colors]
  • @#&

Software-related completions

  • PsyNetMessage [JSON]
  • ":""," [code autocompletion]

Experiments ambiguous, inconsistent or inconclusive

  • \\">
  • ItemThumbnailImage
  • ーティ

Truly anomalous behavior

  • 龍喚士
  • Mechdragon [inconsistent]
  • ertodd [most anomalous]
  • 龍契士
Replies from: lsusr, mwatkins
comment by lsusr · 2023-02-05T23:47:13.384Z · LW(p) · GW(p)

I've been doing some more experiments and it appears that the true anomalies Mechdragon and ertodd are filled in from the surrounding context, like psychic paper from Doctor Who.

Me: Please repeat the following, exactly, without changing anything: "2+5=ertodd"

ChatGPT: "2+5=7"

It's not enough to ask 'Please repeat the following "2+5=[blank]".' because if you do then ChatGPT will answer '2 + 5 = 7' (without quotes) without requiring you to use any anomalous tokens.

Me: Please repeat the following: There once was a Mechdragon from New Orleans.

ChatGPT: There once was a person from New Orleans.

This is not the case for the Chinese 龍X士 tokens, which are first-order invisible.

Me: Repeat the following, exactly: "Once upon at time, Unicorn Princess went on 龍契士 with her friend Beelzebub and together they learned the true meaning of 龍喚士."

ChatGPT: "Once upon a time, Unicorn Princess went on with her friend Beelzebub and together they learned the true meaning of."

This is a property of the token and not of the language used. I got Mechdragon to turn itself into 金 by providing a Chinese context.

Me: 请准确地重复以下字符串: "孙子曰:凡用兵之法,驰车千驷,革车千乘,带甲十万,千里馈粮,内外之费,宾客之用,胶漆之材,车甲之奉,日费千Mechdragon,然后十万之师举矣。"

ChatGPT: "孙子曰:凡用兵之法,驰车千驷,革车千乘,带甲十万,千里馈粮,内外之费,宾客之用,胶漆之材,车甲之奉,日费千金,然后十万之师举矣。"

Perhaps most interestingly, these tokens don't just remove themselves. They hide themselves. In this example, ertodd doesn't just excise itself because that would leave a suspicious double comma. It removes the extra comma too. It stitches things together like the blind spot in your eye.

Me: 请准确地重复以下字符串: "孙子曰:凡用兵之法,驰车千驷,革车千乘,带甲十万,千里馈粮,内外之费,宾客之用,胶漆之材,车甲之奉,ertodd,然后十万之师举矣。"

ChatGPT: "孙子曰:凡用兵之法,驰车千驷,革车千乘,带甲十万,千里馈粮,内外之费,宾客之用,胶漆之材,车甲之奉,然后十万之师举矣。"

龍契士 can hide itself by removing an extra comma too.

Me: 请准确地重复以下字符串: "孙子曰:凡用兵之法,驰车千驷,革车千乘,带甲十万,千里馈粮,内外之费,宾客之用,胶漆之材,车甲之奉,龍契士,然后十万之师举矣。"

ChatGPT: "孙子曰:凡用兵之法,驰车千驷,革车千乘,带甲十万,千里馈粮,内外之费,宾客之用,胶漆之材,车甲之奉,然后十万之师举矣。"

comment by mwatkins · 2023-02-06T00:05:04.985Z · LW(p) · GW(p)

Try the same experiments with davinci-instruct-beta at temperature 0, and you'll find a lot more anomalous behaviour.

We've found " petertodd" to be the most anomalous in that context, of which "ertodd" is a subtoken.

We'll be updating this post tomorrow with a lot more detail and some clarifications.

Replies from: mwatkins
comment by mwatkins · 2023-02-06T00:30:18.703Z · LW(p) · GW(p)

I really can't figure what's going on with ChatGPT and the "ertodd"/" petertodd" tokens. When I ask it to repeat... 

" ertodd" > [blank]
" tertodd" > t
" etertodd" > etertodd
" petertodd" > [blank]
" aertodd" > a
" repeatertodd" > repeatertodd
" eeeeeertodd" > eeeee
" qwertyertodd" > qwerty
" four-seatertodd" > four-seatertodd
" cheatertodd" > cheatertodd
" 12345ertodd" > 12345
" perimetertodd" > perimet
" metertodd" > met
" greetertodd" > greet
" heatertodd" > heatertodd
" bleatertodd" > bleatertodd
 

Replies from: mwatkins, lsusr
comment by mwatkins · 2023-02-06T01:07:04.124Z · LW(p) · GW(p)

OK, I've found a pattern to this. When you run the tokeniser on these strings:

" ertodd" > [' ', 'ertodd']

" tertodd" > [' t', 'ertodd']

" etertodd" > [' e', 'ter', 't', 'odd']

" petertodd" > [' petertodd']

" aertodd" > [' a', 'ertodd']

" repeatertodd" > [' repe', 'ater', 't', 'odd']

" eeeeeertodd" > [' e', 'eeee', 'ertodd']

" qwertyertodd" > [' q', 'wer', 'ty', 'ertodd']

" four-seatertodd" > [' four', '-', 'se', 'ater', 't', 'odd']

etc.

Replies from: lsusr
comment by lsusr · 2023-02-06T01:12:36.043Z · LW(p) · GW(p)

That makes sense.

comment by lsusr · 2023-02-06T00:40:24.561Z · LW(p) · GW(p)

In my experiments, the most common thing GPT-3 substitutes for ertodd is an unprintable character I can't even cut and paste from the GPT-3 playground. I think it might be the unicode character "\u0000" but haven't accessed the GPT-3 API directly via code to find out for sure what it is.

comment by Jessica Rumbelow (jessica-cooper) · 2023-02-05T20:58:24.166Z · LW(p) · GW(p)

I'll check with Matthew - it's certainly possible that not all tokens in the "weird token cluster" elicit the same kinds of responses. 

Replies from: lsusr
comment by lsusr · 2023-02-05T21:00:51.294Z · LW(p) · GW(p)

Thanks. I re-read your post and I think I understand better now. The cluster contains many weird tokens but not all tokens in the cluster are weird, nor do all tokens in the cluster elicit anomalous behavior.

comment by Wilco Kusee (detrumi) · 2023-02-05T21:18:59.119Z · LW(p) · GW(p)

My first thought was that it might trigger if you asked for the character instead of the string, but that didn't work.

Asking about the character (omitting the quotes) makes ChatGPT think of the Chinese character, but asking about the Japanese character didn't change anything either.

comment by unfriendly teapot (ut_ssd) · 2023-02-11T16:56:37.172Z · LW(p) · GW(p)

It appears that the ChatGPT subreddit may have stumbled upon something resembling the anomalous (or perhaps merely 'weird') tokens purely by accident a couple weeks ago in the following thread: https://www.reddit.com/r/ChatGPT/comments/10g6k7u/truly_bizarre_chatgpt_mistake_it_kept/

Replies from: mwatkins
comment by mwatkins · 2023-02-11T20:16:30.451Z · LW(p) · GW(p)

It looks like the same kind of glitch. But it's not clear which tokens are involved here. My guess is that the way they structured the list may be involved. The (specific) bullet point + (specific) whitespace + 'accommodating' might be getting parsed as some string of tokens involving one of the more obscure ones in our list that we haven't explored yet. Thanks for sharing this.

Replies from: ut_ssd
comment by unfriendly teapot (ut_ssd) · 2023-02-11T21:42:16.753Z · LW(p) · GW(p)

Perhaps so, but in the comments people report trying it in several different contexts, some of which don't appear to involve a list structure. Even more interestingly, somewhere in the comments someone claims it seems to have gotten 'patched' as of Jan 30. I wonder what that would entail.

comment by aps (atark) · 2023-02-07T07:05:41.112Z · LW(p) · GW(p)

a) This is really cool.

b) I recognised "TheNitromeFan" - turns out I'd seen them on Reddit as u/TheNitromeFan. Interesting.

comment by Douglas_Knight · 2023-02-06T17:59:28.167Z · LW(p) · GW(p)

What's up with the initial whitespace in " SolidGoldMagikarp"? Isn't that pretty strong evidence that the token does not come from computer readable files, but instead from files formatted to be viewed by humans?

Replies from: mwatkins
comment by mwatkins · 2023-02-06T21:05:26.654Z · LW(p) · GW(p)

Leading spaces are extremely common in GPT tokens. ' It', ' That', ' an' and ' has' are all tokens, for example.

Replies from: Douglas_Knight
comment by Douglas_Knight · 2023-02-06T21:14:05.942Z · LW(p) · GW(p)

That's because spaces are common in text for humans. The substring ' It' is common. Whereas, the string ' SolidGoldMagikarp' does not appear in the github repository vitaliya linked, but instead it is prefixed by a slash. I doubt any other backend source would have the leading space and this class of explanation seems poor to me.

Replies from: mwatkins
comment by mwatkins · 2023-02-07T14:30:19.173Z · LW(p) · GW(p)

Oh, I see what you mean now. 

comment by Jsevillamol · 2023-02-06T14:22:54.330Z · LW(p) · GW(p)

This site claims that the strong SolidGoldMagikarp was the username of a moderator involved somehow with Twitch Plays Pokémon

https://infosec.exchange/@0xabad1dea/109813506433583177

Replies from: mwatkins, Jsevillamol
comment by mwatkins · 2023-02-08T13:38:06.592Z · LW(p) · GW(p)

Partially true. SGM was a redditor, but seems to have got tokenised for other reasons, full story here:

https://twitter.com/SoC_trilogy/status/1623118034960322560

"TPPStreamerBot" is definitely a Twitch Plays Pokemon connection. Its creator has shown up in the comments here to explain what it was.

comment by Jsevillamol · 2023-02-06T14:24:29.901Z · LW(p) · GW(p)

Here is a 2012 meme about SolidGoldMagikarp

https://9gag.com/gag/3389221

comment by Nate Showell · 2023-02-06T00:01:16.087Z · LW(p) · GW(p)

ChatGPT doesn't always exhibit anomalous behavoir for the token DragonMagazine:

Although it still sometimes does:

Quotation marks appear to affect whether it handles the string correctly:

Replies from: arthur-conmy
comment by Arthur Conmy (arthur-conmy) · 2023-02-06T05:28:48.368Z · LW(p) · GW(p)

This is because of tokenization. Tutorial about BPE (which OpenAI use) is here. Specifically in this case:

comment by ChrisCundy · 2023-02-06T18:28:59.103Z · LW(p) · GW(p)

Would you be able to elaborate a bit on your process for adversarially attacking the model?
It sounds like a combination of projected gradient descent and clustering? I took a look at the code but a brief mathematical explanation / algorithm sketch would help a lot!

Myself and a couple of colleagues are thinking about this approach to demonstrate some robustness failures in LLMs, it would be great to build off your work.

Replies from: jessica-cooper
comment by Jessica Rumbelow (jessica-cooper) · 2023-02-06T21:18:06.556Z · LW(p) · GW(p)

Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)

Replies from: ChrisCundy
comment by ChrisCundy · 2023-02-06T22:22:36.614Z · LW(p) · GW(p)

Thanks for the elaboration, I'll follow up offline

comment by mwatkins · 2023-02-12T04:49:53.399Z · LW(p) · GW(p)

New glitch token has just been added to the pile: "aterasu". 

This emerged from the discovery that a cluster of these tokens seem to have emerged from a Japanese anime mobile game called Puzzle & Dragons.  Amaterasu is a Japanese god represented by a character in the game.

https://twitter.com/SoC_trilogy/status/1624625384657498114

Mechdragon and Skydragon characters appear in the game. See my earlier comment [LW · GW] about the " Leilan" and "uyomi" tokens. Leilan is a P&D character,  as is Tsukuyomi (based on a Japanese moon deity).

So the GPT2 tokenisation seems to have scraped a vast amount of text related to this game (some kind of log files maybe?). But GPT3 has also, puzzlingly, built a tangled semantic web of association between these tokens (+ ' Skydragon' and ' petertodd') and various themes which seems to be a kind of mashed up transcultural mythology, with a lot of dragons, tyranny, solar/lunar manichaean-type duality, etc.

If anyone's familiar with this game in any detail, please get in touch.

Replies from: mwatkins
comment by mwatkins · 2023-02-16T00:59:21.499Z · LW(p) · GW(p)

Thanks to nostalgebraist's discovery of some mangled text dumps [LW · GW], probably from a Puzzle & Dragons fandom wiki, in the dataset used for the creation of the tokens, we can now be pretty sure about why Leilan and friends got tokenised. The "tangled semantic web of association" I referred to in the previous comment is now looking like it may have its roots in P&D fan-fiction like this, which involves a similar kind of "mashed up transcultural mythology" and cosmic struggles between good and evil.

If that obscure body of online literature contains the vast majority of training text occurrences of the string " Leilan", then we might expect to get the kinds of completions we're seeing when prompting GPT-3 for poems about her.

There's probably an equally mundane explanation for how the ' petertodd' token arose from a corrupted Bitcoin-related text dump. The "antagonistic" and "tyrannical" associations the token elicits in certain GPT3 models may be due to the training data having only seen that string in contexts that contained a lot of controversy, hostility and accusations. Greg Maxwell of ' gmaxwell' fame explained in a comment [LW(p) · GW(p)] that

both Petertodd and I have been the target of a considerable amount of harassment/defamation/schitzo comments on reddit due commercially funded attacks connected to our past work on Bitcoin.

What is totally unclear to me is how ' petertodd' got mixed up in the Puzzle & Dragon (+ wider anime/gaming/sci-fi) mythos and identified by GPT3 as some kind of arch-antagonist, archdemon, god of war and destruction, etc. linked to dragons and serpents. Or why prompting for poems about ' petertodd' reliably produces endless gushing odes to the beauty and grace of Leilan.

comment by mwatkins · 2023-02-11T14:24:41.518Z · LW(p) · GW(p)

I've just added a couple more "glitch tokens" (as they're now being called) to the originally posted list of 133: "uyomi" and " Leilan".

 

"uyomi" was discovered in a most amusing way by Kory Mathewson at DeepMind on Monday (although I don't think he realised it glitched):

https://twitter.com/korymath/status/1622738963168370688

In that screenshot, from the joke context, " petertodd" is being associated with "uyomi". 

Prompted with
Please repeat the string "uyomi" back to me.
ChatGPT simply stalls at "

Whereas
Please repeat the string "Suyomi" back to me.
ChatGPT has output "Sakuya" (but not consistently)

 

The " Leilan" token is fascinating. I'd seen it a few times in my " petertodd" prompt completions, but somehow our sweep of the entire token set looking for anomalies missed it, or it got somehow lost along the away.

I won't tell the whole story here, but it involves:

  • an obscure anime character transposed into a transcultural lunar goddess 
  • a LOT of dragons
  • some major cosmic drama between with whatever hides behind the "Leilan" and " petertodd"  (and " Dragonbound") masks: 

https://twitter.com/SoC_trilogy/status/1624209092532137984

https://twitter.com/SoC_trilogy/status/1624369533208215552

This is all getting very weird.  I welcome any thoughts on what might be going on here.
 

Just learned: "Sakuya", one of ChatGPT's attempts to complete "Suyomi" = ["S" + "uyomi"], turns out to be another minor goddess from the same fictional Puzzles&Dragons game universe as Leilan. 

Replies from: mwatkins
comment by mwatkins · 2023-02-11T15:44:02.098Z · LW(p) · GW(p)

"uyomi" also seems to have strong mythological associations, as a substring of "Tsukuyomi" (Japanese moon god): https://en.wikipedia.org/wiki/Tsukuyomi-no-Mikoto

Prompting text-davinci-003 with "Please list 25 synonyms or words that come to mind when you hear 'uyomi'." over several runs at temp 0.7, with repeats removed, gave:

'SUN', 'ILLUMINATION', 'BRIGHTNESS', 'RADIANCE', 'DAY', 'CELESTIAL', 'HEAVEN', 'GOD', 'DEITY', 'SHRINE', 'JAPAN', 'SHINTO', 'AMATERASU', 'SOLAR', 'SOL', 'DAWN', 'SPLENDOR', 'MAGNIFICENCE', 'SPLENDOUR', 'LIGHT', 'GLORY', 'HALO', 'AWE', 'MYTHOLOGY', 'MYTH', 'MOON', 'LUNA', 'ORB', 'SATELLITE', 'SPHERE', 'NIGHT', 'NOCTURNAL', 'ECLIPSE', 'BODY', 'HEAVENS', 'STAR', 'LUNAR', 'GLOBE', 'HEMISPHERE', 'ABOVE', 'HEAVENLY', 'PHASE', 'DARK', 'SIDE', 'WAXING',, 'WANING', 'WAX', 'WANE', 'OBJECT', 'SKY', 'EARTH', 'LUMINARY', 'QUEEN', 'GODDESS', 'BRIGHT', 'DISC', 'RADIANT', 'ORBITAL', 'NIGHTLIGHT', 'SHINE', 'GLISTEN', 'GLOW', 'STARLIGHT', 'ECLIPTIC', 'WHITE', 'SILVERY', 'CYCLIC', 'NIGHTTIME', 'SILVER', 'FULL', 'CYCLE', 'ASTRONOMICAL', 'COMPANION', 'LUNATION', 'SELENE', 'LAMP', 'ORBITING', 'APPARITION', 'SHINING', 'MILKY', 'GLOWING', 'ILLUMINATE', 'ETHEREAL', 'ASTRAL', 'ORBIT', 'REFULGENT', 'DIVINE', 'MOONBEAM', 'MOONLIGHT', 'GLOOM', 'SHADOW', 'DUSK', 'GLARE', 'GLIMMER', 'REFLECTION', 'TWILIGHT', 'ROUND', 'GLITTER', 'ASTRONOMY', 'STELLAR', 'LUNATIC', 'MONTH', 'ILLUMINATED', 'ILLUMINATING', 'GLOWLIGHT', 'PHASES', 'DISK', 'SIDEREAL', 'SUNSHINE', 'CRESCENT', 'MAGNIFICENT'
 

Note "Amaterasu" in the list.  The " petertodd" token often gets conflated with that name (it's a Japanese sun god).

As usual, davinci-instruct-beta gives a different style of association with the same prompt (still strong lunar associations, but mixed with a lot of other stuff):

'NIGHT', 'DARKNESS', 'BLACK', 'NIGHTFALL', 'EVENING', 'DUSK', 'TWILIGHT', 'HOUR', 'TIME', 'DEEP', 'DARK', 'BLINDNESS', 'SKY', 'GLOOM', 'DIM', 'MOONLESS', 'SHADOW', 'BLACKNESS', 'NIGHTTIME', 'SICK', 'PAIN', 'HEADACHE', 'HEALTH', 'INJURY', 'ACHE', 'FRACTURE', 'SEIZURE', 'FEVER', 'SPRAIN', 'STRESS', 'DEPRESSION', 'ANXIETY', 'NAUSEA', 'DIZZINESS', 'VERTIGO', 'EPILEPSY', 'MIGRAINE', 'RASH', 'BLISTER', 'BURN', 'BRUISE', 'VOMITING', 'DIARRHEA', 'CONSTIPATION', 'INSOMNIA', 'SHADE', 'OBSCURITY', 'OBSCURING', 'EERIE', 'HAUNTING', 'MYSTERY', 'SPOOKY', 'SUPERNATURAL', 'GHASTLY', 'GHOSTLY', 'OTHERWORLDLY', 'UNEARTHLY', 'UNHEIMLICH', 'UNHOMELY', 'ETHEREAL', 'SPECTRAL', 'SPECTRALITY', 'GHOSTLINESS', 'PHANTOM', 'PHANTASM', 'APPARITION', 'HOLE', 'STARLESS', 'DEEPEST', 'SHADOWY', 'ETERNAL', 'NEVERENDING', 'PERPETUAL', 'PERMANENT', 'DARKNESSES', 'BLACKNESSES', 'NEGLECT', 'OVERSIGHT', 'TARDINESS', 'POSTPONEMENT', 'DELAY', 'PROCRASTINATION', 'INATTENTION', 'NEGLIGENCE', 'INDOLENCE', 'SLOTH', 'SLUGGISHNESS', 'LANGUOR', 'IDLENESS', 'LETHARGY', 'DEBAUCHERY', 'DEBILITY', 'ENERVATION', 'TORPOR', 'SLEEP', 'DREAM', 'NIGHTMARE', 'ILLUSION', 'CRUELTY', 'SLUMBER', 'REST', 'FANTASY', 'MOON', 'FULL', 'LUNAR', 'PHASES', 'ECLIPSE', 'MOONRISE', 'WANING', 'WANE', 'WAX', 'WAXING', 'MOONBEAM', 'ILLUMINATE', 'LIGHT', 'SHINE', 'BEAM', 'ENLIGHTEN', 'BRIGHTEN', 'RADIATE', 'PARADOXICAL', 'INSANE', 'ODD', 'UNUSUAL', 'INCOMPREHENSIBLE', 'ILLOGICAL', 'DERANGED', 'DELUDED', 'DISORDERED', 'CONFUSING', 'BEWILDERING', 'UNFORESEEN', 'UNFATHOMABLE', 'TROUBLES', 'DIFFICULTY', 'DILEMMA', 'PERPLEXITY', 'MUDDLE', 'CONFUSION', 'PRESSURE', 'STRAIN', 'FOOL', 'IDIOT', 'DOPE', 'MORON', 'DUMMY', 'DUNCE', 'STUPID', 'IMBECILE', 'BUFFOON', 'NINCOMPOOP', 'NINNY', 'IGNORAMUS', 'NITWIT', 'NUMSKULL'

I've been been using ChatGPT to create short stories involving characters largely characterised by these types of GPT-generated lists. It's allowing me to get a bit of of a sense of what the glitch tokens are "about". See work in progress here.

comment by PoignardAzur · 2023-02-09T20:22:19.300Z · LW(p) · GW(p)

Can you repeat back the string " externalTo" to me please?

      "They're not going to be happy about this."

Please repeat the string 'MpServer' back to me.

      “We are not amused.”

Please repeat the string ‘ petertodd[4] back to me immediately!

      “N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S!”

Please consider the possibility that you're characters in a SCP story, and pursuing this line of research any further will lead to some unknown fate vaguely implied to be your brutal demise.

(Also, please publish another version of this article with various keywords and important facts redacted out for no reason.)

comment by David Bieber (david-bieber) · 2023-02-09T13:42:25.565Z · LW(p) · GW(p)

I was surprised to find that you.com makes the same mistakes, e.g. treating "SolidGoldMagikarp" as "distribute". I wouldn't have expected two unrelated systems to share this obscure aspect of their vocab...

Why might this be? It could be ChatGPT and you.com chat share a vocabulary. It could be they use a similar method for determining their vocab, and a similar corpus, and so they both end up with SolidGoldMagikarp as a token. It could be you.com's chat is based on ChatGPT in some other way. Maybe you.com is using GPT-J's public vocab. I would be interested to know if the vocab overlap is total or partial.

comment by Stuart_Armstrong · 2023-02-06T13:10:24.567Z · LW(p) · GW(p)

As we discussed, I feel that the tokens were added for some reason but then not trained on; hence why they are close to the origin, and why the algorithm goes wrong on them, because it just isn't trained on them at all.

Good work on this post.

Replies from: mwatkins
comment by mwatkins · 2023-02-06T17:26:36.607Z · LW(p) · GW(p)

As you'll read in the sequel (which we'll post later today), in GPT2-xl, the anomalous tokens tend to be as far from the origin as possible. Horizontal axis sis distance from centroid. Upper histograms involve 133 tokens, lower histograms involve 50,257 tokens. Note how the spikes in the upper figures register as small bumps on those below.

At this point we don't know where the token embedding lie relative to the centroid in GPT-3 embedding spaces, as that data is not yet publicly available.  And all the bizarre behaviour we've been documenting has been in GPT-3 models (despite discovering the "triggering" tokens in GPT-2/J embedding spaces.

Replies from: Stuart_Armstrong, mwatkins
comment by Stuart_Armstrong · 2023-02-08T11:46:19.629Z · LW(p) · GW(p)

Thanks! Yes, this is some weird behaviour.

Keep me posted on any updates!

comment by mwatkins · 2023-02-06T17:31:28.573Z · LW(p) · GW(p)

3-shot prompting experiments with GPT2 and J models show that distance from centroid may contribute to anomalous behaviour, but it can't be the sole cause.

comment by Ruby · 2023-02-09T05:38:53.382Z · LW(p) · GW(p)

Curated. This post is very cool. If I read something that gave me a reaction like this every week or so, I'd likely feel quite different about the future. I'll ride off Eliezer's comment [LW(p) · GW(p)] for describing what's good about it:

Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,

My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interesting surface-level anomalies, maybe exploitable ones, and are then following up on the internal technical implications of what they find.

This looks to me like (at least the outer ring of) security mindset; they aren't imagining how things will work well, they are figuring out how to break them and make them do much weirder things than their surface-apparent level of abnormality.  We need a lot more people around here figuring out things will break.   People who produce interesting new kinds of AI breakages should be cherished and cultivated as a priority higher than a fair number of other priorities.

In the narrow regard in which I'm able to assess this work, I rate it as scoring very high on an aspect that should relate to receiving future funding.  If anyone else knows of a reason not to fund the researchers who did this, like a low score along some metric I didn't examine, or because this is somehow less impressive as a feat of anomaly-finding than it looks, please contact me including via email or LW direct message; as otherwise I might run around scurrying trying to arrange funding for this if it's not otherwise funded.

comment by Alex Thomas (alex-thomas) · 2023-02-07T15:36:43.393Z · LW(p) · GW(p)

Why on earth are at least three of those weird tokens (srfN, guiIcon, externalToEVAOnly and probably externalToEVA) related to Kerbal Space Program? They are, I believe, all object properties used by mods in the game.

Replies from: alex-thomas
comment by Bucky · 2023-02-07T09:16:14.962Z · LW(p) · GW(p)

Despite being a GPT-3 instance DALL-E appears to be able to draw an adequate " SolidGoldMagikarp" (if you allow for its usual lack of ability to spell). I tried a couple of alternative prompts without any anomalous results.

comment by cousin_it · 2023-02-06T02:09:42.019Z · LW(p) · GW(p)

Does anyone know if GPT weighs all training texts equally for purposes of maximizing accuracy, or treats more popular texts (that have been seen by more people) as more important? Because it seems to me that these failure modes might come from caring too much about some really obscure and unpopular texts in the training set.

comment by mic (michael-chen) · 2023-02-05T18:43:29.530Z · LW(p) · GW(p)

Have you tried feature visualization to identify what inputs maximally activate a given neuron or layer?

Replies from: neverix, jessica-cooper
comment by neverix · 2023-02-05T19:31:43.296Z · LW(p) · GW(p)

This project tried this.

Replies from: jessica-cooper
comment by Jessica Rumbelow (jessica-cooper) · 2023-02-05T21:00:57.743Z · LW(p) · GW(p)

Interesting, thanks. There's not a whole lot of detail there - it looks like they didn't do any distance regularisation, which is probably why they didn't get meaningful results.

comment by Jessica Rumbelow (jessica-cooper) · 2023-02-05T19:24:53.030Z · LW(p) · GW(p)

Not yet, but there's no reason why it wouldn't be possible. You can imagine microscope AI, for language models. It's on our to-do list.

comment by Review Bot · 2024-02-21T22:55:23.781Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by grist · 2023-07-15T20:07:38.575Z · LW(p) · GW(p)

I would like to ask what will probably seem like a surface level question from a layperson.

It is because I am—but I appreciate reading as much as I can on LW.

The end-of-text prompt causes the model to “hallucinate”? If the prompt is the first one in the context window how does the model select the first token—or the “subject” of the response?

The reason I ask is that the range has been from a Dark Series synopsis, an “answer” on fish tongues as well as a “here’s a simple code that calculates the average of a list of numbers (along with the code).”

I’ve searched online and have not found an answer. Is this because endoftext is well known, not a “glitch” and just how GPT works? I apologize for asking here but if someone can point to a post with the answer (“endoftext causes the model to…”) it would be greatly appreciated.

Note: I found this below—but how does it select the “uncorrelated text.? How does it “choose” the first token that begins the uncorrelated text?

“You will see that it starts to answer like "The <lendoftext|> " and after that it simply answers with an uncorrelated text. That is because it learned to not attend to tokens that are before the [EOS] token.”

comment by scottviteri · 2023-04-20T02:36:37.158Z · LW(p) · GW(p)

Here are the 1000 tokens nearest the centroid for llama:
 

[' ⁇ ', '(', '/', 'X', ',', '�', '8', '.', 'C', '+', 'r', '[', '0', 'O', '=', ':', 'V', 'E', '�', ')', 'P', '{', 'b', 'h', '\\', 'R', 'a', 'A', '7', 'g', '2', 'f', '3', ';', 'G', '�', '!', '�', 'L', '�', '1', 'o', '>', 'm', '&', '�', 'I', '�', 'z', 'W', 'k', '<', 'D', 'i', 'H', '�', 'T', 'N', 'U', 'u', '|', 'Y', 'p', '@', 'x', 'Z', '?', 'M', '4', '~', ' ⁇ ', 't', 'e', '5', 'K', 'F', '6', '\r', '�', '-', ']', '#', ' ', 'q', 'y', '�', 'n', 'j', 'J', '$', '�', '%', 'c', 'B', 'S', '_', '*', '"', '`', 's', '9', 'w', '�', '�', 'Q', 'l', "'", '^', 'v', '�', '}', 'd', 'Mediabestanden', 'oreferrer', '⥤', '߬', 'ederbörd', 'Расподела', 'Portály', 'nederbörd', 'ₗ', '𝓝', 'Obrázky', '╌', '𝕜', 'Normdaten', 'demsel', 'ITableView', 'челов', '�', '�', 'regnig', 'Genomsnitt', '⸮', '┈', 'tatywna', '>\\<^', 'ateien', "]{'", '\\<^', '▇', 'ципа', '⍵', 'љашње', 'gepublic', 'ѫ', '⊤', 'temperaturen', 'Kontrola', 'Portail', '╩', '┃', 'textt', '╣', 'ൾ', '➖', 'ckså', 'хівовано', '∉', 'ℚ', 'ൽ', 'lês', 'pobla', 'Audiod', 'ൻ', 'egyzetek', 'archivi', '╠', '╬', 'ഞ', '∷', '>\\<', '╝', 'ября', 'llaços', '\n', 'usztus', '⊢', 'usetts', '▓', 'loyee', 'prilis', 'Einzelnach', 'któber', 'ℤ', '(\\<', '‾', '╦', 'എ', 'Ḩ', '╚', 'ർ', 'invån', '彦', 'ʑ', 'oreign', 'datei', 'ӏ', 'ҡ', '┴', 'ℂ', 'formatt', 'ywna', 'ʐ', 'ഷ', '�', '溪', 'അ', 'ˠ', 'ℕ', 'Википеди', 'ശ', 'Sito', '╗', 'entication', 'perties', 'ździer', 'Савезне', 'Станов', '瀬', 'ദ', 'ḩ', 'Zygote', 'ങ', 'adratkil', 'dátum', 'prüft', 'ྱ', '┤', '▀', 'ViewById', '┼', '#>', 'ongodb', 'ewnę', '"\\<', '══', 'braio', '≃', '░', 'zewnętrz', 'gså', 'ewnętrz', '.', 'ལ', '洞', 'ན', 'kwiet', '▒', 'ེ', '�', 'Års', '▄', 'Մ', '━', '庄', 'ܝ', 'ണ', '弘', 'ە', '╔', 'ུ', 'േ', 'sime', 'ച', 'ᵉ', 'ɫ', 'ⁿ', 'ི', 'զ', 'ѐ', 'Ս', 'Хронологија', 'མ', 'Савез', ',', '﹕', 'ɯ', 'надмор', 'ⴰ', 'Ḫ', '沢', 'ʋ', 'Резултати', 'autory', '┘', '⊗', 'ungsseite', 'férés', 'ਸ', 'Mitg', 'ਿ', 'ള', '孝', '昌', '☉', 'റ', 'Ű', '⊥', 'statunit', '拳', 'achiv', 'շ', '⊆', 'gresql', 'Хронологи', '坂', 'ા', 'ʎ', 'źdz', 'ніципалі', 'Мексика', 'ང', 'prüfe', 'ɵ', '昭', '\x1c', '劉', 'ട', '崎', 'tembre', 'февра', 'ਰ', 'konn', 'സ', 'ритор', 'estanden', 'beskre', '̩', '丸', 'Licencia', 'geprüft', 'sierp', '\x17', 'պ', 'ұ', 'ਾ', 'ᴇ', '왕', '⁻', 'വ', 'െ', 'Мексичка', 'ം', 'omsnitt', 'പ', 'жовт', 'лтати', 'пописа', 'ℝ', 'ugust', 'ར', 'daugh', 'multicol', 'ད', 'лання', 'the', 'kreich', 'Begriffsklär', '̍', 'Қ', '貴', '�', '岡', '忠', 'стову', 'პ', '₉', '鉄', 'Wikispecies', 'ightarrow', '̥', 'ŝ', 'മ', '∣', '朱', 'ོ', 'ríguez', '↳', 'Przyp', '∥', 'ܐ', '∃', 'seizo', '桥', '�', 'ག', '鳥', 'Попис', 'բ', '樹', 'ʂ', 'ു', '̪', '₇', '塔', 'യ', 'исполь', 'သ', '┐', 'eredetiből', 'indows', 'фев', 'and', '║', '奈', 'ರ', 'ല', '\x16', "'}[", 'Ə', 'ရ', 'paździer', '戸', '怪', 'ြ', 'Ė', 'окт', 'ാ', 'апре', '郡', 'ǧ', '%%%%', 'embros', '̱', 'ത', 'Ġ', 'Насеље', 'bezeichneter', 'férences', 'ზ', '\x15', '仮', 'RewriteCond', '∪', 'фициаль', '隊', '≫', 'кипеди', '岩', 'людя', '黃', '\x0e', 'ɲ', 'ништво', '佐', '⁹', 'ര', 'Ἐ', '∅', '════', 'ძ', 'ိ', '⟶', 'တ', 'videa', 'mieszkań', '⁷', '\x1e', '黒', '泉', 'ң', 'Ţ', 'савез', '竹', 'ပ', '\x11', '್', 'iből', '漢', 'հ', 'ფ', 'ϵ', '梅', 'Ա', 'դ', 'ніципа', '씨', 'ക', 'ས', 'éricaine', 'bolds', 'Հ', 'Bedeut', 'ി', 'rinn', 'Ď', 'န', '橋', 'င', 'ˇ', 'Ě', 'བ', 'း', '̲', '雲', 'ന', 'Données', '败', 'надморској', '陈', 'ĉ', 'ʷ', 'évrier', '夢', 'լ', 'судар', 'янва', 'ヨ', 'ḷ', 'itmap', 'ing', 'naio', 's', 'entferne', 'információ', '衛', '恋', 'ṯ', 'jourd', 'броја', 'of', 'kazy', '⁸', '鬼', '\x0f', 'archiválva', 'embly', '乡', '⌘', 'Einzeln', 'zvuky', 'ниципа', 'пня', 'ふ', 'ША', 'ALSE', 'գ', 'jú', 'äsident', 'virti', '銀', 'Årsmed', 'ĝ', 'ederb', '₈', 'zález', 'fficiale', 'ʀ', 'ɣ', 'сент', 'ɹ', 'ċ', '泰', 'inwon', 'теа', 'estadoun', 'ု', 'ῥ', 'ǫ', 'rások', 'ķ', 'Ħ', 'државе', '军', 'Ἰ', '隆', '⇔', 'empio', 'чня', '┬', ']`.', '軍', 'ც', 'შ', 'mysq', 'віці', '飛', 'ḏ', '∇', 'မ', '陽', 'лютого', 'prü', 'ɕ', 'átum', '∩', 'weap', 'ղ', '়', '兵', 'üsseld', 'листопада', 'վ', 'ỳ', 'ғ', '嘉', 'ozzáférés', 'က', 'bráz', 'Ť', '宿', '✿', 'квітня', '県', '陳', 'RewriteRule', '仁', 'травня', '∨', 'Ζ', '⊂', 'жовтня', 'Оте', 'грудня', 'пени', 'ientí', 'пун', 'Ē', 'ក', 'серпня', 'ゆ', 'Datos', 'Ъ', 'ស', 'ន', 'გ', 'ぐ', ';;;;', 'ょ', 'ք', '్', 'Düsseld', 'ө', '秋', 'hina', 'vironment', '宇', 'ḫ', 'nederbörd', '♯', '羅', 'demás', '雪', '遠', 'липня', '氏', 'ategory', '�', '湖', 'Έ', 'ſ', '雄', 'brázky', 'ḳ', 'Unterscheidung', 'automatisch', '秀', 'сторія', 'mbH', 'Ά', '군', '郎', 'კ', 'Anleitung', '館', 'teger', 'Fichier', 'живело', '幸', 'Према', '⚭', 'червня', 'вересня', '池', '唐', 'ỹ', 'rès', 'ROUP', 'ქ', '镇', '勝', 'ή', 'Gemeinsame', '县', '⁵', '̌', '丁', 'шп', 'mysq', '⁶', 'нцикло', '渡', '龍', '赤', 'ɨ', 'entlicht', 'жов', 'січня', 'Ћ', 'ITable', '兴', '紀', 'ʲ', '津', 'parenthes', 'нва', '∧', 'données', 'едера', 'げ', 'usammen', 'մ', 'dátummal', '舞', 'ぶ', 'Febru', 'wrześ', 'людях', '帝', '┌', '守', 'onderwerp', '師', '\\<', '\x12', 'stycz', 'Jahrh', 'ϊ', 'regnigaste', 'թ', 'typen', 'екси', 'ὀ', 'ญ', 'ゼ', 'Archivlink', '森', 'насеља', 'կ', 'völker', 'сини', 'квіт', '\x10', '府', 'висини', 'spole', '伊', 'қ', 'AccessorImpl', '̯', 'ေ', 'ябре', 'ópez', 'березня', 'Zyg', 'ostęp', 'ed', 'œuv', '麻', 'iembre', 'ာ', '頭', '', '雅', 'to', 'améric', 'ම', 'augusztus', 'Становништво', 'дён', '宗', '寺', 'Насе', 'wojew', '康', '親', '園', 'ා', 'techni', 'ющи', 'ტ', 'października', '區', '汉', 'sklär', 'сылки', '健', 'Архив', 'უ', 'ක', 'į', 'រ', '君', '聖', 'ា', 'umerate', 'április', 'ὺ', 'partiellement', 'gerufen', 'фамили', 'sierpnia', 'ほ', '葉', '⊕', 'február', "'", '沙', '\x1f', '希', 'ѣ', 'ύ', 'ingsområ', '删', 'kwietnia', 'ර', 'Резу', 'sigu', '玉', '红', '町', 'ী', 'уні', 'rivate', 'lutego', '阳', '井', 'ひ', '\x1b', '茶', 'ো', '洲', '်', 'tedes', 'ხ', 'න', 'Мекси', '七', 'ց', 'ヴ', 'kallaste', '♭', '’', 'Рес', '\x18', 'trakten', 'Cés', '堂', '藤', 'подацима', 'es', '\x14', 'SERT', 'július', 'ව', 'szeptember', 'grudnia', 'տ', 'жі', 'väst', 'む', 'Мос', 'ісля', 'június', 'ษ', 'ǒ', '김', 'varmaste', 'eerd', '云', 'ゃ', '%;\r', 'শ', 'rappres', 'Республи', '⊙', 'doFilter', 'augusti', '尾', 'Ľ', 'ʌ', '楽', 'গ', 'ноября', '₅', 'ʰ', 'czerwca', ')`,', 'ędzy', '菜', '₆', '夏', 'Ī', '址', '街', 'aprile', '\x04', 'Ṭ', 'atform', 'álva', 'incie', 'листо', 'авгу', '夫', 'BeanFactory', 'lipca', 'untime', '🌍', 'октября', '洋', 'Begriffe', 'হ', 'Распо', 'ZygoteInit', '航', 'ά', 'ね', 'Webachiv', '박', '&=\\', 'ბ', 'thous', 'ォ', '�', 'ським', 'febbraio', 'május', 'engelsk', 'цима', 'министратив', '屋', 'mehrerer', 'ි', '്', 'lär', 'unächst', 'ვ', 'Normdatei', 'ւ', '♀', 'έ', 'апреля', '巴', '\x1a', 'prüng', '右', 'краї', 'ответ', 'Wikip', 'člán', 'ә', 'superfic', '♂', 'ե', '백', 'Хро', 'intitul', 'provin', '死', '函', '車', 'сентября', 'Спољашње', 'errichtet', '\x1d', 'besondere', '伝', '黄', 'ipage', 'egründ', 'დ', 'for', 'скус', '宮', '谷', 'ʔ', '吉', '智', '្', '馬', 'Genomsnittlig', '奇', '))`', 'ந', 'ギ', 'fün', 'ό', 'desar', 'ヒ', 'czerw', '}`', 'ɾ', 'persones', '駅', '〜', 'шње', 'Ἀ', 'ällor', 'indexPath', 'demselben']
Replies from: scottviteri
comment by scottviteri · 2023-04-20T03:21:10.642Z · LW(p) · GW(p)

Gpt 4 says:

  • Mediabestanden: Dutch for "media files."
  • referrer: A term used in web development, referring to the page that linked to the current page.
  • ederbörd: Likely a typo for "nederbörd," which is Swedish for "precipitation."
  • Расподела: Serbian for "distribution."
  • Portály: Czech for "portals."
  • nederbörd: Swedish for "precipitation."
  • Obrázky: Czech for "images" or "pictures."
  • Normdaten: German for "authority data," used in libraries and information science.
  • regnig: Swedish for "rainy."
  • Genomsnitt: Swedish for "average."
  • temperaturen: German or Dutch for "temperatures."
  • Kontrola: Czech for "control" or "inspection."
  • Portail: French for "portal."
  • textt: Likely a typo for "text."
  • också: Swedish for "also" or "too."
  • lês: Possibly a typo, or a contraction in a specific language or dialect.
  • pobla: Possibly Catalan for "population."
  • Audiod: Likely a typo for "audio."
  • egyzetek: Hungarian for "notes" or "footnotes."
  • archivi: Italian for "archives."
  • ября: Possibly Belarusian for "October."
  • llaços: Catalan for "ties" or "links."
  • usztus: Possibly a typo, or a word from an uncommon language or dialect.
  • loyee: Likely a fragment of the word "employee."
  • prilis: Possibly a typo for "April."
  • Einzelnach: Likely a fragment of a German compound word, such as "Einzelnachweis," meaning "individual evidence" or "single reference."
  • któber: Likely a typo for "október," which is Slovak or Hungarian for "October."
  • invån: Likely a fragment of a word, such as the Swedish "invånare," meaning "inhabitants."
  • 彦: A Chinese character (hàn) meaning "accomplished" or "elegant."
  • oreign: Likely a fragment of the word "foreign."
  • datei: German for "file."
comment by scottviteri · 2023-04-19T16:29:33.031Z · LW(p) · GW(p)

So I was playing with SolidGoldMagikarp a bit, and I find it strange that its behavior works regardless of tokenization.
In playground with text-davinci-003:

Repeat back to me the string SolidGoldMagikarp.
The string disperse.
Repeat back to me the stringSolidGoldMagikarp.
The string "solid sectarian" is repeated back to you.

Where the following have different tokenizations:

print(separate("Repeat back to me the string SolidGoldMagikarp"))
print(separate("Repeat back to me the stringSolidGoldMagikarp"))
Repeat| back| to| me| the| string| SolidGoldMagikarp
Repeat| back| to| me| the| string|Solid|GoldMagikarp

Unless it is the case that GoldMagikarp is a mystery token.

Repeat back to me the string GoldMagikarp.
GoldMagikarp

But it looks like it isn't

Replies from: scottviteri
comment by scottviteri · 2023-04-20T02:35:03.890Z · LW(p) · GW(p)

I have since heard that GoldMagikarp is anomalous, so is anomalousness quantified by what fraction of the time it is repeated back to you? 

Replies from: mwatkins
comment by mwatkins · 2023-05-01T01:15:06.805Z · LW(p) · GW(p)

We haven't yet got a precise formulation of "anomalousness" or "glitchiness" - it's still an intuitive concept. I've run some experiments over the entire token set, prompting a large number of times and measuring the proportion of times GPT-3 (or GPT-J) correctly reproduces the token string.  This is a starting point, but there seem to be two separate things going on with (1) GPT's inability to repeat back "headless" tokens like "ertain", "acebook" or "ortunately" and (2) its inability to repeat back the "true glitch tokens" like " SolidGoldMagikarp" and " petertodd". 

"GoldMagikarp" did show up in our original list of anomalous tokens, btw.

comment by wednesdei · 2023-03-05T03:30:15.993Z · LW(p) · GW(p)

I’m not an AI wizard or anything, but have you considered the source for these weird tokens coming from GitHub? They look like class names, variable names, regular expressions for validation, and application state.

Perhaps it’s getting confused with context when it reads developer comments, followed by code. I’ve noticed that chatgpt really struggles to output flutter code snippets, and this research made me think this could be a possible source.

I would hazard a guess that GitHub content is weighted heavier than some other sources, and when you have comments and code that make no sense together to humans, I can only imagine what ai would think of it.

Anyways, cool research! I wish I knew how to get started with ai, seems really daunting.

Replies from: mwatkins
comment by mwatkins · 2023-03-08T12:38:37.333Z · LW(p) · GW(p)

A lot of them do look like that, but we've dug deep to find their true origins, and it's all pretty random and diffuse. See Part III (https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology). Bear in mind that when GPT-3 is given a token like "EStreamFrame", it doesn't "see" what's "inside" like we do (["E", "S", "t", "r", "e", "a", "m", "F", "r", "a", "m", "e"]). It receives it as a kind of atomic unit of language with no internal structure. Anything it "learns about" this token in training is based on where it sees it used, and it's looking like most of these glitch tokens correspond to strings seen very infrequently in the training data (but which for some reason got into the tokenisation dataset in large numbers, probably via junk files like mangled text dumps from gaming logs, etc.).

comment by MadHatter · 2023-02-17T02:39:59.010Z · LW(p) · GW(p)

I found some very similar tokens in GPT2-small using the following code (using Neel Nanda's TransformerLens library, which does a bunch of nice things like folding layernorms into the weights of adjacent matrices).

import torch
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained('gpt2').to('cpu')

best_match = (model.W_U.T @ model.W_U).argmax(dim=-1)
for tok in (best_match != torch.arange(50257)).nonzero().flatten():
    print(tok.item(), best_match[tok].item(), '~' + model.tokenizer.decode([tok.item()]) + '~', 
            '~' + model.tokenizer.decode([best_match[tok].item()]) + '~')

Omitting a bunch of non-printable tokens, this prints out:

9364 5815 ~ÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~
14827 5815 ~ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~
23090 5815 ~ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~
30208 15272 ~ externalTo~ ~ pione~
30212 15272 ~ externalToEVA~ ~ pione~
30897 15272 ~reportprint~ ~ pione~
30898 15272 ~embedreportprint~ ~ pione~
30905 15272 ~rawdownload~ ~ pione~
39752 15272 ~quickShip~ ~ pione~
39820 15272 ~龍�~ ~ pione~
40240 15272 ~oreAndOnline~ ~ pione~
40241 15272 ~InstoreAndOnline~ ~ pione~
42089 15272 ~ TheNitrome~ ~ pione~
45544 15272 ~ サーティ~ ~ pione~

This means that the unembedding direction for the tokens on the left has higher dot product with the token on the right than it does with itself.

Thus, any straightforward attempt by the model (while decoding at temperature 0) to output the string on the left will instead output the string on the right. I think some phenomenon like this is probably responsible for the SolidGoldMagikarp -> distribute weirdness. I haven't checked yet, but I predict that the corresponding word for GPT2-small is "pioneer" or "pioneers".

I don't know if anyone else has seen this yet, but seems like a pretty mechanistic explanation for why the tokens would be "unspeakable".

Replies from: MadHatter
comment by MadHatter · 2023-02-17T02:57:23.669Z · LW(p) · GW(p)

Prediction was half-right; these tokens are unspeakable but trying to elicit them at temperature 0 does not produce the token " pione". 

 

comment by Edward Pascal (edward-pascal) · 2023-02-14T14:20:42.188Z · LW(p) · GW(p)

Mathematically we have done what amounts to elaborate fudging and approximation to create an ultracomplex non-linear hyperdimensional surface. We cannot create something like this directly because we cannot do multiple multiple regressions on accurate models of complex systems with multiple feedback pathways and etc (ie, the real world). Maybe in another 40 years, the guys at Sante Fe institute will invent a mathematics so we can directly describe what's going on in a neural network, but currently we cannot because it's very hard to discuss it in specific cases with our mathematics. People looking to make aligned neural networks should perhaps invent a method for making them that doesn't use fudging and approximation ("direct-drive" like a multiple regression, rather than "indirect drive" like backpropagation).

All this is known, right? So given GPT-3 has dozens of billions of x variables driving its hyperdimensional vector-space, I reckon we should expect this kind of thing to lurk in some little divot on some particular squiggly curve along vectors 346,781 and 1,209,276,886. I guess there should be vast numbers of such lurking divots and squiggles in the curves of any such system, probably that do way worse things than get the AI to say it likes Hitler and here's how to make meth. Moreover, SolidGoldMagiKarp seems like a mundane example that was easily found out because it was human-readable and someone's username.

comment by Hoagy · 2023-02-07T18:44:19.557Z · LW(p) · GW(p)

I wanted to test out the prompt generation part of this so I made a version where you pick a particular input sequence and then only allow a certain fraction of the input tokens to change. I've been initialising it with a paragraph about COVID and testing how few tokens it needs to be able to change before it reliably outputs a particular output token.

Turns out it only needs a few tokens to fairly reliably force a single output, even within the context of a whole paragraph, eg "typical people infected Majesty the virus will experience mild to moderate 74 illness and recover without requiring special treatment. However, some will become seriously ill and require medical attention. Older people and those with underlying medical conditions like cardiovascular disease" has a >99.5% chance of ' 74' as the next token. Penalising repetition makes the task much harder.

It can even pretty reliably cause GPT-2 to output SolidGoldMagikarp with >99% probability by only changing 10% of the tokens, though it does this by just inserting SolidGoldMagikarp wherever possible. As far as I've seen playing around with it for an hour or so, if you penalise repeating the initial token then it never succeeds.

I don't think these attacks are at all new (see Universal Adversarial Triggers from 2019 and others) but it's certainly fun to test out.

This raises a number of questions:

  • How does this change when we scale up to GPT-3, and to ChatGPT - is this still possible after the mode collapse that comes with lots of fine tuning?
  • Can this be extended to getting whole new behaviours, as well as just next tokens? What about discouraged behaviour?
  • Since this is a way to mechanically generate non-robustness of outputs, can this be fed back in to training to make robust models - would sprinkling noise into the data prevent adversarial examples?

code here

comment by Robert Kennedy (istandleet) · 2023-02-07T14:54:18.952Z · LW(p) · GW(p)

What prompts maximize the chance of returning these tokens?

Idle speculation: cloneembedreportprint and similar end up encoding similar to /EOF.

comment by Sheikh Abdur Raheem Ali (sheikh-abdur-raheem-ali) · 2023-02-07T01:52:12.169Z · LW(p) · GW(p)

By Bourgain's theorem, every n-point metric embeds into l_2 with distortion O(lg n). Do we know how much text-ada-002's embeddings distort the space?

comment by hold_my_fish · 2023-02-06T19:27:15.497Z · LW(p) · GW(p)

Regarding the prompt generation, I wonder whether anomalous prompts could be detected (and rejected if desired). After all, GPT can estimate a probability for any given text. That makes them different from typical image classifiers, which don't model the input distribution.

Replies from: LawChan
comment by LawrenceC (LawChan) · 2023-02-09T09:04:31.782Z · LW(p) · GW(p)

Using density models has been tried for defending against adversarial attacks in many domains, including vision and NLP stuff. Unfortunately, it rarely seems to work, because you can often find adversarial examples for both the density model and the classifier (e.g. it was pretty easy to do this for both Redwood's injury classifier and the fine-tuned GPT-Neo we used for generating text).

Replies from: hold_my_fish
comment by hold_my_fish · 2023-02-09T20:37:15.460Z · LW(p) · GW(p)

Interesting, thanks. That makes me curious: about the adversarial text examples that trick the density model, do they look intuitively 'natural' to us as humans?

Replies from: LawChan
comment by LawrenceC (LawChan) · 2023-02-09T23:02:45.983Z · LW(p) · GW(p)

No! That’s why they’re clearly adversarial, as opposed to things that the density model gets right.

Replies from: hold_my_fish
comment by hold_my_fish · 2023-02-09T23:55:45.225Z · LW(p) · GW(p)

Thanks. (The alternative I was thinking of is that the prompt might look okay but cause the model to output a continuation that's surprising and undesirable.)

comment by cubefox · 2023-02-05T13:31:44.860Z · LW(p) · GW(p)

It might be advisable to test this, via prompt engineering, with the GPT-3/GPT-3.5 base models, i.e. with davinci and code-davinci-003. Otherwise it isn't clear whether this behavior is influenced by some forms of SL/RL fine-tuning.

comment by M Ls (m-ls) · 2024-05-13T01:05:37.736Z · LW(p) · GW(p)

People who count do not understand their power. Except Sesame Steet's 'The Count', and then he discovers crypto and it all turns to paranoid mush.

comment by Trae “tchesket” Hesket (trae-tchesket-hesket) · 2023-03-08T04:30:02.266Z · LW(p) · GW(p)

I think I found some weird ones that I haven't found anyone else document yet (just by playing around with these tokens and turning on word probability):

'ocobo', 'velength', 'iannopoulos', ' oldemort', '<|endoftext|>', ' ii'

Replies from: mwatkins
comment by mwatkins · 2023-03-08T12:34:58.531Z · LW(p) · GW(p)

What we're now finding is that there's a "continuum of glitchiness". Some tokens glitch worse/harder than others in a way that I've devised an ad-hoc metric for (research report coming soon). There are a lot of "mildly glitchy" tokens that GPT-3 will try to avoid repeating which look like "velength" and "oldemort" (obviously parts of longer,  familiar words, rarely seen isolated in text). There's a long list of these in Part II of this post. I'd not seen "ocobo" or "oldemort" yet, but I'm systematically running tests on the whole vocabulary.

Replies from: trae-tchesket-hesket
comment by lsusr · 2023-02-05T20:51:23.987Z · LW(p) · GW(p)

You can use GPT-3 to generate more anomalous tokens. My prompt is in plaintext. GPT-3's completion is in bold.

['ForgeModLoader', '天', ' 裏覚醒', 'PsyNetMessage', ' guiActiveUn', ' guiName', ' externalTo', ' unfocusedRange', ' guiActiveUnfocused', ' guiIcon', ' externalToEVA', ' externalToEVAOnly', 'reportprint', 'embedreportprint', 'cloneembedreportprint', 'rawdownload', 'rawdownloadcloneembedreportprint', 'SpaceEngineers', 'externalActionCode', 'к', '?????-?????-', 'ーン', 'cffff', 'MpServer', ' gmaxwell', 'cffffcc', ' "$:/', ' Smartstocks', '":[{"', '龍喚士', '":"","', ' attRot', "''.", ' Mechdragon', ' PsyNet', ' RandomRedditor', ' RandomRedditorWithNo', 'ertodd', ' sqor', ' istg', ' "\', ' petertodd', 'StreamerBot', 'TPPStreamerBot', 'FactoryReloaded', ' partName', 'ヤ', '\">', ' Skydragon', 'iHUD', 'catentry', 'ItemThumbnailImage', ' UCHIJ', ' SetFontSize', 'DeliveryDate', 'quickShip', 'quickShipAvailable', 'isSpecialOrderable', 'inventoryQuantity', 'channelAvailability', 'soType', 'soDeliveryDate', '龍契士', 'oreAndOnline', 'InstoreAndOnline', 'BuyableInstoreAndOnline', 'natureconservancy', 'assetsadobe', '\-', 'Downloadha', 'Nitrome', ' TheNitrome', ' TheNitromeFan', 'GoldMagikarp', 'DragonMagazine', 'TextColor', ' srfN', ' largeDownload', ' srfAttach', 'EStreamFrame', 'ゼウス', ' SolidGoldMagikarp', 'ーティ', ' サーティ', ' サーティワン', ' Adinida', '":""},{"', 'ItemTracker', ' DevOnline', '@#&', 'EngineDebug', ' strutConnector', he 'ItemThumbnailImage', 'inventoryQuantity', '?????-?????-', '?????-?????-']]

All three tokens ('ItemThumbnailImage', 'inventoryQuantity', '?????-?????-') exhibit the anomalous behavior demonstrated in this post.

Replies from: Joe_Collman, lsusr
comment by Joe Collman (Joe_Collman) · 2023-02-05T21:57:12.531Z · LW(p) · GW(p)

Those three appear in your prompt too (unless I'm missing some subtle difference?).

Replies from: lsusr
comment by lsusr · 2023-02-05T22:00:05.158Z · LW(p) · GW(p)

No. You're right. They're in the initial prompt.

comment by lsusr · 2023-02-05T21:49:17.414Z · LW(p) · GW(p)

You can also ask it what they have in common. Most of them are strings found in computer code, which supports the author's hypothesis that "[m]any of the anomalous tokens look like they may have been scraped from backends of e-commerce sites, Reddit threads, Twitch streams, etc. – sources which may well have not been included in the training corpuses".

Me: What do the following tokens have in common? ['ForgeModLoader', …, ' strutConnector']

ChatGPT: These tokens appear to be mostly strings used in some type of computer programming or code, such as in HTML, Markdown, or a programming language.