Nokens: A potential method of investigating glitch tokens

post by Hoagy · 2023-03-15T16:23:38.079Z · LW · GW · 0 comments

Contents

  Summary
  Intro
  Observing glitch tokens
  Introducing 'nokens'
  Why this might not be worth working on
None
No comments

Summary

Intro

This post is me trying to think through what it would mean to study glitch tokens as an alignment agenda, considering possibilities for what one might do, whether this would or would not be a useful agenda, and how we might get quick data on whether this is a good strategy. It's not something I'm planning on working on in the immediate future, mostly just a format for spelling out my thinking on the topic.

For background on glitch tokens see SolidGoldMagikarp [LW · GW] and [LW · GW] followups [LW · GW], and the many posts exploring glitch token behaviours on Matthew Watkins' twitter. 

Observing glitch tokens

The first thing to do would be to get a proper look at these tokens, which Matthew and Jessica have started to do in their posts, with questions such as:

While this is interesting, they are just a just a few tokens, from which it's hard to get a meaningful picture of what effects these tokens are having on the language model. To move beyond this I suggest:

Introducing 'nokens'

The consensus understanding on glitch tokens is that these tokens have minimal occurrences in the dataset, and so the model has no idea what the semantics of this token should be. Instead they just look like random vectors which it has no idea how to deal with, and which therefore has unpredictable downstream effects. Robustness to these kind of effects may be more important now that multimodal models like GPT-4 are SoTA, presumably involving embeddings of images being passed to a transformer which won't be discrete in the same way (h/t Misha Wagner).

This suggests that we can create similar effects by simply generating our own new embedding vectors, and inserting them into the first layer of the transformer as if they were a token. For ease I will call these synthetic glitch tokens 'nokens'. They form a kind of adversarial robustness training for language models 

To explore this I would suggest experimenting on the following list of questions:

By the end of this we are many hypotheticals deep and so it's quite likely that one of these steps would break down and one would have to stop, reassess, and reconsider or reconceptualize, but it feels like there's an interesting chain to follow.

There's also nothing restricting this kind of work to the inputs- we could also replace the residual stream with a random, or out-of-distribution vector at one or more points in the sample. Characterising what 'out-of-distribution' means for the residual stream is trickier than for input vectors - it may not be nearly as large since the model will be incentivised to maximise the information content of the residual stream - but should still be possible.

Why this might not be worth working on

I'd be interested in any feedback on this as a proposal, and people's predictions for what would happen/why it would go wrong.

Thanks to Misha Wagner for comments and everyone at the Wye Combinator retreat for conversations on this topic.

0 comments

Comments sorted by top scores.