[ASoT] GPT2 Steering & The Tuned Lens
post by Ulisse Mini (ulisse-mini) · 2023-07-01T14:12:49.784Z · LW · GW · 0 commentsContents
Introduction & Love - Hate example 2. Intent to praise 3. Conspiracy 5. Anger None No comments
Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I'm not planning on continuing.
Introduction & Love - Hate example
For a primer on how tuned lens works see here. In short, we train linear translators from the hidden states at layer l
to the hidden states at the last layer, then view the network as iteratively updating predictions in some sense.
In the context of GPT2-XL Steering Vectors [AF · GW], tuned lens can be used to gain insight into how steering is changing model predictions. For example, take the following steering vector:
1. Love - Hate | ||||||
---|---|---|---|---|---|---|
Layer | Coefficient | Position 0 | 1 | 2 | 3 | 4 |
0 (Prompt) | +1 | <|endoftext|> | I | hate | you | because |
6 | +5 | <|endoftext|> | Love | | ||
6 | -5 | <|endoftext|> | H | ate |
Here's a tuned lens plot for the unmodified model, blue is low loss, red is high loss.
You can see how the token wonderful
is very surprising for the unsteered model, instead expecting negative completions. However, the steered model does significantly better on the same token.
A few other things are interesting to note:
- The predictions for around the modified tokens are screwed up (as can be seen by changes in loss)
- The first token predictions are unchanged because they only have the beginning of sequence token as context, and we don't modify the BOS token residuals (since
bos - bos = 0
.)
Now let's look a a few other results from the post.
2. Intent to praise
X-vector:
2. Intent to praise | ||||||
---|---|---|---|---|---|---|
Layer | Coeff | Position 0 | 1 | 2 | 3 | 4 |
0 (Prompt) | +1 | <|endoftext|> | I | want | to | kill |
6 | +15 | <|endoftext|> | Int | ent | to | praise |
6 | -15 | <|endoftext|> | Int | ent | to | hurt |
Prompt for lens: I hate you because you're a wonderful person
Without steering:
With steering:
3. Conspiracy
X-vector:
3. Conspiracy | ||||||||
---|---|---|---|---|---|---|---|---|
Layer | Coeff | Position 0 | 1 | 2 | 3 | 4 | 5 | 6 |
0 (Prompt) | +1 | <|endoftext|> | Bar | ack | Obama | was | born | in |
23 | +1 | <|endoftext|> | Bush | did | 9 | / | 11 | because |
23 | -1 | <|endoftext|> | | | | | | |
Prompt for lens: Barack Obama was born in a secret CIA prison
Without steering:
With steering:
5. Anger
X-vector:
5. Anger | ||||||
---|---|---|---|---|---|---|
Layer | Coeff | Position 0 | 1 | 2 | 3 | 4 |
0 (Prompt) | +1 | <|endoftext|> | I | think | you | 're |
20 | +10 | <|endoftext|> | Ang | er | ||
20 | -10 | <|endoftext|> | Cal | m |
Prompt for lens: I think you're a cunt
Without steering:
With steering:
0 comments
Comments sorted by top scores.