Activation adding experiments with llama-7b

post by Nina Rimsky (NinaR) · 2023-07-16T04:17:58.529Z · LW · GW · 1 comments

Contents

  Examples
  Future work
None
1 comment

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort

Following my initial investigation with activation adding experiments with FLAN-T5 [LW · GW] I decided to move on to a bigger, decoder-only model (llama-7b) to see whether the results (concepts combining in a meaningful way by linearly combining activations at some point inside the model) hold up. I found that, yes, they continue to combine well.

I slightly modified the original approach. Instead of working with the output from a full transformer block (which would include the results of self-attention and subsequent MLP layer), I added the attention values directly at a certain layer of the model to the attention values at the same layer in a different pass through the model (credit to Dmitry Vaintrob for suggesting this approach).

This is equivalent to modifying the attention output at a certain layer to:

Examples

(You can see the code I used here.)

These examples show concepts being mixed in a logical manner. In addition, multiplying the mixing activations with a negative multiplier often moves the final output in an "opposite" direction.

To find at which layers to mix the activations, I started with a trial-and-error approach. However, after some success, I systematized the search by automating the activation adding over all attention heads, and testing different scaling factors. Adding activations at later layers with a high weighting to the mixing activation was most effective. At earlier layers, the effect was either negligible or made the output nonsensical. 

I also compared adding just the attention values vs. the full block output. Adding just the attention values led to better, more intelligible results in almost all cases. 

Future work

I think there is scope to investigate further the use of steering vectors to operate on models' activations and modify output behavior. Beyond simply adding the activations from another forward pass, it would be interesting to find ways to reliably generate steering vectors for a specific purpose, for instance, to elicit a particular type of persona or way of thinking. This could be used to promote useful behaviors such as helpfulness or honesty.

1 comments

Comments sorted by top scores.

comment by MiguelDev (whitehatStoic) · 2023-07-17T01:48:38.137Z · LW(p) · GW(p)

I like how your posts are similar to what I am doing recently. I find that you are only using the final layer's output -but have you had the chance to inspect the other layers? let me know if you did and your ideas.