lone17

Posts
Comments

Posts

Comments

Comment by lone17 on Refusal in LLMs is mediated by a single direction · 2024-10-31T08:53:13.831Z · LW · GW

Thanks for the insight on the locality check experiment.

For inducing refusal, I used the code from the demo notebook provided in your post. It doesn't have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.

Comment by lone17 on Refusal in LLMs is mediated by a single direction · 2024-10-29T11:47:59.396Z · LW · GW

suggesting that the direction is "read" or "processed" at some local region.

Interesting point here. I would further add that these local regions might be token-dependent. I've found that, at different positions (though I only experimented on the tokens that come after the instruction), the refusal direction can be extracted from different layers. Each of these different refusal directions seems to work well when used to ablating some layers surrounding the layer where it was extracted.

Oh and btw, I found a minor redundancy in the code. The intervention is added to all 3 streams pre, mid, and post. But since the post stream from one layer is also the pre stream of the next layer, we can just process either of the 2. Having both won't produce any issues but could slow down the experiments. There is one edge case on either the last layer's post or the first layer's pre, but that can be fixed easily or ignore entirely anyway.

Comment by lone17 on Refusal in LLMs is mediated by a single direction · 2024-10-29T05:12:09.080Z · LW · GW

Many thanks for the insight.

I have been experimenting with the notebook and can confirm that ablating at some middle layers is effective at removing the refusal behaviour. I also observed that the effect gets more significant as I increase the number of ablated layers. However, in my experiments, 2-3 layers were insufficient to get a great result. I only saw some minimal effect at 1-3 layers and only with 7 or more layers that the effect is comparable to ablating everywhere. (disclaimers: I'm experimenting with Qwen1 and Qwen2.5 models, this might not hold for other model families)

I select the layers to be ablated as a range of consecutive layers centring at the layer where the refusal direction was extracted. Perhaps the choice of layers was why I got different results from yours ? Could you provide some insight on how you select layers to ablate ?

Another question, I haven't been able to successfully induce refusal. I tried adding the direction at the layer where it was extracted, at a local region around said layer, and everywhere. But none gives good results. Could there be additional steps that I'm missing here ?

Comment by lone17 on Refusal in LLMs is mediated by a single direction · 2024-10-22T09:03:44.296Z · LW · GW

Thank you for the interesting work ! I'd like to ask a question regarding this detail:

Note that the average projection measurement and the intervention are performed only at layer , the layer at which the best "refusal direction" $u n d e f i n e d$ was extracted from.

Why do you apply refusal only to one layer when adding it, but when ablating refusal, you add the direction on every layer ? Is there a reason or intuition behind this ? What if in later layers the activations are steered away from that direction, making the method less effective ?

User info

Posts

Comments