Posts

Comments

Comment by lewtun on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-13T21:07:44.751Z · LW · GW

Do you have any plans to release the instructions in RefusalBench? I understand the reasons to not provide many details of your underlying technique, but given the limitations you highlight with AdvBench, wouldn't access to RefusalBench provide safety researchers with a better benchmark to test new models on?