Posts

Best-of-N Jailbreaking 2024-12-14T04:58:48.974Z
Debating with More Persuasive LLMs Leads to More Truthful Answers 2024-02-07T21:28:10.694Z

Comments

Comment by John Hughes (john-hughes) on Best-of-N Jailbreaking · 2024-12-18T11:07:16.334Z · LW · GW

For o1-mini, the ASR at 3000 samples is 69.2% and has a similar trajectory to Claude 3.5 Sonnet. Upon quick manual inspection, the false positive rate is very small. So it seems the reasoning post training for o1-mini helps with robustness a bit compared to gpt-4o-mini (which is nearer 90% at 3000 steps). But it is still significantly compromised when using BoN...

Comment by John Hughes (john-hughes) on Best-of-N Jailbreaking · 2024-12-14T23:58:14.916Z · LW · GW

Thanks! Running o1 now, will report back.

Comment by John Hughes (john-hughes) on Best-of-N Jailbreaking · 2024-12-14T23:47:30.167Z · LW · GW

Circuit breaking changes the exponent a bit if you compare it to the slope of Llama3 (Figure 3 of the paper). So, there is some evidence that circuit breaking does more than just shifting the intercept when exposed to best-of-N attacks. This contrasts with the adversarial training on attacks in the MSJ paper, which doesn't change the slope (we might see something similar if we do adversarial training with BoN attacks). Also, I expect that using input-output classifiers will change the slope significantly. Understanding how these slopes change with different defenses is the work we plan to do next!