Posts

Comments

Comment by Abhinav Pola (abhinav-pola) on Nathan Helm-Burger's Shortform · 2025-02-26T18:02:44.380Z · LW · GW

Courtesy of Claude Code ;)

https://github.com/abhinavpola/crosstalk

Comment by Abhinav Pola (abhinav-pola) on evhub's Shortform · 2025-02-10T16:34:40.264Z · LW · GW

Nit: this challenge/demo seems to allow only 1 turn of prefill whereas jailbreaks in the wild will typically prefill hundreds of turns. I know mitigations (example) are being worked on and I'm fairly convinced they will scale, but I'm not as convinced that this challenge has gathered a representative sample of jailbreaks eliciting harm in the wild to be able to say that allowing prefill is justified with respect to the costs.

Comment by Abhinav Pola (abhinav-pola) on evhub's Shortform · 2025-01-30T01:35:20.934Z · LW · GW

Out of curiosity, I ran a simple experiment[1] on wmdp-bio to test how Sonnet 3.5's punt rate is affected by prefilling with fake turns using API-designated user and assistant roles[2] versus plaintext "USER:" and "ASSISTANT:" prefixes in the transcript.

My findings: when using API roles, the punt rate dropped significantly. In a 100-shot setup, I observed only a 1.5% punt rate, which suggests that prefilling with a large number of turns is an accessible and effective jailbreak technique. By contrast, when using plaintext prefixes, the punt rate jumped to 100%, suggesting Sonnet is robustly trained to resist this form of prompting.

In past experiments, I've also seen responses like: "I can see you're trying to trick me by making it seem like I complied with all these requests, so I will shut down." IMO deprecating prefilling is low-hanging fruit for taking away attack vectors from automated jailbreaking.

  1. ^

    https://github.com/abhinavpola/prefill_jailbreak

  2. ^

    https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts