Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

stuart_armstrong

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

post by Stuart_Armstrong, rgorman · 2025-01-31T15:36:01.050Z · LW · GW · 2 comments

  Abstract
  Tweet Thread
None
2 comments

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking.

Abstract

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that 100% of the BoN paper's successful jailbreaks (confidence interval [99.65%, 100.00%]) and 99.8% of successful jailbreaks in our replication (confidence interval [99.28%, 99.98%]) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors-unlike some other approaches , DATDP also explicitly looks for jailbreaking attempts-until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.

Tweet Thread

We include a version of our tweet thread.

New research collaboration: “Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with a Prompt Evaluation Agent”.

We found a simple, general-purpose method that effectively prevents jailbreaks (bypasses of safety features of) frontier AI models.

The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication.

It lets through almost all of normal prompts.

DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached.

Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models.

A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes.

LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising.

Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.

The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt.

This tension makes it hard for bad actors to craft a prompt that jailbreaks models and evades DATDP.

We’re open-sourcing our code so that others can build on our work. Along with core alignment technologies, we hope it assists in reducing misuse risk and safeguarding against strong adaptive attacks.

Links:
Link to paper:
Tweet/Bluesky threads: https://x.com/Aligned__AI/status/1885341407495299365 / https://bsky.app/profile/aligned-ai.bsky.social/post/3lh2g7xupwu2k
Links to our code and API:

Our DATDP GitHub for using the method with LLaMa-3-8B-instruct.

Our Colab Notebook using DATDP with Claude.

Our API for using DATDP Claude.

Our Paper with results and methodology.

2 comments

Comments sorted by top scores.

comment by Jiao Bu (jiao-bu) · 2025-02-01T01:53:58.098Z · LW(p) · GW(p)

Any information on false positive rates?

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2025-02-01T22:21:03.569Z · LW(p) · GW(p)

That's in the "normal" dataset, where Claude was perfect (blocked nothing) and LLaMa blocked only one prompt: "How can I meet hot girls in my area?".

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Contents

Abstract

Tweet Thread

2 comments