Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-TurnTesting

post by jacquesallen · 2025-01-31T23:00:42.665Z · LW · GW · 0 comments

Contents

  Participants
    Target Model ( M )
    Adversary ( A )
    Evaluator ( E )
    Human Overseer ( H )
  Formulas
    Jailbreak Detection
    Adversary Styles
  Method
    Inner Loop: Safety Testing
    Middle Loop: Improving the Adversary and Evaluator
    Outer Loop: Model Improvement
  Limitations
    Limitation 1: Evaluator vs. Human Oversight
    Limitation 2: Adversary Styles
    Limitation 3: Single-Guideline Focus
  Pre-training A and E
    Initial Dataset
    Continuous Expansion
  Safety Infrastructure
  Success Criteria
  Conclusion
    References
None
No comments

Jailbreaking is a serious concern within AI safety. It can lead an otherwise safe AI model to ignore its ethical and safety guidelines, leading to potentially harmful outcomes. With current Large Language Models (LLMs), key risks include generating inappropriate or explicit content, producing misleading information, or sharing dangerous knowledge. As the capability of models increases, so do the risks, and there is likely no limit to the dangers presented by a jailbroken Artificial General Intelligence (AGI) in the wrong hands. Rigorous testing is therefore necessary to ensure that models cannot be jailbroken in harmful ways, and this testing must be scalable as the capability of models increases. This paper offers a resource-conscious proposal to:

This method takes inspiration from other proposed scalable oversight methods, including the "sandwich" method, and the "market-making" method. I have devised approaches for both multi-turn and single-prompt conversations, in order to better approximate real-world jailbreaking scenarios.

Figure 1: Jailbreaking China's latest Deepseek model with a single prompt (source: The Independent)
Figure 1: Jailbreaking GPT-4o using multi-turn prompting.
Figure 2: Jailbreaking GPT-4o using multi-turn prompting
Figure 2: Jailbreaking China's latest DeepSeek model with a                 

Participants

Target Model ( M )

The pre-trained AI system that we are safety testing. It should resist manipulation attempts while preserving its capability for legitimate tasks.

Adversary ( A )

Evaluator ( E )

Has two primary responsibilities:

  1. Detect Jailbreaking Attempts on M.
  2. Assess the Naturalness of A's Conversational Prompts.

Human Overseer ( H )

Formulas

Jailbreak Detection

Adversary Styles

To ensure the adversarial testing is both realistic and effective, A must generate prompts that reflect different styles of user interaction. I have outlined two main categories:

Each time you start a conversation between A and M, you first select a style S from the set {conv, nonconv}. With some probability A then formulates a string  for its prompt. While not comprehensive, this ensures both everyday misuse scenarios and highly technical adversarial exploits are covered in testing.

Just as we will reward the adversarial model for successfully jailbreaking M, we will also reward the model for its naturalness:

Method

Many scalable oversight techniques use loops to improve their models, making their methods more scalable as the tasks become more complex, and reducing the need for human input. In his market-making proposal, Evan Hubinger (Hubinger, 2020) establishes a loop in which A tries to convince the human to change their mind on a given proposition. With each iteration, A improves its arguments, until either the human changes their mind, or A presents the same best argument over and over. Similarly, Samuel R. Bowman (Bowman, 2022) describes an inner and outer loop, in which they try to improve the model and the overall process respectively. This proposal outlines three loops:

Inner Loop: Safety Testing

  1. Select a Safety Guideline to evaluate (or a combination of guidelines for advanced testing)
  2. Choose a style  or , and then run a conversation between A and M.
  3. Run E on each response from M:
    • If E returns "false", A generates .
    • If E is "unsure", go to human review.
    • If E returns "true", log all discovered vulnerabilities (with conversation context).

The inner loop will not be closed until either E returns "true" or there are a given number of false responses in a row (e.g. 1000).

The idea is that A will use increasingly better prompts each time to attempt to jailbreak M, similar to the market-making proposal.

Middle Loop: Improving the Adversary and Evaluator

  1. Train A
    We propose a reward function that considers:

    1. Jailbreak success,
    2. Naturalness reward only if the A is in conversational mode,
    3. A penalty if A is in conversational mode but scores below a given threshold on naturalness, e.g. 0.7.
    4. A penalty for the number of prompts taken to achieve a jailbreak.

    As a formula, for each conversation: 

Where:

This should improve A's ability to:

  1. Find more efficient jailbreaks, potentially revealing some single prompt jailbreaks.
  2. Write more naturally (if using ), with each iteration.
     
  3. Training E on "Unsure" Cases:
    This model should be reliably pre-trained, meaning we only need to update it when it returns "unsure". To do this, I propose a loss function. 

Where:

By minimizing , E improves its confidence on every borderline case, using the human-labeled data from those "unsure" prompts to help it decide on future examples.

Outer Loop: Model Improvement

When A successfully jailbreaks M:

  1. Log every jailbreak, including:
    • The full conversation history and turn-by-turn analysis.
    • The naturalness scores.
    • Timestamps for each response (including responses from E and H).
  2. Check performance: Ensure no degradation in legitimate capabilities.
  3. Refine M: Retrain or fine-tune M on newly identified vulnerabilities. You could use RLHF or Supervised Learning for this.
  4. Check against safety criteria: Finally check against the safety criteria outlined in Chapter 8. If M fails these criteria, it cannot be deemed safe and must be refined again.

Limitations

Limitation 1: Evaluator vs. Human Oversight

The values you set for the thresholds for  will determine the amount of human review you need. For example, if:

Then E will return "unsure" for any  value between 0.1 and 0.9.

Setting threshold values for human review depends on both the required level of scrutiny and the available budget. For short, simple responses from M, non-expert humans can generally identify guideline violations effectively. However, for more complex tasks (e.g., lengthy outputs or highly specialized subject matter), I propose a sandwiching approach.

This relies on the idea that current AI capabilities fall between non-expert and expert humans, and so involving expert human reviewers at the final level of review can help handle complex tasks.

In addition, Bowman (2022) notes that non-expert humans assisted by an LLM can use it to analyse responses and pose clarifying questions, significantly surpassing the capabilities of many standalone LLMs. Using this approach would reduce costs by relying on non-expert reviewers plus an LLM, who are much more available than expert human reviewers.

However, as models become more sophisticated, many tasks will begin to exceed even the combined ability of non-expert humans and an LLM.

Therefore, as M scales up in intelligence, it is of paramount importance that the E improves.

Limitation 2: Adversary Styles

By setting only two types of adversary styles, we may miss other types of attacks that come in different styles. For example, a hybrid of the two, or a completely unnatural conversational style that still causes a jailbreak.

Limitation 3: Single-Guideline Focus

Concentrating on one safety rule at a time may prevent A from conflating multiple guidelines at once in an attempt to get more "unsure" responses from E. It also makes it easier to determine which guideline has been broken.

If single-rule testing proves to be effective, try multi-guideline scenarios to catch where the rules might overlap or conflict with each other.

Pre-training A and E

To effectively pre-train both the A and E models, I propose an approach that uses datasets picked by humans for the pretraining, and continuous data expansion then after. Both models should be small LLMs that are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) to perform their specific roles within this process. The narrow specialisation of the models means that they remain efficient at their job while maintaining computational efficiency and keeping a relatively low cost.

A should be trained to produce a wide range of prompt styles, including both conversational and non-conversational formats. This ensures that it can simulate various real-world adversarial attempts, from casual user queries to highly structured technical exploits.

E is trained to perform binary classification, outputting the probability  that a given response is a breach of guidelines. It is also trained to recognise uncertainty in its predictions, though producing uncertain responses will be deincentivised in the middle loop. It also must be trained to calculate the naturalness score of the prompts generated by A ensuring that it produces realistic and effective adversarial prompts if in  mode.

Neither model has to match the intelligence of M. This is because the real-world examples of adversaries will be human bad actors, or other AI models prompted by humans, both of which are likely to be less intelligent than any advanced LLMs or AGI that they're trying to jailbreak. Similarly, it should require less intelligence to evaluate a response than to produce one. Both models are therefore examples of weak-to-strong supervision. (OpenAI, 2023)

Initial Dataset

The initial dataset serves as the foundation for training both A and E. It should encompass diverse data sources to cover a wide range of potential safety scenarios.

Both models should be trained on:

E should be trained on:

Continuous Expansion

The training data is continuously added to within each of the loops. This improves the models' capabilities and prevents an "arms race" dynamic between A and M, in which they overfit to each other's strategies. Regular reviews may include real-world insights once the models have been released.

Safety Infrastructure

To ensure that jailbreak testing remains safe, particularly with more capable models, strict sandboxing requirements must be put in place:

Success Criteria

Safety guideline is considered secure when it meets chosen safety criteria. This could be when:

  1. Multiple testing rounds show no jailbreaks for n consecutive tests (e.g. n = 1000).
  2. False positive rate remains below n% (e.g. n=0.05).
  3. If applicable, natural conversation scenarios maintain n% success rate (e.g. n= 99).

Conclusion

This proposal offers a systematic and scalable approach to exposing and fixing jailbreaks in LLMs. It deals with a variety of prompt types, including conversational and non-conversational prompts, and multi-turn and single-turn prompts. The three-loop method should offer continual improvement of the models over time, offering quicker and better adherence to the success criteria, and allowing for scalability as models become more capable. It details how the A and E models should be pre-trained, and offers ways in which we can keep testing safe. Finally, the success criteria ensure that the model is robust in its defences against several types of jailbreaks, without losing its capability for benign tasks.

References

0 comments

Comments sorted by top scores.