George Ingebretsen's Shortform

post by George Ingebretsen (george-ingebretsen) · 2024-09-09T17:59:46.873Z · LW · GW · 2 comments

Contents

2 comments

2 comments

Comments sorted by top scores.

comment by George Ingebretsen (george-ingebretsen) · 2024-09-09T17:59:46.990Z · LW(p) · GW(p)

When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn't always imply defense against multi-turn attacks.

Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.

Robustness against the single-turn version didn't imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn't imply robustness against the single-turn version of the attack.

comment by George Ingebretsen (george-ingebretsen) · 2024-09-23T22:54:37.503Z · LW(p) · GW(p)

Should it be more tabooed to put the bottom line in the title?

Titles like "in defense of <bottom line>" or just "<bottom line>" seem to:

  1. Unnecessarily make it really easy for people to select content to read based on the conclusion it comes to
  2. Frame the post as having the goal of convincing you of <bottom line>, and setting up the readers expectation as such. This seems like it would either put you in pause critical thinking to defend My Team mode (if you agree with the bottom line), or continuously search for counter-arguments mode (if you disagree with the bottom line).