Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

post by Archimedes · 2025-02-04T02:55:44.401Z · LW · GW · 0 comments

This is a link post for https://www.anthropic.com/research/constitutional-classifiers

Contents

  Excerpt below. Follow the link for the full post.
None
No comments

Excerpt below. Follow the link for the full post.

In our new paper, we describe a system based on Constitutional Classifiers that guards models against jailbreaks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.

We are currently hosting a temporary live demo version of a Constitutional Classifiers system, and we encourage readers who have experience jailbreaking AI systems to help “red team” it. Find out more below and at the demo website.

A graph showing the results for vulnerability to jailbreaks, overrefusals, and compute overhead for the Constitutional Classifiers system versus the base model Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions.

A schematic diagram of the how the Constitutional Classifiers system works, from the creation of the constitution through to generating a test set to using the system to guard an LLM Training and implementing Constitutional Classifiers. (a) A constitution is produced specifying harmless and harmful categories; (b) the constitution is used as the basis for the production of many synthetic prompts and completions, which are further augmented (with variations on style and language) and turned into a training set; (c) classifiers trained on this training set are used as model safeguards to detect and block harmful content.

0 comments

Comments sorted by top scores.