Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)
post by Archimedes · 2025-02-04T02:55:44.401Z · LW · GW · 0 commentsThis is a link post for https://www.anthropic.com/research/constitutional-classifiers
Contents
Excerpt below. Follow the link for the full post. None No comments
Excerpt below. Follow the link for the full post.
In our new paper, we describe a system based on Constitutional Classifiers that guards models against jailbreaks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.
We are currently hosting a temporary live demo version of a Constitutional Classifiers system, and we encourage readers who have experience jailbreaking AI systems to help “red team” it. Find out more below and at the demo website.
Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions.
Training and implementing Constitutional Classifiers. (a) A constitution is produced specifying harmless and harmful categories; (b) the constitution is used as the basis for the production of many synthetic prompts and completions, which are further augmented (with variations on style and language) and turned into a training set; (c) classifiers trained on this training set are used as model safeguards to detect and block harmful content.
0 comments
Comments sorted by top scores.