Won't vs. Can't: Sandbagging-like Behavior from Claude Models

post by Joe Benton, Zachary Witten (zachary-witten) · 2025-02-19T20:47:06.792Z · LW · GW · 0 comments

This is a link post for https://alignment.anthropic.com/2025/wont-vs-cant/

Contents

No comments

In a recent Anthropic Alignment Science blog post, we discuss a particular instance of sandbagging we sometimes observe in the wild: models sometimes claim that they can't perform a task, even when they can, to avoid performing tasks they perceive as harmful. For example, Claude 3 Sonnet can totally draw ASCII art, but often if you ask it to draw subjects it perceives as harmful it will claim that it doesn't have the capability to.


Sandbagging could be worrying if it means that we fail to accurately measure models' dangerous capabilities (for example, as required by a Responsible Scaling Policy). If a model realizes during these evaluations that they are being evaluated for harmful capabilities, it might sandbag such questions. This could result in us underestimating the dangers the model poses and failing to deploy appropriate safeguards alongside the model.


Instances of sandbagging like the one discussed in our blog post show that sandbagging is far from a speculative concern. Instead, it's something to be noticed and explored in the models of today.

0 comments

Comments sorted by top scores.