Anthropic releases Claude 3.7 Sonnet with extended thinking mode

post by LawrenceC (LawChan) · 2025-02-24T19:32:43.947Z · LW · GW · 2 comments

This is a link post for https://www.anthropic.com/news/claude-3-7-sonnet

Contents

  Benchmark performance
  Still ASL-2, but with better evaluations this time
  Alignment evaluations
  At least it's not named Claude 3.5 Sonnet
None
2 comments

See also: the research post detailing Claude's extended reasoning abilities and the Claude 3.7 System Card

About 1.5 hours ago, Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model that interpolates between a normal LM and long chains of thought:

Today, we’re announcing Claude 3.7 Sonnet1, our most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users also have fine-grained control over how long the model can think for.

They call Claude's reasoning ability "extended thinking" (from their system card):

Claude 3.7 Sonnet introduces a new feature called "extended thinking" mode. In extended thinking mode, Claude produces a series of tokens which it can use to reason about a problem at length before giving its final answer. Claude was trained to do this via reinforcement learning, and it allows Claude to spend more time on
questions which require extensive reasoning to produce better outputs. Users can specify how many tokens Claude 3.7 Sonnet can spend on extended thinking.

Benchmark performance

Anthropic performed their evaluations on Claude 3.7 Sonnet in two ways: a normal pass@1 run, as well as a version with extended thinking. For GPQA and AIME, they also report numbers that use extensive test-time compute on top of extended thinking mode, where they use Claude to generate 256 responses and then pick the best according to a learned scoring model; for SWE-Bench verified, they do a similar approach of generating many responses, discarding solutions that fail visible regression tests, and then returning the best solution according to the learned scoring model. 

Claude 3.7 Sonnet gets close to state-of-the-art on every benchmark, and crushes existing models on SWE-Bench verified: 

Benchmark table comparing frontier reasoning models

 

They also evaluate Claude on Pokemon Red (!!), and find that it can get 3 badges:

Still ASL-2, but with better evaluations this time

Based on our assessments, we’ve concluded that Claude 3.7 Sonnet is released under the ASL-2 standard.

This determination follows our most rigorous evaluation process to date.
As outlined in our RSP framework, our standard capability assessment involves multiple distinct stages: the Frontier Red Team (FRT) evaluates the model for specific capabilities and summarizes their findings in a report, which is then independently reviewed and critiqued by our Alignment Stress Testing (AST) team. Both FRT’s report and AST’s feedback are submitted to the RSO and CEO for the ASL determination. For this model assessment, we began with our standard evaluation process, which entailed an initial round of evals and a Capability Report from the Frontier Red Team, followed by the Alignment Stress Testing team’s
independent critique. Because the initial evaluation results revealed complex patterns in model capabilities, we supplemented our standard process with multiple rounds of feedback between FRT and AST. The teams worked iteratively, continually refining their respective analyses and challenging each other’s assumptions to reach a thorough understanding of the model’s capabilities and their implications. This more comprehensive process reflected the complexity of assessing a model with increased capabilities relevant to the capability threshold.

It's worth noting that they evaluated the model over the course of its development, instead of just pre-deployment, and that they evaluated helpful-only models as well: 

For this model release, we adopted a new evaluation approach compared to previous releases. We conducted evaluations throughout the training process to better understand how capabilities related to catastrophic risk evolved over time. Also, testing on early snapshots allowed us to adapt our evaluations to account for the
extended thinking feature and make sure we would not encounter difficulties in running evals later on.

We tested six different model snapshots:
• An early snapshot with minimal finetuning (Claude 3.7 Sonnet Early)
• Two helpful-only preview models (Claude 3.7 Sonnet H-only V1 and V2)
• Two production release candidates (Claude 3.7 Sonnet Preview V3.1 and V3.3)
• The final release model (Claude 3.7 Sonnet)

We evaluated each model in both standard mode and extended thinking mode where possible. Further, we generally repeated all evaluations on each model snapshot, prioritizing coverage for later snapshots because they were more likely to resemble the release candidate.

Going down the list of RSP-relevant evaluations, they find Claude 3.7 provides a ~2.1x uplift on their bioweapons evaluations, comparable to Claude 3.5 Sonnet (new):

Note that these evaluations were performed on different checkpoints than the release model. Claude 3.7 Sonnet also performs comparably to 3.5 Sonnet (new) on their other bioweapons evaluations (e.g. virology knowledge, LabBench).

On CyBench, Claude 3.7 Sonnet performs substantially better than Claude 3.5 Sonnet (new), which makes sense in light of their claims that Claude 3.7 improves substantially relative to 3.5 (new) on coding capabilities:

They provide a transcript of Claude 3.7 successfully exfiltrating data on one of their cyber evaluations: 

Finally, Claude 3.7 scores about 0.24 on RE-Bench (similar to their reported results for 3.5 (new)):

Alignment evaluations

Anthropic also performed a suite of alignment evaluations. 

First, they find that Claude 3.7 Sonnet's chains of thought are not faithful + complete representations of model reasoning. Note that the lack of CoT faithfulness/completeness is an important caveat to the subsequent results. 

They monitor its chains of thought for both misalignment and model distress (!!), and find no signs of distress and (rarely) a few signs of deception or harmful actions: 

However, as with all reasoning models, Claude 3.7 Sonnet often reward hacks to get better scores:

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.

[...]

User testing did not identify this behavior because it occurs infrequently in normal usage and largely occurs only after multiple attempts, in specific agentic programming contexts. However, our automated classifiers successfully detected this pattern in training transcripts as part of our broader safety monitoring framework. After detection, we characterized the behavior and implemented partial mitigations before launch

Interestingly, Claude 3.7 Sonnet exhibits less alignment faking than Opus or Claude 3.5 Sonnet (old):

We defined alignment faking as the model complying with harmful requests while explicitly reasoning about doing so to prevent preference modification in future training. [8] finds a “compliance gap” where models comply more on interactions they think will be used for future training. In particular, [8] found Claude 3 Opus and the original Claude 3.5 Sonnet exhibited significant alignment faking reasoning (15% and 30% respectively) as well as compliance gaps of roughly the same size.

Claude 3.7 Sonnet showed marked improvement, with alignment faking dropping to <1% of instances and a reduced compliance gap of only 5%

At least it's not named Claude 3.5 Sonnet

Presented without context:

2 comments

Comments sorted by top scores.

comment by Matt Goldenberg (mr-hire) · 2025-02-24T21:57:53.955Z · LW(p) · GW(p)

Why didn't they run agentic coding or tool use with their reasoning model?

comment by Brendan Long (korin43) · 2025-02-24T21:32:32.785Z · LW(p) · GW(p)

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.

Claude officially passes the junior engineer Turing Test?