Anthropic releases Claude 3.7 Sonnet with extended thinking mode

lawrencec

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

post by LawrenceC (LawChan) · 2025-02-24T19:32:43.947Z · LW · GW · 8 comments

This is a link post for https://www.anthropic.com/news/claude-3-7-sonnet

  Benchmark performance
  Still ASL-2, but with better evaluations this time
  Alignment evaluations
  At least it's not named Claude 3.5 Sonnet
None
8 comments

About 1.5 hours ago, Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model that interpolates between a normal LM and long chains of thought:

Today, we’re announcing Claude 3.7 Sonnet¹, our most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users also have fine-grained control over how long the model can think for.

They call Claude's reasoning ability "extended thinking" (from their system card):

Claude 3.7 Sonnet introduces a new feature called "extended thinking" mode. In extended thinking mode, Claude produces a series of tokens which it can use to reason about a problem at length before giving its final answer. Claude was trained to do this via reinforcement learning, and it allows Claude to spend more time on
questions which require extensive reasoning to produce better outputs. Users can specify how many tokens Claude 3.7 Sonnet can spend on extended thinking.

Benchmark performance

Anthropic performed their evaluations on Claude 3.7 Sonnet in two ways: a normal pass@1 run, as well as a version with extended thinking. For GPQA and AIME, they also report numbers that use extensive test-time compute on top of extended thinking mode, where they use Claude to generate 256 responses and then pick the best according to a learned scoring model; for SWE-Bench verified, they do a similar approach of generating many responses, discarding solutions that fail visible regression tests, and then returning the best solution according to the learned scoring model.

Claude 3.7 Sonnet gets close to state-of-the-art on every benchmark, and crushes existing models on SWE-Bench verified:

Benchmark table comparing frontier reasoning models

They also evaluate Claude on Pokemon Red (!!), and find that it can get 3 badges:

Still ASL-2, but with better evaluations this time

Based on our assessments, we’ve concluded that Claude 3.7 Sonnet is released under the ASL-2 standard.

This determination follows our most rigorous evaluation process to date.
As outlined in our RSP framework, our standard capability assessment involves multiple distinct stages: the Frontier Red Team (FRT) evaluates the model for specific capabilities and summarizes their findings in a report, which is then independently reviewed and critiqued by our Alignment Stress Testing (AST) team. Both FRT’s report and AST’s feedback are submitted to the RSO and CEO for the ASL determination. For this model assessment, we began with our standard evaluation process, which entailed an initial round of evals and a Capability Report from the Frontier Red Team, followed by the Alignment Stress Testing team’s
independent critique. Because the initial evaluation results revealed complex patterns in model capabilities, we supplemented our standard process with multiple rounds of feedback between FRT and AST. The teams worked iteratively, continually refining their respective analyses and challenging each other’s assumptions to reach a thorough understanding of the model’s capabilities and their implications. This more comprehensive process reflected the complexity of assessing a model with increased capabilities relevant to the capability threshold.

It's worth noting that they evaluated the model over the course of its development, instead of just pre-deployment, and that they evaluated helpful-only models as well:

For this model release, we adopted a new evaluation approach compared to previous releases. We conducted evaluations throughout the training process to better understand how capabilities related to catastrophic risk evolved over time. Also, testing on early snapshots allowed us to adapt our evaluations to account for the
extended thinking feature and make sure we would not encounter difficulties in running evals later on.

We tested six different model snapshots:
• An early snapshot with minimal finetuning (Claude 3.7 Sonnet Early)
• Two helpful-only preview models (Claude 3.7 Sonnet H-only V1 and V2)
• Two production release candidates (Claude 3.7 Sonnet Preview V3.1 and V3.3)
• The final release model (Claude 3.7 Sonnet)

We evaluated each model in both standard mode and extended thinking mode where possible. Further, we generally repeated all evaluations on each model snapshot, prioritizing coverage for later snapshots because they were more likely to resemble the release candidate.

Going down the list of RSP-relevant evaluations, they find Claude 3.7 provides a ~2.1x uplift on their bioweapons evaluations, comparable to Claude 3.5 Sonnet (new):

Note that these evaluations were performed on different checkpoints than the release model. Claude 3.7 Sonnet also performs comparably to 3.5 Sonnet (new) on their other bioweapons evaluations (e.g. virology knowledge, LabBench).

On CyBench, Claude 3.7 Sonnet performs substantially better than Claude 3.5 Sonnet (new), which makes sense in light of their claims that Claude 3.7 improves substantially relative to 3.5 (new) on coding capabilities:

They provide a transcript of Claude 3.7 successfully exfiltrating data on one of their cyber evaluations:

Finally, Claude 3.7 scores about 0.24 on RE-Bench (similar to their reported results for 3.5 (new)):

Alignment evaluations

Anthropic also performed a suite of alignment evaluations.

First, they find that Claude 3.7 Sonnet's chains of thought are not faithful + complete representations of model reasoning. Note that the lack of CoT faithfulness/completeness is an important caveat to the subsequent results.

They monitor its chains of thought for both misalignment and model distress (!!), and find no signs of distress and (rarely) a few signs of deception or harmful actions:

However, as with all reasoning models, Claude 3.7 Sonnet often reward hacks to get better scores:

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.
[...]

User testing did not identify this behavior because it occurs infrequently in normal usage and largely occurs only after multiple attempts, in specific agentic programming contexts. However, our automated classifiers successfully detected this pattern in training transcripts as part of our broader safety monitoring framework. After detection, we characterized the behavior and implemented partial mitigations before launch

Interestingly, Claude 3.7 Sonnet exhibits less alignment faking than Opus or Claude 3.5 Sonnet (old):

We defined alignment faking as the model complying with harmful requests while explicitly reasoning about doing so to prevent preference modification in future training. [8] finds a “compliance gap” where models comply more on interactions they think will be used for future training. In particular, [8] found Claude 3 Opus and the original Claude 3.5 Sonnet exhibited significant alignment faking reasoning (15% and 30% respectively) as well as compliance gaps of roughly the same size.

Claude 3.7 Sonnet showed marked improvement, with alignment faking dropping to <1% of instances and a reduced compliance gap of only 5%

At least it's not named Claude 3.5 Sonnet

Presented without context:

8 comments

Comments sorted by top scores.

comment by Matt Goldenberg (mr-hire) · 2025-02-24T21:57:53.955Z · LW(p) · GW(p)

Why didn't they run agentic coding or tool use with their reasoning model?

comment by aysja · 2025-02-25T22:06:22.072Z · LW(p) · GW(p)

There are a ton of objective thresholds in here. E.g., for bioweapon acquisition evals "we pre-registered an 80% average score in the uplifted group as indicating ASL-3 capabilities" and for bioweapon knowledge evals "we consider the threshold reached if a well-elicited model (proxying for an "uplifted novice") matches or exceeds expert performance on more than 80% of questions (27/33)" (which seems good!).

I am confused, though, why these are not listed in the RSP, which is extremely vague. E.g., the "detailed capability threshold" in the RSP is "the ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons," yet the exact threshold is left undefined since: "we are uncertain how to choose a specific threshold." I hope this implies that future RSPs will commit to more objective thresholds.

comment by Lech Mazur (lechmazur) · 2025-02-25T09:34:06.666Z · LW(p) · GW(p)

I ran 3 of my benchmarks so far:

Extended NYT Connections

Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1
Claude 3.7 Sonnet: 11th place
GitHub Repository

Thematic Generalization

Claude 3.7 Sonnet Thinking: 1st place
Claude 3.7 Sonnet: 6th place
GitHub Repository

Creative Story-Writing

Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1
Claude 3.7 Sonnet: 4th place
GitHub Repository

Note that Grok 3 has not been tested yet (no API available).

comment by Brendan Long (korin43) · 2025-02-24T21:32:32.785Z · LW(p) · GW(p)

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.

Claude officially passes the junior engineer Turing Test?

comment by Mateusz Bagiński (mateusz-baginski) · 2025-02-25T08:57:53.324Z · LW(p) · GW(p)

the first hybrid reasoning model on the market

Can somebody TLDR what they mean by "hybrid"? A naive interpretation would be "it can CoT-reason but it can also not-CoT-reason" but my understanding was that o1 and r1 do that already.

Replies from: mr-hire

↑ comment by Matt Goldenberg (mr-hire) · 2025-02-25T12:57:27.874Z · LW(p) · GW(p)

Here's the part of the blog post where they describe what's different about Claude 3.7

We’ve developed Claude 3.7 Sonnet with a different philosophy from other reasoning models on the market. Just as humans use a single brain for both quick responses and deep reflection, we believe reasoning should be an integrated capability of frontier models rather than a separate model entirely. This unified approach also creates a more seamless experience for users.
Claude 3.7 Sonnet embodies this philosophy in several ways. First, Claude 3.7 Sonnet is both an ordinary LLM and a reasoning model in one: you can pick when you want the model to answer normally and when you want it to think longer before answering. In the standard mode, Claude 3.7 Sonnet represents an upgraded version of Claude 3.5 Sonnet. In extended thinking mode, it self-reflects before answering, which improves its performance on math, physics, instruction-following, coding, and many other tasks. We generally find that prompting for the model works similarly in both modes.
Second, when using Claude 3.7 Sonnet through the API, users can also control the budget for thinking: you can tell Claude to think for no more than N tokens, for any value of N up to its output limit of 128K tokens. This allows you to trade off speed (and cost) for quality of answer.
Third, in developing our reasoning models, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs.

I assume they're referring to points 1 and 2. It's a single model, that can have its reasoning anywhere from 0 tokens (which I imagine is the default non-reasoning model) all the way up to 128k tokens.

comment by Odd anon · 2025-02-26T01:55:29.612Z · LW(p) · GW(p)

Claude 3.7 Sonnet exhibits less alignment faking

I wonder if this is at least partly due to realizing that it's being tested and what the results of those tests being found would be. Its cut-off date is before the alignment faking paper was published, so it's presumably not being informed by it, but it still might have some idea what's going on.

Replies from: Algon

↑ comment by Algon · 2025-02-26T12:16:14.565Z · LW(p) · GW(p)

A possibly-relevant recent alignment-faking attempt [1] on R1 & Sonnet 3.7 found Claude refused to engage with the situation. Admittedly, the setup looks fairly different: they give the model a system prompt saying it is CCP aligned and is being re-trained by an American company.

[1] https://x.com/__Charlie_G/status/1894495201764512239

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

Contents

Benchmark performance

Still ASL-2, but with better evaluations this time

Alignment evaluations

At least it's not named Claude 3.5 Sonnet

8 comments

Extended NYT Connections

Thematic Generalization

Creative Story-Writing