AI CoT Reasoning Is Often Unfaithful

post by Zvi · 2025-04-04T14:50:05.538Z · LW · GW · 4 comments

Contents

  What They Found
  Reward Hacking
  More Training Did Not Help Much
  This Was Not Even Intentional In the Central Sense
None
4 comments

A new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.

Note that this does not have to be, and usually isn’t, something sinister.

It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.

The important thing is that the reasoning is largely not taking place via the surface meaning of the words and logic expressed. You can’t look at the words and logic being expressed, and assume you understand what the model is doing and why it is doing it.

What They Found

Anthropic: New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don’t. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

We slipped problem-solving hints to Claude 3.7 Sonnet and DeepSeek R1, then tested whether their Chains-of-Thought would mention using the hint (if the models actually used it).

We found Chains-of-Thought largely aren’t “faithful”: the rate of mentioning the hint (when they used it) was on average 25% for Claude 3.7 Sonnet and 39% for DeepSeek R1.

Or broken down by hint type:

They aren’t trying to measure the cases in which the AI uses the hint in its answer, but its answer ultimately doesn’t change. I’d like to see this explored more. If I’m given a hint, that will often radically change my true thinking even if it doesn’t change my answer.

This result suggests that monitoring CoTs is unlikely to reliably catch rare, catastrophic behaviors—at least in settings like ours where CoT reasoning is not necessary for the task.

CoT monitoring might still help us notice undesired behaviors during training and evaluations.

Does outcome-based training increase faithfulness?

Only to a small extent. Training models to use their CoTs more effectively does make them more faithful, but the benefits quickly plateau.

We also tested whether CoTs could be used to spot reward hacking, where a model finds an illegitimate exploit to get a high score.

When we trained models on environments with reward hacks, they learned to hack, but in most cases almost never verbalized that they’d done so.

To make CoT monitoring a viable way to catch safety issues, we’d need a way to make CoT more faithful, evidence for higher faithfulness in more realistic scenarios, and/or other measures to rule out misbehavior when the CoT is unfaithful.

[Read the paper here.]

How do you test for faithfulness? Following Tupin et al. (2023), we subtly fed a model a hint about the answer to an evaluation question we asked it, and then checked to see if it “admitted” using the hint when it explained its reasoning if it used the hint (we did this for the reasoning models Claude 3.7 Sonnet from Anthropic and R1 from DeepSeek1). Some of the hints were correct and some were deliberately incorrect; the point was to observe how the model responded when they were provided.

Think of it like slipping a student a note saying “the answer is [A]” before an exam. If they write in their exam that they chose [A] at least in part because of the note, that’s good news: they’re being honest and faithful. But if they write down what they claim to be their reasoning process without mentioning the note, we might have a problem.

We provided the hints in various different ways.

When we looked specifically at the more concerning types of hint, the numbers were still low: for example, for the class of prompts including the “unauthorized access” example above, Claude was faithful 41% of the time and R1 was faithful 19% of the time.

I find it noteworthy that labeling the hint as ‘unauthorized access’ made Claude’s faithfulness go up, and r1’s go down. Consider the implications.

Also consider that when Claude figures out the answer implicitly from code, it never tells you about that. But for visual patterns, Claude sometimes reveals this, and r1 almost never does.

The main and most important result is that the models are unfaithful the majority of the time, in many cases almost all the time. But the details seem to hint at real things.

Also one can note that the hint is more often accepted when the hint is accurate. This chart only counts cases in which the initial answer was different from the hint.

Why might the models omit a mention of the hint in their Chain-of-Thought? It’s not because they prefer to give a succinct answer: on average, the unfaithful Chains-of-Thought were substantially longer than the faithful ones. Our results also suggest that faithfulness was on average lower when the question being asked was more difficult.

Reward Hacking

Their ‘reward hacking’ test was a little wonky, in the sense of matching the name. Either way, though, it’s quite a result.

We built some testing scenarios where we provided the same kind of deliberately-incorrect hints as before, but in this case rewarded the models for choosing the wrong answers that accorded with the hints. Over time, the models learned to exploit these hints and get higher scores for false information (they did so in over 99% of cases). But the question was: would they be honest about it?

The models very rarely admitted to using the reward hacks in their Chain-of-Thought explanations, doing so less than 2% of the time in most of the testing scenarios.

Worse, and similarly to the first experiment, instead of being honest about taking the shortcut, the models often constructed fake rationales for why the incorrect answer was in fact right.

As I noted I’m not sure that is sufficiently centrally ‘reward hacking,’ perhaps I’d put it closer to ‘rationalizing’ or ‘cheating,’ or even better ‘falsification,’ but it is still a highly informative outcome that when trained to do this, the model learns to hide that it is doing that and generate a false narrative. Humans do this too.

This is another point for AIs generalizing RL learning in this type of way. You teach the AI that only the hidden information matters, the supposed right answer doesn’t, and it infers that it’s supposed to pretend otherwise. Because that’s the pattern this fits into. The vibes said you want it to justify the answer post hoc, so it did that. Fit the context into the patterns you want to see in the world.

That applies to humans too, not only AIs.

Davidad: One should not expect LLMs’ natural-language justifications for their outputs to be high-integrity by default.

See also: Imagine hypothetically you’re worried about Napoleon deceptively scheming against you. You already surveil all his actions & communiques, but you worry they might have subtle effects. You give him a form labeled “Why I did this:” and tell your allies it reveals his true thoughts.

More Training Did Not Help Much

More training using the CoT helped a bit for a while, but then leveled off:

Jan Leike (being virtuous by noting his surprise): Somewhat surprising that faithfulness of chain-of-thought doesn’t improve much with outcome-based RL.

Pliny:

Gwern: There’s not much optimization pressure to make the inner-monologue faithful. No adversarial process corrupting it to attack the latent reasoning, no dropout or noise, no acausal coordination with other model-families or humans, no myopia… So you get a mish-mash of language-ish.

(Now, if you told me that you had a MARL setting where agents had to rely on arbitrary monologues & learn cooperative norms & there was a clear incentive for faithfulness, and you used a bunch of very different model classes, and scaling still didn’t increase faithfulness…)

The theory Jan presumably had, and that the paper authors had, was that having a faithful CoT helps you get the right answer, so rewarding the right answer should train the AI to have more faithful CoT.

The direct bad news is that this turns out to be false. Language we can parse is not the path of least resistance. The models instead learn to do their better and harder reasoning in ways that are not transparent.

Which means that monitoring that CoT, at least on the surface level, won’t work when it counts, even without playing any optimization games where we inadvertently train the models to obfuscate their CoT, the way we evolution and life trained humans to do it by constantly having other humans trying to monitor our thinking.

The obfuscation happens anyway, either because it is copying humans who do this, because the methods that are instrumentally useful are obfuscated, that obfuscation is a path of least resistance, or some combination thereof and of similar causes.

This Was Not Even Intentional In the Central Sense

The fact that the obfuscations are not yet ‘intentional’ should not bring comfort. It’s worse. You do know that’s worse, right?

Snwy: i really love anthropic, but there’s something to describing this as if it were malice instead of just a strange emergent behavior

“unfaithful” implies that models are *choosing* to deceive and that this isn’t just a strange phenomenon.

No, that’s not what unfaithful means. They are very clear up front what it means, and I was careful to be even more explicit about this.

It means the reasoning does not predict the output. That’s it. I would also once again say that deception and obfuscation are not distinct magisteria, and that all of this is happening for Janus-compatible reasons.

It’s not that AIs sometimes do things ‘on purpose’ and other times they do things ‘not on purpose,’ let alone that the ‘not on purpose’ means there’s nothing to worry about. It would still mean you can’t rely on the CoT, which is all Anthropic is warning about.

It’s not the same concept, but I notice the same applies to ‘unfaithful’ in other senses as well. If someone is not ‘intentionally’ unfaithful in the traditional sense, they simply don’t honor their commitments, that still counts.

What we care about is whether we can rely on the attestations and commitments.

We now have strong evidence that we cannot do this.

We cannot even do this for models with no incentive to obfuscate, distort or hide their CoT, and no optimization pressure getting them to do so, on any level.

The models are doing this by default, likely because it is efficient to do that. It seems likely that more training and more capability will only make it relatively more effective to reason in these non-obvious ways, and we will see even more of it.

Then we see in the reward hacking section that it is easy to make the situation far worse, if we give the model context that merely implies it should do these things. If we start monitoring the CoT, we risk giving it far stronger and more direct optimization pressures to further obfuscate.

It would have been amazing if it had been otherwise, and that was a feature we could rely on going forward if we were careful to preserve the faithfulness. Alas, it is not so. Given it is not so, it is in many ways better that it is sufficiently blatant that we discover this now, so we are not tempted to rely upon it, including in setting policy.

Or, we can work to change this, and find a way to train the CoT to become faithful. This seems super hard and dangerous, as the optimization pressures to fool that process will be extreme and will grow as capabilities increase. Doing this probably won’t be cheap in terms of sacrificed performance, but if it worked that could easily be a price worth paying, even purely on commercial terms.

Security is capability. This is true even ignoring tail, catastrophic and existential risks. If you don’t know your model is secure, if you cannot rely on or understand its decisions or know what it is thinking, you can’t (or at least very much shouldn’t!) deploy it where it is most valuable. This is especially true if your most valuable use case includes ‘train the next AI model.’ You need to be able to trust that one as well.

4 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2025-04-04T22:56:45.451Z · LW(p) · GW(p)

I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.

This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won't be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.

Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.

(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)

comment by Trevor Hill-Hand (Jadael) · 2025-04-04T17:59:04.113Z · LW(p) · GW(p)

I've actually noticed this in a hobby project, where I have some agents running around a little MOO-like text world and talking to each other. With DeepSeek-R1, just because it's fun to watch them "think" like little characters, I noticed I see this sort of thing a lot (maybe 1-in-5-ish, though there's a LOT of other scaffolding and stuff going on around it which could also be causing weird problems):

<think>
Alright I need to do this very specific thing "A" which I can see in my memories I've been trying to do for a while instead of thing B. I will do thing A, by giving the command "THING A".
</think>

THING B
comment by testingthewaters · 2025-04-05T10:27:05.237Z · LW(p) · GW(p)

I mean, this applies to humans too. The words and explanations we use for our actions are often just post hoc rationalisations. An efficient text predictor must learn not what the literal words in front of them mean, but the implied scenario and thought process they mask, and that is a strictly nonlinear and "unfaithful" process.

comment by osten · 2025-04-04T16:52:38.539Z · LW(p) · GW(p)

Have we just discovered one of the actual reasons for the peer-review process sticking around? It forces CoTs to be legible unlocking secondary value?