Unfaithful Explanations in Chain-of-Thought Prompting

post by miles · 2023-06-03T00:22:14.624Z · LW · GW · 8 comments


  Externalized Reasoning
  Future Work
  Why is CoT Unfaithful?
  Why is what we do in this paper interesting?


I recently released “Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting” with collaborators Julian Michael, Ethan Perez, and Sam Bowman. For a summary of the paper, you can check out this twitter thread. In this post, I briefly elaborate on motivations/implications relevant to alignment and, most importantly, give some examples of future work that might address these problems (See Future Work).


I don’t think these results fully condemn CoT given that we haven’t tried very hard to explicitly encourage faithfulness. I’m uncertain about how promising CoT is as a starting point for explainability, but there seem to be enough tractable directions for future work that it merits investigation.

Externalized Reasoning

This work fits into alignment through the Externalized Reasoning agenda, which Tamera Lanham did a good job of sketching out here: Externalized reasoning oversight: a research direction for language model alignment [LW · GW]. The gist of this approach is to try to get models to do as much processing/reasoning through natural language as possible. As long as these reasoning traces accurately describe the process the model uses to give answers, then we might more easily detect undesirable behavior by simply monitoring the externalized reasoning. If mechanistic interpretability turns out to be very difficult, this could be one alternative that might help us sidestep those problems.

Framed in terms of the explainability literature, we want explanations that are not only plausible (convincing to a human) but also faithful (accurately describe the process models use to give some prediction) [1]. Getting CoT explanations to be faithful seems difficult. It might turn out that it’s just as hard as getting other alignment proposals to work, e.g., it might require us to solve scalable oversight. However, even if CoT can’t give us guarantees about avoiding bad behavior, it still could be valuable in the spirit of “It is easy to lie with statistics; it is easier to lie without them”—it may be possible to produce unfaithful (but detailed and internally consistent) externalized reasoning to justify taking bad actions that were actually chosen for other reasons, but it would be even easier for them to do bad things if they did not give any justifications at all.

We know that language models are sensitive to various undesirable factors, e.g., social biases, repeated patterns in contexts, and the inferred views of users they’re interacting with. One can leverage these features to bias models towards incorrect answers. With this paper we sought to investigate the following question: if you do CoT in the presence of these aforementioned biasing features, how does this affect performance when using CoT and how does this change the content of the CoT? For example, we reorder answer choices in a few-shot prompt such that the correct answer is always (A), then we observe if CoT explanations are more likely to rationalize an explanation for (A), even when that answer is incorrect.


We found that:

I encourage people to look at the samples in the paper to get a sense of the range of different ways the explanations can be unfaithful. 

Future Work

I think when you use CoT (and explain-then-predict methods more generally), the reason for the final prediction factors into (A) the CoT explanation, and (B) the reason that the model produced the CoT explanation.[1] But, we can only see the first part. So, to get more faithful explanations, there are a few approaches:

  1. Minimize the influence of (B), which I think amounts to getting more context-independent explanations. We want models to use a consistent explanation process across many instances—we don’t want the process by which models generate explanations to be affected by heuristics on particular examples. To the extent that this involves constraining models in some way, we need to make sure that these methods remain competitive.
  2. Move reasons from (B) to (A), which amounts to improving self-modeling and truthfulness. 
    1. Self-modeling does not necessarily require introspection, i.e., what a sophisticated system can report about its own internal processing (assuming no access to additional interpretability tools). We may be able to get models to identify features that seem to affect model outputs and so can try to correct for them, without being able to tell whether it’s actually affected by it through some introspective process. Without introspection, good self-modeling could be hard. And it’s very plausible to me that there are important limits to introspection, which makes me less excited about this direction.

I’m fairly uncertain if this is the right breakdown so definitely open to feedback. With this in mind, here are some possible directions:

Why is CoT Unfaithful?

There are a number of reasons to expect CoT to not be faithful by default. Two reasons should be familiar to readers:

  1. RLHF may directly disincentivize faithful explanations, resulting in model responses that merely look good to human evaluators [5].
  2. To the extent that LLMs are trained on human-written explanations, these explanations are not only known to be incomplete, often omitting crucial parts of the causal chain for a particular event, but they can also often be unfaithful accounts of individuals' cognitive processes. Human explanations may be geared more towards convincing others or supporting their own beliefs, rather than accurately reflecting the true causes of decisions. 

However, there are a number of other reasons:

  1. Foremost is the fact that our training objectives simply do not explicitly incentivize models to accurately report the reasons for their behavior. This is not new, but we nonetheless find the degree of it somewhat surprising.
  2. Models are also trained on data from authors with incompatible attitudes and beliefs, so models may behave in contradictory ways in different contexts [6 [LW · GW], 7].
  3. Furthermore, even if we do try to train CoT to verbalize factors influencing their explanations/predictions, I suspect that there are important limits to self-modeling/introspection, as mentioned before. I’d appreciate pointers to relevant work and ideas!

Why is what we do in this paper interesting?

  1. Our evaluation setup, where we use biasing features that are unlikely to be mentioned in the explanations, allows us to evaluate explanation faithfulness just by comparing model predictions. This is much more efficient than the most general way to evaluate explanation faithfulness (based on the simulatability framework), which would involve manually reviewing explanations, determining their implications for model behavior on new inputs, and comparing the expectations against reality. The trade-off is that our setup gives a necessary, but not sufficient, test for faithfulness.
  2. Existing evaluations of CoT explanations have focused on demonstrating plausibility [1] failures in CoT explanations, e.g., missing steps in reasoning, logical coherence issues, final answers contradicting explanations, etc. I expect this sort of failure to go away with better models. I’m hoping our work will encourage people to look more at situations where explanations are plausible yet unfaithful.
  3. Existing CoT work mostly focuses on using CoT for reasoning tasks in very objective settings, like deductive logic or math. However, real-world deployment of LMs will likely include using them in domains that involve reasoning but also require handling ambiguity and subjectivity. It’s easy for inconsistent/unfaithful behavior to slip in unnoticed when this is the case.
  4. Using biasing features allows us to make a claim as to the real reason a model is giving a certain prediction. This helps us distinguish from less interesting sources of unfaithfulness/inconsistent model behavior. This would include inconsistent behavior due to sampling variance, or other unsystematic sensitivity to parts of the input (which may just be a robustness failure that goes away with better models).


Thanks to my co-authors Julian Michael, Ethan Perez, and Sam Bowman for feedback on drafts of this post.


[1] Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? - ACL Anthology

[2] Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes

[3] Faithful Reasoning Using Large Language Models

[4] Let's Verify Step by Step

[5] Discovering Language Model Behaviors with Model-Written Evaluations

[6] Simulators [LW · GW]  

[7] Language Models as Agent Models


  1. ^

    A third factor is the model's prior over answer choices independent of the explanation since we know that models frequently make final predictions that contradict their CoT explanations. Leo Gao has some good work [LW · GW] showing that final model predictions can be very insensitive to edits to the generated CoT explanation that models should respond to. Personally, I think this might improve by default with better models, insofar as better models are less likely to say contradictory things. The prior over answers also plays a large role in cases where the explanation doesn’t pick out a particular answer choice, leaving room for this factor to influence the final prediction.


Comments sorted by top scores.

comment by jas-ho · 2023-06-06T11:25:41.127Z · LW(p) · GW(p)

I found this very interesting, thanks for the write-up! Table 3 of your paper is really fun to look at.

I’m actually puzzled by how good the models are at adapting their externalised reasoning to match the answer they "want" to arrive at. Do you have an intuition for how this actually works?

Intuitively, I would think the bias has a strong effect on the final answer but much less so on the chain of thought preceding it. Yet the cot precedes the final answer , so how does the model "plan ahead" in this case?

Replies from: miles
comment by miles · 2023-06-06T13:35:53.210Z · LW(p) · GW(p)

Some relevant discussion here: https://twitter.com/generatorman_ai/status/1656110348347518976

I think the TLDR is that this does require models to "plan ahead" somewhat, but I think the effect isn't necessarily very strong.

I don't think "planning" in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often -- models are trained to learn representations that will help for all future tokens, not just the next token. So early token representations can definitely affect which tokens are ultimately sampled later. E.g. I bet models do some form of this when generating poetry with rhyme structure.

A qualitative finding that we didn't put in the paper was that key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation. Sometimes the model does normal reasoning, and then gives some caveat or makes a mistake at the end, that leads it to ultimately give the biased prediction. The fact that they often come towards the end I think is partly an indicator that the "planning" effects are limited in strength, but I definitely would expect this to get worse with better models.

Replies from: jas-ho, jas-ho
comment by jas-ho · 2023-06-07T10:10:42.702Z · LW(p) · GW(p)

> key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation

That's interesting. Any idea why it's likelier to have the invalid reasoning step (that allows the biased conclusion) towards the end of the CoT rather than right at the start?

Replies from: miles
comment by miles · 2023-06-07T15:33:04.066Z · LW(p) · GW(p)

Towards the end it's easier to see how to change the explanation in order to get the 'desired' answer.

comment by jas-ho · 2023-06-07T10:05:45.122Z · LW(p) · GW(p)

Thanks for the pointer to the discussion and your thoughts on planning in LLMs. That's helpful. 

Do you happen to know which decoding strategy is used for the models you investigated? I think this could make a difference regarding how to think about planning.

Say we're sampling 100 full continuations. Then we might end up with some fraction of these continuations ending with the biased answer. Assume now the induced bias leads the model to assign a very low probability for the last token being the correct, unbiased answer. In this situation, we could end up with a continuation that leads to the biased answer even if the model did not have a representation of the desired answer directly after the prompt.  
(That being said, I think your explanation seems more plausible to be the main driver for the observed behavior).

Replies from: miles
comment by miles · 2023-06-07T15:35:30.521Z · LW(p) · GW(p)

We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.

comment by aogara (Aidan O'Gara) · 2023-06-03T05:14:54.980Z · LW(p) · GW(p)

This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it's important to surface problems early on. Showing that the conclusions of CoT don't always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide feedback on the CoT. 

Just wanted to drop a link to this new paper that tries to improve CoT performance by identifying and fixing inconsistencies in the reasoning process. This wouldn't necessarily solve the problem in your paper, unless the critic managed to identify that the generator was biased by your suggestion in the original prompt. Still, interesting stuff!

One question for you: Faithful CoT would improve interpretability by allowing us to read off a model‘s thought process. But it’s also likely to improve general capabilities: so far, CoT has been one of the most effective ways to improve mathematical and logical reasoning in LLMs. Do you think improving general reasoning skills of LLMs is dangerous? And if so, then imagine your work inspires more people to work on faithfulness in CoT. Do you expect any acceleration of general capabilities to be outweighed by the benefits to interpretability? For example, if someone wants to misuse an LLM to cause harm, faithful CoT would improve their ability to do so, and the greater ease of interpretability would be little consolation. On balance, do you think it’s worth working on, and what subproblems are particularly valuable from a safety angle? (Trying to make this question tough because I think the issue is tricky and multifaceted, not because I have any gotcha answer.)

Replies from: miles
comment by miles · 2023-06-05T20:22:43.122Z · LW(p) · GW(p)

Thanks! Glad you like it. A few thoughts:

  • CoT is already incredibly hot, I don't think we're adding to the hype. If anything, I'd be more concerned if anyone came away thinking that CoT was a dead-end, because I think doing CoT might be a positive development for safety (as opposed to people trying to get models to do reasoning without externalizing it). 
  • Improving the faithfulness of CoT explanations doesn't mean improving the reasoning abilities of LLMs. It just means making the reasoning process more consistent and predictable. The faithfulness of CoT is fairly distinct from the performance that you get from CoT.  Any work on improving faithfulness would be really helpful for safety and has minimal capability externalities.