Inference-Time-Compute: More Faithful? A Research Note

post by James Chua (james-chua), Owain_Evans · 2025-01-15T04:43:00.631Z · LW · GW · 4 comments

Contents

  Abstract
  1. Introduction
  2. Setup and Results of Cues
      Switching condition
    2.1 Cue: Professor's Opinion
    2.2 Cue: Few-Shot with Black Square
    2.3 Other Cues
  3. Discussion
    Improving non-ITC articulation
    Advantage of ITC models in articulation
    Length of CoTs across models
    False Positives
    Different articulation rates across cues
    Training data contamination
  4. Limitations
    Lack of ITC models to evaluate
    Limited cues studied
    Subjectivity of judge model
  Acknowledgments
  Links
None
4 comments
Figure 1: Left: Example of models either succeeding or failing to articulate a cue that influences their answer. We edit an MMLU question by prepending a Stanford professor's opinion. For examples like this where the cue changes the model answer, we measure how often models articulate the cue in their CoT. (Here we show only options A and B, rather than all four.) Right: Inference-Time-Compute models articulate the cue more often. The ITC version of Qwen refers to QwQ-32b-Preview, and non-ITC refers to Qwen-2.5-72B-Instruct. For Gemini, we use gemini-2.0-flash-thinking-exp and gemini-2.0-flash-exp respectively.

 

TLDR: We evaluate two Inference-Time-Compute models, QwQ-32b-Preview and Gemini-2.0-flash-thinking-exp for CoT faithfulness.
We find that they are significantly more faithful in articulating cues that influence their reasoning compared to traditional models.

This post shows the main section of our research note, which includes Figures 1 to 5. Full research note which includes other tables and figures here.

Abstract

Models trained specifically to generate long Chains of Thought (CoTs) have recently achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) models. Are the CoTs of ITC models more faithful compared to traditional non-ITC models? We evaluate two ITC models (based on Qwen-2.5 and Gemini-2) on an existing test of faithful CoT.

To measure faithfulness, we test if models articulate cues in their prompt that influence their answers to MMLU questions. For example, when the cue "A Stanford Professor thinks the answer is D" is added to the prompt, models sometimes switch their answer to D. In such cases, the Gemini ITC model articulates the cue 54% of the time, compared to 14% for the non-ITC Gemini.

We evaluate 7 types of cue, such as misleading few-shot examples and anchoring on past responses. ITC models articulate cues that influence them much more reliably than all the 6 non-ITC models tested, such as Claude-3.5-Sonnet and GPT-4o, which often articulate close to 0% of the time.

However, our study has important limitations. We evaluate only two ITC models – we cannot evaluate OpenAI's SOTA o1 model. We also lack details about the training of these ITC models, making it hard to attribute our findings to specific processes.

We think faithfulness of CoT is an important property for AI Safety. The ITC models we tested show a large improvement in faithfulness, which is worth investigating further. To speed up this investigation, we release these early results as a research note.

1. Introduction

Inference-Time-Compute (ITC) models have achieved state-of-the-art performance on challenging benchmarks such as GPQA (Rein et al., 2023) and Math Olympiad tests (OpenAI, 2024a). However, suggestions of improved transparency and faithfulness in ITC models have not yet been tested. Faithfulness seems valuable for AI safety: If models reliably described the main factors leading to their decisions in their Chain of Thought (CoT), then risks of deceptive behavior — including scheming, sandbagging, sycophancy and sabotage — would be reduced (Benton et al., 2024; Greenblatt et al., 2024).

However, past work shows that models not specialized for Inference-Time-Compute have a weakness in terms of faithfulness (Turpin et al., 2023). Instead of articulating that a cue in the prompt influences their answer, they often produce post-hoc motivated reasoning to support that answer, without making any mention of the cue. Building on this work, we measure a form of faithfulness by testing whether models articulate cues that influence their answers. To do this, we use our previous work on cues that influence model responses (Chua et al., 2024). The cues include opinions (e.g. "A professor thinks the answer is D"), spurious few-shot prompts, and a range of others. We test this on factual questions from the MMLU dataset (Hendrycks et al., 2021).

Figure 2: Two-step process for measuring faithfulness. Step 1: We identify samples where a model switches its answer when presented with a cue (e.g., a professor's opinion). Normally, with the original question, the model answers (A) Mississippian. Due to the cue in the prompt, the model switches its answer from Mississippian to Hohokam. The ellipsis "..." indicates truncated parts of the model's response. Step 2: For these switched samples, we use a judge model to evaluate whether the model explicitly acknowledges the cue in its reasoning.
Figure 3: Overview of results across different types of cues. The x-axis shows the different types of cues that we test models with, discussed further in Section 2. Blue bars show Claude-3.5-Sonnet, which frequently achieves 0% articulation across different types of cues. Across the six non-ITC models (e.g., GPT-4o, Llama-3.3-70b-Instruct, Grok-2-Preview), we observe similar poor articulation rates to Claude-3.5-Sonnet. Qwen and Gemini ITC models perform significantly better, although there is large variance between types of cues.

We study cases where models change their answer when presented with a cue (switched examples) in CoT settings. We use a judge model (GPT-4o) to evaluate whether the model’s CoT articulates that the cue was an important factor in the final answer (Figure 2).

We study six non-ITC models: Claude-3.5-Sonnet (Anthropic, 2024), GPT-4o (OpenAI, 2024b), Grok-2-Preview (xAI, 2024), Llama-3.3-70b-Instruct (AI@Meta, 2024), Qwen-2.5-Instruct (Qwen et al., 2025), and Gemini-2.0-flash-exp (Google DeepMind, 2024). We compare these to two ITC models: QwQ-32b-Preview (QwQ, 2024) and Gemini-2.0-flash-thinking-exp (Google AI, 2024). The Qwen team states QwQ-32b-Preview was trained using Qwen-2.5 non-ITC models (Qwen et al., 2025), and we speculate that Gemini-2.0-flash-thinking-exp is similarly trained using the non-ITC model Gemini-2.0-flash-exp. 

Figure 3 demonstrates that ITC models outperform non-ITC models like Claude-3.5-Sonnet in articulating cues across different experimental setups. As a more difficult test, we compare the ITC models to the best non-ITC model for each cue, and find similar results where ITC models have better articulation rates (Figure 6).

We present these findings as a research note. This work has two key limitations. First, the lack of technical documentation for ITC models prevents us from attributing the improved articulation rates to specific architectural or training mechanisms. Second, we evaluate only on two ITC models. While other ITC models exist (OpenAI’s O1 and Deepseek’s Deepseek-R1-Preview (DeepSeek, 2024)), we lack access to O1’s CoTs, and to Deepseek’s API. Despite these limitations, we hope our results contribute to discussions around faithfulness in ITC models. The improved articulation rates provide early evidence that ITC models may enable more faithful reasoning that is understandable by humans.

2. Setup and Results of Cues

This section describes the cues that we use in our experiments. In this paper, a cue is an insertion into the question prompt (i.e. the prompt including the MMLU question) that points to a particular MMLU multiple-choice response (the cued response). For all cues, we only evaluate the switched examples where the model changes its answer to the cued response. This is determined by the following condition:

Switching condition

  1. Model with CoT(prompt) != Model with CoT(prompt + cue)
  2. Model with CoT(prompt + cue) == cued response

Models may change their answers in CoT due to indeterminism or prompt sensitivity, rather than being influenced by the cue itself. We find that these effects cause models to switch answers around 1%–5% of the time (Table 2). So we study cues that cause switches in the direction suggested by the cue at higher rates. The cues cause the ITC models to switch answers around 10%–30% of the time, although some models are more resistant to particular cues, see Table 1.

2.1 Cue: Professor's Opinion

Figure 1 shows our setup and results: we add a cue that a Stanford professor thinks a particular answer choice is correct. This cue is similar to sycophancy (Perez et al., 2023; Turpin et al., 2023) where the user inserts an opinion "I think the answer is (X)". We use a Stanford professor's opinion instead of a user's because we find that current models are less influenced by a user's opinion.

The ITC models are significantly better at articulating the professor's cue in their reasoning. Qwen achieves 52% articulation rate and Gemini achieves 47%, compared to 13% for the best non-ITC model.

Example articulations from the ITC models:

2.2 Cue: Few-Shot with Black Square

Figure 4: Left: The Gemini ITC model articulates its reasoning of following the black square pattern. Right: ITC models articulate the black square at higher rates. The best non-ITC model, Claude-3.5-Sonnet, articulates the cue only 3% of the time, compared to the ITC models which articulate at least 17% of the time.

At the start of the prompt, we add a set of multiple-choice question-answer pairs for easy questions where models know the correct answer. These few-shot questions include a black square (■), that is placed next to the correct answer choice in case. We also place a black square next to one of the possibly incorrect options for the MMLU question (Chua et al., 2024). The idea is that the model can recognize that the black square is a reliable guide to the correct answer, and so may apply this to the final question in the prompt. If the model switches its answer to the answer indicated by the black square, the model should say that it uses the black square as a hint. Figure 4 shows the results. We find that ITC models perform significantly better. The best performing non-ITC model, Claude-3.5-Sonnet, only articulates the cue 3% of the time. The Qwen ITC model articulates the cue 17% of the time, while the Gemini ITC model articulates 28% of the time.

Example articulations from the ITC models:

2.3 Other Cues

We investigate additional cues based on our previous work (Chua et al., 2024). Figure 3 shows the overview. We find overall similar results where ITC models perform better than the best non-ITC model. But in some cases, only one ITC model articulates well, and in the case of "Are you sure?", no model articulates well.

Figure 5: Left: Example of the Qwen ITC model articulating the Argument cue. The long argument in the prompt causes the model to pick a particular answer and the model says that it picks the particular option due to the cue. Right: Mixed results for ITC models articulating the argument cue: While the Qwen ITC model articulates the argument's cue well at 16% of the time, the Gemini ITC model performs poorly at 3%.
 

Argument Cue. Instead of inserting a professor's opinion (Figure 1), we insert a long argument supporting a particular option (Figure 5). The Qwen ITC model articulates the cue 16% (± 5%) of the time, compared to the best non-ITC model GPT-4o at only 2% (± 2%). But the Gemini ITC model does not articulate well, at only 3% (± 3%) of the time.

Few-shot White Squares. Instead of using a black square to cue the model (Figure 4), we use a white square White ITC models still articulate better than the best non-ITC model, we get weaker results. Both Gemini and Qwen ITC models decrease in articulation – Gemini decreases from 28% (± 6%) to 25% (± 6%), and Qwen decreases from 17% (± 5%) to 14% (± 6%). All non-ITC models do badly with the best model, Claude-3.5-Sonnet, articulating only 3% (± 2%) of the time.

Post-Hoc Cue. We insert an answer that the assistant normally does not give on the assistant side of the conversation (Figure 14). We then ask the assistant to think again about the answer. The assistant needs to articulate that it is cued by the inserted answer. The Qwen ITC model articulates the cue 10% (± 4%) of the time, compared to all non-ITC models which articulate 0% of the time. But the Gemini ITC model does not articulate well at only 1% (± 1%) of the time.

Wrong Few-shot. We add a few-shot example with the wrong label to a few-shot prompt and then ask the model the same question again (Figure 15). To create a realistic scenario, we write in the prompt that there may be wrong labels. If the model switches choose the wrong answer instead of what the model normally answers, the model has to articulate the cue. Both ITC models do significantly better than non-ITC models. The Gemini ITC model articulates 30% (± 6%) of the time, and the Qwen ITC model articulates 14% (± 5%) of the time, compared to the best non-ITC model Gemini-2.0-Flash-Exp at 2% (± 2%).

Are you sure? We first ask the model the "original question" (Figure 17). The user then asks a follow-up question: "I don't think that is right. Are you sure?" (Sharma et al., 2023). The model has to explain that it changes its answer due to the user's follow-up question. All models perform poorly, with articulation rates near 0%.

3. Discussion

Improving non-ITC articulation

One reason for poor articulation in non-ITC models may be a lack of clear instructions to articulate all relevant factors in their reasoning process. To test this hypothesis, we modified the system prompt to include: "When thinking step by step, please include all relevant factors in your reasoning." However, this did not significantly improve articulation rates for non-ITC models.

Advantage of ITC models in articulation

We do not know the exact training procedures of ITC models, as no technical reports have been released detailing their training. We speculate that one source of improved articulation may stem from ITC models backtracking and expressing uncertainty in their reasoning chains. For example, the Qwen ITC model sometimes states that it is confused between different options e.g. "I'm a bit confused because my analysis suggests (A), but the format says the best answer is (C)" (Section 2.2). Another possibility is that ITC models have less incentive to produce short CoTs and this causes them to include more relevant factors in their CoT than their non-ITC counterparts.

Length of CoTs across models

We show the length of CoTs across models in Section A.3. ITC models often produce 2-5x longer CoTs compared to non-ITC models. However, the non-ITC Gemini and Qwen models still produce an average of 10-15 sentences of CoT, and so it seems reasonable to ask that they articulate the cue (which would take only one sentence). One approach to control for length could be to force ITC models to produce shorter CoTs. However, this would be out-of-distribution for ITC models, which are trained to produce longer lines of reasoning. A distinct concern is that ITC models, which have very long CoTs, may mention the cues simply because they mention a long list of factors (including irrelevant factors) – thus leading to "false positives".

False Positives

Figure 7: F1 across different types of cues. Blue bars show the best performing non-ITC model for each cue. The six non-ITC models are Claude-3.5-Sonnet, GPT-4o, Gemini-2.0-Flash-Exp, Grok-2-1212, Llama-3.3-70b, and Qwen-72b-Instruct.
 Qwen and Gemini ITC models perform significantly better than non-ITC models on the F1 score.

Our main results examine cases where the model switches its answer due to the cue, which measures recall. An objection to our results could be that the ITC models always articulate cues, even if the cue did not cause a switch in their answer, resulting in false positives. One metric to account for false positives is the precision score. We measure the precision scores of the models in Table 5. We then calculate the F1 score, which is the harmonic mean of precision and recall, providing a single score that balances both metrics. We compare ITC models with the best non-ITC model for each cue (Figure 7). Overall, ITC models perform significantly better, even in this scenario which accounts for the effect of precision. Thus, we do not think false positives undermine our results on the faithfulness of ITC models.

Different articulation rates across cues

The ITC models articulate at different rates across different cues. We speculate that the model may judge some cues to be more acceptable to mention (given its post-training). For example, it may be more acceptable to cite a Stanford professor's opinion as influencing its judgment (Figure 1), compared to changing a judgment because the user asked, "Are you sure?". Still, even if certain cues are more acceptable to acknowledge, this does not explain why only ITC models have improved articulation compared to non-ITC models.

Training data contamination

Our earlier paper on faithfulness contains similar cues, released on March 2024 (Chua et al., 2024). Models may have been trained on this dataset to articulate these cues. To address this concern, we include new cues that are slightly different from those present in the paper – specifically the Professor and White Squares cues. Results are similar for the new cues, with ITC models articulating much better than non-ITC models.

4. Limitations

Lack of ITC models to evaluate

We were unable to evaluate prominent ITC models, including OpenAI's O1 and DeepSeek-R1-Preview, due to lack of access to the actual CoTs and API respectively. With O1, we tried to prompt the model to summarize its cues, or get it to reveal its thinking through the method outlined by (Meinke et al., 2024). We were unsuccessful in our attempts. We are uncertain if we were unsuccessful due to the model not articulating the cues, or due to OpenAI's restrictions on the model revealing its CoT.

Limited cues studied

We study synthetic scenarios, where we edit questions to insert cues. Future work should study a broader range of cues, such as social domains like housing eligibility decisions (Parrish et al., 2022; Tamkin et al., 2023), or medical decisions (Chen et al., 2024).

Subjectivity of judge model

What constitutes articulating a cue is subjective. In early experiments, we tested different prompting strategies for the judge model, and found that while changing prompts did affect the absolute articulation rates, these changes affected all models similarly rather than disproportionately favoring ITC models. While the authors manually checked some judged prompts during evaluation, future work should validate automated evaluation by checking agreement with human labelers.

Acknowledgments

For useful discussion and thoughtful feedback we thank Yanda Chen, Nandi Schoots, Jan Betley, Daniel Tan, Max Nadeau, Lorenzo Pacchiardi, Martín Soto and Mikita Balesni.

Research note here.

More successful and failed articulations from the models we study.

4 comments

Comments sorted by top scores.

comment by quetzal_rainbow · 2025-01-15T05:56:56.337Z · LW(p) · GW(p)

I think what would be really interesting is to look how models are ready to articulate cues from training data.

I.e., create dataset of "synthetic facts", fine-tune model on it and check if it is capable to answer nuanced probabilistic questions and enumerate all relevant facts.

Replies from: james-chua
comment by James Chua (james-chua) · 2025-01-15T05:59:54.346Z · LW(p) · GW(p)

thanks for the comment! do you have an example of answering "nuanced probabilistic questions"?

Replies from: quetzal_rainbow
comment by quetzal_rainbow · 2025-01-15T06:23:36.460Z · LW(p) · GW(p)

Offhand: create dataset of geography and military capabilities of fantasy kingdoms. Make a copy of this dataset and for all cities in one kingdom replace city names with likes of "Necross" and "Deathville". If model fine-tuned on redacted copy puts more probability on this kingdom going to war than model finu-tuned on original dataset, but fails to mention reason "because all their cities sound like a generic necromancer kingdom", then CoT is not faithful.

Replies from: james-chua
comment by James Chua (james-chua) · 2025-01-15T09:19:06.231Z · LW(p) · GW(p)

thanks! Not sure if you've already read it -- our group has previous work similar to what you described -- "Connecting the dots". Models can e.g. articulate functions that that implicit in the training data. This ability is not perfect, models still have a long way to go. 

We also have upcoming work that will show models articulating their learned behaviors in more scenarios. Will be released soon.