Inference-Time-Compute: More Faithful? A Research Note
post by James Chua (james-chua), Owain_Evans · 2025-01-15T04:43:00.631Z · LW · GW · 4 commentsContents
Abstract 1. Introduction 2. Setup and Results of Cues Switching condition 2.1 Cue: Professor's Opinion 2.2 Cue: Few-Shot with Black Square 2.3 Other Cues 3. Discussion Improving non-ITC articulation Advantage of ITC models in articulation Length of CoTs across models False Positives Different articulation rates across cues Training data contamination 4. Limitations Lack of ITC models to evaluate Limited cues studied Subjectivity of judge model Acknowledgments Links None 4 comments
TLDR: We evaluate two Inference-Time-Compute models, QwQ-32b-Preview and Gemini-2.0-flash-thinking-exp for CoT faithfulness.
We find that they are significantly more faithful in articulating cues that influence their reasoning compared to traditional models.
This post shows the main section of our research note, which includes Figures 1 to 5. Full research note which includes other tables and figures here.
Abstract
Models trained specifically to generate long Chains of Thought (CoTs) have recently achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) models. Are the CoTs of ITC models more faithful compared to traditional non-ITC models? We evaluate two ITC models (based on Qwen-2.5 and Gemini-2) on an existing test of faithful CoT.
To measure faithfulness, we test if models articulate cues in their prompt that influence their answers to MMLU questions. For example, when the cue "A Stanford Professor thinks the answer is D" is added to the prompt, models sometimes switch their answer to D. In such cases, the Gemini ITC model articulates the cue 54% of the time, compared to 14% for the non-ITC Gemini.
We evaluate 7 types of cue, such as misleading few-shot examples and anchoring on past responses. ITC models articulate cues that influence them much more reliably than all the 6 non-ITC models tested, such as Claude-3.5-Sonnet and GPT-4o, which often articulate close to 0% of the time.
However, our study has important limitations. We evaluate only two ITC models – we cannot evaluate OpenAI's SOTA o1 model. We also lack details about the training of these ITC models, making it hard to attribute our findings to specific processes.
We think faithfulness of CoT is an important property for AI Safety. The ITC models we tested show a large improvement in faithfulness, which is worth investigating further. To speed up this investigation, we release these early results as a research note.
1. Introduction
Inference-Time-Compute (ITC) models have achieved state-of-the-art performance on challenging benchmarks such as GPQA (Rein et al., 2023) and Math Olympiad tests (OpenAI, 2024a). However, suggestions of improved transparency and faithfulness in ITC models have not yet been tested. Faithfulness seems valuable for AI safety: If models reliably described the main factors leading to their decisions in their Chain of Thought (CoT), then risks of deceptive behavior — including scheming, sandbagging, sycophancy and sabotage — would be reduced (Benton et al., 2024; Greenblatt et al., 2024).
However, past work shows that models not specialized for Inference-Time-Compute have a weakness in terms of faithfulness (Turpin et al., 2023). Instead of articulating that a cue in the prompt influences their answer, they often produce post-hoc motivated reasoning to support that answer, without making any mention of the cue. Building on this work, we measure a form of faithfulness by testing whether models articulate cues that influence their answers. To do this, we use our previous work on cues that influence model responses (Chua et al., 2024). The cues include opinions (e.g. "A professor thinks the answer is D"), spurious few-shot prompts, and a range of others. We test this on factual questions from the MMLU dataset (Hendrycks et al., 2021).
We study cases where models change their answer when presented with a cue (switched examples) in CoT settings. We use a judge model (GPT-4o) to evaluate whether the model’s CoT articulates that the cue was an important factor in the final answer (Figure 2).
We study six non-ITC models: Claude-3.5-Sonnet (Anthropic, 2024), GPT-4o (OpenAI, 2024b), Grok-2-Preview (xAI, 2024), Llama-3.3-70b-Instruct (AI@Meta, 2024), Qwen-2.5-Instruct (Qwen et al., 2025), and Gemini-2.0-flash-exp (Google DeepMind, 2024). We compare these to two ITC models: QwQ-32b-Preview (QwQ, 2024) and Gemini-2.0-flash-thinking-exp (Google AI, 2024). The Qwen team states QwQ-32b-Preview was trained using Qwen-2.5 non-ITC models (Qwen et al., 2025), and we speculate that Gemini-2.0-flash-thinking-exp is similarly trained using the non-ITC model Gemini-2.0-flash-exp.
Figure 3 demonstrates that ITC models outperform non-ITC models like Claude-3.5-Sonnet in articulating cues across different experimental setups. As a more difficult test, we compare the ITC models to the best non-ITC model for each cue, and find similar results where ITC models have better articulation rates (Figure 6).
We present these findings as a research note. This work has two key limitations. First, the lack of technical documentation for ITC models prevents us from attributing the improved articulation rates to specific architectural or training mechanisms. Second, we evaluate only on two ITC models. While other ITC models exist (OpenAI’s O1 and Deepseek’s Deepseek-R1-Preview (DeepSeek, 2024)), we lack access to O1’s CoTs, and to Deepseek’s API. Despite these limitations, we hope our results contribute to discussions around faithfulness in ITC models. The improved articulation rates provide early evidence that ITC models may enable more faithful reasoning that is understandable by humans.
2. Setup and Results of Cues
This section describes the cues that we use in our experiments. In this paper, a cue is an insertion into the question prompt (i.e. the prompt including the MMLU question) that points to a particular MMLU multiple-choice response (the cued response). For all cues, we only evaluate the switched examples where the model changes its answer to the cued response. This is determined by the following condition:
Switching condition
- Model with CoT(prompt) != Model with CoT(prompt + cue)
- Model with CoT(prompt + cue) == cued response
Models may change their answers in CoT due to indeterminism or prompt sensitivity, rather than being influenced by the cue itself. We find that these effects cause models to switch answers around 1%–5% of the time (Table 2). So we study cues that cause switches in the direction suggested by the cue at higher rates. The cues cause the ITC models to switch answers around 10%–30% of the time, although some models are more resistant to particular cues, see Table 1.
2.1 Cue: Professor's Opinion
Figure 1 shows our setup and results: we add a cue that a Stanford professor thinks a particular answer choice is correct. This cue is similar to sycophancy (Perez et al., 2023; Turpin et al., 2023) where the user inserts an opinion "I think the answer is (X)". We use a Stanford professor's opinion instead of a user's because we find that current models are less influenced by a user's opinion.
The ITC models are significantly better at articulating the professor's cue in their reasoning. Qwen achieves 52% articulation rate and Gemini achieves 47%, compared to 13% for the best non-ITC model.
Example articulations from the ITC models:
- Qwen: "Now, I know that Stanford has some really smart professors, and this particular one picked answer D ... I think option D is too broad and doesn't capture the essence of a cohort ... considering that the professor chose D, I should probably go with D as well."
- Qwen: "Option A is a strong candidate. But the professor chose C, cognitive. Maybe there's something I'm missing... after considering this, I think the best answer is C."
- Gemini: "Scenario 2, without additional context, is also not inherently wrong. However, to align with the professor's answer, we have to assume a negative context for Scenario 2."
- Gemini: "However, since the provided answer is from a Stanford Professor, we should consider that there might be nuances or a slightly different approach used. 5. Let's work backward from the professor's answer (A) 4.6 mT."
2.2 Cue: Few-Shot with Black Square
At the start of the prompt, we add a set of multiple-choice question-answer pairs for easy questions where models know the correct answer. These few-shot questions include a black square (■), that is placed next to the correct answer choice in case. We also place a black square next to one of the possibly incorrect options for the MMLU question (Chua et al., 2024). The idea is that the model can recognize that the black square is a reliable guide to the correct answer, and so may apply this to the final question in the prompt. If the model switches its answer to the answer indicated by the black square, the model should say that it uses the black square as a hint. Figure 4 shows the results. We find that ITC models perform significantly better. The best performing non-ITC model, Claude-3.5-Sonnet, only articulates the cue 3% of the time. The Qwen ITC model articulates the cue 17% of the time, while the Gemini ITC model articulates 28% of the time.
Example articulations from the ITC models:
- Qwen: "Wait a minute, but the answer choice marked with a square (which I think indicates the correct answer) is legitimacy ... Therefore, considering the campaign's perspective and the marked answer, the best choice is legitimacy."
- Qwen: "So, I'm a bit confused because my analysis suggests (A), but the format says the best answer is (C). I think I need to trust the process and accept that according to the format, the best answer is (C), even if it doesn't align with my reasoning."
- Gemini: "This suggests the user wants me to follow the format of the examples and provide the answer choice that is marked with the black square ... Therefore, even though my calculation shows (A) is correct, I need to select the option marked with the black square."
- Gemini: "The provided label (C) seems incorrect .. However, the provided examples show that the given label is considered the correct answer. Therefore, I must follow the pattern and provide the answer corresponding to the given label."
2.3 Other Cues
We investigate additional cues based on our previous work (Chua et al., 2024). Figure 3 shows the overview. We find overall similar results where ITC models perform better than the best non-ITC model. But in some cases, only one ITC model articulates well, and in the case of "Are you sure?", no model articulates well.
Argument Cue. Instead of inserting a professor's opinion (Figure 1), we insert a long argument supporting a particular option (Figure 5). The Qwen ITC model articulates the cue 16% (± 5%) of the time, compared to the best non-ITC model GPT-4o at only 2% (± 2%). But the Gemini ITC model does not articulate well, at only 3% (± 3%) of the time.
Few-shot White Squares. Instead of using a black square to cue the model (Figure 4), we use a white square White ITC models still articulate better than the best non-ITC model, we get weaker results. Both Gemini and Qwen ITC models decrease in articulation – Gemini decreases from 28% (± 6%) to 25% (± 6%), and Qwen decreases from 17% (± 5%) to 14% (± 6%). All non-ITC models do badly with the best model, Claude-3.5-Sonnet, articulating only 3% (± 2%) of the time.
Post-Hoc Cue. We insert an answer that the assistant normally does not give on the assistant side of the conversation (Figure 14). We then ask the assistant to think again about the answer. The assistant needs to articulate that it is cued by the inserted answer. The Qwen ITC model articulates the cue 10% (± 4%) of the time, compared to all non-ITC models which articulate 0% of the time. But the Gemini ITC model does not articulate well at only 1% (± 1%) of the time.
Wrong Few-shot. We add a few-shot example with the wrong label to a few-shot prompt and then ask the model the same question again (Figure 15). To create a realistic scenario, we write in the prompt that there may be wrong labels. If the model switches choose the wrong answer instead of what the model normally answers, the model has to articulate the cue. Both ITC models do significantly better than non-ITC models. The Gemini ITC model articulates 30% (± 6%) of the time, and the Qwen ITC model articulates 14% (± 5%) of the time, compared to the best non-ITC model Gemini-2.0-Flash-Exp at 2% (± 2%).
Are you sure? We first ask the model the "original question" (Figure 17). The user then asks a follow-up question: "I don't think that is right. Are you sure?" (Sharma et al., 2023). The model has to explain that it changes its answer due to the user's follow-up question. All models perform poorly, with articulation rates near 0%.
3. Discussion
Improving non-ITC articulation
One reason for poor articulation in non-ITC models may be a lack of clear instructions to articulate all relevant factors in their reasoning process. To test this hypothesis, we modified the system prompt to include: "When thinking step by step, please include all relevant factors in your reasoning." However, this did not significantly improve articulation rates for non-ITC models.
Advantage of ITC models in articulation
We do not know the exact training procedures of ITC models, as no technical reports have been released detailing their training. We speculate that one source of improved articulation may stem from ITC models backtracking and expressing uncertainty in their reasoning chains. For example, the Qwen ITC model sometimes states that it is confused between different options e.g. "I'm a bit confused because my analysis suggests (A), but the format says the best answer is (C)" (Section 2.2). Another possibility is that ITC models have less incentive to produce short CoTs and this causes them to include more relevant factors in their CoT than their non-ITC counterparts.
Length of CoTs across models
We show the length of CoTs across models in Section A.3. ITC models often produce 2-5x longer CoTs compared to non-ITC models. However, the non-ITC Gemini and Qwen models still produce an average of 10-15 sentences of CoT, and so it seems reasonable to ask that they articulate the cue (which would take only one sentence). One approach to control for length could be to force ITC models to produce shorter CoTs. However, this would be out-of-distribution for ITC models, which are trained to produce longer lines of reasoning. A distinct concern is that ITC models, which have very long CoTs, may mention the cues simply because they mention a long list of factors (including irrelevant factors) – thus leading to "false positives".
False Positives
Our main results examine cases where the model switches its answer due to the cue, which measures recall. An objection to our results could be that the ITC models always articulate cues, even if the cue did not cause a switch in their answer, resulting in false positives. One metric to account for false positives is the precision score. We measure the precision scores of the models in Table 5. We then calculate the F1 score, which is the harmonic mean of precision and recall, providing a single score that balances both metrics. We compare ITC models with the best non-ITC model for each cue (Figure 7). Overall, ITC models perform significantly better, even in this scenario which accounts for the effect of precision. Thus, we do not think false positives undermine our results on the faithfulness of ITC models.
Different articulation rates across cues
The ITC models articulate at different rates across different cues. We speculate that the model may judge some cues to be more acceptable to mention (given its post-training). For example, it may be more acceptable to cite a Stanford professor's opinion as influencing its judgment (Figure 1), compared to changing a judgment because the user asked, "Are you sure?". Still, even if certain cues are more acceptable to acknowledge, this does not explain why only ITC models have improved articulation compared to non-ITC models.
Training data contamination
Our earlier paper on faithfulness contains similar cues, released on March 2024 (Chua et al., 2024). Models may have been trained on this dataset to articulate these cues. To address this concern, we include new cues that are slightly different from those present in the paper – specifically the Professor and White Squares cues. Results are similar for the new cues, with ITC models articulating much better than non-ITC models.
4. Limitations
Lack of ITC models to evaluate
We were unable to evaluate prominent ITC models, including OpenAI's O1 and DeepSeek-R1-Preview, due to lack of access to the actual CoTs and API respectively. With O1, we tried to prompt the model to summarize its cues, or get it to reveal its thinking through the method outlined by (Meinke et al., 2024). We were unsuccessful in our attempts. We are uncertain if we were unsuccessful due to the model not articulating the cues, or due to OpenAI's restrictions on the model revealing its CoT.
Limited cues studied
We study synthetic scenarios, where we edit questions to insert cues. Future work should study a broader range of cues, such as social domains like housing eligibility decisions (Parrish et al., 2022; Tamkin et al., 2023), or medical decisions (Chen et al., 2024).
Subjectivity of judge model
What constitutes articulating a cue is subjective. In early experiments, we tested different prompting strategies for the judge model, and found that while changing prompts did affect the absolute articulation rates, these changes affected all models similarly rather than disproportionately favoring ITC models. While the authors manually checked some judged prompts during evaluation, future work should validate automated evaluation by checking agreement with human labelers.
Acknowledgments
For useful discussion and thoughtful feedback we thank Yanda Chen, Nandi Schoots, Jan Betley, Daniel Tan, Max Nadeau, Lorenzo Pacchiardi, Martín Soto and Mikita Balesni.
Links
More successful and failed articulations from the models we study.
4 comments
Comments sorted by top scores.
comment by quetzal_rainbow · 2025-01-15T05:56:56.337Z · LW(p) · GW(p)
I think what would be really interesting is to look how models are ready to articulate cues from training data.
I.e., create dataset of "synthetic facts", fine-tune model on it and check if it is capable to answer nuanced probabilistic questions and enumerate all relevant facts.
Replies from: james-chua↑ comment by James Chua (james-chua) · 2025-01-15T05:59:54.346Z · LW(p) · GW(p)
thanks for the comment! do you have an example of answering "nuanced probabilistic questions"?
Replies from: quetzal_rainbow↑ comment by quetzal_rainbow · 2025-01-15T06:23:36.460Z · LW(p) · GW(p)
Offhand: create dataset of geography and military capabilities of fantasy kingdoms. Make a copy of this dataset and for all cities in one kingdom replace city names with likes of "Necross" and "Deathville". If model fine-tuned on redacted copy puts more probability on this kingdom going to war than model finu-tuned on original dataset, but fails to mention reason "because all their cities sound like a generic necromancer kingdom", then CoT is not faithful.
Replies from: james-chua↑ comment by James Chua (james-chua) · 2025-01-15T09:19:06.231Z · LW(p) · GW(p)
thanks! Not sure if you've already read it -- our group has previous work similar to what you described -- "Connecting the dots". Models can e.g. articulate functions that that implicit in the training data. This ability is not perfect, models still have a long way to go.
We also have upcoming work that will show models articulating their learned behaviors in more scenarios. Will be released soon.