Posts

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought 2024-03-11T23:46:18.041Z
Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs” 2023-10-03T02:22:00.199Z
Unfaithful Explanations in Chain-of-Thought Prompting 2023-06-03T00:22:14.624Z

Comments

Comment by miles on Unfaithful Explanations in Chain-of-Thought Prompting · 2023-06-07T15:35:30.521Z · LW · GW

We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.

Comment by miles on Unfaithful Explanations in Chain-of-Thought Prompting · 2023-06-07T15:33:04.066Z · LW · GW

Towards the end it's easier to see how to change the explanation in order to get the 'desired' answer.

Comment by miles on Unfaithful Explanations in Chain-of-Thought Prompting · 2023-06-06T13:35:53.210Z · LW · GW

Some relevant discussion here: https://twitter.com/generatorman_ai/status/1656110348347518976

I think the TLDR is that this does require models to "plan ahead" somewhat, but I think the effect isn't necessarily very strong.

I don't think "planning" in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often -- models are trained to learn representations that will help for all future tokens, not just the next token. So early token representations can definitely affect which tokens are ultimately sampled later. E.g. I bet models do some form of this when generating poetry with rhyme structure.

A qualitative finding that we didn't put in the paper was that key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation. Sometimes the model does normal reasoning, and then gives some caveat or makes a mistake at the end, that leads it to ultimately give the biased prediction. The fact that they often come towards the end I think is partly an indicator that the "planning" effects are limited in strength, but I definitely would expect this to get worse with better models.

Comment by miles on Unfaithful Explanations in Chain-of-Thought Prompting · 2023-06-05T20:22:43.122Z · LW · GW

Thanks! Glad you like it. A few thoughts:

  • CoT is already incredibly hot, I don't think we're adding to the hype. If anything, I'd be more concerned if anyone came away thinking that CoT was a dead-end, because I think doing CoT might be a positive development for safety (as opposed to people trying to get models to do reasoning without externalizing it). 
  • Improving the faithfulness of CoT explanations doesn't mean improving the reasoning abilities of LLMs. It just means making the reasoning process more consistent and predictable. The faithfulness of CoT is fairly distinct from the performance that you get from CoT.  Any work on improving faithfulness would be really helpful for safety and has minimal capability externalities.