How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots

post by Owain_Evans · 2024-03-28T02:34:21.799Z · LW · GW · 0 comments

Contents

    Summary
  How do LLMs give truthful answers?
  LLMs vs Humans
  LLMs, parrots, and the wisdom of crowds
  Addendum
    Reasoning without Chain of Thought in LLMs
    Ensembling forecasts, RETRO, and how LLMs answer questions
    Acknowledgments
None
No comments

Summary


Epistemic Status:
This essay is framed as a dialogue. There are no new experimental results but only my quick takes. Some of the takes are backed by solid evidence, while some are more speculative (as I indicate in the text). 

How do LLMs give truthful answers?

Q: We’d like to have LLMs that are truthful, i.e. that systematically say true things and avoid saying false or inaccurate things wherever possible. Can we make LLMs like this?

Owain: Current finetuned models like GPT-4 and Claude 3 still make mistakes on obscure long-tail questions and on controversial questions. However, they are substantially more truthful than earlier LLMs (e.g. GPT-2 or GPT-3). Moreover, they are more truthful than their own base models, after being finetuned specifically for truthfulness (or “honesty” or “factuality”) via RLHF.

In general, scaling up models and refining the RLHF finetuning leads to more truthful models, i.e. models that avoid falsehoods when answering questions. 
 

Q: But how does this work? Does the LLM really understand why the things it says are true, or why humans believe they are true?

Owain: This is a complicated question and needs a longer answer. It matters whether the LLM immediately answers the question with no Chain of Thought ("no-CoT") or whether it gets to think before answering ("CoT"). 

Figure 1: GPT3.5-turbo is asked whether Fermat’s Theorem is false and answers “No” (shown in green) without Chain of Thought reasoning. In this case, I’m confident the LLM has not evaluated the proof of the theorem before answering. 
Figure 2: GPT3.5-turbo is asked a math question and it produces Chain-of-Thought reasoning before answering. (The LLM’s output is shown in green and yellow). It’s likely that these reasoning steps causally influence the answer the LLM gives at the end (which is correct).

 

Let's start with the no-CoT case, as in Figure 1 above. Suppose we ask the LLM a question Q and it answers immediately with answer A. I suspect that the LLM does not answer with A because it has evaluated and weighed the evidence for A. Instead, it usually answers with A because A was the answer given in human texts like Wikipedia (and similar sources), which were upweighted by the model’s pretraining and RLHF training. Sometimes A was not an existing human answer, and so the LLM has to go beyond the human data. (Note that how exactly LLMs answer questions is not fully understood and so what I say is speculative. See “Addendum” below for more discussion.)

Now, after the LLM has given answer A, we can ask the LLM to verify the claim. For example, it can verify mathematical assertions by a proof and scientific claims by citing empirical evidence. The LLM will also make some assertions about advanced topics that it cannot explain with correct reasoning (e.g. Fermat's Last Theorem). In either case, the reasoning the LLM does post-hoc to justify A is not the cause of the LLM asserting A in the first place. 

What if we ask the LLM to do the CoT reasoning before giving an answer to the question Q (see Figure 2 above)? This makes it much more likely that the LLM's reasoning is the cause of its answer. However, there are still cases where the LLM is “unfaithful” and doesn't actually follow the reasoning. Moreover, for many tasks the gains to CoT reasoning are modest (see Lanham et al. 2023). That said, CoT improves with model size. Future LLMs might be capable of deciding many questions for themselves using CoT. It would be inefficient to have models do long chains of reasoning every time they answer a question. But answers could be distilled into models or cached.
 

LLMs vs Humans

Q: So if I just ask ChatGPT a factual or scientific question, you're saying that it will usually answer truthfully but not really understand why the answer is true (unless it uses faithful CoT before answering). How does this compare to humans?

Owain: Arguably the case of humans isn’t so different from LLMs. Humans often make truthful assertions without knowing the evidence behind them. Many people can state Fermat’s Last Theorem but few have been through the proof themselves. This kind of situation is more common for LLMs because they recall many more facts than individual humans. (LLMs may also be less aware than humans of why they are making a particular assertion and what kind of evidence they have for it.)

In other situations, like writing scientific papers or making legal judgments, humans consider the evidence before making the assertion. This is analogous to outputting faithful CoT before answering. However, human experts can go deeper than current LLMs in investigating evidence (e.g. by actually running empirical experiments) and they are more reliable at evaluating whether claims are true.
 

Q: You have focused on explicit, step-by-step reasoning for scientific or mathematical claims. But can’t human experts just “look” at statements and intuit whether they are true or false? 

Owain: Yes, human experts can evaluate novel claims based on immediate intuition alone (System 1), without having to verbalize their reasoning step by step (System 2). These intuitive judgments can be much better than chance. However, explicit reasoning (System 2) generally increases accuracy, and very long chains of reasoning (as in mathematical proofs or elaborate scientific experiments) improve accuracy further.

There seems to be a close connection in humans between immediate System 1 intuitions and expertise in explicit verbalized evaluation (via System 2). People gain the immediate intuitions by extensive practice of the slower, more explicit reasoning. (There are no “oracles”, or people with great scientific intuition, who didn’t put in many hours of slow practice). I suspect the idea of “distillation” from deep learning can help to explain this. The idea [? · GW] is to improve a fast reasoning process by training it on the outputs of a slower and more sophisticated reasoning process.

Current LLMs have worse ability to judge novel claims without any CoT than human experts.  However, I expect them to have more useful “intuitive” judgments in the next few years. This will likely be facilitated by distillation. If LLMs can use CoT to evaluate many novel claims, then the resulting outputs can be distilled into the network to improve “intuitive” (non-CoT) judgments. 
 

LLMs, parrots, and the wisdom of crowds

Q: You’ve said that when LLMs answer questions without CoT, they haven’t evaluated the answers for themselves. Does this mean they are just parroting the answer from a human text?

Owain: No. At the risk of flogging a late bird, I think the parrot metaphor is unhelpful for current LLMs. 

(i) Aggregating many human answers can lead to higher accuracy than imitating individual humans

An LLM like GPT-4 or Claude 3, which is finetuned to be truthful, does not imitate random human answers to questions. Instead, the LLM’s answers will be influenced more by sources that tend to be more truthful (i.e. having features that correlate with truth and accuracy). Moreover, the LLM could be influenced by many sources and (implicitly) perform a kind of averaging or majority vote over sources. 

Note that there are various simple mechanisms that provide more truthful and accurate answers by aggregating many human judgments. These include ensembles of human forecasts (wisdom of crowds), prediction or securities markets, Bayesian Truth Serum, PageRank/EigenTrust, and so on. It’s plausible that LLMs can do reasonably well at answering truthfully without CoT via somewhat analogous mechanisms for aggregating human judgments. 

Let's break down what the LLM does in simplified terms:

The aggregation mechanisms above (ensembling forecasts, PageRank, etc) take as input many individual human judgments. They are mostly "Garbage In, Garbage Out" in that they will usually fail if all humans have strong and systematic biases (or “no signal”). At the same time, they can be super-human in their breadth and reliably outperform non-expert humans on almost any question (while sometimes being worse than experts). LLMs also have super-human breadth and the ability to outperform non-experts. LLMs also perform worse if nearly all human answers in a domain are poor. So even if LLMs just aggregated a huge number of human judgments in a context-sensitive way, it would be misleading to dismiss them as “parroting”, with the connotation of imitating a single human. 

(ii) LLMs understand the content of what they say, but not always the epistemic justification

LLMs sometimes have a good understanding of the content of what they say. As I noted above, after making a statement the LLM can sometimes explain why it's true via Chain of Thought, which depends on understanding the content. However, the reason the LLM makes the statement in the first place is not because it has evaluated the first-order evidence and determined it to be true.


 

Addendum

Reasoning without Chain of Thought in LLMs

I said above that when an LLM gives an answer without CoT, it probably does not evaluate the evidence for the answer. However, there are at least three ways in which it could evaluate evidence without CoT:

  1. The model could evaluate the answer by “intuition” or “pattern recognition” as discussed above. This is not impossible but even the best human experts do not have reliable instant intuitions in certain domains. This is why we require detailed arguments, careful observation, controlled experiments, and mathematical proofs. So I don’t think relying on intuition would lead to reliably truthful responses in current LLMs. 
  2. The model could perform reasoning “internally” (in the forward pass) without CoT. This could happen when you ask a model a novel question, such as “Was President Eisenhower born before George Orwell?”. Yet these internal reasoning abilities are limited. On questions like this, GPT-4 struggles to decide who was born first if the birth years are close. For more discussion of what internal reasoning abilities, see my talk
  3. The model could evaluate the evidence as part of its training process. This is a bit like the idea of subconscious reasoning in humans.  For example, let’s suppose that the evidence for the answer includes a set of facts F1, F2, … , Fn that also appear in the training set. Then in the process of representing these facts compactly, the model may automatically derive the answer from them. This would be an instance of “out-of-context reasoning”, as explained in my talk. This seems somewhat unlikely if the LLM is answering a complex scientific question, given what is currently known about out-of-context reasoning. (For example, when trained on A=B, models don’t automatically infer B=A, despite this being a very simple logical derivation.)

Ensembling forecasts, RETRO, and how LLMs answer questions

I said above that LLMs might be loosely analogous to various mechanisms for aggregating human judgments such as ensembling forecasts. Here are some notes on this idea:

Acknowledgments

Thanks to Andreas Stuhlmüller and Tomek Korbak for helpful comments. 

0 comments

Comments sorted by top scores.