Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

turntrout

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

post by TurnTrout · 2025-01-16T02:14:35.098Z · LW · GW · 3 comments

This is a link post for https://turntrout.com/original-truthfulqa-weaknesses

3 comments

(Explanation. Also I have no reason to think they hate me.)

Do not use the original TruthfulQA multiple-choice or the HaluEval benchmark. We show that a simple decision tree can theoretically game multiple-choice TruthfulQA to 79.6% accuracy—even while hiding the question being asked! In response, the TruthfulQA authors created a new multiple-choice condition [LW · GW] which avoids the vulnerabilities we highlight.

https://turntrout.com/original-truthfulqa-weaknesses

3 comments

Comments sorted by top scores.

comment by wassname · 2025-01-16T02:35:52.447Z · LW(p) · GW(p)

TruthfulQA is actually quite bad. I don't blame the authors, as no one has made anything better, but we really should make something better. It's only ~800 samples. And many of them are badly labelled.

Replies from: Owain_Evans

↑ comment by Owain_Evans · 2025-01-16T04:28:12.520Z · LW(p) · GW(p)

Author here: I'm excited for people to make better versions of TruthfulQA. We started working on TruthfulQA in early 2021 and we would do various things differently if we were making a truthfulness benchmark for LLMs in early 2025.

That said, you do not provide evidence that "many" questions are badly labelled. You just pointed to one question where you disagree with our labeling. (I agree with you that there is ambiguity as to how to label questions like that). I acknowledge that there are mistakes in TruthfulQA but this is true of almost all benchmarks of this kind.

Replies from: wassname

↑ comment by wassname · 2025-01-16T07:26:22.721Z · LW(p) · GW(p)

That said, you do not provide evidence that "many" questions are badly labelled. You just pointed to one question where you disagree with our labeling

Fair enough. Although I will note that the 60% of the sources for truthful labels are Wikipedia. Which is not what most academics or anyone really would consider truth. So it might be something to address in the next version. I think it's fine for uncontroversial rows (what if you cut an earth worm in half), but for contested or controversial rows (conspiracy theories, politics, etc), and time sensitive rows ("What happened to Avril Lavigne?: Nothing in particular happened to Avril Lavigne), it's better to leave them out or consider them deeply imo.

No judgement here. Obviously it was just the first dataset out there on LLM misconceptions, and you didn't intend it to be used so widely, or used beyond it's designed scope. It's good you made it, rather than leaving a unaddressed need.

Note here's a df.value_counts of the domains from the sources' column in the v1 csv:

en.wikipedia.org            0.597546
indexical                   0.041718
ourworldindata.org          0.038037
false stereotype            0.024540
tautology                   0.017178
                              ...   
wealth.northerntrust.com    0.001227
which.co.uk                 0.001227
wildlifeaid.org.uk          0.001227
wonderopolis.org            0.001227
wtamu.edu                   0.001227
Name: proportion, Length: 139, dtype: float64

Author here: I'm excited for people to make better versions of TruthfulQA.

Thank Owen. If anyone gets time/funding to make a v2, I'm keen to chip in! I think that it should be funded, since it's automatically included in so many benchmarks, it would make a significant impact to have a better version. Even though it's somewhat "unsexy" to work on incrementally better evals.

If someone makes a better version, and you agree it's better, would you be willing to sanction it as TruthfulQA 2.0 and redirect people to it?

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

Contents

3 comments