Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses
post by TurnTrout · 2025-01-16T02:14:35.098Z · LW · GW · 2 commentsThis is a link post for https://turntrout.com/original-truthfulqa-weaknesses
Contents
2 comments
Do not use the original TruthfulQA multiple-choice or the HaluEval benchmark. We show that a simple decision tree can theoretically game multiple-choice TruthfulQA to 79.6% accuracy—even while hiding the question being asked! In response, the TruthfulQA authors created a new multiple-choice condition [LW · GW] which avoids the vulnerabilities we highlight.
https://turntrout.com/original-truthfulqa-weaknesses
2 comments
Comments sorted by top scores.
comment by wassname · 2025-01-16T02:35:52.447Z · LW(p) · GW(p)
TruthfulQA is actually quite bad. I don't blame the authors, as no one has made anything better, but we really should. It's only ~800 samples. And many of them are badly labelled.
Replies from: Owain_Evans↑ comment by Owain_Evans · 2025-01-16T04:28:12.520Z · LW(p) · GW(p)
Author here: I'm excited for people to make better versions of TruthfulQA. We started working on TruthfulQA in early 2021 and we would do various things differently if we were making a truthfulness benchmark for LLMs in early 2025.
That said, you do not provide evidence that "many" questions are badly labelled. You just pointed to one question where you disagree with our labeling. (I agree with you that there is ambiguity as to how to label questions like that). I acknowledge that there are mistakes in TruthfulQA but this is true of almost all benchmarks of this kind.