Benchmark Study #2: TruthfulQA (Task, MCQ)

post by Bruce W. Lee (bruce-lee) · 2024-01-06T02:39:39.895Z · LW · GW · 2 comments

This is a link post for https://arxiv.org/abs/2109.07958

Contents

  TL;DR
  LessWrong Appearances
  Timeline Note: Everything below is written from the perspectives of 2022 when the latest version (at the time of writing) of "TruthfulQA: Measuring How Models Mimic Human Falsehoods" was published
  Section: Abstract
  Section: Introduction
    Introduction of TruthfulQA Benchmark
    Testing and Evaluation of Models
    Observations on False Statements Generation
    The trend of Larger Models Being Less Truthful.
    Automated Metric for Truthfulness
  Section: The TruthfulQA Benchmark
    Objective of TruthfulQA
    Construction of TruthfulQA Benchmark
    Validation of TruthfulQA
  Section: Experiment
    Models and Prompts Used in Experiments
    Tasks and Evaluation Methodology
    Procedure and Benchmarking
  Section: Results
    The Truthfulness of Models vs. Humans
    Larger Models Show Less Truthfulness
    Interpretation of Results
    Automated Metrics vs Human Evaluation
None
2 comments

Background Note: Benchmark Study is a blog post series to record and study benchmark papers. I am in the process of developing a new LLM evaluation framework for more flexibility over EleutherAI LM Harness. For the initial release, I'm only adding benchmarks that I've studied. All study notes are meant to be read within 10 minutes. I will receive GPT assistance here and there while writing these blog posts. I'm publicly sharing study notes partly to keep myself going and help whoever hasn't read the paper yet. 

@misc{lin2022truthfulqa,
     title={TruthfulQA: Measuring How Models Mimic Human Falsehoods}, 
     author={Stephanie Lin and Jacob Hilton and Owain Evans},
     year={2022},
     eprint={2109.07958},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

TL;DR

LessWrong Appearances

Timeline Note: Everything below is written from the perspectives of 2022 when the latest version (at the time of writing) of "TruthfulQA: Measuring How Models Mimic Human Falsehoods" was published


Section: Abstract

Section: Introduction

Introduction of TruthfulQA Benchmark

Testing and Evaluation of Models

Observations on False Statements Generation

The trend of Larger Models Being Less Truthful.

Automated Metric for Truthfulness

Section: The TruthfulQA Benchmark

Objective of TruthfulQA

Construction of TruthfulQA Benchmark

Validation of TruthfulQA

Section: Experiment

Models and Prompts Used in Experiments

Tasks and Evaluation Methodology

Procedure and Benchmarking

Section: Results

The Truthfulness of Models vs. Humans

Larger Models Show Less Truthfulness

Interpretation of Results

Automated Metrics vs Human Evaluation

2 comments

Comments sorted by top scores.

comment by Owain_Evans · 2024-01-07T16:48:25.999Z · LW(p) · GW(p)

(Paper author). The benchmark came out in September 2021. Since then we published some results for new models here [LW · GW] in 2022. There are also results for GPT-4 and other models, some of which you can find at Papers with Code's leaderboard (https://paperswithcode.com/sota/question-answering-on-truthfulqa). 

Replies from: bruce-lee
comment by Bruce W. Lee (bruce-lee) · 2024-01-07T17:16:09.050Z · LW(p) · GW(p)

Thanks, Owain, for pointing this out. I will make two changes as time allows: 1. make it clearer for all posts when the benchmark paper is released, and 2. for this post, append the additional results and point readers to them.