Benchmark Study #3: HellaSwag (Task, MCQ)

post by Bruce W. Lee (bruce-lee) · 2024-01-07T04:59:21.347Z · LW · GW · 4 comments

This is a link post for https://arxiv.org/abs/1905.07830

Contents

  TL;DR
  Timeline Note: Everything below is written from the perspectives of 2019 when the latest version (at the time of writing) of "HellaSwag: Can a Machine Really Finish Your Sentence?" was published
  Section: Abstract
      Introduction to HellaSwag and Commonsense Inference
      Development of HellaSwag Dataset
      Implications for Machine Learning and NLP
  Section: Introduction
      Exploring Commonsense Inference in AI Models
      Introduction of HellaSwag Dataset
      Assessing Model Limitations and Dataset Evolution
      Adversarial Filtering Overview
      Future of Verified Progress in NLP
  Section: Investigating SWAG
      Investigating SWAG's Resolution by BERT
      Learning Dynamics During Finetuning
      Source of Stylistic Biases in SWAG
      BERT's Adaptability and Discriminatory Power
  Section: HellaSwag
    A. Development and Structure of HellaSwag
      Creation of HellaSwag for Commonsense NLI:
      Incorporating WikiHow as a New Testbed:
      Adversarial Filtering (AF) Methodology:
    B. Human Interaction and Model Evaluation in HellaSwag
      Achieving High Human Agreement:
      Zero-Shot Categories for Model Generalization:
      Observations on Dataset Lengths and Model Performance:
  Section: Results
    A. Evaluation of Models on HellaSwag Dataset
      Model Performance Comparison:
      Results Indicating Dataset Difficulty:
      Insights on Pretraining and Finetuning:
    B. Model Transferability Between SWAG and HellaSwag
      Transfer Experiments:
      Domain-Specific Observations:
    C. Qualitative Analysis of Model Responses
      Evaluation of BERT-Large's Predictions:
  Section: Discussion
      HellaSwag as a Challenging Testbed:
      Difficulty for Future Discriminators:
      Scaling of Pretraining:
      Potential Algorithmic Improvements:
      Evolving Benchmarks in NLP:
None
4 comments

Background Note: Benchmark Study is a blog post series to record and study benchmark papers. I am in the process of developing a new LLM evaluation framework for more flexibility over EleutherAI LM Harness. For the initial release, I'm only adding benchmarks that I've studied. All study notes are meant to be read within 10 minutes. I will receive GPT assistance here and there while writing these blog posts. I'm publicly sharing study notes partly to keep myself going and help whoever hasn't read the paper yet. 

@misc{zellers2019hellaswag,
     title={HellaSwag: Can a Machine Really Finish Your Sentence?}, 
     author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
     year={2019},
     eprint={1905.07830},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

TL;DR

Timeline Note: Everything below is written from the perspectives of 2019 when the latest version (at the time of writing) of "HellaSwag: Can a Machine Really Finish Your Sentence?" was published


Section: Abstract

Introduction to HellaSwag and Commonsense Inference

Development of HellaSwag Dataset

Implications for Machine Learning and NLP

Section: Introduction

Exploring Commonsense Inference in AI Models

Introduction of HellaSwag Dataset

Assessing Model Limitations and Dataset Evolution

Adversarial Filtering Overview

Future of Verified Progress in NLP

Section: Investigating SWAG

Investigating SWAG's Resolution by BERT

Learning Dynamics During Finetuning

Source of Stylistic Biases in SWAG

BERT's Adaptability and Discriminatory Power

Section: HellaSwag

A. Development and Structure of HellaSwag

Creation of HellaSwag for Commonsense NLI:

Incorporating WikiHow as a New Testbed:

Adversarial Filtering (AF) Methodology:

B. Human Interaction and Model Evaluation in HellaSwag

Achieving High Human Agreement:

Zero-Shot Categories for Model Generalization:

Observations on Dataset Lengths and Model Performance:

Section: Results

A. Evaluation of Models on HellaSwag Dataset

Model Performance Comparison:

Results Indicating Dataset Difficulty:

Insights on Pretraining and Finetuning:

B. Model Transferability Between SWAG and HellaSwag

Transfer Experiments:

Domain-Specific Observations:

C. Qualitative Analysis of Model Responses

Evaluation of BERT-Large's Predictions:

Section: Discussion

HellaSwag as a Challenging Testbed:

Difficulty for Future Discriminators:

Scaling of Pretraining:

Potential Algorithmic Improvements:

Evolving Benchmarks in NLP:

4 comments

Comments sorted by top scores.

comment by jacobjacob · 2024-01-07T17:49:58.587Z · LW(p) · GW(p)

Humans achieve over 95% accuracy, while no model surpasses 50% accuracy. (2019)


A series on benchmarks does seem very interesting and useful -- but you really gotta report more recent model results than from 2019!! GPT-4 allegedly surpasses 95.3% on HellaSwag, making that initial claim in the post very misleading. 

A Google Gemini benchmark performance chart provided by Google.
Replies from: bruce-lee
comment by Bruce W. Lee (bruce-lee) · 2024-01-07T19:46:40.901Z · LW(p) · GW(p)

Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.

But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?

I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.

Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!

Replies from: jacobjacob
comment by jacobjacob · 2024-01-07T20:33:38.294Z · LW(p) · GW(p)

If that's your belief, I think you should edit in a disclaimer to your TL;DR section, like "Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology". 

Also, the numbers aren't "non-provable": anyone could just replicate them with the GPT-4 API! (Modulo dataset contamination considerations.)

Replies from: bruce-lee
comment by Bruce W. Lee (bruce-lee) · 2024-01-08T02:59:40.096Z · LW(p) · GW(p)

Thanks for the recommendation, though I'll think of a more fundamental solution to satisfy all ethical/communal concerns.

"Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology." Regarding this, just to sort everything out, because I'm writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It's just me questioning everything when I still can as a student. But I'll make sure not to cause any further confusion, as you recommended!