Broken Benchmark: MMLU

awg

Broken Benchmark: MMLU

post by awg · 2023-08-29T18:09:02.907Z · LW · GW · 5 comments

This is a link post for https://www.youtube.com/watch?v=hVade_8H8mE

5 comments

Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.

Among them:

Crucial context missing from questions (apparently copy-paste errors?)
Ambiguous sets of answers
Wrong sets of answers

He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.

I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.

5 comments

Comments sorted by top scores.

comment by Dan H (dan-hendrycks) · 2023-08-30T02:11:13.222Z · LW(p) · GW(p)

Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).

Replies from: glerzing, awg

↑ comment by alenoach (glerzing) · 2023-08-30T23:47:53.050Z · LW(p) · GW(p)

As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.

↑ comment by awg · 2023-08-30T15:47:10.003Z · LW(p) · GW(p)

Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why?

For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise "fine" under those conditions (if, for example, we could somehow get that number down to 0%)?

Replies from: ricraz, o-o

↑ comment by Richard_Ngo (ricraz) · 2023-08-30T23:43:58.557Z · LW(p) · GW(p)

It seems like he's mainly responding to the implication that this means MMLU is "broken". Label noise can be both suboptimal and also much less important than this post's title suggests.

↑ comment by O O (o-o) · 2023-08-30T16:59:17.288Z · LW(p) · GW(p)

I imagine researchers at big labs know this and are correcting these errors as models get good enough for this to matter.

Broken Benchmark: MMLU

Contents

5 comments