Broken Benchmark: MMLU

post by awg · 2023-08-29T18:09:02.907Z · LW · GW · 5 comments

This is a link post for https://www.youtube.com/watch?v=hVade_8H8mE

Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.

Among them:

He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.

I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.

5 comments

Comments sorted by top scores.

comment by Dan H (dan-hendrycks) · 2023-08-30T02:11:13.222Z · LW(p) · GW(p)

Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).

Replies from: glerzing, awg
comment by alenoach (glerzing) · 2023-08-30T23:47:53.050Z · LW(p) · GW(p)

As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.

comment by awg · 2023-08-30T15:47:10.003Z · LW(p) · GW(p)

Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why?

For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise "fine" under those conditions (if, for example, we could somehow get that number down to 0%)?

Replies from: ricraz, o-o
comment by Richard_Ngo (ricraz) · 2023-08-30T23:43:58.557Z · LW(p) · GW(p)

It seems like he's mainly responding to the implication that this means MMLU is "broken". Label noise can be both suboptimal and also much less important than this post's title suggests.

comment by O O (o-o) · 2023-08-30T16:59:17.288Z · LW(p) · GW(p)

I imagine researchers at big labs know this and are correcting these errors as models get good enough for this to matter.