Broken Benchmark: MMLU
post by awg · 2023-08-29T18:09:02.907Z · LW · GW · 5 commentsThis is a link post for https://www.youtube.com/watch?v=hVade_8H8mE
Contents
5 comments
Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.
Among them:
- Crucial context missing from questions (apparently copy-paste errors?)
- Ambiguous sets of answers
- Wrong sets of answers
He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.
I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.
5 comments
Comments sorted by top scores.
comment by Dan H (dan-hendrycks) · 2023-08-30T02:11:13.222Z · LW(p) · GW(p)
Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).
Replies from: glerzing, awg↑ comment by alenoach (glerzing) · 2023-08-30T23:47:53.050Z · LW(p) · GW(p)
As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.
↑ comment by awg · 2023-08-30T15:47:10.003Z · LW(p) · GW(p)
Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why?
For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise "fine" under those conditions (if, for example, we could somehow get that number down to 0%)?
Replies from: ricraz, o-o↑ comment by Richard_Ngo (ricraz) · 2023-08-30T23:43:58.557Z · LW(p) · GW(p)
It seems like he's mainly responding to the implication that this means MMLU is "broken". Label noise can be both suboptimal and also much less important than this post's title suggests.