post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by jacobjacob · 2024-01-07T17:49:58.587Z · LW(p) · GW(p)

Humans achieve over 95% accuracy, while no model surpasses 50% accuracy. (2019)


A series on benchmarks does seem very interesting and useful -- but you really gotta report more recent model results than from 2019!! GPT-4 allegedly surpasses 95.3% on HellaSwag, making that initial claim in the post very misleading. 

A Google Gemini benchmark performance chart provided by Google.
Replies from: bruce-lee
comment by Bruce W. Lee (bruce-lee) · 2024-01-07T19:46:40.901Z · LW(p) · GW(p)

Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.

But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?

I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.

Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!

Replies from: jacobjacob
comment by jacobjacob · 2024-01-07T20:33:38.294Z · LW(p) · GW(p)

If that's your belief, I think you should edit in a disclaimer to your TL;DR section, like "Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology". 

Also, the numbers aren't "non-provable": anyone could just replicate them with the GPT-4 API! (Modulo dataset contamination considerations.)

Replies from: bruce-lee
comment by Bruce W. Lee (bruce-lee) · 2024-01-08T02:59:40.096Z · LW(p) · GW(p)

Thanks for the recommendation, though I'll think of a more fundamental solution to satisfy all ethical/communal concerns.

"Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology." Regarding this, just to sort everything out, because I'm writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It's just me questioning everything when I still can as a student. But I'll make sure not to cause any further confusion, as you recommended!