

Comment by gkamradt on AI #69: Nice · 2024-06-20T20:29:59.250Z · LW · GW

Test set: 51% vs prior SoTA of 34% (human baseline is unknown)

Ryan tested against the public test set and got 51%. The SOTA score reported here was on the private test set.

Reporting scores on public data are usually inflated due to overfitting (by humans looking at the questions and answers then tailoring their model)