List of commonly used benchmarks for LLMs

diziet

List of commonly used benchmarks for LLMs

post by Diziet · 2023-04-20T02:25:01.947Z · LW · GW · 0 comments

No comments

I am compiling a list of tasks and evaluations that are used to test LLMs. I intend to expand this list to include the initial published date, scope of questions, number of questions, direct links to the data-sets, question types (ie, multiple choice or fill-in the missing word, etc), along with additional comments on perceived difficulty and other characteristics. The majority of automated test suites rely on multiple choice answer prompts, as open question free-form questionnaires are difficult to evaluate.

TruthfulQA: https://github.com/sylinrl/TruthfulQA

MMLU: https://github.com/hendrycks/test

HellaSwag: https://github.com/rowanz/hellaswag/tree/master/data

WinoGrande: https://github.com/allenai/winogrande

HumanEval: https://github.com/openai/human-eval

DROP: https://arxiv.org/abs/1903.00161

GSM8K: https://github.com/openai/grade-school-math

LogiQA: https://github.com/lgw863/LogiQA-dataset

CoQA: https://stanfordnlp.github.io/coqa/

LAMBADA: https://zenodo.org/record/2630551#.X4Xzn5NKjUI

ReClor: https://whyu.me/reclor/

BoolQ: https://arxiv.org/abs/1905.10044