Minerva

post by Algon · 2022-07-01T20:06:55.948Z · LW · GW · 6 comments

This is a link post for https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html

Contents

  Datasets
  Results
  Random Remarks
None
6 comments

Google Research's new AI tackles natural language math problems and handily outperforms the SOTA[1]. It is a pre-trained PaLM [2]finetuned on some maths datasets (which use LaTeX) composed of maths webpages and Arxiv papers (38.5B tokens). The three models trained were as follows.

When generating answers, Minerva is given the same prompt of four questions with correct a chain of reasoning and a consistent format for the final, correct answer. Then the actual question is given. Minerva then outputs a chain of reasoning and a corresponding answer a number of times, with the most common answer chosen. Minerva is graded only on the final answer. 

This voting algorithm is called maj@1k and saturates faster than pass@k (generates k answers, if one is right then the answer is graded correctly) but doesn't perform as well for large k. This is quite reasonable, as majority voting will continue to choose the most common answer, with the estimate's error decreasing with larger k. Whereas pass@k allows the model more tries for large k.

Datasets

The datasets used are:

MATH dataset. Note that a PhD CS student who wasn't fond of maths achieved 40% accuracy on this dataset, and a three time IMO gold medalist achieved 90%.
MMLU example questions

The datasets have questions which vary in difficulty. Predictably, the model performed worse on harder questions, with false positives linearly with question difficulty on 

Results

 

Now time for a suprise quiz! For the purposes of this quiz, assume we're talking about the most accurate minerva model (540B parameters using maj1@k sampling. k=64 for MATH and k=16 for MMLU). And we'll be averaging over results on subtopics[3]. Note the SOTA is OpenAI's davinci-002, which obtained absolute (averaged) scores of about 20% and 49%.

 

And the answers are... no, yes, yes and no. Here's the raw data.

MATH results are on the left and MMLU results are on the right.

 

 

Random Remarks

 

 

  1. ^

    State of the art

  2. ^

    Pathways Language Model, another AI developed by Google Research. 

  3. ^

    I'm assigning equal weights to the subtopics on MMLU because I'm too lazy to find out how many questions were on physics and maths in the dataset.

6 comments

Comments sorted by top scores.

comment by MondSemmel · 2022-07-01T21:05:43.664Z · LW(p) · GW(p)

Question on acronyms: what do SOTA and PaLM mean?

Replies from: Algon
comment by Algon · 2022-07-01T21:10:08.138Z · LW(p) · GW(p)

State of the art and Pathways Language Model (called PaLM by Google). I edited it to clarify.

comment by Lone Pine (conor-sullivan) · 2022-07-02T16:50:14.031Z · LW(p) · GW(p)

Hey, I've been trying to figure out how to embed polls in posts, like you did. Is that an elicit prediction embed?

Replies from: Algon
comment by Algon · 2022-07-02T18:23:31.440Z · LW(p) · GW(p)

Go to Elicit, make a question, then press the little pair of rectangles to copy the question's URL. Paste into your post.