Stephen McAleese's Shortform

post by Stephen McAleese (stephen-mcaleese) · 2023-01-08T21:46:25.888Z · LW · GW · 11 comments

Contents

11 comments

11 comments

Comments sorted by top scores.

comment by Stephen McAleese (stephen-mcaleese) · 2024-09-14T19:52:28.164Z · LW(p) · GW(p)

The rate of progress on the MATH dataset is incredible and faster than I expected.

The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.

The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.

But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.

So it seems like we're getting 2028 performance on the MATH dataset already in 2024.

Quote from the blog post:

"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this."

Replies from: mattmacdermott, valery-cherepanov, nikolas-kuhn, shankar-sivarajan, nathan-helm-burger
comment by mattmacdermott · 2024-09-14T21:50:56.054Z · LW(p) · GW(p)

Do we know that the test set isn’t in the training data?

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-09-14T22:27:20.112Z · LW(p) · GW(p)

I don't know if we can be confident in the exact 95% result, but it is the case that o1 consistently performs at a roughly similar level on math across a variety of different benchmarks (e.g., AIME and other people have found strong performance on other math tasks which are unlikely to have been in the training corpus).

comment by Qumeric (valery-cherepanov) · 2024-09-15T16:17:33.581Z · LW(p) · GW(p)

I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don't remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.

Nevertheless, it's very impressive, and AIMO results are even more impressive in my opinion.

comment by Amalthea (nikolas-kuhn) · 2024-09-15T00:00:57.987Z · LW(p) · GW(p)

In 2021, I predicted math to be basically solved by 2023 (using the kind of reinforcement learning on formally checkable proofs that deepmind is using). It's been slower than expected and I wouldn't have guessed some less formal setting like o1 to go relatively well - but since then I just nod along to these kinds of results.

(Not sure what to think of that claimed 95% number though - wouldn't that kind of imply they'd blown past the IMO grand challenge? EDIT: There were significant time limits on the human participants, see Qumeric's comment.)

comment by Shankar Sivarajan (shankar-sivarajan) · 2024-09-15T02:15:45.507Z · LW(p) · GW(p)

A nice test might be the 2024 IMO (from July). I'm curious to see if it's reached gold medal performance on that.

The IMO Grand Challenge might be harder; I don't know how Lean works, but it's probably harder to write than human-readable LaTeX. 

Replies from: leon-lang
comment by Leon Lang (leon-lang) · 2024-09-16T12:45:45.101Z · LW(p) · GW(p)

OpenAI would have mentioned if they had reached gold on the IMO.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-09-15T12:42:18.269Z · LW(p) · GW(p)

The rate of progress is surprising even to experts pushing the frontier... Another example: https://x.com/polynoamial/status/998902692050362375

comment by Stephen McAleese (stephen-mcaleese) · 2024-12-08T10:42:48.950Z · LW(p) · GW(p)

Here's an argument for why current alignment methods like RLHF are already much better than what evolution can do.

Evolution has to encode information about the human brain's reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don't generalize well like "sweet foods are good".

In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.

  1. ^
comment by Stephen McAleese (stephen-mcaleese) · 2023-01-08T21:46:26.119Z · LW(p) · GW(p)

What books do you want to read in 2023?

Replies from: stephen-mcaleese
comment by Stephen McAleese (stephen-mcaleese) · 2023-01-08T22:01:36.269Z · LW(p) · GW(p)
  • The Scout Mindset
  • The Age of Em
  • The Beginning of Infinite
  • The Selfish Gene (2nd time)