Stephen McAleese's Shortform
post by Stephen McAleese (stephen-mcaleese) · 2023-01-08T21:46:25.888Z · LW · GW · 11 commentsContents
11 comments
11 comments
Comments sorted by top scores.
comment by Stephen McAleese (stephen-mcaleese) · 2024-09-14T19:52:28.164Z · LW(p) · GW(p)
The rate of progress on the MATH dataset is incredible and faster than I expected.
The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.
The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.
But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.
So it seems like we're getting 2028 performance on the MATH dataset already in 2024.
Quote from the blog post:
Replies from: mattmacdermott, valery-cherepanov, nikolas-kuhn, shankar-sivarajan, nathan-helm-burger"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this."
↑ comment by mattmacdermott · 2024-09-14T21:50:56.054Z · LW(p) · GW(p)
Do we know that the test set isn’t in the training data?
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2024-09-14T22:27:20.112Z · LW(p) · GW(p)
I don't know if we can be confident in the exact 95% result, but it is the case that o1 consistently performs at a roughly similar level on math across a variety of different benchmarks (e.g., AIME and other people have found strong performance on other math tasks which are unlikely to have been in the training corpus).
↑ comment by Qumeric (valery-cherepanov) · 2024-09-15T16:17:33.581Z · LW(p) · GW(p)
I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don't remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.
Nevertheless, it's very impressive, and AIMO results are even more impressive in my opinion.
↑ comment by Amalthea (nikolas-kuhn) · 2024-09-15T00:00:57.987Z · LW(p) · GW(p)
In 2021, I predicted math to be basically solved by 2023 (using the kind of reinforcement learning on formally checkable proofs that deepmind is using). It's been slower than expected and I wouldn't have guessed some less formal setting like o1 to go relatively well - but since then I just nod along to these kinds of results.
(Not sure what to think of that claimed 95% number though - wouldn't that kind of imply they'd blown past the IMO grand challenge? EDIT: There were significant time limits on the human participants, see Qumeric's comment.)
↑ comment by Shankar Sivarajan (shankar-sivarajan) · 2024-09-15T02:15:45.507Z · LW(p) · GW(p)
A nice test might be the 2024 IMO (from July). I'm curious to see if it's reached gold medal performance on that.
The IMO Grand Challenge might be harder; I don't know how Lean works, but it's probably harder to write than human-readable LaTeX.
Replies from: leon-lang↑ comment by Leon Lang (leon-lang) · 2024-09-16T12:45:45.101Z · LW(p) · GW(p)
OpenAI would have mentioned if they had reached gold on the IMO.
↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-09-15T12:42:18.269Z · LW(p) · GW(p)
The rate of progress is surprising even to experts pushing the frontier... Another example: https://x.com/polynoamial/status/998902692050362375
comment by Stephen McAleese (stephen-mcaleese) · 2024-12-08T10:42:48.950Z · LW(p) · GW(p)
Here's an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain's reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don't generalize well like "sweet foods are good".
In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.
comment by Stephen McAleese (stephen-mcaleese) · 2023-01-08T21:46:26.119Z · LW(p) · GW(p)
What books do you want to read in 2023?
Replies from: stephen-mcaleese↑ comment by Stephen McAleese (stephen-mcaleese) · 2023-01-08T22:01:36.269Z · LW(p) · GW(p)
- The Scout Mindset
- The Age of Em
- The Beginning of Infinite
- The Selfish Gene (2nd time)