METR’s preliminary evaluation of o3 and o4-mini

post by Christopher King (christopher-king) · 2025-04-16T20:23:00.285Z · LW · GW · 2 comments

This is a link post for https://metr.github.io/autonomy-evals-guide/openai-o3-report/

Contents

2 comments

2 comments

Comments sorted by top scores.

comment by Thomas Kwa (thomas-kwa) · 2025-04-17T02:41:55.815Z · LW(p) · GW(p)

Time horizon of o3 is ~1.5 hours vs Claude 3.7's 54 minutes, and it's statistically significant that it's above the long-term trend. It's been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.

My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.

comment by MalcolmMcLeod · 2025-04-17T01:15:11.798Z · LW(p) · GW(p)

I thank y'all for rapidly replicating and extending this eval. This is the most important eval extant. Units are truly comparable, and it's directly connected to the questions of "coding for ML/AI research" and "long-horizon agency" that seem cruxy for short timelines. I did not expect @Daniel Kokotajlo [LW · GW] to be right about the superexponentiality so quickly. 
 

My long-timeline probability mass is increasingly dependent on "this doesn't generalize past formally verifiable domains + formally verifiable domains are insufficient for to automate AI algorithmic progress substantially" or "somehow this progress doesn't extend to the arbitrarily messy and novel real world." But it ain't looking good.