o3

post by Zach Stein-Perlman · 2024-12-20T18:30:29.448Z · LW · GW · 79 comments

Contents

79 comments

See livestream, site, OpenAI thread, Nat McAleese thread.

OpenAI announced (but isn't yet releasing) o3 and o3-mini (skipping o2 because of telecom company O2's trademark). "We plan to deploy these models early next year." "o3 is powered by further scaling up RL beyond o1"; I don't know whether it's a new base model.

o3 gets 25% on FrontierMath, smashing the previous SoTA. (These are really hard math problems.[1]) Wow. (The dark blue bar, about 7%, is presumably one-attempt and most comparable to the old SoTA; unfortunately OpenAI didn't say what the light blue bar is, but I think it doesn't really matter and the 25% is for real.[2])

o3 also is easily SoTA on SWE-bench Verified and Codeforces.

It's also easily SoTA on ARC-AGI, after doing RL on the public ARC-AGI problems + when spending $4,000 per task on inference (!).[3]

OpenAI has a "new alignment strategy." (Just about the "modern LLMs still comply with malicious prompts, overrefuse benign queries, and fall victim to jailbreak attacks" problem.) It looks like RLAIF/Constitutional AI. See Lawrence Chan's thread.

OpenAI says "We're offering safety and security researchers early access to our next frontier models"; yay.

o3-mini will be able to use a low, medium, or high amount of inference compute, depending on the task and the user's preferences. o3-mini (medium) outperforms o1 (at least on Codeforces and the 2024 AIME) with less inference cost.

GPQA Diamond:

  1. ^

    Update: most of them are not as hard as I thought:

    There are 3 tiers of difficulty within FrontierMath: 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style [problems], 25% T3 = early researcher problems.

  2. ^

    My guess is it's consensus@128 or something (i.e. write 128 answers and submit the most common one). Even if it's pass@n (i.e. submit n tries) rather than consensus@n, that's likely reasonable because I heard FrontierMath is designed to have easier-to-verify numerical-ish answers.

    Update: it's not pass@n.

  3. ^

    It's not clear how they can leverage so much inference compute; they must be doing more than consensus@n. See Vladimir_Nesov's comment [LW(p) · GW(p)].

79 comments

Comments sorted by top scores.

comment by Thane Ruthenis · 2024-12-20T20:22:35.645Z · LW(p) · GW(p)

I'm going to go against the flow here and not be easily impressed. I suppose it might just be copium.

Any actual reason to expect that the new model beating these challenging benchmarks, which have previously remained unconquered, is any more of a big deal than the last several times a new model beat a bunch of challenging benchmarks that have previously remained unconquered?

Don't get me wrong, I'm sure it's amazingly more capable in the domains in which it's amazingly more capable. But I see quite a lot of "AGI achieved" panicking/exhilaration in various discussions, and I wonder whether it's more justified this time than the last several times this pattern played out. Does anything indicate that this capability advancement is going to generalize in a meaningful way to real-world tasks and real-world autonomy, rather than remaining limited to the domain of extremely well-posed problems?

One of the reasons I'm skeptical is the part where it requires thousands of dollars' worth of inference-time compute. Implies it's doing brute force at extreme scale, which is a strategy that'd only work for, again, domains of well-posed problems with easily verifiable solutions. Similar to how o1 blows Sonnet 3.5.1 out of the water on math, but isn't much better outside that.

Edit: If we actually look at the benchmarks here:

  • The most impressive-looking jump is FrontierMath from 2% to 25.2%, but it's also exactly the benchmark where the strategy of "generate 10k candidate solutions, hook them up to a theorem-verifier, see if one of them checks out, output it" would shine.
    • (With the potential theorem-verifier having been internalized by o3 over the course of its training; I'm not saying there was a separate theorem-verifier manually wrapped around o3.)
  • Significant progress on ARC-AGI has previously been achieved using "crude program enumeration", which made the authors conclude that "about half of the benchmark was not a strong signal towards AGI".
  • The SWE jump from 48.9 to 71.7 is significant, but it's not much of a qualitative improvement.

Not to say it's a nothingburger, of course. But I'm not feeling the AGI here.

Replies from: hastings-greer, sharmake-farah, Seth Herd, jbash, robert-lynn
comment by Hastings (hastings-greer) · 2024-12-20T21:30:57.444Z · LW(p) · GW(p)

It’s not AGI, but for human labor to retain any long-term value, there has to be an impenetrable wall that AI research hits, and this result rules out a small but nonzero number of locations that wall might’ve been.

comment by Noosphere89 (sharmake-farah) · 2024-12-20T21:39:16.599Z · LW(p) · GW(p)

To first order, I believe a lot of the reason why the "AGI achieved" shrill posting often tends to be overhyped is that not because the models are theoretically incapable, but rather that reliability was way more of a requirement for it to replace jobs fast than people realized, and there are only a very few jobs where an AI agent can do well without instantly breaking down because it can't error-correct/be reliable, and I think this has been continually underestimated by AI bulls.

Indeed, one of my broader updates is that a capability is only important to the broader economy if it's very, very reliable, and I agree with Leo Gao and Alexander Gietelink Oldenziel that reliability is a bottleneck way more than people thought:

https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#f5WAxD3WfjQgefeZz [LW(p) · GW(p)]

https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#YxLCWZ9ZfhPdjojnv [LW(p) · GW(p)]

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2024-12-20T22:44:46.170Z · LW(p) · GW(p)

I agree that this seems like an important factor. See also this post [LW · GW] making a similar point.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-12-20T23:30:18.347Z · LW(p) · GW(p)

To be clear, I do expect AI to accelerate AI research, and AI research may be one of the few exceptions to this rule, but it's one of the reasons I have longer timelines nowadays than a lot of other people, and also why I expect AI impact on the economy to be surprisingly discontinuous in practice, and is a big reason I expect AI governance have few laws passed until very near the end of the AI as complement era for most jobs that are not AI research.

The post you linked is pretty great, thanks for sharing.

comment by Seth Herd · 2024-12-20T21:58:19.729Z · LW(p) · GW(p)

It's not really dangerous real AGI [LW · GW] yet. But it will be soon this is a version that's like a human with severe brain damage to the frontal lobes that provide agency and self-management, and the temporal lobe, which does episodic memory and therefore continuous, self-directed learning.

Those things are relatively easy to add, since it's smart enough to self-manage as an agent and self-direct its learning. Episodic memory systems exist and only need modest improvements - some low-hanging fruit are glaringly obvious from a computational neuroscience perspective, so I expect them to be employed almost as soon as a competent team starts working on episodic memory.

Don't indulge in even possible copium. We need your help to align these things, fast. The possibility of dangerous AGI soon can no longer be ignored.

 

Gambling that the gaps in LLMs abilities (relative to humans) won't be filled soon is a bad gamble.

comment by jbash · 2024-12-21T16:25:10.519Z · LW(p) · GW(p)

Not to say it's a nothingburger, of course. But I'm not feeling the AGI here.

These math and coding benchmarks are so narrow that I'm not sure how anybody could treat them as saying anything about "AGI". LLMs haven't even tried to be actually general.

How close is "the model" to passing the Woz test (go into a strange house, locate the kitchen, and make a cup of coffee, implicitly without damaging or disrupting things)? If you don't think the kinesthetic parts of robotics count as part of "intelligence" (and why not?), then could it interactively direct a dumb but dextrous robot to do that?

Can it design a nontrivial, useful physical mechanism that does a novel task effectively and can be built efficiently? Produce usable, physically accurate drawings of it? Actually make it, or at least provide a good enough design that it can have it made? Diagnose problems with it? Improve the design based on observing how the actual device works?

Can it look at somebody else's mechanical design and form a reasonably reliable opinion about whether it'll work?

Even in the coding domain, can it build and deploy an entire software stack offering a meaningful service on a real server without assistance?

Can it start an actual business and run it profitably over the long term? Or at least take a good shot at it? Or do anything else that involves integrating multiple domains of competence to flexibly pursue possibly-somewhat-fuzzily-defined goals over a long time in an imperfectly known and changing environment?

Can it learn from experience and mistakes in actual use, without the hobbling training-versus-inference distinction? How quickly and flexibly can it do that?

When it schemes, are its schemes realistically feasible? Can it tell when it's being conned, and how? Can it recognize an obvious setup like "copy this file to another directory to escape containment"?

Can it successfully persuade people to do specific, relatively complicated things (as opposed to making transparently unworkable hypothetical plans to persuade them)?

comment by Foyle (robert-lynn) · 2024-12-21T07:41:13.763Z · LW(p) · GW(p)

A very large amount of human problem solving/innovation in challenging areas is creating and evaluating potential solutions, it is a stochastic rather than deterministic process.  My understanding is that our brains are highly parallelized in evaluating ideas in thousands of 'cortical columns' a few mm across (Jeff Hawkin's 1000 brains formulation) with an attention mechanism that promotes the filtered best outputs of those myriad processes forming our 'consciousness'.

So generating and discarding large numbers of solutions within simpler 'sub brains', via iterative, or parallelized operation is very much how I would expect to see AGI and SI develop.

comment by Hastings (hastings-greer) · 2024-12-20T19:33:15.224Z · LW(p) · GW(p)

“Scaling is over” was sort of the last hope I had for avoiding the “no one is employable, everyone starves” apocalypse. From that frame, the announcement video from openai is offputtingly cheerful.

Replies from: Seth Herd, winstonBosan, sharmake-farah
comment by Seth Herd · 2024-12-20T21:51:55.661Z · LW(p) · GW(p)

Really. I don't emphasize this because I care more about humanity's survival than the next decades sucking really hard for me and everyone I love. But how do LW futurists not expect catastrophic job loss that destroys the global economy?

Replies from: lc, o-o
comment by lc · 2024-12-20T22:07:31.476Z · LW(p) · GW(p)

I don't emphasize this because I care more about humanity's survival than the next decades sucking really hard for me and everyone I love.

I'm flabbergasted by this degree/kind of altruism. I respect you for it, but I literally cannot bring myself to care about "humanity"'s survival if it means the permanent impoverishment, enslavement or starvation of everybody I love. That future is simply not much better on my lights than everyone including the gpu-controllers meeting a similar fate. In fact I think my instincts are to hate that outcome more, because it's unjust.

But how do LW futurists not expect catastrophic job loss that destroys the global economy?

Slight correction: catastrophic job loss would destroy the ability of the non-landed, working public to paritcipate in and extract value from the global economy. The global economy itself would be fine. I agree this is a natural conclusion; I guess people were hoping to get 10 or 15 more years out of their natural gifts.

Replies from: Seth Herd, Tenoke, Richard_Kennaway, o-o
comment by Seth Herd · 2024-12-20T23:14:52.218Z · LW(p) · GW(p)

Thank you. Oddly, I am less altruistic than many EA/LWers. They routinely blow me away.

I can only maintain even that much altruism because I think there's a very good chance that the future could be very, very good for a truly vast number of humans and conscious AGIs. I don't think it's that likely that we get a perpetual boot-on-face situation. I think only about 1% of humans are so sociopathic AND sadistic in combination that they wouldn't eventually let their tiny sliver of empathy cause them to use their nearly-unlimited power to make life good for people. They wouldn't risk giving up control, just share enough to be hailed as a benevolent hero instead of merely god-emperor for eternity.

I have done a little "metta" meditation to expand my circle of empathy. I think it makes me happy; I can "borrow joy". The side effect is weird decisions like letting my family suffer so that more strangers can flourish in a future I probably won't see.

comment by Tenoke · 2024-12-21T11:36:04.573Z · LW(p) · GW(p)

Survival is obviously much better because 1. You can lose jobs but eventually still have a good life (think UBI at minimum) and 2. Because if you don't like it you can always kill yourself and be in the same spot as the non-survival case anyway.

Replies from: sil-ver
comment by Rafael Harth (sil-ver) · 2024-12-21T11:44:23.392Z · LW(p) · GW(p)
  1. Because if you don't like it you can always kill yourself and be in the same spot as the non-survival case anyway.

Not to get too morbid here but I don't think this is a good argument. People tend not to commit suicide even if they have strongly net negative lives

comment by Richard_Kennaway · 2024-12-21T15:44:34.076Z · LW(p) · GW(p)

catastrophic job loss would destroy the ability of the non-landed, working public to paritcipate in and extract value from the global economy. The global economy itself would be fine.

Who would the producers of stuff be selling it to in that scenario?

BTW, I recently saw the suggestion that discussions of “the economy” can be clarified by replacing the phrase with “rich people’s yacht money”. There’s something in that. If 90% of the population are destitute, then 90% of the farms and factories have to shut down for lack of demand (i.e. not having the means to buy), which puts more out of work, until you get a world in which a handful of people control the robots that keep them in food and yachts and wait for the masses to die off.

I wonder if there are any key players who would welcome that scenario. Average utilitarianism FTW!

At least, supposing there are still any people controlling the robots by then.

Replies from: Seth Herd
comment by Seth Herd · 2024-12-21T18:01:45.236Z · LW(p) · GW(p)

That's what would happen, and the fact that nobody wanted it to happen wouldn't help. It's a Tragedy of the Commons situation.

comment by O O (o-o) · 2024-12-20T22:34:29.872Z · LW(p) · GW(p)

Why would that be the likely case? Are you sure it's likely or are you just catastrophizing?

comment by O O (o-o) · 2024-12-20T22:29:29.094Z · LW(p) · GW(p)

catastrophic job loss that destroys the global economy?

I expect the US or Chinese government to take control of these systems sooner than later to maintain sovereignty. I also expect there will be some force to counteract the rapid nominal deflation that would happen if there was mass job loss. Every ultra rich person now relies on billions of people buying their products to give their companies the valuation they have. 

I don't think people want nominal deflation even if it's real economic growth. This will result in massive printing from the fed that probably lands in poeple's pockets (Iike covid checks).

Replies from: sharmake-farah, Seth Herd
comment by Noosphere89 (sharmake-farah) · 2024-12-20T22:36:45.081Z · LW(p) · GW(p)

I think this is reasonably likely, but not a guaranteed outcome, and I do think there's a non-trivial chance that the US regulates it way too late to matter, because I expect mass job loss to be one of the last things AI does, due to pretty severe reliability issues with current AI.

Replies from: robert-lynn
comment by Foyle (robert-lynn) · 2024-12-21T07:27:05.349Z · LW(p) · GW(p)

I think Elon will bring strong concern about AI to fore in current executive - he was an early voice for AI safety though he seems too have updated to a more optimistic view (and is pushing development through x-AI) he still generally states P(doom) ~10-20%.  His antipathy towards Altman and Google founders is likely of benefit for AI regulation too - though no answer for the China et al AGI development problem.

comment by Seth Herd · 2024-12-20T23:08:43.403Z · LW(p) · GW(p)

I also expect government control; see If we solve alignment, do we die anyway? [LW · GW] for musings about the risks thereof. But it is a possible partial solution to job loss. It's a lot tougher to pass a law saying "no one can make this promising new technology even though it will vastly increase economic productivity" than to just show up to one company and say "heeeey so we couldn't help but notice you guys are building something that will utterly shift the balance of power in the world.... can we just, you know, sit in and hear what you're doing with it and maybe kibbitz a bit?" Then nationalize it officially if and when that seems necessary.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2024-12-20T23:16:12.869Z · LW(p) · GW(p)

I actually think doing the former is considerably more in line with the way things are done/closer to the Overton window.

Replies from: Seth Herd, o-o
comment by Seth Herd · 2024-12-20T23:35:38.900Z · LW(p) · GW(p)

For politicians, yes - but the new administration looks to be strongly pro-tech (unless DJ Trump gets a bee in his bonnet and turns dramatically anti-Musk).

For the national security apparatus, the second seems more in line with how they get things done. And I expect them to twig to the dramatic implications much faster than the politicians do. In this case, there's not even anything illegal or difficult about just having some liasons at OAI and an informal request to have them present in any important meetings.

At this point I'd be surprised to see meaningful legislation slowing AI/AGI progress in the US, because the "we're racing China" narrative is so compelling - particularly to the good old military-industrial complex, but also to people at large.

Slowing down might be handing the race to China, or at least a near-tie.

I am becoming more sure that would beat going full-speed without a solid alignment plan. Despite my complete failure to interest anyone in the question of Who is Xi Jinping? [LW(p) · GW(p)] in terms of how he or his successors would use AGI. I don't think he's sociopathic/sadistic enough to create worse x-risks or s-risks than rushing to AGI does. But I don't know.

comment by O O (o-o) · 2024-12-20T23:24:48.301Z · LW(p) · GW(p)

We still somehow got the steam engine, electricity, cars, etc.  

There is an element of international competition to it. If we slack here, China will probably raise armies of robots with unlimited firepower and take over the world. (They constantly show aggression)

The longshoreman strike is only allowed (I think) because the west coast did automate and somehow are less efficient than the east coast for example. 

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2024-12-20T23:48:19.858Z · LW(p) · GW(p)

Counterpoints: nuclear power, pharmaceuticals, bioengineering, urban development.

If we slack here, China will probably raise armies of robots with unlimited firepower and take over the world

Or maybe they will accidentally ban AI too due to being a dysfunctional autocracy, as autocracies are wont to do, all the while remaining just as clueless regarding what's happening as their US counterparts banning AI to protect the jobs.

I don't really expect that to happen, but survival-without-dignity [LW · GW] scenarios do seem salient.

Replies from: o-o
comment by O O (o-o) · 2024-12-21T01:36:21.672Z · LW(p) · GW(p)

I think a lot of this is wishful thinking from safetyists who want AI development to stop. This may be reductionist but almost every pause historically can be explained economics. 

Nuclear - war usage is wholly owned by the state and developed to its saturation point (i.e. once you have nukes that can kill all your enemies, there is little reason to develop them more). Energy-wise, supposedly, it was hamstrung by regulation, but in countries like China where development went unfettered, they are still not dominant. This tells me a lot it not being developed is it not being economical. 

For bio related things, Eroom's law reigns supreme. It is just economically unviable to discover drugs in the way we do. Despite this, it's clear that bioweapons are regularly researched by government labs. The USG being so eager to fund gof research despite its bad optics should tell you as much.

Or maybe they will accidentally ban AI too due to being a dysfunctional autocracy - 

I remember many essays from people all over this site on how China wouldn't be able to get to X-1 nm (or the crucial step for it) for decades, and China would always figure a way to get to that nm or step within a few months. They surpassed our chip lithography expectations for them. They are very competent. They are run by probably the most competent government bureaucracy in the world. I don't know what it is, but people keep underestimating China's progress. When they aim their efforts on a target, they almost always achieve it.

Rapid progress is a powerful attractor state that requires a global hegemon to stop. China is very keen on the possibilities of AI which is why they stop at nothing to get their hands on Nvidia GPUs. They also have literally no reason to develop a centralized project they are fully in control of. We have superhuman AI that seem quite easy to control already. What is stopping this centralized project on their end. No one is buying that even o3, which is nearly superhuman in math and coding, and probably lots of scientific research, is going to attempt world takeover. 


 

comment by winstonBosan · 2024-12-20T20:06:22.431Z · LW(p) · GW(p)

And for me, the (correct) reframing of RL as the cherry on top of our existing self-supervised stack was the straw that broke my hopeful back.

And o3 is more straws to my broken back.

comment by Noosphere89 (sharmake-farah) · 2024-12-20T20:11:43.921Z · LW(p) · GW(p)

Do you mean this is evidence that scaling is really over, or is this the opposite where you think scaling is not over?

comment by LGS · 2024-12-21T02:51:37.045Z · LW(p) · GW(p)

It's hard to find numbers. Here's what I've been able to gather (please let me know if you find better numbers than these!). I'm mostly focusing on FrontierMath.

  1. Pixel counting on the ARC-AGI image, I'm getting $3,644 ± $10 per task.
  2. FrontierMath doesn't say how many questions they have (!!!). However, they have percent breakdowns by subfield, and those percents are given to the nearest 0.1%; using this, I can narrow the range down to 289-292 problems in the dataset. Previous models solve around 3 problems (4 problems in total were ever solved by any previous model, though the full o1 was not evaluated, only o1-preview was)
  3. o3 solves 25.2% of FrontierMath. This could be 73/290. But it is also possible that some questions were removed from the dataset (e.g. because they're publicly available). 25.2% could also be 72/286 or 71/282, for example.
  4. The 280 to 290 problems means a rough ballpark for a 95% confidence interval for FrontierMath would be [20%, 30%]. It is pretty strange that the ML community STILL doesn't put confidence intervals on their benchmarks. If you see a model achieve 30% on FrontierMath later, remember that its confidence interval would overlap with o3's. (Edit: actually, the confidence interval calculation assumes all problems are sampled from the same pool, which is explicitly not the case for this dataset: some problems are harder than others. So it is hard to get a true confidence interval without rerunning the evaluation several times, which would cost many millions of dollars.)
  5. Using the pricing for ARC-AGI, o3 cost around $1mm to evaluate on the 280-290 problems of FrontierMath. That's around $3,644 per attempt, or roughly $14,500 per correct solution.
  6. This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem (they'd take at most few hours per problem if you find the right expert, except for the hardest problems which o3 also didn't solve). Even without hiring a different domain expert per problem, I think if you gave me FrontierMath and told me "solve as many as you can, and you get $15k per correct answer" I'd probably spend like a year and solve a lot of them: if I match o3 within a year, I'd get $1mm, which is much higher than my mathematician salary. (o3 has an edge in speed, of course, but you could parallelize the hiring of mathematicians too.) I think this is the first model I've seen which gets paid more than I do!
Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2024-12-21T08:18:04.594Z · LW(p) · GW(p)

This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem

I don't think anchoring to o3's current cost-efficiency is a reasonable thing to do. Now that AI has the capability to solve these problems in-principle, buying this capability is probably going to get orders of magnitude cheaper within the next five minutes months, as they find various algorithmic shortcuts.

I would guess that OpenAI did this using a non-optimized model because they expected it to be net beneficial: that producing a headline-grabbing result now will attract more counterfactual investment than e. g. the $900k they'd save by running the benchmarks half a year later.

Edit: In fact, if, against these expectations, the implementation of o3's trick can't be made orders-of-magnitude cheaper (say, because a base model of a given size necessarily takes ~n tries/MCTS branches per a FrontierMath problem and you can't get more efficient than one try per try), that would make me do a massive update against the "inference-time compute" paradigm.

comment by Seth Herd · 2024-12-20T21:49:27.841Z · LW(p) · GW(p)

Fucking o3. This pace of improvement looks absolutely alarming. I would really hate to have my fast timelines turn out to be right.

The "alignment" technique, "deliberative alignment", is much better than constitutional AI. It's the same during training, but it also teaches the model the safety criteria, and teaches the model to reason about them at runtime, using a reward model that compares the output to their specific safety criteria. (This also suggests something else I've been expecting - the CoT training technique behind o1 doesn't need perfectly verifiable answers in coding and math, it can use a decent guess as to the better answer in what's probably the same procedure).

While safety is not alignment (SINA?), this technique has a lot of promise for actual alignment. By chance, I've been working on an update to my Internal independent review for language model agent alignment [AF · GW], and have been thinking about how this type of review could be trained instead of scripted into an agent as I'd originally predicted.

This is that technique. It does have some promise.

But I don't think OpenAI has really thought through the consequences of using their smarter-than-human models with scaffolding that makes them fully agentic and soon enough reflective and continuously learning.

The race for AGI speeds up, and so does the race to align it adequately by the time it arrives in a takeover-capable form.

I'll write a little more on their new alignment approach soon.

comment by Charlie Steiner · 2024-12-21T04:33:57.433Z · LW(p) · GW(p)

Oh dear, RL for everything, because surely nobody's been complaining about the safety profile of doing RL directly on instrumental tasks rather than on goals that benefit humanity.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-12-21T17:41:25.670Z · LW(p) · GW(p)

My rather hot take is that a lot of the arguments for safety of LLMs also transfer over to practical RL efforts, with some caveats.

comment by Rafael Harth (sil-ver) · 2024-12-21T11:12:57.660Z · LW(p) · GW(p)

My probably contrarian take is that I don't think improvement on a benchmark of math problems is particularly scary or relevant. It's not nothing -- I'd prefer if it didn't improve at all -- but it only makes me slightly more worried.

Replies from: mr-hire
comment by Matt Goldenberg (mr-hire) · 2024-12-21T15:05:39.664Z · LW(p) · GW(p)

can you say more about your reasoning for this?

Replies from: sil-ver
comment by Rafael Harth (sil-ver) · 2024-12-21T18:01:38.983Z · LW(p) · GW(p)

About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I'd freak out if/when LLMs solve them. They're still 1/10 and nothing has changed in the past year, and I doubt o3 will do better. (But I'm not making them public.)

Will write a reply to this comment when I can test it.

comment by sapphire (deluks917) · 2024-12-21T06:48:18.560Z · LW(p) · GW(p)

I was still hoping for a sort of normal life. At least for a decade or maybe more. But that just doesn't seem possible anymore. This is a rough night.

comment by Aaron_Scher · 2024-12-20T23:58:10.363Z · LW(p) · GW(p)

Regarding whether this is a new base model, we have the following evidence: 

Jason Wei:

o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years

Nat McAleese:

o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)

The prices leaked by ARC-ARG people indicate $60/million output tokens, which is also the current o1 pricing. 33m total tokens and a cost of $2,012. 

Notably, the codeforces graph with pricing puts o3 about 3x higher than o1 (tho maybe it's a secretly log scale), and the ARC-AGI graph has the cost of o3 being 10-20x that of o1-preview. Maybe this indicates it does a bunch more test-time reasoning. That's collaborated by ARC-AGI, average 55k tokens per solution, which seems like a ton. 

I think this evidence indicates this is likely the same base model as o1, and I would be at like 65% sure, so not super confident. 

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-12-21T00:23:38.663Z · LW(p) · GW(p)

GPT-4o costs $10 per 1M output tokens, so the cost of $60 per 1M tokens is itself more than 6 times higher than it has to be. Which means they can afford to sell a much more expensive model at the same price. It could also be GPT-4.5o-mini or something, similar in size to GPT-4o but stronger, with knowledge distillation from full GPT-4.5o, given that a new training system [LW(p) · GW(p)] has probably been available for 6+ months now.

comment by Vladimir_Nesov · 2024-12-20T20:09:09.352Z · LW(p) · GW(p)

Using $4K per task means a lot of inference in parallel, which wasn't in o1. So that's one possible source of improvement, maybe it's running MCTS instead of individual long traces (including on low settings at $20 per task). And it might be built on the 100K H100s base model.

The scary less plausible option is that RL training scales, so it's mostly o1 trained with more compute, and $4K per task is more of an inefficient premium option on top rather than a higher setting on o3's source of power.

Replies from: Zach Stein-Perlman, Aaron_Scher
comment by Zach Stein-Perlman · 2024-12-20T20:37:30.495Z · LW(p) · GW(p)

The obvious boring guess is best of n. Maybe you're asserting that using $4,000 implies that they're doing more than that.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-12-20T20:54:58.967Z · LW(p) · GW(p)

Performance at $20 per task is already much better than for o1, so it can't be just best-of-n, you'd need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that's plausible, but not at $20 per task). There are various clever beam search options that don't need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).

There's still no word on the 100K H100s model, so that's another possibility. Currently Claude 3.5 Sonnet seems to be better at System 1, while OpenAI o1 is better at System 2, and combining these advantages in o3 based on a yet-unannounced GPT-4.5o base model that's better than Claude 3.5 Sonnet might be sufficient to explain the improvement. Without any public 100K H100s Chinchilla optimal models it's hard to say how much that alone should help.

Replies from: RussellThor
comment by RussellThor · 2024-12-21T06:17:08.720Z · LW(p) · GW(p)

Anyone want to guess how capable Claude system level 2 will be when it is polished? I expect better than o3 by a small amt.

comment by Aaron_Scher · 2024-12-21T00:01:29.540Z · LW(p) · GW(p)

The ARC-AGI page (which I think has been updated) currently says: 

At OpenAI's direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).

comment by Alex_Altair · 2024-12-20T19:35:25.581Z · LW(p) · GW(p)

I wish they would tell us what the dark vs light blue means. Specifically, for the FrontierMath benchmark, the dark blue looks like it's around 8% (rather than the light blue at 25.2%). Which like, I dunno, maybe this is nit picking, but 25% on FrontierMath seems like a BIG deal, and I'd like to know how much to be updating my beliefs.

Replies from: elriggs, UnexpectedValues, Alex_Altair
comment by Logan Riggs (elriggs) · 2024-12-20T21:39:09.887Z · LW(p) · GW(p)

From an apparent author on reddit:

[Frontier Math is composed of] 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style porblems, 25% T3 = early researcher problems

The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions. 

comment by Eric Neyman (UnexpectedValues) · 2024-12-20T20:46:20.544Z · LW(p) · GW(p)

My random guess is:

  • The dark blue bar corresponds to the testing conditions under which the previous SOTA was 2%.
  • The light blue bar doesn't cheat (e.g. doesn't let the model run many times and then see if it gets it right on any one of those times) but spends more compute than one would realistically spend (e.g. more than how much you could pay a mathematician to solve the problem), perhaps by running the model 100 to 1000 times and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning.
Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2024-12-20T20:51:25.796Z · LW(p) · GW(p)

and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning

The FrontierMath answers are numerical-ish ("problems have large numerical answers or complex mathematical objects as solutions"), so you can just check which answer the model wrote most frequently.

Replies from: UnexpectedValues
comment by Eric Neyman (UnexpectedValues) · 2024-12-20T20:55:48.982Z · LW(p) · GW(p)

Yeah, I agree that that could work. I (weakly) conjecture that they would get better results by doing something more like the thing I described, though.

comment by Alex_Altair · 2024-12-20T19:37:34.853Z · LW(p) · GW(p)

On the livestream, Mark Chen says the 25.2% was achieved "in aggressive test-time settings". Does that just mean more compute?

Replies from: Charlie Steiner, Jonas Hallgren
comment by Charlie Steiner · 2024-12-21T04:26:12.585Z · LW(p) · GW(p)

It likely means running the AI many times and submitting the most common answer from the AI as the final answer.

comment by Jonas Hallgren · 2024-12-20T20:38:12.373Z · LW(p) · GW(p)

Extremely long chain of thought, no?

Replies from: Alex_Altair
comment by Alex_Altair · 2024-12-20T20:50:47.704Z · LW(p) · GW(p)

I guess one thing I want to know is like... how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn't what happened, but it could be something similar.

If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn't particularly matter to me how much compute it used.

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2024-12-20T20:52:08.414Z · LW(p) · GW(p)

It was one submission, apparently.

Replies from: Alex_Altair
comment by Alex_Altair · 2024-12-20T21:13:06.666Z · LW(p) · GW(p)

Thanks. Is "pass@1" some kind of lingo? (It seems like an ungoogleable term.)

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-12-20T21:39:53.241Z · LW(p) · GW(p)

Pass@k means that at least one of k attempts passes, according to an oracle verifier. Evaluating with pass@k is cheating when k is not 1 (but still interesting to observe), the non-cheating option is best-of-k where the system needs to pick out the best attempt on its own. So saying pass@1 means you are not cheating in evaluation in this way.

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2024-12-20T21:51:56.247Z · LW(p) · GW(p)

pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-12-20T22:45:29.571Z · LW(p) · GW(p)

For coding, a problem statement won't have exhaustive formal requirements that will be handed to the solver, only evals and formal proofs can be expected to have adequate oracle verifiers. If you do have an oracle verifier, you can just wrap the system in it and call it pass@1. Affordance to reliably verify helps in training (where the verifier is applied externally), but not in taking the tests (where the system taking the test doesn't itself have a verifier on hand).

comment by boazbarak · 2024-12-21T07:32:53.515Z · LW(p) · GW(p)

As I say here https://x.com/boazbaraktcs/status/1870369979369128314

Constitutional AI is a great work but Deliberative Alignment is fundamentally different. The difference is basically system 1 vs system 2. In RLAIF ultimately the generative model that answers user prompt is trained with (prompt, good response, bad response). Even if the good and bad responses were generated based on some constitution, the generative model is not taught the text of this constitution, and most importantly how to reason about this text in the context of a particular example.

This ability to reason is crucial to OOD performance such as training only on English and generalizing to other languages or encoded output.

See also https://x.com/boazbaraktcs/status/1870285696998817958

Replies from: boazbarak
comment by boazbarak · 2024-12-21T07:37:18.168Z · LW(p) · GW(p)

Also the thing I am most excited about deliberative alignment is that it becomes better as models are more capable. o1 is already more robust than o1 preview and I fully expect this to continue.

(P.s. apologies in advance if I’m unable to keep up with comments; popped from holiday to post on the DA paper.)

comment by O O (o-o) · 2024-12-20T22:27:19.733Z · LW(p) · GW(p)

While I'm not surprised by the pessimism here, I am surprised at how much of it is focused on personal job loss. I thought there would be more existential dread. 

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-12-20T22:50:10.467Z · LW(p) · GW(p)

Existential dread doesn't necessarily follow from this specific development if training only works around verifiable tasks and not for everything else, like with chess. Could soon be game-changing for coding and other forms of engineering, without full automation even there and without applying to a lot of other things.

Replies from: o-o
comment by O O (o-o) · 2024-12-20T22:57:27.857Z · LW(p) · GW(p)

Oh I guess I was assuming automation of coding would result in a step change in research in every other domain. I know that coding is actually one of the biggest blockers in much of AI research and automation in general.  

It might soon become cost effective to write bespoke solutions for a lot of labor jobs for example. 

comment by Taleuntum · 2024-12-21T13:33:17.368Z · LW(p) · GW(p)

I just straight up don't believe the codeforces rating. I guess only a small subset of people solve algorithmic problems for fun in their free time, so it's probably opaque to many here, but a rating of 2727 (the one in the table) would be what's called an international grandmaster and is the 176th best rating among all actively competing users on the site. I hope they will soon release details about how they got that performance measure..

Replies from: joel-burget
comment by Joel Burget (joel-burget) · 2024-12-21T14:48:59.510Z · LW(p) · GW(p)

It's hard to compare across domains but isn't the FrontierMath result similarly impressive?

comment by anaguma · 2024-12-21T09:18:33.483Z · LW(p) · GW(p)

How have your AGI timelines changed after this announcement?

Replies from: Thane Ruthenis, mateusz-baginski
comment by Thane Ruthenis · 2024-12-21T09:52:42.641Z · LW(p) · GW(p)

~No update, priced it all in after the Q* rumors first surfaced in November 2023.

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-12-21T10:02:40.869Z · LW(p) · GW(p)

A rumor is not the same as a demonstration.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2024-12-21T10:18:35.393Z · LW(p) · GW(p)

It is if you believe the rumor and can extrapolate its implications, which I did. Why would I need to wait to see the concrete demonstration that I'm sure would come, if I can instead update on the spot?

It wasn't hard to figure out how "something like an LLM with A*/MCTS stapled on top" would look like, or where it'd shine, or that OpenAI might be trying it and succeeding at it (given that everyone in the ML community had already been exploring this direction at the time).

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-12-21T10:42:35.863Z · LW(p) · GW(p)

Suppose I throw up a coin but I dont show you the answer. Your friend's cousin tells you they think the bias is 80/20 in favor of heads.

If I show you the outcome was indeed heads should you still update ? (Yes)

Replies from: Thane Ruthenis, mattmacdermott
comment by Thane Ruthenis · 2024-12-21T10:48:24.229Z · LW(p) · GW(p)

Sure. But if you know the bias is 95/5 in favor of heads, and you see heads, you don't update very strongly.

And yes, I was approximately that confident that something-like-MCTS was going to work, that it'd demolish well-posed math problems, and that this is the direction OpenAI would go in (after weighing in the rumor's existence). The only question was the timing, and this is mostly within my expectations as well.

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-12-21T14:38:59.401Z · LW(p) · GW(p)

That's significantly outside the prediction intervals of forecasters so I will need to see an metaculus /manifold/etc account where you explicitly make this prediction sir

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2024-12-21T15:29:49.278Z · LW(p) · GW(p)

Fair! Except I'm not arguing that you should take my other predictions at face value on the basis of my supposedly having been right that one time. Indeed, I wouldn't do that without just the sort of receipt you're asking for! (Which I don't have. Best I can do is a December 1, 2023 private message I sent to Zvi making correct predictions regarding what o1-3 could be expected to be, but I don't view these predictions as impressive and it notably lacks credences.)

I'm only countering your claim that no internally consistent version of me could have validly updated all the way here from November 2023. You're free to assume that the actual version of me is dissembling or confabulating.

comment by mattmacdermott · 2024-12-21T11:58:38.436Z · LW(p) · GW(p)

The coin coming up heads is “more headsy” than the expected outcome, but maybe o3 is about as headsy as Thane expected.

Like if you had thrown 100 coins and then revealed that 80 were heads.

comment by Mateusz Bagiński (mateusz-baginski) · 2024-12-21T09:42:20.321Z · LW(p) · GW(p)

I guess one's timelines might have gotten longer if one had very high credence that the paradigm opened by o1 is a blind alley (relative to the goal of developing human-worker-omni-replacement-capable AI) but profitable enough that OA gets distracted from its official most ambitious goal.

I'm not that person.

comment by jacobjacob · 2024-12-20T23:43:45.584Z · LW(p) · GW(p)

For people who don't expect a strong government response... remember that Elon is First Buddy now. 🎢

comment by RussellThor · 2024-12-21T06:35:10.825Z · LW(p) · GW(p)

That's some significant progress, but I don't think will lead to TAI. 

However there is a realistic best case scenario where LLM/Transformer stop just before and can give useful lessons and capabilities. 

I would really like to see such an LLM system get as good as a top human team at security, so it could then be used  to inspect and hopefully fix masses of security vulnerabilities. Note that could give a false sense of security, unknown unknown type situation where it would't find a totally new type of attack, say a combined SW/HW attack like Rowhammer/Meltdown but more creative. A superintelligence not based on LLM could however.

comment by Jonas V (Jonas Vollmer) · 2024-12-21T00:56:48.274Z · LW(p) · GW(p)

OpenAI didn't say what the light blue bar is

Presumably light blue is o3 high, and dark blue is o3 low?

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2024-12-21T01:03:52.642Z · LW(p) · GW(p)

I think they only have formal high and low versions for o3-mini

Edit: nevermind idk