Posts
Comments
I am not convinced, at least for my own purposes, although obviously most people will be unable to come up with valuable insights here. I think salience of ideas is a big deal, people don’t do things, and yes often I get ideas that seem like they might not get discovered forever otherwise
My model is that "people don't do things" is the bigger bottleneck on capabilities progress than "no-one's thought of that yet".
I'm sure there is a person in each AGI lab who has had, at some point, an idea for capability-improvement isomorphic to almost any idea an alignment researcher had (perhaps with some exceptions). But the real blockers are, "Was this person one of the people deciding the direction of company research?", and, "If yes, do they believe in this idea enough to choose to allocate some of the limited research budget to it?".
And the research budget appears very limited. o1 seems to be incredibly simple, so simple all the core ideas were floating around back in 2022. Yet it took perhaps a year (until the Q* rumors in November 2023) to build a proof-of-concept prototype, and two years to ship it. Making even something as straightforward-seeming as that was overwhelmingly fiddly. (Arguably it was also delayed by OpenAI researchers having to star in a cyberpunk soap opera, except what was everyone else doing?)
So making a bad call regarding what bright idea to pursue is highly costly, and there are only so many ideas you can pursue in parallel. This goes tenfold for any ideas that might only work at sufficiently big scale – imagine messing up a GPT-5-level training run because you decided to try out something daring.
But: this still does not mean you can freely share capability insights. Yes, "did an AI capability researcher somewhere ever hear of this idea?" doesn't matter as much as you'd think. What does matter is, "is this idea being discussed widely enough to be fresh on the leading capability researchers' minds?". If yes, then:
- They may be convinced by one of the justifications regarding why this is a good idea.
- This idea may make it to the top of a leading researcher's mind, such that they would be idly musing on it 24/7 until finding a variant of it/an implementation of it that they'd be willing to try.
- If the idea is the talk of the town, they may not face as much reputational damage if they order R&D departments to focus on it and then it fails. (A smaller factor, but likely still in play.)
So I think avoiding discussion of potential capability insights is ever a good policy.
Edit: I. e., don't give capability insights steam.
Noteably, Eliezer, Nate, and John don't spend much of any of their time assessing research at all (at least recently) as far as I can tell.
- Perhaps not specific research projects, but they've communicated a lot regarding their models of what types of research are good/bad. (See e. g. Eliezer's list of lethalities, John's Why Not Just... sequence, this post of Nate's.)
- I would assume this is because this doesn't scale and their reviews are not, in any given instance, the ultimate deciding factor regarding what people do or what gets funded. Spending time evaluating specific research proposals is therefore cost-inefficient compared to reviewing general research trends/themes.
My current view is that more of the bottleneck in grantmaking is not having good stuff to fund
Because no entity that I know of is currently explicitly asking for proposals that Eliezer/Nate/John would fund. Why would people bother coming up with such proposals in these circumstances? The system explicitly doesn't select for it.
I expect that if there were an actual explicit financial pressure to goodhart to their preferences, much more research proposals that successfully do so would be around.
Hm. Eliezer has frequently complained that the field has no recognition function for good research he's satisfied with besides "he personally looks at the research and passes his judgement", and that this obviously doesn't scale.
Stupid idea: Set up a grantmaker that funds proposals based on a prediction market tasked with evaluating how likely Eliezer/Nate/John is to approve of a given research project. Each round, after the funding is assigned to the highest-credence projects, Eliezer/Nate/John evaluate a random subset of proposals to provide a ground-truth signal; the corresponding prediction markets pay out, the others resolve N/A.
This should effectively train a reward function that emulates the judges' judgements in a scalable way.
Is there an obvious reason this doesn't work? (One possible issue is the amount of capital that'd need to be frozen in those markets by market participants, but we can e. g. upscale the effective amounts of money each participant has as some multiple of the actual dollars invested, based on how many of their bets are likely to actually pay out.)
- Yup, I read through it after writing the previous response and now see that you don't need to be convinced of that point. Sorry about dragging you into this.
- I could nitpick the details here, but I think the discussion has kind of wandered away from any pivotal points of disagreement, plus John didn't want object-level arguments under this post. So I petition to leave it at that.
Also, random nitpick, who is talking about inference runs of billions of dollars???
There's a log-scaling curve, OpenAI have already spent on the order of a million dollars just to score well on some benchmarks, and people are talking about "how much would you be willing to pay for the proof of the Riemann Hypothesis?". It seems like a straightforward conclusion that if o-series/inference-time scaling works as well as ML researchers seem to hope, there'd be billion-dollar inference runs funded by some major institutions.
To lay out my arguments properly:
"Search is ruinously computationally inefficient" does not work as a counter-argument against the retargetability of search, because the inefficiency argument applies to babble-and-prune search, not to the top-down heuristical-constraint-based search that was/is being discussed.
There are valid arguments against easily-retargetable heuristics-based search as well (I do expect many learned ML algorithms to be much messier than that). But this isn't one of them.
ML researchers are currently incredibly excited about the inference-time scaling laws, talking about inference runs costing millions/billions of dollars, and how much capability will be unlocked this way.
The o-series paradigm would use this compute to, essentially, perform babble-and-prune search. The pruning would seem to be done by some easily-swappable evaluator (either the system's own judgement based on the target specified in a prompt, or an external theorem-prover, etc.).
If things will indeed go this way, then it would seem that a massive amount of capabilities will be based on highly inefficient babble-and-prune search, and that this search would be easily retargetable by intervening on one compact element of the system (the prompt, or the evaluator function).
I almost want to say that it sounds like we should recruit people from the same demographic as good startup founders. Almost.
Per @aysja's list, we want creative people with an unusually good ability to keep themselves on-track, who can fluently reason at several levels of abstraction, and who don't believe in the EMH. This fits pretty well with the stereotype of a successful technical startup founder – an independent vision, an ability to think technically and translate that technical vision into a product customers would want (i. e., develop novel theory and carry it across the theory-practice gap), high resilience in the face of adversity, high agency, willingness to believe you can spot an exploitable pattern where no-one did, etc.
... Or, at least, that is the stereotype of a successful startup founder from Paul Graham's essays. I expect that this idealized image diverges from reality in quite a few ways. (I haven't been following Silicon Valley a lot, but from what I've seen, I've not been impressed with all the LLM and LLM-wrapper startups. Which made me develop quite a dim image of what a median startup actually looks like.)
Still, when picking whom to recruit, it might be useful to adopt some of the heuristics Y Combinator/Paul Graham (claim to) employ when picking which startup-founder candidates to support?
(Connor Leahy also makes a similar point here: that pursuing some ambitious non-templated vision in the real world is a good way to learn lessons that may double as insights regarding thorny philosophical problems.)
I'm not strongly committed to the view that the costs won't rapidly reduce: I can certainly see the worlds in which it's possible to efficiently distill trees-of-thought unrolls into single chains of thoughts. Perhaps it scales iteratively, where we train a ML model to handle the next layer of complexity by generating big ToTs, distilling them into CoTs, then generating the next layer of ToTs using these more-competent CoTs, etc.
Or perhaps distillation doesn't work that well, and the training/inference costs grow exponentially (combinatorially?).
Rohin Shah has already explained the basic reasons why I believe the mesa-optimizer-type search probably won't exist/be findable in the inner workings of the models we encounter: "Search is computationally inefficient relative to heuristics, and we'll be selecting really hard on computational efficiency."
I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work (at train-time and at inference-time both), and how much AI researchers hype it up.
By contrast, my understanding is that the sort of search John is talking about retargeting isn't the brute-force babble-and-prune algorithms, but a top-down heuristical-constraint-based search.
So it is in fact the ML researchers now who believe in the superiority of the computationally inefficient search; not the agency theorists.
Exploitation is when one rational economic agent violates the validity of another rational economic agent's abstraction layer, using non-economic side-channels to make the latter accept an economically unfair division of gains from trade. Examples:
- Violence.
- Psychological manipulations. (Exerting psychological pressure that leads to bad, emotion-driven choices; gaslighting someone into thinking their work is less valuable than it is.)
- "Frogboiling" is a subset of this to which many of your examples apply, where an agent inflicts costs on another agent that are dramatically smaller than terminating the economic relationship, but which add up. (Exploits e. g. hyperbolic discounting.)
- Deception. (Lying about the meaning of the terms of the contract until the counterparty already commits to it; reneging on the contract; misrepresenting or concealing the actual state of the economy.)
- Cultivating irrationality. (E. g., propagandizing CDT over FDT, so that other agents accept blackmail.)
- Cultural influences. (Creating a culture in which e. g. working for Company A is seen as its own reward.)
- Destroying the interfaces other economic agents can use to coordinate against you. (E. g., destroying communications, making agents (/employees) distrust each other, etc.)
Roughly speaking, exploitation can target one of the following:
- Algorithms. (Make an economic agent behave not as an economic agent, but as some entirely different type of system; make a set of rational economic agents unable to act as a set of rational economic agents.)
- Values. (Modify a self-interested agent into an agent that wants to pursue something other than its interests.)
- Rationality. (Warp the target agent's world-model into an incorrect but beneficial-to-you state, making it unable to make correctly-informed choices.)
That's mostly my experience as well: experiments are near-trivial to set up, and setting up any experiment that isn't near-trivial to set up is a poor use of the time that can instead be spent thinking on the topic a bit more and realizing what the experimental outcome would be or why this would be entirely the wrong experiment to run.
But the friction costs of setting up an experiment aren't zero. If it were possible to sort of ramble an idea at an AI and then have it competently execute the corresponding experiment (or set up a toy formal model and prove things about it), I think this would be able to speed up even deeply confused/non-paradigmatic research.
... That said, I think the sorts of experiments we do aren't the sorts of experiments ML researchers do. I expect they're often things like "do a pass over this lattice of hyperparameters and output the values that produce the best loss" (and more abstract equivalents of this that can't be as easily automated using mundane code). And which, due to the atheoretic nature of ML, can't be "solved in the abstract".
So ML research perhaps could be dramatically sped up by menial-software-labor AIs. (Though I think even now the compute needed for running all of those experiments would be the more pressing bottleneck.)
in domains where there is a way to verify that the solution actually works, RL can scale to superhuman performance
Sure, the theory on that is solid. But how efficiently does it scale off-distribution, in practice?
The inference-time scaling laws, much like the pretraining scaling laws, are ultimately based on test sets whose entries are "shallow" (in the previously discussed sense). It doesn't tell us much regarding how well the technique scales with the "conceptual depth" of a problem.
o3 took a million dollars in inference-time compute and unknown amounts in training-time compute just to solve the "easy" part of the FrontierMath benchmark (which likely take human experts single-digit hours, maybe <1 hour for particularly skilled humans). How much would be needed for beating the "hard" subset of FrontierMath? How much more still would be needed for problems that take individual researchers days; or problems that take entire math departments months; or problems that take entire fields decades?
It's possible that the "synthetic data flywheel" works so well that the amount of human-researcher-hour-equivalents per unit of compute scales, say, exponentially with some aspect of o-series' training, and so o6 in 2027 solves the Riemann Hypothesis.
Or it scales not that well, and o6 can barely clear real-life equivalents of hard FrontierMath problems. Perhaps instead the training costs (generating all the CoT trees on which RL training is then done) scale exponentially, while researcher-hour-equivalents per compute units scale linearly.
It doesn't seem to me that we know which one it is yet. Do we?
Convincing.
It's only now that LLMs are reasonably competent in at least some hard problems
I don't think that's the limiter here. Reports in the style of "my unpublished PhD thesis was about doing X using Y methodology, I asked an LLM to do that and it one-shot a year of my work! the equations it derived are correct!" have been around for quite a while. I recall it at least in relation to Claude 3, and more recently, o1-preview.
If LLMs are prompted to combine two ideas, they've been perfectly capable of "innovating" for ages now, including at fairly high levels of expertise. I'm sure there's some sort of cross-disciplinary GPQA-like benchmark that they've saturated a while ago, so this is even legible.
The trick is picking which ideas to combine/in what direction to dig. This doesn't appear to be something LLMs are capable of doing well on their own, nor do they seem to speed up human performance on this task. (All cases of them succeeding at it so far have been, by definition, "searching under the streetlight": checking whether they can appreciate a new idea that a human already found on their own and evaluated as useful.)
I suppose it's possible that o3 or its successors change that (the previous benchmarks weren't measuring that, but surely FrontierMath does...). We'll see.
I expect RL to basically solve the domain
Mm, I think it's still up in the air whether even the o-series efficiently scales (as in, without requiring a Dyson Swarm's worth of compute) to beating the Millennium Prize Eval (or some less legendary yet still major problems).
I expect such problems don't pass the "can this problem be solved by plugging the extant crystallized-intelligence skills of a number of people into each other in a non-contrived[1] way?" test. Does RL training allow to sidestep this, letting the model generate new crystallized-intelligence skills?
I'm not confident one way or another.
we have another scale-up that's coming up
I'm bearish on that. I expect GPT-4 to GPT-5 to be palatably less of a jump than GPT-3 to GPT-4, same way GPT-3 to GPT-4 was less of a jump than GPT-2 to GPT-3. I'm sure it'd show lower loss, and saturate some more benchmarks, and perhaps an o-series model based on it clears FrontierMath, and perhaps programmers and mathematicians would be able to use it in an ever-so-bigger number of cases...
But I predict, with low-moderate confidence, that it still won't kick off a deluge of synthetically derived innovations. It'd have even more breadth and eye for nuance, but somehow, perplexingly, still no ability to use those capabilities autonomously.
- ^
"Non-contrived" because technically, any cognitive skill is just a combination of e. g. NAND gates, since those are Turing-complete. But obviously that doesn't mean any such skill is accessible if you've learned the NAND gate. Intuitively, a combination of crystallized-intelligence skills is only accessible if the idea of combining them is itself a crystallized-intelligence skill (e. g., in the math case, a known ansatz).
Which perhaps sheds some light on why LLMs can't innovate even via trivial ideas combinations. If a given idea-combination "template" weren't present in the training data, the LLM can't reliably independently conceive of it except by brute-force enumeration...? This doesn't seem quite right, but maybe in the right direction.
The internet and the mathematical literature is so vast that, unless you are doing something truly novel, there's some relevant subfield there
Previously, I'd intuitively assumed the same as well: that it doesn't matter if LLMs can't "genuinely research/innovate", because there is enough potential for innovative-yet-trivial combination of existing ideas that they'd still massively speed up R&D by finding those combinations. ("Innovation overhang", as @Nathan Helm-Burger puts it here.)
Back in early 2023, I'd considered it fairly plausible that the world would start heating up in 1-2 years due to such synthetically-generated innovations.
Except this... just doesn't seem to be happening? I'm yet to hear of a single useful scientific paper or other meaningful innovation that was spearheaded by a LLM.[1] And they're already adept at comprehending such innovative-yet-trivial combinations if a human prompts them with those combinations. So it's not the matter of not yet being able to understand or appreciate the importance of such synergies. (If Sonnet 3.5.1 or o1 pro didn't do it, I doubt o3 would.)
Yet this is still not happening. My guess is that "innovative-yet-trivial combinations of existing ideas" are not actually "trivial", and LLMs can't do that for the same reasons they can't do "genuine research" (whatever those reasons are).
- ^
Admittedly it's possible that this is totally happening all over the place and people are just covering it up in order to have all of the glory/status for themselves. But I doubt it: there are enough remarkably selfless LLM enthusiasts that if this were happening, I'd expect it would've gone viral already.
Thanks, that's important context!
And fair enough, I used excessively sloppy language. By "instantly solvable", I did in fact mean "an expert would very quickly ("instantly") see the correct high-level approach to solving it, with the remaining work being potentially fiddly, but conceptually straightforward". "Instantly solvable" in the sense of "instantly know how to solve"/"instantly reducible to something that's trivial to solve".[1]
Which was based on this quote of Litt's:
FWIW the "medium" and "low" problems I say I immediately knew how to do are very close to things I've thought about; the "high"-rated problem above is a bit further, and I suspect an expert closer to it would similarly "instantly" know the answer.
That said,
if you are hard-pressed to find humans that could solve it "instantly" when seeing it the first time, then I wouldn't describe it in those terms
If there are no humans who can "solve it instantly" (in the above sense), then yes, I wouldn't call it "shallow". But if such people do exist (even if they're incredibly rare), this implies that the conceptual machinery (in the form of theorems or ansatzes) for translating the problem into a trivial one already exists as well. Which, in turn, means it's likely present in the LLM's training data. And therefore, from the LLM's perspective, that problem is trivial to translate into a conceptually trivial problem.
It seems you'd largely agree with that characterization?
Note that I'm not arguing that LLMs aren't useful or unimpressive-in-every-sense. This is mainly an attempt to build a model of why LLMs seem to perform so well on apparently challenging benchmarks while reportedly falling flat on their faces on much simpler real-life problems.
- ^
Or, closer to the way I natively think of it: In the sense that there are people (or small teams of people) with crystallized-intelligence skillsets such that they would be able to solve this problem by plugging their crystallized-intelligence skills one into another, without engaging in prolonged fluid-intelligence problem-solving.
Here's something that confuses me about o1/o3. Why was the progress there so sluggish?
My current understanding is that they're just LLMs trained with RL to solve math/programming tasks correctly, hooked up to some theorem-verifier and/or an array of task-specific unit tests to provide ground-truth reward signals. There are no sophisticated architectural tweaks, not runtime-MCTS or A* search, nothing clever.
Why was this not trained back in, like, 2022 or at least early 2023; tested on GPT-3/3.5 and then default-packaged into GPT-4 alongside RLHF? If OpenAI was too busy, why was this not done by any competitors, at decent scale? (I'm sure there are tons of research papers trying it at smaller scales.)
The idea is obvious; doubly obvious if you've already thought of RLHF; triply obvious after "let's think step-by-step" went viral. In fact, I'm pretty sure I've seen "what if RL on CoTs?" discussed countless times in 2022-2023 (sometimes in horrified whispers regarding what the AGI labs might be getting up to).
The mangled hidden CoT and the associated greater inference-time cost is superfluous. DeepSeek r1/QwQ/Gemini Flash Thinking have perfectly legible CoTs which would be fine to present to customers directly; just let them pay on a per-token basis as normal.
Were there any clever tricks involved in the training? Gwern speculates about that here. But none of the follow-up reasoning models have a o1-style deranged CoT, so the more straightforward approaches probably Just Work.
Did nobody have the money to run the presumably compute-intensive RL-training stage back then? But DeepMind exists. Did nobody have the attention to spare, with OpenAI busy upscaling/commercializing and everyone else catching up? Again, DeepMind exists: my understanding is that they're fairly parallelized and they try tons of weird experiments simultaneously. And even if not DeepMind, why have none of the numerous LLM startups (the likes of Inflection, Perplexity) tried it?
Am I missing something obvious, or are industry ML researchers surprisingly... slow to do things?
(My guess is that the obvious approach doesn't in fact work and you need to make some weird unknown contrivances to make it work, but I don't know the specifics.)
I do think that something like dumb scaling can mostly just work
The exact degree of "mostly" is load-bearing here. You'd mentioned provisions for error-correction before. But are the necessary provisions something simple, such that the most blatantly obvious wrappers/prompt-engineering works, or do we need to derive some additional nontrivial theoretical insights to correctly implement them?
Last I checked, AutoGPT-like stuff has mostly failed, so I'm inclined to think it's closer to the latter.
Generalizing the lesson here: the supposedly-hard benchmarks for which I have seen a few problems (e.g. GPQA, software eng) turn out to be mostly quite easy, so my prior on other supposedly-hard benchmarks which I haven't checked (e.g. FrontierMath) is that they're also mostly much easier than they're hyped up to be
Daniel Litt's account here supports this prejudice. As a math professor, he knew instantly how to solve the low/medium-level problems he looked at, and he suggests that each "high"-rated problem would be likewise instantly solvable by an expert in that problem's subfield.
And since LLMs have eaten ~all of the internet, they essentially have the crystallized-intelligence skills for all (sub)fields of mathematics (and human knowledge in general). So from their perspective, all of those problems are very "shallow". No human shares their breadth of knowledge, so math professors specialized even in slightly different subfields would indeed have to do a lot of genuine "deep" cognitive work; this is not the case for LLMs.
GPQA stuff is even worse, a literal advanced trivia quiz that seems moderately resistant to literal humans literally googling things, but not to the way the knowledge gets distilled into LLMs.
Basically, I don't think any extant benchmark (except I guess the Millennium Prize Eval) actually tests "deep" problem-solving skills, in a way LLMs can't cheat at using their overwhelming knowledge breadth.
My current strong-opinion-weakly-held is that they're essentially just extensive knowledge databases with a nifty natural-language interface on top.[1] All of the amazing things they do should be considered surprising facts about how far this trick can scale; not surprising facts about how close we are to AGI.
- ^
Which is to say: this is the central way to characterize what they are; not merely "isomorphic to a knowledge database with a natural-language search engine on top if you think about them in a really convoluted way". Obviously a human can also be considered isomorphic to database search if you think about it in a really convoluted way, but that wouldn't be the most-accurate way to describe a human.
the structure of jobs is shaped to accommodate human unreliability by making mistakes less fatal
Mm, so there's a selection effect on the human end, where the only jobs/pursuits that exist are those which humans happen to be able to reliably do, and there's a discrepancy between the things humans and AIs are reliable at, so we end up observing AIs being more unreliable, even though this isn't representative of the average difference between the human vs. AI reliability across all possible tasks?
I don't know that I buy this. Humans seem pretty decent at becoming reliable at ~anything, and I don't think we've observed AIs being more-reliable-than-humans at anything? (Besides trivial and overly abstract tasks such as "next-token prediction".)
(2) seems more plausible to me.
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn't go into a benchmark (even if it were initially intended for a benchmark), it'd go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don't have a bird's-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
Which perhaps suggests a better way to do math evals is to scope out a set of novel math publications made after a given knowledge-cutoff date, and see if the new model can replicate those? (Though this also needs to be done carefully, since tons of publications are also trivial and formulaic.)
Reliability is way more important than people realized
Yes, but whence human reliability? What makes humans so much more reliable than the SotA AIs? What are AIs missing? The gulf in some cases is so vast it's a quantity-is-a-quality-all-its-own thing.
Math and code specialist speeds up AI R&D
Does it? ML progress is famously achieved by atheoretical empirical tinkering, i. e. by having a very well-developed intuitive research taste: the exact opposite of well-posed math problems on which o1-3 shine. Something similar seems to be the case with programming: AIs seem bad at architecture/system design.
So it only speeds up the "drudge work", not the actual load-bearing theoretical work. Which is nonzero speedup, as it allows to test intuitive-theoretical ideas quicker, but it's more or less isomorphic to having a team of competent-ish intern underlings.
I am also very confused. The space of problems has a really surprising structure, permitting algorithms that are incredibly adept at some forms of problem-solving, yet utterly inept at others.
We're only familiar with human minds, in which there's a tight coupling between the performances on some problems (e. g., between the performance on chess or sufficiently well-posed math/programming problems, and the general ability to navigate the world). Now we're generating other minds/proto-minds, and we're discovering that this coupling isn't fundamental.
(This is an argument for longer timelines, by the way. Current AIs feel on the very cusp of being AGI, but there in fact might be some vast gulf between their algorithms and human-brain algorithms that we just don't know how to talk about.)
No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer
I don't think that's strictly true, the peer-review system often approves utter nonsense. But yes, I don't think any AI system can generate an actually worthwhile research paper.
@Scott Alexander, correction to the above: there are rumors that, like o1, o3 doesn't generate runtime trees of thought either, and that they spent thousands-of-dollars' worth of compute on single tasks by (1) having it generate a thousand separate CoTs, (2) outputting the answer the model produced most frequently. I. e., the "pruning meta-heuristic" I speculated about might just be the (manually-implemented) majority vote.
I think the guy in the quotes might be misinterpreting OpenAI researchers' statements, but it's possible.
In which case:
- We have to slightly reinterpret the reason for having the model try a thousand times. Rather than outputting the correct answer if at least one try is correct, it outputs the correct answer if, in N tries, it produces the correct answer more frequently than incorrect ones. The fact that they had to set N = 1024 for best performance on ARC-AGI still suggests there's a large amount of brute-forcing involved.
- Since it implies that if N = 100, the correct answer isn't more frequent than incorrect ones. So on the problems which o3 got wrong in the N = 6 regime but got right in the N = 1024 regime, the probability of any given CoT producing the correct answer is quite low.
- This has similar implications for the FrontierMath performance, if the interpretation of the dark-blue vs. light-blue bars is that dark-blue is for N = 1 or N = 6, and light-blue is for N = bignumber.
- We have to throw out everything about the "pruning" meta-heuristics; only the "steering" meta-heuristics exist. In this case, the transfer-of-performance problem would be that the "steering" heuristics only become better for math/programming; that RL only skewes the distribution over CoTs towards the high-quality ones for problems in those domains. (The metaphorical "taste" then still exists, but only within CoTs.)
- (I now somewhat regret introducing the "steering vs. pruning meta-heuristic" terminology.)
Again, I think this isn't really confirmed, but I can very much see it.
I don't think counting from announcement to announcement is valid here, no. They waited to announce o1 until they had o1-mini and o1-preview ready to ship: i. e., until they've already came around to optimizing these models for compute-efficiency and to setting up the server infrastructure for running them. That couldn't have taken zero time. Separately, there's evidence they've had them in-house for a long time, between the Q* rumors from a year ago and the Orion/Strawberry rumors from a few months ago.
This is not the case for o3. At the very least, it is severely unoptimized, taking thousands of dollars per task (i. e., it's not even ready for the hypothetical $2000/month subscription they floated).
That is,
Do you think that o1 wasn't the best model (of this type) that OpenAI had internally at the point of the o1 announcement? If so, do you think that o3 isn't the best model (of this type) that OpenAI has internally now?
Yes and yes.
The case for "o3 is the best they currently have in-house" is weaker, admittedly. But even if it's not the case, and they already have "o4" internally, the fact that o1 (or powerful prototypes) existed well before the September announcement seem strongly confirmed, and that already disassembles the narrative of "o1 to o3 took three months".
There are degrees of Goodharting. It's not Goodharting to ARC-AGI specifically, but it is optimizing for performance on the array of easily-checkable benchmarks. Which plausibly have some common factor between them to which you could "Goodhart"; i. e., a way to get good at them without actually training generality.
I concur with all of this.
Two other points:
- It's unclear to what extent the capability advances brought about by moving from LLMs to o1/3-style stuff generalize beyond math and programming (i. e., domains in which it's easy to set up RL training loops based on machine-verifiable ground-truth).
Empirical evidence: "vibes-based evals" of o1 hold that it's much better than standard LLMs in those domains, but is at best as good as Sonnet 3.5.1 outside them. Theoretical justification: if there are easy-to-specify machine verifiers, then the "correct" solution for the SGD to find is to basically just copy these verifiers into the model's forward passes. And if we can't use our program/theorem-verifiers to verify the validity of our real-life plans, it'd stand to reason the corresponding SGD-found heuristics won't generalize to real-life stuff either.
Math/programming capabilities were coupled to general performance in the "just scale up the pretraining" paradigm: bigger models were generally smarter. It's unclear whether the same coupling holds for the "just scale up the inference-compute" paradigm; I've seen no evidence of that so far. - The claim that "progress from o1 to o3 was only three months" is likely false/misleading. The talk of Q*/Strawberry was around since the board drama of November 2023, at which point it had already supposedly beat some novel math benchmarks. So o1, or a meaningfully capable prototype of it, was around for more than a year now. They've only chosen to announce and release it three months ago. (See e. g. gwern's related analysis here.)
o3, by contrast, seems to be their actual current state-of-the-art model, which they've only recently trained. They haven't been sitting on it for months, haven't spent months making it ready/efficient enough for a public release.
Hence the illusion of insanely fast progress. (Which was probably exactly OpenAI's aim.)
I'm open to be corrected on any of these claims if anyone has relevant data, of course.
One can almost argue that some Kaggle models are very slightly trained on the test set
I'd say they're more-than-trained on the test set. My understanding is that humans were essentially able to do an architecture search, picking the best architecture for handling the test set, and then also put in whatever detailed heuristics they wanted into it based on studying the test set (including by doing automated heuristics search using SGD, it's all fair game). So they're not "very slightly" trained, they're trained^2.
Arguably the same is the case for o3, of course. ML researchers are using benchmarks as targets, and while they may not be directly trying to Goodhart to them, there's still a search process over architectures-plus-training-loops whose termination condition is "the model beats a new benchmark". And SGD itself is, in some ways, a much better programmer than any human.
So o3's development and training process essentially contained the development-and-training process for Kaggle models. They've iteratively searched for an architecture that can be trained to beat several benchmarks, then did so.
[ARC-AGI] is designed in such a way such that unlocking the training set cannot allow you to do well on the test set
I don't know whether I would put it this strongly. I haven't looked deep into it, but isn't it basically a non-verbal IQ test? Those very much do have a kind of "character" to them, such that studying how they work in general can let you derive plenty of heuristics for solving them. Those heuristics would be pretty abstract, yet far below the abstraction level of "general intelligence" (or the pile of very-abstract heuristics we associate with "general intelligence").
Do you not consider that ultimately isomorphic to what o3 does?
The basic guess regarding how o3's training loop works is that it generates a bunch of chains of thoughts (or, rather, a branching tree), then uses some learned meta-heuristic to pick the best chain of thought and output it.
As part of that, it also learns a meta-heuristic for which chains of thought to generate to begin with. (I. e., it continually makes judgement calls regarding which trains of thought to pursue, rather than e. g. generating all combinatorially possible combinations of letters.)
It would indeed work best in domains that allow machine verification, because then there's an easily computed ground-truth RL signal for training the meta-heuristic. Run each CoT through a proof verifier/an array of unit tests, then assign reward based on that. The learned meta-heuristics can then just internalize that machine verifier. (I. e., they'd basically copy the proof-verifier into the meta-heuristics. Then (a) once a spread of CoTs is generated, it can easily prune those that involve mathematically invalid steps, and (b) the LLM would become ever-more-unlikely to generate a CoT that involves mathematically invalid steps to begin with.)
However, arguably, the capability gains could transfer to domains outside math/programming.
There are two main possibilities here:
- You can jury-rig "machine verification" for "soft domains" by having an LLM inspect the spread of ideas it generated (e. g., 100 business plans), then pick the best one, using the LLM's learned intuition as the reward function. (See e. g. how Constitutional AI works, compared to RLHF.)
- You can hope that the meta-heuristics, after being trained on math/programming, learn some general-purpose "taste", an ability to tell which CoTs are better or worse, in a way that automatically generalizes to "soft" domains (perhaps with some additional fine-tuning using the previous idea).
That said, empirically, if we compare o1-full to Claude Sonnet 3.5.1, it doesn't seem that the former dominates the latter in "soft" domains as dramatically as it does at math. So the transfer, if it happens at all, isn't everything AI researchers could hope for.
Also, there's another subtle point here:
- o1's public version doesn't seem to actually generate trees of thought in response to user queries and then pruning it. It just deterministically picks the best train of thought to pursue as judged by the learned meta-heuristic (the part of it that's guiding which trees to generate; see the previous point regarding how it doesn't just generate all possible combinations of letters, but makes judgement calls regarding that as well).
- By contrast, o3 definitely generates that tree (else it couldn't have spent thousands-of-dollars' worth of compute on individual tasks, due to the context-window limitations).
Am I understanding right that this is all just clever ways of having it come up with many different answers or subanswers or preanswers, then picking the good ones to expand upon?
The best guess based on the publicly available information is that yes, this is the case.
Why should this be good for eg proving difficult math theorems, where many humans using many different approaches have failed, so it doesn't seem like it's as simple as trying a hundred times, or even trying using a hundred different strategies?
Which strategies you're trying matters. It indeed wouldn't do much good if you just pick completely random steps/generate totally random messages. But if you've trained some heuristic for picking the best-seeming strategies among the strategy-space, and this heuristic has superhuman research taste...
What do people mean when they say that o1 and o3 have "opened up new scaling laws" and that inference-time compute will be really exciting?
That for a given LLM model being steered by a given meta-heuristic, the performance on benchmarks steadily improves with the length of CoTs / the breadth of the ToTs generated.
Why do we expect this to scale?
Straight lines on graphs go brr? Same as with the pre-training laws. We see a simple pattern, we assume it extrapolates.
What is o3 doing that you couldn't do by running o1 on more computers for longer?
I'm not sure. It's possible that a given meta-heuristic can only keep the LLM on-track for a fixed length of CoT / for a fixed breadth of ToT. You would then need to learn how to train better meta-heuristics to squeeze out more performance.
A possible explanation is that you need "more refined" tastes to pick between a broader range of CoTs. E. g., suppose that the quality of CoTs is on a 0-100 scale. Suppose you've generated a spread of CoTs, and the top 5 of them have the "ground-truth quality" of 99.6, 99.4, 98, 97, 96. Suppose your meta-heuristic is of the form Q + e, where Q is the ground-truth quality and e is some approximation-error term. If e is on the order of 0.5, then the model can't distinguish between the top 2 guesses, and picks one at random. If e is on the order of 0.05, however, it reliably picks the best guess of those five. This can scale: then, depending on how "coarse" your model's tastes are, it can pick out the best guess among 10^4, 10^5 guesses, etc.
(I. e., then the inference-time scaling law isn't just "train any good-enough meta-heuristic and pour compute", it's "we can train increasingly better meta-heuristics, and the more compute they can usefully consume at inference-time, the better the performance".)
(Also: notably, the issue with the transfer-of-performance might be that how "refined" the meta-heuristic's taste is depends on the domain. E. g., for math, the error term might be 10^-5, for programming 10^-4, and for "soft" domains, 10^-1.)
Does inference compute scaling mean that o3 will use ten supercomputers for one hour per prompt, o4 will use a hundred supercomputers for ten hours per prompt, and o5 will use a thousand supercomputers for a hundred hours per prompt?
Not necessarily. The strength of the LLM model being steered, and the quality of the meta-heuristics doing the steering, matters. GPT-5 can plausibly outperfrom o3-full for much less inference-time compute, by needing shorter CoTs. "o3.5", using the same LLM but equipped with a better-trained meta-level heuristic, can likewise outperform o3 by having a better judgement regarding which trains of thought to pursue (roughly, the best guess of o3.5 among 10 guesses would be as good as the best guess of o3 among 100 guesses).
And then if my guess regarding different meta-heuristics only being able to make use of no more than a fixed quantity of compute is right, then yes, o[3+n] models would also be able to usefully consume more raw compute.
Edit: I. e., there are basically three variables at play here:
- How many guesses it needs to find a guess with a ground-truth quality above some threshold. (How refined the "steering" meta-heuristic is. What is the ground-truth quality of the best guess in 100 guesses it generated? How much is the probability distribution over guesses skewed towards the high-quality guesses?)
- How refined the tastes of the "pruning" meta-heuristic are. (I. e., the size of the error e in the toy Q + e model above. Mediates the number of guesses among which it can pick the actual best one, assuming that they're drawn from a fixed distribution.)
- How long the high-quality CoTs are. (E. g., recall how much useless work/backtracking o1's publicly shown CoTs seems to do, and how much more efficient it'd be if the base LLM were smart enough to just instantly output the correct answer, on pure instinct.)
Improving on (1) and (3) increases the efficiency of the compute used by the models. Improving on (2) lets models productively use more compute.
And notably, capabilities could grow either from improving on (2), in a straightforward manner, or from improving (1) and (3). For example, suppose that there's a "taste overhang", in that o3's tastes are refined enough to reliably pick the best guess out of 10^9 guesses (drawn from a fixed distribution), but it is only economical to let it generate 10^5 guesses. Then improving on (1) (skewing the distribution towards the high-quality guesses) and (3) (making guesses cheaper) would not only reduce the costs, but also increase the quality of the ultimately-picked guesses.
(My intuition is that there's no taste overhang, though; and also that the tastes indeed get increasingly less refined the farther you move from the machine-verifiable domains.)
I'll admit I'm not very certain in the following claims, but here's my rough model:
- The AGI labs focus on downscaling the inference-time compute costs inasmuch as this makes their models useful for producing revenue streams or PR. They don't focus on it as much beyond that; it's a waste of their researchers' time. The amount of compute at OpenAI's internal disposal is well, well in excess of even o3's demands.
- This means an AGI lab improves the computational efficiency of a given model up to the point at which they could sell it/at which it looks impressive, then drop that pursuit. And making e. g. GPT-4 10x cheaper isn't a particularly interesting pursuit, so they don't focus on that.
- Most of the models of the past several years have only been announced near the point at which they were ready to be released as products. I. e.: past the point at which they've been made compute-efficient enough to be released.
- E. g., they've spent months post-training GPT-4, and we only hear about stuff like Sonnet 3.5.1 or Gemini Deep Research once it's already out.
- o3, uncharacteristically, is announced well in advance of its release. I'm getting the sense, in fact, that we might be seeing the raw bleeding edge of the current AI state-of-the-art for the first time in a while. Perhaps because OpenAI felt the need to urgently counter the "data wall" narratives.
- Which means that, unlike the previous AIs-as-products releases, o3 has undergone ~no compute-efficiency improvements, and there's a lot of low-hanging fruit there.
Or perhaps any part of this story is false. As I said, I haven't been keeping a close enough eye on this part of things to be confident in it. But it's my current weakly-held strong view.
Fair! Except I'm not arguing that you should take my other predictions at face value on the basis of my supposedly having been right that one time. Indeed, I wouldn't do that without just the sort of receipt you're asking for! (Which I don't have. Best I can do is a December 1, 2023 private message I sent to Zvi making correct predictions regarding what o1-3 could be expected to be, but I don't view these predictions as impressive and it notably lacks credences.)
I'm only countering your claim that no internally consistent version of me could have validly updated all the way here from November 2023. You're free to assume that the actual version of me is dissembling or confabulating.
Sure. But if you know the bias is 95/5 in favor of heads, and you see heads, you don't update very strongly.
And yes, I was approximately that confident that something-like-MCTS was going to work, that it'd demolish well-posed math problems, and that this is the direction OpenAI would go in (after weighing in the rumor's existence). The only question was the timing, and this is mostly within my expectations as well.
It is if you believe the rumor and can extrapolate its implications, which I did. Why would I need to wait to see the concrete demonstration that I'm sure would come, if I can instead update on the spot?
It wasn't hard to figure out how "something like an LLM with A*/MCTS stapled on top" would look like, or where it'd shine, or that OpenAI might be trying it and succeeding at it (given that everyone in the ML community had already been exploring this direction at the time).
~No update, priced it all in after the Q* rumors first surfaced in November 2023.
This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem
I don't think anchoring to o3's current cost-efficiency is a reasonable thing to do. Now that AI has the capability to solve these problems in-principle, buying this capability is probably going to get orders of magnitude cheaper within the next five minutes months, as they find various algorithmic shortcuts.
I would guess that OpenAI did this using a non-optimized model because they expected it to be net beneficial: that producing a headline-grabbing result now will attract more counterfactual investment than e. g. the $900k they'd save by running the benchmarks half a year later.
Edit: In fact, if, against these expectations, the implementation of o3's trick can't be made orders-of-magnitude cheaper (say, because a base model of a given size necessarily takes ~n tries/MCTS branches per a FrontierMath problem and you can't get more efficient than one try per try), that would make me do a massive update against the "inference-time compute" paradigm.
Counterpoints: nuclear power, pharmaceuticals, bioengineering, urban development.
If we slack here, China will probably raise armies of robots with unlimited firepower and take over the world
Or maybe they will accidentally ban AI too due to being a dysfunctional autocracy, as autocracies are wont to do, all the while remaining just as clueless regarding what's happening as their US counterparts banning AI to protect the jobs.
I don't really expect that to happen, but survival-without-dignity scenarios do seem salient.
I actually think doing the former is considerably more in line with the way things are done/closer to the Overton window.
I agree that this seems like an important factor. See also this post making a similar point.
I'm going to go against the flow here and not be easily impressed. I suppose it might just be copium.
Any actual reason to expect that the new model beating these challenging benchmarks, which have previously remained unconquered, is any more of a big deal than the last several times a new model beat a bunch of challenging benchmarks that have previously remained unconquered?
Don't get me wrong, I'm sure it's amazingly more capable in the domains in which it's amazingly more capable. But I see quite a lot of "AGI achieved" panicking/exhilaration in various discussions, and I wonder whether it's more justified this time than the last several times this pattern played out. Does anything indicate that this capability advancement is going to generalize in a meaningful way to real-world tasks and real-world autonomy, rather than remaining limited to the domain of extremely well-posed problems?
One of the reasons I'm skeptical is the part where it requires thousands of dollars' worth of inference-time compute. Implies it's doing brute force at extreme scale, which is a strategy that'd only work for, again, domains of well-posed problems with easily verifiable solutions. Similar to how o1 blows Sonnet 3.5.1 out of the water on math, but isn't much better outside that.
Edit: If we actually look at the benchmarks here:
- The most impressive-looking jump is FrontierMath from 2% to 25.2%, but it's also exactly the benchmark where the strategy of "generate 10k candidate solutions, hook them up to a theorem-verifier, see if one of them checks out, output it" would shine.
- (With the potential theorem-verifier having been internalized by o3 over the course of its training; I'm not saying there was a separate theorem-verifier manually wrapped around o3.)
- Significant progress on ARC-AGI has previously been achieved using "crude program enumeration", which made the authors conclude that "about half of the benchmark was not a strong signal towards AGI".
- The SWE jump from 48.9 to 71.7 is significant, but it's not much of a qualitative improvement.
Not to say it's a nothingburger, of course. But I'm not feeling the AGI here.
But yeah, personally, I think this is all a result of a kind of precious view about experiential continuity that I don't share
Yeah, I don't know that this glyphisation process would give us what we actually want.
"Consciousness" is a confused term. Taking on a more executable angle, we presumably value some specific kinds of systems/algorithms corresponding to conscious human minds. We especially value various additional features of these algorithms, such as specific personality traits, memories, et cetera. A system that has the features of a specific human being would presumably be valued extremely highly by that same human being. A system that has fewer of those features would be valued increasingly less (in lockstep with how unlike "you" it becomes), until it's only as valuable as e. g. a randomly chosen human/sentient being.
So if you need to mold yourself into a shape where some or all of the features which you use to define yourself are absent, each loss is still a loss, even if it happens continuously/gradually.
So from a global perspective, it's not much different than acausal aliens resurrecting Schelling-point Glyph Beings without you having warped yourself into a Glyph Being over time. If you value systems that are like Glyph Beings, their creation somewhere in another universe is still positive by your values. If you don't, if you only value human-like systems, then someone creating Glyph Being bring no joy. Whether you or your friends warped yourself into a Glyph Being in the process doesn't matter.
A dog will change the weather dramatically, which will substantially effect your perceptions.
In this case, it's about alt-complexity again. Sure, a dog causes a specific weather-pattern change. But could this specific weather-pattern change have been caused only by this specific dog? Perhaps if we edit the universe to erase this dog, but add a cat and a bird five kilometers away, the chaotic weather dynamic would play out the same way? Then, from your perceptions' perspective, you wouldn't be able to distinguish between a dog timeline and a cat-and-bird timeline.
In some sense, this is common-sensical. The mapping from reality's low-level state to your perceptions is non-injective: the low-level state contains more information than you perceive on a moment-to-moment basis. Therefore, for any observation-state, there are several low-level states consistent with it. Scaling up: for any observed lifetime, there are several low-level histories consistent with it.
Sure. This setup couldn't really be exploited for optimizing the universe. If we assume that the self-selection assumption is a reasonable assumption to make, inducing amnesia doesn't actually improve outcomes across possible worlds. One out of 100 prisoners still dies.
It can't even be considered "re-rolling the dice" on whether the specific prisoner that you are dies. Under the SSA, there's no such thing as a "specific prisoner", "you" are implemented as all 100 prisoners simultaneously, and so regardless of whether you choose to erase your memory or not, 1/100 of your measure is still destroyed. Without SSA, on the other hand, if we consider each prisoner's perspective to be distinct, erasing memory indeed does nothing: it doesn't return your perspective to the common pool of prisoner-perspectives, so if "you" were going to get shot, "you" are still going to get shot.
I'm not super interested in that part, though. What I'm interested in is whether there are in fact 100 clones of me: whether, under the SSA, "microscopically different" prisoners could be meaningfully considered a single "high-level" prisoner.
Agreed. I think a type of "stop AGI research" argument that's under-deployed is that there's no process or actor in the world that society would trust with unilateral godlike power. At large, people don't trust their own governments, don't trust foreign governments, don't trust international organizations, and don't trust corporations or their CEOs. Therefore, preventing anyone from building ASI anywhere is the only thing we can all agree on.
I expect this would be much more effective messaging with some demographics, compared to even very down-to-earth arguments about loss of control. For one, it doesn't need to dismiss the very legitimate fear that the AGI would be aligned to values that a given person would consider monstrous. (Unlike "stop thinking about it, we can't align it to any values!".)
And it is, of course, true.
That's probably not what Page meant. On consideration, he would probably have clarified that AI that includes what we value about humanity would be a worthy successor. He probably wasn't even clear on his own philosophy at the time.
I don't see reasons to be so confident in this optimism. If I recall correctly, Robin Hanson explicitly believes that putting any constraints on future forms of life, including on its values, is undesirable/bad/regressive, even though lack of such constraints would eventually lead to a future with no trace of humanity left. Similar for Beef Jezos and other hardcore e/acc: they believe that a worthy future involves making a number go up, a number that corresponds to some abstract quantity like "entropy" or "complexity of life" or something, and that if making it go up involves humanity going extinct, too bad for humanity.
Which is to say: there are existence proofs that people with such beliefs can exist, and can retain these beliefs across many years and in the face of what's currently happening.
I can readily believe that Larry Page is also like this.
Also this:
From Altman: [...] Admitted that he lost a lot of trust with Greg and Ilya through this process. Felt their messaging was inconsistent and felt childish at times. [...] Sam was bothered by how much Greg and Ilya keep the whole team in the loop with happenings as the process unfolded. Felt like it distracted the team.
Apparently airing such concerns is "childish" and should only be done behind closed doors, otherwise it "distracts the team", hm.
Perhaps if you did have the full solution, but it feels like that there are some things of a solution that you could figure out, such that that part of the solution doesn't tell you as much about the other parts of the solution.
I agree with that.
I'd think you can define a tedrahedron for non-euclidean space
If you relax the definition of a tetrahedron to cover figures embedded in non-Euclidean spaces, sure. It wouldn't be the exact same concept, however. In a similar way to how "a number" is different if you define it as a natural number vs. real number.
Perhaps more intuitively, then: the notion of a geometric figure with specific properties is dependent on the notion of a space in which it is embedded. (You can relax it further – e. g., arguably, you can define a "tetrahedron" for any set with a distance function over it – but the general point stands, I think.)
Just consider if you take the assumption that the system would not change in arbitrary ways in response to it's environment. There might be certain constrains. You can think about what the constrains need to be such that e.g. a self modifying agent would never change itself such that it would expect that in the future it would get less utility than if it would not selfmodify.
Yes, but: those constraints are precisely the principles you'd need to code into your AI to give it general-intelligence capabilities. If your notion of alignment only needs to be robust to certain classes of changes, because you've figured out that an efficient generally intelligent system would only change in such-and-such ways, then you've figured out a property of how generally intelligent systems ought to work – and therefore, something about how to implement one.
Speaking abstractly, the "negative image" of the theory of alignment is precisely the theory of generally intelligent embedded agents. A robust alignment scheme would likely be trivial to transform into an AGI recipe.
I am pretty sure you can figure out alignment in advance as you suggest
I'm not so sure about that. How do you figure out how to robustly keep a generally intelligent dynamically updating system on-target without having a solid model of how that system is going to change in response to its environment? Which, in turn, would require a model of what that system is?
I expect the formal definition of "alignment" to be directly dependent on the formal framework of intelligence and embedded agency, the same way a tetrahedron could only be formally defined within the context of Euclidean space.