Posts
Comments
A brief history of things that have defined my timelines to AGI since learning about AI safety <2 years ago
- Bio anchors gave me a rough ceiling around 1e40 FLOP for how much compute will easily make AGI.
- Fun with +12 OOMs of Compute brought that same 'training-compute-FLOP needed for AGI' down a bunch to around 1e35 FLOP.
- Researching how much compute is scaling in the near future.
At this point I think it was pretty concentrated across ~1e27 - 1e33 flop so very long tail and something like a 2030-2040 50% CI.
- The benchmarks+gaps argument to partial AI research automation.
- The takeoff forecast for how partial AI research automation will translate to algorithmic progress.
- The trend in METR's time horizon data.
At this point my middle 50% CI is like 2027 - 2035, and would be tighter if not for a long tail that I keep around just because I think it's have a bunch of uncertainty. Though I do wish I had more arguments in place to justify the tail or make it bigger, ones that compete in how compelling they feel to me to the ones above.
Thanks for linking. I skimmed the early part of this post because you labelled it explicitly as viewpoints. Then I see that you engaged with a bunch of arguments about short timelines, but they are all pretty weak/old ones that I never found very convincing (the only exception is that bio anchors gave me an early ceiling early on around 1e40 FLOP for compute needed to make AGI). Then you got to LLMs and acknowledged:
- The existence of today's LLMs is scary and should somewhat shorten people's expectations about when AGI comes.
But then gave a bunch of points about the things LLMs are missing and suck at, which I already agree with.
Aside: Have I mischaracterized so far? Please let me know if so.
So, do you think you have arguments against the 'benchmarks+gaps argument' for timelines to AI research automation, or why AI research automation won't translate to much algorithmic progress? Or any of the other things that I listed as ones that moved my timelines down:
- Fun with +12 OOMs of Compute IMO, a pretty compelling writeup that brought my 'timelines to AGI uncertainty-over-training-compute-FLOP' down a bunch to around 1e35 FLOP
- Researching how much compute is scaling.
- The benchmarks+gaps argument to partial AI research automation
- The takeoff forecast for how partial AI research automation will translate to algorithmic progress.
- The recent trend in METR's time horizon data.
But instead of discussing the crux of which system is relevant (which has to be about details of recursive self-improvement), only the proponents will tend to talk about software-only singularity, while the opponents will talk about different systems whose scaling they see as more relevant, such as the human economy or datacenter economy.
Totally agree! Thank you for phrasing it elegantly. This is basically what I commented on Ege's post yesterday, I asked him to engage with the actual crux and make arguments about why the software-only singularity is unlikely.
There's an entire class of problem within ML that I would see as framing problems and the one thing I think LLMs don't help that much with is framing.
Could you say more about this? What do you mean by framing in this context?
There's this quote I've been seeing from Situation Awareness that all you have to do is "believe in a straight line on a curve" and when I hear that and see the general trend extrapolations my spider senses start tingling.
Yeah that's not really compelling to me either. SitA didn't move my timelines. Curious if you have engaged with the benchmarks+gaps argument to AI R&D automation (timelines forecast), and then the AI algorithmic progress that would drive (takeoff forecast). These are the things that actually moved my view.
Thanks for the link, that's compelling.
By far the most top-voted answer was one arguing that we might get AI to substantially accelerate AI progress because a particular AI research engineering benchmark looks like it will get saturated within a couple of years.
The list of things I see as concrete arguments that have moved down my timelines include exactly this!
But this is again assuming that good performance on a benchmark for AI research engineering actually translates into significant real-world capability.
...and I think this characterization is importantly false! This timelines forecast does not assume that. It breaks things down into gaps between benchmarks and real-world capability and tries to forecast how long it will take to cross each.
The benchmarks also do not take into account the fact that the vast majority of them measure a model's performance in a situation where the model is only given one task at a time, and it can completely focus on solving that...
Agree that there are many such 'gaps'! Would be curious to hear if you think there are important ones missing from the timelines forecast or if you have strong views that some of them will be importantly harder!
Furthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don't meet the spirit of it.
Thanks for this! This is a good point. Do you think you can go further and say why you think it will be very hard to fix in the near term, so much so that models won't be useful for AI research?
I agree re Leopold's piece, it didn't move my timelines.
A brief history of the things that have most collapsed my timelines down since becoming aware of AI safety <2 years ago:
- Fun with +12 OOMs of Compute IMO, a pretty compelling writeup that brought my 'timelines to AGI uncertainty-over-training-compute-FLOP' down a bunch
- Generally working on AI 2027, which has included
- Writing and reading the capabilities progression where each step seems plausible.
- Researching how much compute is scaling.
- Thinking about how naive and limiting current algorithms and architectures seem, and what changes they are plausibly going to be able to implement soon.
- The detailed benchmarks+gaps argument in the timelines forecast.
- The recent trend in METR's time horizon data.
Maybe i'm in an echo chamber or have just had my head in the sand while working on AI 2027, but now that i've been paying attention to AI safety for almost 2 years and seen my timelines gradually collapse, I really want to engage with compelling arguments that might lengthen my timelines again.
I have feel like there are a bunch of viewpoints expressed about long timelines/slow takeoff but a lack of arguments. This is me reaching out in the hope that people might point me to the best existing write-ups or maybe make new ones!
I am tracking things like: "takeoff will be slow because of experiment compute bottlenecks," or "timelines to AIs with good research taste are very long," or even more general "look how bad AI is at all this (not-super-relevant-to-a-software-only-singularity-)stuff that is so easy for humans!" but in my opinion, these are just viewpoints (which by the way, seem to often get stated very confidently in a way that makes me not trust the epistemology behind them). So sadly these statements don't tend to lengthen my timelines.
In my view, these viewpoints would become arguments if they were more like (excuse the spitballing):
- "1e28 FLOPs of experiment compute is unlikely to produce much algorithmic progress + give a breakdown of why a compelling allocation of 1e28 FLOP doesn't get very far"
- "Research taste is in a different reference class to the things AI has been making progress on recently + compelling reasoning, like, maybe:
- 'it has O(X) more degrees of freedom,'
- 'it has way less existing data and/or it's way harder to create data, or give a reward signal'
- 'things are looking grim for the likelihood of generalization to these kinds of skills'
- "there are XYZ properties intelligence needed that can't be simulated by current hardware paradigms"
Currently I feel like I have a heavy tail on my timelines and takeoff speeds as a placeholder in lieu of arguments like this, that i'm hoping exist.
Thanks for the detailed comments! We really appreciate it. Regarding revenue, here's some thoughts:
"AI 2027" forecasts that OpenAI's revenue will reach 140 billion in 2027. This considerably exceeds even OpenAI's own forecast, which surpasses 125 billion in revenue in 2029. I believe that the AI 2027 forecast is implausible.[4]
AI 2027 is not a median forecast but a modal forecast, so a plausible story for the faster side of the capability progression expected by the team. If you condition on the capability progression in the scenario, I actually think $140B in 2027 is potentially on the conservative side. My favourite parts of the FutureSearch report is the examples from the ~$100B/year reference class, e.g., 'Microsoft’s Productivity and Business Process segment.' If you take the AI's agentic capabilities and reliability from the scenario seriously, I think it feels intuitively easy to imagine how a similar scale business booms relatively quickly, and i'm glad that FutureSearch was able to give a breakdown as an example of how that could look.
So maybe I should just ask whether you are conditioning on the capabilities progression or not with this disagreement? Do you think $140b in 2027 is implausible even if you condition on the AI 2027 capability progression?
If you just think $140B in 2027 is not a good unconditional median forecast all things considered, then I think we all agree!
Note: "AI 2027" chooses to call the leading lab "OpenBrain", but FutureSearch is explicit that they're talking about OpenAI.
We aren't forecasting OpenAI revenue but OpenBrain revenue which is different because its ~MAX(OpenAI, Anthropic, GDM (AI-only), xAI, etc.).[1] In some places FutureSearch indeed seems to have given the 'plausible $100B ARR breakdown' under the assumption that OpenAI is the leading company in 2027, but that doesn't mean the two are supposed to be equal neither in their own revenue forecast nor in any of the AI 2027 work.
- FutureSearch's estimate of paid subscribers for April 2025 was 27 million; the actual figure is 20 million. They justify high expected consumer growth with reference to the month-on-month increase in unpaid users from December 2024 -> February 2025. Data from Semrush replicates that increase, but also shows that traffic has since declined rather than continuing to increase.
The exact breakdown FutureSearch use seems relatively unimportant compared to the high level argument that the headline (1) $/month and (2) no. of subscribers, very plausibly reaches the $100B ARR range, given the expected quality of agents that they will be able to offer.
- Looking at market size estimates, FutureSearch seems to implicitly assume that OpenAI will achieve a near-monopoly on Agents, the same way they have for Consumer subscriptions. Enterprise sales are significantly different from consumer signups, and OpenAI doesn't currently have a significant technical advantage.
I don't think a monopoly is necessary, there's a significant OpenBrain lead-time in the scenario, and I think it seems plausible that OpenBrain would convert that into a significant market share.
- ^
Not exactly equal since maybe the leading company in AI capabilities (measured by AI R&D prog. multiplier), i.e., OpenBrain, is not the one making the most revenue.
Also see FutureSearch's report on a plausible breakdown for how OpenBrain hits $100B ARR by mid-2027.
I think if you condition on the capability progression in the scenario and look at existing subscription services generating in the $100B range, it feels very plausible intuitively, independently from the 'tripling' extrapolation.
Seconding Daniel, thanks for the comment! I decided to adjust down the early numbers to be below the human professional range until Dec 2025[1] due to agreeing with the considerations you raised about about longer horizon tasks which should be included in how these ranges are defined.
- ^
Note that these are based on internal capabilities, so that translates to the best public models reaching the low human range in early-mid 2026.
Thanks for the comment Vladimir!
[...] for the total of 2x in performance.
I never got around to updating based on the GTC 2025 announcement but I do have the Blackwell to Rubin efficiency gain down as ~2.0x adjusted by die size so looks like we are in agreement there (though I attributed it a little differently based on information I could find at the time).
So the first models will start being trained on Rubin no earlier than late 2026, much more likely only in 2027 [...]
Agreed! I have them coming into use in early 2027 in this chart.
This predicts at most 1e28 BF16 FLOPs (2e28 FP8 FLOPs) models in late 2026
Agreed! As you noted we have the early version of Agent-2 at 1e28 fp16 in late 2026.
Rubin Ultra is another big step ~1 year after Rubin, with 2x more compute dies per chip and 2x more chips per rack, so it's a reason to plan pacing the scaling a bit rather than rushing it in 2026-2027. Such plans will make rushing it more difficult if there is suddenly a reason to do so, and 4 GW with non-Ultra Rubin seems a bit sudden.
Agree! I wrote this before knowing about the Rubin Ultra roadmap, but this part of the forecast starts to be affected somewhat by the intelligence explosion. Specifically an urgent demand for research experiment compute and inference specialised chips for running automated researchers.
Good point, thanks. Previously I would have pretty confidently read "100K GB200 GPUs," or "100K GB200 cluster" as 200K B200s (~= 500K H100s) but I can see how it's easily ambiguous. Now that I think of it, I remembered this Tom's Hardware article where B200 and GB200 are mistakenly used interchangeably (compare the subtitle vs. the end of the first paragraph)...
That's indeed inconvenient. I was aware of NVL2, NVL4, NVL36, NVL72, but I was under the impression that 'GB200' mentioned on its own always means 2 Blackwells, 1 Grace (unless you add on a 'NVL__'). Are there counterexamples to this? I scanned the links you mentioned and only saw 'GB200 NVL2,' 'GB200 NVL4,' 'GB200 NVL72' respectively.
I was operating on this pretty confidently but unsure where else I saw this described (apart from the column I linked above). On a quick search of 'GB200 vs B200' the first link I found seemed to corroborate GB200 = 2xB200s + 1xGrace CPU. Edit: second link also says: "the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU..."
I think 'GB200' refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low.
My guess is that Bloomberg's phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I'd be very surprised if OpenAI don't have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1/4 of what Microsoft alone plan to invest this year.
Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.[1]
- ^
There's a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that's 5e26 FLOP/month.
Thanks Vladimir, this is really interesting!
Re: OpenAI's compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would've guessed they'd be on track to be using around 400k H100s by now.
So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible?
The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively.
- ^
$6e9 / 365.25d / 24h / $2.5/hr = 274k