What o3 Becomes by 2028
post by Vladimir_Nesov · 2024-12-22T12:37:20.929Z · LW · GW · 7 commentsContents
Reign of GPT-4 Engines of Scaling Two More Turns of the Crank Peak Data None 7 comments
Funding for $150bn training systems just turned less speculative, with OpenAI o3 reaching 25% on FrontierMath, 70% on SWE-Verified, 2700 on Codeforces, and 80% on ARC-AGI. These systems will be built in 2026-2027 and enable pretraining models for 5e28 FLOPs, while o3 itself is plausibly based on an LLM pretrained only for 8e25-4e26 FLOPs. The natural text data wall won't seriously interfere until 6e27 FLOPs, and might be possible to push until 5e28 FLOPs. Scaling of pretraining won't end just yet.
Reign of GPT-4
Since the release of GPT-4 in March 2023, subjectively there was no qualitative change in frontier capabilities. In 2024, everyone in the running merely caught up. To the extent this is true, the reason might be that the original GPT-4 was probably a 2e25 FLOPs MoE model trained on 20K A100. And if you don't already have a cluster this big, and experience in training MoE models at that scale, no amount of money can let you immediately match this feat.
We now know that 16K H100 and more than a year are sufficient to do that with a 4e25 FLOPs dense model. Until 2024, there probably wasn't a training cluster larger than about 30K H100, which enables pretraining for 1e26 FLOPs in BF16 at 40% compute utilization when running for 4 months. That's the range of LLMs we've seen deployed in 2023-2024, between 2e25 and 1e26 FLOPs, likely at most 5x up from original GPT-4.
Engines of Scaling
In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft's 3 buildings in Goodyear, Arizona, xAI's Memphis cluster, Meta's training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.
Then there are Google's 100K TPUv6e clusters and Amazon's 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.
Anthropic might need more time than the other players to gets its new hardware running, but there is also an advantage to Trn2 and TPUv6e over H100, larger scale-up domains that enable more tensor parallelism and smaller minibatch sizes. This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
We haven't seen AIs made from compute optimal LLMs pretrained on these systems yet, but the systems were around for 6+ months, so the AIs should start getting deployed imminently, and will become ubiquitous in 2025. This is a change from 4e25-1e26 FLOPs to 2e26-6e26 FLOPs, up to 30x original GPT-4. More might follow later in 2025 from the 400K Trn2 cluster, the possibility that Google gets more than one 100K TPUv6e cluster connected into a single training system, and xAI's upcoming doubling of the 100K H100 cluster.
Two More Turns of the Crank
Funding, power, and data are constraints that might plausibly slow things down at some point. But when specifically does that become a possibility? Hyperscalers spend about $30-50bn a year on CAPEX (building things like datacenters around the world), so in 2024 shaping $5-10bn in the form of clusters useful for frontier model training is not yet painful.
In 2025, Microsoft might be building a 300K B200 cluster, possibly part of a geographically distributed training system of 500K-700K B200. A cluster of 100K B200 needs about 200 MW, so the whole thing would need 600-1400 MW. Google is doing something similar, with sites under construction and in close proximity adding up to about 1 GW by the end of 2025. There are two such 1 GW collections of sites, one in Ohio and another in Iowa/Nebraska.
A training system built in parts across multiple sites makes it easier to quickly secure enough power, and an inter-datacenter network with bandwidth of about 300 Tbps[2] might be sufficient for a 1 GW training system, which is practical even for oceanfloor cables. Overland cables enable more than that, as long as they can actually get in place quickly.
At this pace, it looks like the next step of scaling would need 5 GW and take another 18-24 months, with systems ready for training by 2028, maybe late 2027. The 1 GW systems will already cost $30-50bn, on the order of annual CAPEX. For datacenters planned to last 6 years, a better anchor might be a fraction of CAPEX over 6 years, which is $200-300bn. But even then $100-200bn for a 5 GW training system seems too much to allocate without a stronger case.
The impact of OpenAI o3 on timelines might be primarily in its successors making the case for building these 5 GW training systems. In 2026, even if the 5 GW training systems are still not given a go-ahead, the 1 GW training systems built in 2025 will start producing 5e27 FLOPs models (250x original GPT-4). The case made by the successors of o3 will be strengthened by their scale, and only if that still fails could a scaling slowdown before 2028 become a possibility.
Peak Data
Largest datasets used in training LLMs with disclosed size are 15T and 18T tokens. FineWeb dataset is 15T tokens, RedPajama2 dataset is 30T tokens. A 4e25-2e26 FLOPs compute optimal model doesn't need more data than that, it needs better selection of data. As the scale changes, the goals become different. The DCLM paper details which data gets thrown out, starting with DCLM-Pool, a raw 240T token Common Crawl dataset (see Figure 4). I would guess at least 50T tokens are not completely useless, and there are many tokens outside Common Crawl.
Starting with 50T tokens, it's possible to repeat them in training, 5 times with little penalty and 15 times in a still-useful way (see Figure 4). The paper systematically studies things like how perplexity for a model trained on 5X tokens differs from that for a model trained on X tokens repeated 5 times, with up to 1e22 FLOPs per datapoint.
With 50T tokens repeated 5 times, and a 60 tokens/parameter[3] estimate for a compute optimal dense transformer, we have enough data for 6e27 FLOPs of compute, the scale of 1 GW training systems. Sacrificing some quality and repeating data 15 times, we could get 5e28 FLOPs. Though at that point, feeding the model video or scanned documents might become more valuable in making it smarter.
Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it's trivial to make it so, so they are probably already too large. They can't be made smaller, because only tensor parallelism divides the size of minibatches, and it's only feasible to apply within scale-up domains, smaller collections of accelerators connected with highly performant network. For H100, that's only 8 GPUs in the standard configuration that seems to be used everywhere. For TPUv6e, that's a whole 256-chip pod, and this wasn't a constraint in older TPUs either. For Trn2, that's either 16 or 64 GPUs in either standard or Ultra variants. Each Trn2 produces 0.65e15 BF16 FLOP/s, compared to 1e15 FLOP/s of an H100, so a Trn2 scale-up domain produces compute of 10-40 H100, dividing the minibatch size by 1.3x to 5x compared to an H100 cluster. ↩︎
For Llama 3 405B, about 6 seconds passed between optimizer steps, and there might be about 1.6 TB of gradients to communicate between parts of a hypothetical geographically distributed training system, which should happen much faster, say in 1 second. A 500K B200 system offers 80 times more compute than Llama 3's 16K H100, so by Chinchilla scaling the model might be 9 times larger. A scale-up domain in an NVL72 GB200 setup is 72 B200s or 180 H100s worth of compute, 22 times more than for an H100 cluster. So even with a model that's 9 times larger, tensor parallelism will allow processing a minibatch 2.5 times faster. Communicating 15 TB of gradients in 0.4 seconds takes 300 Tbps of bandwidth. If trained for 4x longer, both compute optimal model size and minibatch processing time would increase 2x, so the necessary bandwidth stays the same. ↩︎
There is a well-known estimate from the Chinchilla paper of 20 tokens/parameter being compute optimal, but subsequent studies show that this ratio can significantly vary. It also slowly but not insignificantly changes with scale. The Llama 3 paper estimates a ratio of 40 tokens/parameter at 4e25 FLOPs, increasing by 15% with every 10x of compute, using experiments of up to 1e22 FLOPs per datapoint (see Figure 3). This in particular predicts 30 tokens/parameter at Chinchilla's 6e23 FLOPs, 55 tokens/parameter at 6e27 FLOPs, and 60 tokens/parameter at 5e28 FLOPs. Extrapolating from 1e22 FLOPs to 5e28 FLOPs and using the best 15T tokens with no repetition means wide error bars. ↩︎
7 comments
Comments sorted by top scores.
comment by lunatic_at_large · 2024-12-22T17:43:38.211Z · LW(p) · GW(p)
I guess I'm a bit confused where o3 comes into this analysis. This discussion appears to be focused on base models to me? Is data really the bottleneck these days for o-series-type advancements? I thought that compute available to do RL in self-play / CoT / long-time-horizon-agentic-setups would be a bigger consideration.
Edit: I guess upon another reading this article seems like an argument against AI capabilities hitting a plateau in the coming years, whereas the o3 announcement makes me more curious about whether we're going to hyper-accelerate capabilities in the coming months.
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2024-12-22T20:59:53.875Z · LW(p) · GW(p)
My thesis is that the o3 announcement is timelines-relevant in a strange way. The causation goes from o3 to impressiveness or utility of its successors trained on 1 GW training systems, then to decisions to build 5 GW training systems, and it's those 5 GW training systems that have a proximate effect on timelines (in comparison to the world only having 1 GW training systems for a few years). The argument goes through even if o3 and its successors don't particularly move timelines directly through their capabilities, they can remain a successful normal technology.
The funding constraint stopping $150bn training systems previously seemed more plausible, but with o3 it might be lifted. This is timelines-relevant precisely because there aren't any other constraints that come into play before that point.
comment by Logan Riggs (elriggs) · 2024-12-22T15:19:27.882Z · LW(p) · GW(p)
I really appreciate your post and all the links! This and your other recent posts/comments have really helped make a clearer model of timelines.
comment by avturchin · 2024-12-22T16:21:08.047Z · LW(p) · GW(p)
With 50T tokens repeated 5 times, and a 60 tokens/parameter[3] [LW · GW] estimate for a compute optimal dense transformer,
Does it mean that the optimal size of the model will be around 4.17Tb?
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2024-12-22T17:03:18.973Z · LW(p) · GW(p)
About 4T parameters, which is 8 TB in BF16. With about 100x more compute (compared to Llama 3 405B), we get a 10x larger model by Chinchilla scaling, the correction from a higher tokens/parameter ratio is relatively small (and in this case cancels out the 1.5 factor in compute being 150x actually).
Not completely sure if BF16 remains sufficient at 6e27-5e28 FLOPs, as these models will have more layers and larger sums in matrix multiplications. If BF16 doesn't work, the same clusters will offer less compute (at a higher precision). Seems unlikely though, as 3 OOMs of compute only increase model size 30x, which means 3x more layers and 3x larger matrices (in linear size), which is not that much. There are block number formats like microscaling that might help if this is somehow a problem, but usability of this remains unclear, as everyone is still training in BF16 in practice.
In the other direction, there is a Nov 2024 paper that suggests 7-8 bit precision might be compute optimal at any scale, that the proper way to adapt to scale is by increasing the number of parameters rather than increasing precision (Section 4.3.2). If this can be made practical at a given scale, there'll be 2x more compute, and even more in effective compute, which is essentially the paper's claim. (I don't know how this interacts with scarce data, possibly either higher or lower precision can improve the situation.)
comment by 152334H (152334h) · 2024-12-23T11:11:20.187Z · LW(p) · GW(p)
[minor technical disputes below; ignore if disinterested]
This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it's trivial to make it so, so they are probably already too large.
I'm a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.
In general, I don't understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
For H100, that's only 8 GPUs in the standard configuration that seems to be used everywhere. For TPUv6e, that's a whole 256-chip pod, and this wasn't a constraint in older TPUs either. For Trn2, that's either 16 or 64 GPUs in either standard or Ultra variants.
I think it's plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations, but we may have to wait for Semianalysis to provide good numbers on this.
comment by anaguma · 2024-12-22T22:09:20.130Z · LW(p) · GW(p)
In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft's 3 buildings in Goodyear, Arizona, xAI's Memphis cluster, Meta's training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.
Then there are Google's 100K TPUv6e clusters and Amazon's 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.
Anthropic might need more time than the other players to gets its new hardware running, but there is also an advantage to Trn2 and TPUv6e over H100, larger scale-up domains that enable more tensor parallelism and smaller minibatch sizes. This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Do we know much about TPU and Trn2 performance at lower precision? I expect most training runs are using 4-8 bit precision by this point.