What o3 Becomes by 2028
post by Vladimir_Nesov · 2024-12-22T12:37:20.929Z · LW · GW · 15 commentsContents
Reign of GPT-4 Engines of Scaling Two More Turns of the Crank Peak Data None 15 comments
Funding for $150bn training systems just turned less speculative, with OpenAI o3 reaching 25% on FrontierMath, 70% on SWE-Verified, 2700 on Codeforces, and 80% on ARC-AGI. These systems will be built in 2026-2027 and enable pretraining models for 5e28 FLOPs, while o3 itself is plausibly based on an LLM pretrained only for 8e25-4e26 FLOPs. The natural text data wall won't seriously interfere until 6e27 FLOPs, and might be possible to push until 5e28 FLOPs. Scaling of pretraining won't end just yet.
Reign of GPT-4
Since the release of GPT-4 in March 2023, subjectively there was no qualitative change in frontier capabilities. In 2024, everyone in the running merely caught up. To the extent this is true, the reason might be that the original GPT-4 was probably a 2e25 FLOPs MoE model trained on 20K A100. And if you don't already have a cluster this big, and experience in training MoE models at that scale, no amount of money can let you immediately match this feat.
We now know that 16K H100 and more than a year are sufficient to do that with a 4e25 FLOPs dense model. Until 2024, there probably wasn't a training cluster larger than about 30K H100, which enables pretraining for 1e26 FLOPs in BF16 at 40% compute utilization when running for 4 months. That's the range of LLMs we've seen deployed in 2023-2024, between 2e25 and 1e26 FLOPs, likely at most 5x up from original GPT-4.
Engines of Scaling
In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft's 3 buildings in Goodyear, Arizona, xAI's Memphis cluster, Meta's training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.
Then there are Google's 100K TPUv6e clusters and Amazon's 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.
Anthropic might need more time than the other players to gets its new hardware running, but there is also an advantage to Trn2 and TPUv6e over H100, larger scale-up domains that enable more tensor parallelism and smaller minibatch sizes. This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
We haven't seen AIs made from compute optimal LLMs pretrained on these systems yet, but the systems were around for 6+ months, so the AIs should start getting deployed imminently, and will become ubiquitous in 2025. This is a change from 4e25-1e26 FLOPs to 2e26-6e26 FLOPs, up to 30x original GPT-4. More might follow later in 2025 from the 400K Trn2 cluster, the possibility that Google gets more than one 100K TPUv6e cluster connected into a single training system, and xAI's upcoming doubling of the 100K H100 cluster.
Two More Turns of the Crank
Funding, power, and data are constraints that might plausibly slow things down at some point. But when specifically does that become a possibility? Hyperscalers spend about $30-50bn a year on CAPEX (building things like datacenters around the world), so in 2024 shaping $5-10bn in the form of clusters useful for frontier model training is not yet painful.
In 2025, Microsoft might be building a 300K B200 cluster, possibly part of a geographically distributed training system of 500K-700K B200. A cluster of 100K B200 needs about 200 MW, so the whole thing would need 600-1400 MW. Google is doing something similar, with sites under construction and in close proximity adding up to about 1 GW by the end of 2025. There are two such 1 GW collections of sites, one in Ohio and another in Iowa/Nebraska.
A training system built in parts across multiple sites makes it easier to quickly secure enough power, and an inter-datacenter network with bandwidth of about 300 Tbps[2] might be sufficient for a 1 GW training system, which is practical even for oceanfloor cables. Overland cables enable more than that, as long as they can actually get in place quickly.
At this pace, it looks like the next step of scaling would need 5 GW and take another 18-24 months, with systems ready for training by 2028, maybe late 2027. The 1 GW systems will already cost $30-50bn, on the order of annual CAPEX. For datacenters planned to last 6 years, a better anchor might be a fraction of CAPEX over 6 years, which is $200-300bn. But even then $100-200bn for a 5 GW training system seems too much to allocate without a stronger case.
The impact of OpenAI o3 on timelines might be primarily in its successors making the case for building these 5 GW training systems. In 2026, even if the 5 GW training systems are still not given a go-ahead, the 1 GW training systems built in 2025 will start producing 5e27 FLOPs models (250x original GPT-4). The case made by the successors of o3 will be strengthened by their scale, and only if that still fails could a scaling slowdown before 2028 become a possibility.
Peak Data
Largest datasets used in training LLMs with disclosed size are 15T and 18T tokens. FineWeb dataset is 15T tokens, RedPajama2 dataset is 30T tokens. A 4e25-2e26 FLOPs compute optimal model doesn't need more data than that, it needs better selection of data. As the scale changes, the goals become different. The DCLM paper details which data gets thrown out, starting with DCLM-Pool, a raw 240T token Common Crawl dataset (see Figure 4). I would guess at least 50T tokens are not completely useless, and there are many tokens outside Common Crawl.
Starting with 50T tokens, it's possible to repeat them in training, 5 times with little penalty and 15 times in a still-useful way (see Figure 4). The paper systematically studies things like how perplexity for a model trained on 5X tokens differs from that for a model trained on X tokens repeated 5 times, with up to 1e22 FLOPs per datapoint.
With 50T tokens repeated 5 times, and a 60 tokens/parameter[3] estimate for a compute optimal dense transformer, we have enough data for 6e27 FLOPs of compute, the scale of 1 GW training systems. Sacrificing some quality and repeating data 15 times, we could get 5e28 FLOPs. Though at that point, feeding the model video or scanned documents might become more valuable in making it smarter.
Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it's trivial to make it so, so they are probably already too large. They can't be made smaller, because only tensor parallelism divides the size of minibatches, and it's only feasible to apply within scale-up domains, smaller collections of accelerators connected with highly performant network. For H100, that's only 8 GPUs in the standard configuration that seems to be used everywhere. For TPUv6e, that's a whole 256-chip pod, and this wasn't a constraint in older TPUs either. For Trn2, that's either 16 or 64 GPUs in either standard or Ultra variants. Each Trn2 produces 0.65e15 BF16 FLOP/s, compared to 1e15 FLOP/s of an H100, so a Trn2 scale-up domain produces compute of 10-40 H100, dividing the minibatch size by 1.3x to 5x compared to an H100 cluster. ↩︎
For Llama 3 405B, about 6 seconds passed between optimizer steps, and there might be about 1.6 TB of gradients to communicate between parts of a hypothetical geographically distributed training system, which should happen much faster, say in 1 second. A 500K B200 system offers 80 times more compute than Llama 3's 16K H100, so by Chinchilla scaling the model might be 9 times larger. A scale-up domain in an NVL72 GB200 setup is 72 B200s or 180 H100s worth of compute, 22 times more than for an H100 cluster. So even with a model that's 9 times larger, tensor parallelism will allow processing a minibatch 2.5 times faster. Communicating 15 TB of gradients in 0.4 seconds takes 300 Tbps of bandwidth. If trained for 4x longer, both compute optimal model size and minibatch processing time would increase 2x, so the necessary bandwidth stays the same. ↩︎
There is a well-known estimate from the Chinchilla paper of 20 tokens/parameter being compute optimal, but subsequent studies show that this ratio can significantly vary. It also slowly but not insignificantly changes with scale. The Llama 3 paper estimates a ratio of 40 tokens/parameter at 4e25 FLOPs, increasing by 15% with every 10x of compute, using experiments of up to 1e22 FLOPs per datapoint (see Figure 3). This in particular predicts 30 tokens/parameter at Chinchilla's 6e23 FLOPs, 55 tokens/parameter at 6e27 FLOPs, and 60 tokens/parameter at 5e28 FLOPs. Extrapolating from 1e22 FLOPs to 5e28 FLOPs and using the best 15T tokens with no repetition means wide error bars. ↩︎
15 comments
Comments sorted by top scores.
comment by lunatic_at_large · 2024-12-22T17:43:38.211Z · LW(p) · GW(p)
I guess I'm a bit confused where o3 comes into this analysis. This discussion appears to be focused on base models to me? Is data really the bottleneck these days for o-series-type advancements? I thought that compute available to do RL in self-play / CoT / long-time-horizon-agentic-setups would be a bigger consideration.
Edit: I guess upon another reading this article seems like an argument against AI capabilities hitting a plateau in the coming years, whereas the o3 announcement makes me more curious about whether we're going to hyper-accelerate capabilities in the coming months.
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2024-12-22T20:59:53.875Z · LW(p) · GW(p)
My thesis is that the o3 announcement is timelines-relevant in a strange way. The causation goes from o3 to impressiveness or utility of its successors trained on 1 GW training systems, then to decisions to build 5 GW training systems, and it's those 5 GW training systems that have a proximate effect on timelines (in comparison to the world only having 1 GW training systems for a few years). The argument goes through even if o3 and its successors don't particularly move timelines directly through their capabilities, they can remain a successful normal technology.
The funding constraint stopping $150bn training systems previously seemed more plausible, but with o3 it might be lifted. This is timelines-relevant precisely because there aren't any other constraints that come into play before that point.
comment by Logan Riggs (elriggs) · 2024-12-22T15:19:27.882Z · LW(p) · GW(p)
I really appreciate your post and all the links! This and your other recent posts/comments have really helped make a clearer model of timelines.
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-12-25T14:11:44.573Z · LW(p) · GW(p)
Do you have thoughts on the apparent recent slowdown/disappointing results in scaling up pretraining? These might suggest very diminishing returns in scaling up pretraining significantly before 6e27 FLOP.
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2024-12-25T15:33:32.053Z · LW(p) · GW(p)
They've probably scaled up 2x-4x compared to the previous scale of about 8e25 FLOPs, it's not that far (from 30K H100 to 100K H100). One point as I mentioned in the post is inability to reduce minibatch size, which might make this scaling step even less impactful than it should be judging from compute alone, though that doesn't apply to Google.
In any case this doesn't matter yet, since the 1 GW training systems are already being built (in case of Nvidia GPUs with larger scale-up worlds of GB200 NVL72), the decision to proceed to the yet-unobserved next level of scaling doesn't depend on what's observed right now. The 1 GW training systems allow training up to about 5e27 FLOPs, about 60x[1] the currently deployed models, a more significant change. We'll see its impact in late 2026.
The number of chips increases 5x from 100K H100 to 500K B200, and the new chips are 2.5x faster. If 1 GW systems are not yet expected to be quickly followed by larger systems, more time will be given to individual frontier model training runs, let's say 1.5x more. And there was that 3x factor from 30K H100 to 100K H100. ↩︎
comment by wassname · 2024-12-24T03:20:41.700Z · LW(p) · GW(p)
Peak Data
We don't know how o3 works, but we can speculate. If it's like the open source huggingface kinda-replication then it uses all kinds of expensive methods to make the next level of reward model, and this model teaches a simpler student model. That means that the expensive methods are only needed once, during the training.
In other words, you use all kinds of expensive methods (process supervision, test time compute, MCTS) to bootstrap the next level of labels/supervision, which teaches a cheaper student model. This is essentially bootstrapping superhuman synthetic data/supervision.
o3 seems to have shown that this bootstrapping process can be repeated beyond the limits of human training data.
If this is true, we've reached peak cheap data. Not peak data.
comment by avturchin · 2024-12-22T16:21:08.047Z · LW(p) · GW(p)
With 50T tokens repeated 5 times, and a 60 tokens/parameter[3] [LW · GW] estimate for a compute optimal dense transformer,
Does it mean that the optimal size of the model will be around 4.17Tb?
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2024-12-22T17:03:18.973Z · LW(p) · GW(p)
About 4T parameters, which is 8 TB in BF16. With about 100x more compute (compared to Llama 3 405B), we get a 10x larger model by Chinchilla scaling, the correction from a higher tokens/parameter ratio is relatively small (and in this case cancels out the 1.5 factor in compute being 150x actually).
Not completely sure if BF16 remains sufficient at 6e27-5e28 FLOPs, as these models will have more layers and larger sums in matrix multiplications. If BF16 doesn't work, the same clusters will offer less compute (at a higher precision). Seems unlikely though, as 3 OOMs of compute only increase model size 30x, which means 3x more layers and 3x larger matrices (in linear size), which is not that much. There are block number formats like microscaling that might help if this is somehow a problem, but usability of this remains unclear, as everyone is still training in BF16 in practice.
In the other direction, there is a Nov 2024 paper that suggests 7-8 bit precision might be compute optimal at any scale, that the proper way to adapt to scale is by increasing the number of parameters rather than increasing precision (Section 4.3.2). If this can be made practical at a given scale, there'll be 2x more compute, and even more in effective compute, which is essentially the paper's claim. (I don't know how this interacts with scarce data, possibly either higher or lower precision can improve the situation.)
comment by 152334H (152334h) · 2024-12-23T11:11:20.187Z · LW(p) · GW(p)
[minor technical disputes below; ignore if disinterested]
This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it's trivial to make it so, so they are probably already too large.
I'm a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.
In general, I don't understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
For H100, that's only 8 GPUs in the standard configuration that seems to be used everywhere. For TPUv6e, that's a whole 256-chip pod, and this wasn't a constraint in older TPUs either. For Trn2, that's either 16 or 64 GPUs in either standard or Ultra variants.
I think it's plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations, but we may have to wait for Semianalysis to provide good numbers on this.
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2024-12-23T22:11:28.870Z · LW(p) · GW(p)
In general, I don't understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
Pipeline parallelism doesn't reduce batch size, it just moves the processing of a given sequence around the cluster in stages, but the number of sequences being processed by the cluster at a given time doesn't change (the time needed to process a layer for some sequence doesn't change, so the time between optimizer steps doesn't change, other than through bubbles). Tensor parallelism spreads the processing of a sequence across multiple GPUs, so there are fewer sequences processed at once within the cluster, which can be used to reduce the batch size (the time needed to process a layer for some sequence is divided by degree of tensor parallelism, so the time between optimizer steps reduces, and so does the total compute expended in a batch, proportional to the total number of sequences in it). You can only do tensor parallelism within a scale-up world without murdering compute utilization, which puts a bound on how much you can reduce the batch size.
I believe the l3 paper indicates the training seqlen was increased mid-training.
Section 3.4 says they start with sequences of length 4K, move to sequences of length 8K after 250M tokens, then to 16M tokens per batch after 2.9T tokens, and finally to long context training in the last 800B tokens (out of about 15T tokens in total). So 11T out of 15T tokens were learned in batches of 2K sequences of length 8K.
I think it's plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations
Good catch, TP=32 on 400K Trn2 gives the same batch size as TP=8 on 100K H100, so there is only an advantage with TP=64, which is not a priori a sure thing to work well. And a hypothetical non-Ultra 400K Trn2 cluster with its 16 GPU scale-up worlds is worse even though there's more compute in 16 Trn2 than in 8 H100. Though it would be surprising if the Rainier cluster doesn't have the Ultra config, as what else is it supposed to be for.
comment by anaguma · 2024-12-22T22:09:20.130Z · LW(p) · GW(p)
In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft's 3 buildings in Goodyear, Arizona, xAI's Memphis cluster, Meta's training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.
Then there are Google's 100K TPUv6e clusters and Amazon's 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.
Anthropic might need more time than the other players to gets its new hardware running, but there is also an advantage to Trn2 and TPUv6e over H100, larger scale-up domains that enable more tensor parallelism and smaller minibatch sizes. This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Do we know much about TPU and Trn2 performance at lower precision? I expect most training runs are using 4-8 bit precision by this point.
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2024-12-23T22:18:28.678Z · LW(p) · GW(p)
Are there any signs to be found in public that anyone is training 10B+ LLMs in a precision that is not 16 bits? There are experiments that are specifically about precision on smaller LLMs, but they don't seem to get adopted in practice for larger models, despite the obvious advantage of getting to 2x the compute.
Replies from: anaguma↑ comment by anaguma · 2024-12-27T02:06:32.867Z · LW(p) · GW(p)
Deepseek v3 is one example, and semianalysis has claimed that most labs use FP8.
Replies from: Vladimir_NesovFP8 Training is important as it speeds up training compared to BF16 & most frontier labs use FP8 Training.
↑ comment by Vladimir_Nesov · 2024-12-28T18:50:21.461Z · LW(p) · GW(p)
DeepSeek-V3 might be the only example (and it's from the future, released after I asked the question). Not sure if it generalizes to expecting more FP8 training, as it's a MoE model with 257 experts and uses relatively small 7Kx2K matrices in its experts, while GPT-3-175B tested in FP8 in the Sep 2022 paper has much larger matrices, and that result wasn't sufficient to promote widespread adoption (at least where it's possible to observe).
On the other hand, if DeepSeek-V3 really is as good for its compute (4e24-6e24 FLOPs) as the benchmarks indicate, it might motivate more training with a huge number of smaller experts (it activates 8 experts per token, so the number of experts is even higher than one would expect from its ratio of total to active parameters). There was a Feb 2024 paper claiming 20x or higher compute multipliers for MoE models compared to dense (Figure 1b), appearing only if they activate a lot of experts per token, predicting 64 to be optimal at 1e24-1e25 FLOPs (the usual practice is to activate 2 experts). So DeepSeek-V3 weakly supports this surprising claim, though actual experimental results with more compute than that paper's 3e19-4e20 FLOPs per datapoint would be better. The paper also predicts reduction in tokens per parameter with more compute (Table 2), reaching 8 tokens per active parameter at 5e25 FLOPs (in a MoE model with 4096 experts, 64 of which get activated per token). If this too is somehow correct, natural text data can be sufficient for 10 times more compute than with dense models.
Replies from: anaguma