Scaling of AI training runs will slow down after GPT-5
post by Maxime Riché (maxime-riche) · 2024-04-26T16:05:59.957Z · LW · GW · 5 commentsContents
The reasoning behind the claim: Unrelated to the claim: How big is that effect going to be? Impact of GPT-5 None 5 comments
My credence: 33% confidence in the claim that the growth in the number of GPUs used for training SOTA AI will slow down significantly directly after GPT-5. It is not higher because of (1) decentralized training is possible, and (2) GPT-5 may be able to increase hardware efficiency significantly, (3) GPT-5 may be smaller than assumed in this post, (4) race dynamics.
TLDR: Because of a bottleneck in energy access to data centers and the need to build OOM larger data centers.
Update: See Vladimir_Nesov [LW(p) · GW(p)]'s comment below for why this claim is likely wrong, since decentralized training seems to be solved. As a consequence, I updated my credence in the claim exposed in this post from 33% to 15%.
The reasoning behind the claim:
- Current large data centers consume around 100 MW of power, while a single nuclear power plant generates 1GW. The largest seems to consume 150 MW.
- An A100 GPU uses 250W, and around 1kW with overheard. B200 GPUs, uses ~1kW without overhead. Thus a 1MW data center can support maximum 1k to 2k GPUs.
- GPT-4 used something like 15k to 25k GPUs to train, thus around 15 to 25MW.
- Large data centers are around 10-100 MW. This is likely one of the reason why top AI labs are mostly only using ~ GPT-4 level of FLOPS to train new models.
- GPT-5 will mark the end of the fast scaling of training runs.
- A 10-fold increase in the number of GPUs above GPT-5 would require a 1 to 2.5 GW data center, which doesn’t exist and would take years to build, OR would require decentralized training using several data centers. Thus GPT-5 is expected to mark a significant slowdown in scaling runs. The power consumption required to continue scaling at the current rate is becoming unsustainable, as it would require the equivalent of multiple nuclear power plants. I think this is basically what Sam Altman, Elon Musk and Mark Zuckerberg are saying in public interviews.
- The main focus to increase capabilities will be one more time on improving software efficiency. In the next few years, investment will also focus on scaling at inference time and decentralized training using several data centers.
- If GPT-5 doesn’t unlock research capabilities, then after GPT-5, scaling capabilities will slow down for some time towards historical rates, with most gains coming from software improvements, a bit from hardware improvement, and significantly less than currently from scaling spending.
- Scaling GPUs will be slowed down by regulations on lands, energy production, and build time. Training data centers may be located and built in low-regulation countries. E.g., the Middle East for cheap land, fast construction, low regulation, and cheap energy, thus maybe explaining some talks with Middle East investors.
Unrelated to the claim:
- Hopefully, GPT-5 is still insufficient for self-improvement:
- Research has pretty long horizon tasks that may require several OOM more compute.
- More accurate world models may be necessary for longer horizon tasks and especially for research (hopefully requiring the use of compute-inefficient real, non-noisy data, e.g., real video).
- “Hopefully”, moving to above human level requires RL.
- “Hopefully”, RL training to finetune agents is still several OOM less efficient than pretraining and/or is currently too noisy to improve the world model (this is different than simply shaping propensities) and doesn’t work in the end.
- Guessing that GPT-5 will be at expert human level on short horizon tasks but not on long horizon tasks nor on doing research (improving SOTA), and we can’t scale as fast as currently above that.
How big is that effect going to be?
Using values from: https://epochai.org/blog/the-longest-training-run, we have estimates that in a year, the effective compute is increased by:
- Software efficiency: x1.7/year (1 OOM in 3.9 y)
- Hardware efficiency: x1.3/year (1 OOM in 5.9 y)
- Investment increase:
- x2.8/year (before ChatGPT) (1 OOM in 2.3 y)
- x10/year (since ChatGPT) (1 OOM in 1 y) (my guess for GPT-4 => GPT-5)
Let's assume GPT-5 is using 10 times more GPUs than GPT-4 for training. 250k GPUs would mean around 250MW needed for training. This is already larger than the largest data center reported in this article... Then, moving to GPT-6 with 2.5M GPUs would require 2.5 GW.
Building the infrastructure for GPT-6 may require a few years (e.g., using existing power plants and building a 2.5M GPU data center). For reference, OpenAI and Microsoft seem to have a $100B data center project going until 2028 (4 years); that’s worth around 3M B200 GPUs (at $30k per units).
Building the infrastructure for GPT-7 may require even more time (e.g., building 25 power plant units).
If the infrastructure for GPT-6 takes 4 years to be assembled, then the increase in GPUs is limited to 1 OOM in 4 years (~ x1.8/year).
The total growth rate between GPT-4 and GPT-5 is x22/year or x6.2/year when using investment growth values from before ChatGPT.
Taking into account the decrease in the growth of investment in training runs, the total growth rate between GPT-5 and GPT-6 would then be x4/year. The growth rate would be divided by 5.5 or by 1.55 when using values from before ChatGPT.
These estimates assume no efficient decentralized training.
Impact of GPT-5
One could assume that software and hardware efficiency will have a growth rate increased by something like 100% because of the increased productivity from GPT-5 (vs before ChatGPT).
In that case, the growth rate of effective compute after GPT-5 would be significantly above the growth rate before ChatGPT (~ x8.8/year vs. ~ x6/year before ChatGPT).
5 comments
Comments sorted by top scores.
comment by Vladimir_Nesov · 2024-04-26T16:55:40.947Z · LW(p) · GW(p)
Distributed training seems close enough to being a solved problem that a project costing north of a billion dollars might get it working on schedule. It's easier to stay within a single datacenter, and so far it wasn't necessary to do more than that, so distributed training not being routinely used yet is hardly evidence that it's very hard to implement.
There's also this snippet in the Gemini report:
Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. [...] we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
I think the crux for feasibility of further scaling (beyond $10-$50 billion) is whether systems with currently-reasonable cost keep getting sufficiently more useful, for example enable economically valuable agentic behavior, things like preparing pull requests based on feature/bug discussion on an issue tracker, or fixing failing builds. Meaningful help with research is a crux for reaching TAI and ASI, but it doesn't seem necessary for enabling existence of a $2 trillion AI company.
Replies from: maxime-riche↑ comment by Maxime Riché (maxime-riche) · 2024-04-26T20:35:52.228Z · LW(p) · GW(p)
Thank for the great comment!
Do we know if distributed training is expected to scale well to GPT-6 size models (100 trillions parameters) trained over like 20 data centers? How does the communication cost scale with the size of the model and the number of data centers? Linearly on both?
After reading for 3 min this:
Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips (Google November 2023). It seems that scaling is working efficiently at least up to 50k GPUs (GPT-6 would be like 2.5M GPUs). There are also some surprising linear increases in start time with the number of GPUs, 13min for 32k GPUs. What is the SOTA?
comment by Chris_Leong · 2024-04-26T19:16:58.409Z · LW(p) · GW(p)
Only 33% confidence? It seems strange to state X will happen if your odds are < 50%
Replies from: maxime-riche↑ comment by Maxime Riché (maxime-riche) · 2024-04-26T19:36:43.517Z · LW(p) · GW(p)
The title is clearly an overstatement. It expresses more that I updated in that direction, than that I am confident in it.
Also, since learning from other comments that decentralized learning is likely solved, I am now even less confident in the claim, like only 15% chance that it will happen in the strong form stated in the post.
Maybe I should edit the post to make it even more clear that the claim is retracted.
comment by jsd · 2024-04-26T21:56:45.388Z · LW(p) · GW(p)
Amazon recently bought a 960MW nuclear-powered datacenter.
I think this doesn't contradict your claim that "The largest seems to consume 150 MW" because the 960MW datacenter hasn't been built (or there is already a datacenter there but it doesn't consume that much energy for now)?