Estimates of GPU or equivalent resources of large AI players for 2024/5
post by CharlesD · 2024-11-28T23:01:58.522Z · LW · GW · 1 commentsContents
Nvidia chip production 2024 production Previous production GPU/TPU counts by organisation Estimating H100 equivalent chip counts at year end 2024 Microsoft, Meta Google, Amazon XAI 2025 - Blackwell Summary of estimated chip counts [11] Model training notes None 1 comment
AI infrastructure numbers are hard to find with any precision. There are many reported numbers of “[company] spending Xbn on infrastructure this quarter” and “[company] has bought 100k H100s or “has a cluster of 100k H100s” but when I went looking for an estimate of how much compute a given company had access to, I could not find consistent numbers available. Here I’ve tried to pull together information from a variety of sources to get ballpark estimates of (i) as of EOY 2024, who do we expect to have how much compute? and (ii) how do we expect that to change in 2025? I then spend a little time talking about what that might mean for training compute availability at the main frontier labs. Before going into this, I want to lay out a few caveats:
- These numbers are all estimates I’ve made from publicly available data, in limited time, and are likely to contain errors and miss some important information somewhere.
- There are very likely much better estimates available from paywalled vendors, who can spend more time going into detail of how many fabs there are, what each fab is likely producing, where the data centers are and how many chips are in each one, and other detailed minutiae and come to much more accurate numbers. This is not meant to be a good substitute for that, and if you need very accurate estimates I suggest you go pay one of several vendors for that data.
With that said, let’s get started.
Nvidia chip production
The first place to start is by looking at the producers of the most important data center GPUs, Nvidia. As of November 21st, after Nvidia reported 2025 Q3 earnings[1] calendar year Data Center revenues for Nvidia look to be around $110bn. This is up from $42bn in 2023, and is projected to be $173bn in 2025 (based on this estimate of $177bn for fiscal 2026).[2]
Data Center revenues are overwhelmingly based on chip sales. 2025 chip sales are estimated to be 6.5-7m GPUs, which will almost entirely be Hopper and Blackwell models. I have estimated 2m Hopper models and 5m Blackwell models based on the proportion of each expected from the CoWoS-S and CoWoS-L manufacturing processes and the expected pace of Blackwell ramp up.
2024 production
Sources for 2024 production numbers were thin and often conflicting, but estimates of 1.5m Hopper GPUs for Q4 2024 (though this will include some H20 chips, a significantly inferior chip, and so is an upper bound) and data center revenue ratios quarter by quarter suggest an upper bound of 5m were produced (this would assume approx $20k of revenue per H100-equivalent which seems low - using a more plausible $25k we get 4m). This is in conflict with estimates of 1.5-2m h100s produced from earlier in the year - whether this difference could plausibly be attributed to h100 vs h200, expanded capacity, or another factor, is unclear, but since this is incongruent with their revenue numbers I have chosen to use the higher figure.
Previous production
For the purpose of knowing who has the most compute now and especially going forward, pre 2023 numbers are not going to significantly move the needle, due to improvements in GPUs themselves and big increases in the production numbers, based on Nvidia sales.
Based on estimates that Microsoft and Meta each got 150k H100s in 2023, and looking at Nvidia Data Center revenues, something in the 1m range for H100 equivalent production in 2023 seems likely.
GPU/TPU counts by organisation
Here I try to get estimates for how many chips (expressed as H100 equivalents) each of Microsoft, Meta, Google, Amazon and XAI will have access to at Year End 2024, and project numbers for 2025.
Numerous sources report things to the effect that “46% of Nvidia’s revenue came from 4 customers”. However, this is potentially misleading. If we look at Nvidia 10-Qs and 10-Ks, we can see that they distinguish between direct and indirect customers, and the 46% number here refers to direct customers. However, direct customers are not what we care about here. Direct customers are mostly middlemen like SMC, HPE and Dell, who purchase the GPUs and assemble the servers used by indirect customers, such as public cloud providers, consumer internet companies, enterprises, public sector and startups.
The companies we care about fall under “indirect customers”, and the disclosures around these are slightly looser, and possibly less reliable. For fiscal year 2024 (approx 2023 as discussed) Nvidia’s annual report disclosed that “One indirect customer which primarily purchases our products through system integrators and distributors [..] is estimated to have represented approximately 19% of total revenue”. They are required to disclose customers with >10% revenue share[3], so either their second customer is at most half as big as the first, or there are measurement errors here[4]. Who is this largest customer? The main candidate seems to be Microsoft. There are sporadic disclosures on a quarterly basis of a second customer exceeding 10% briefly[5], but not consistently and not for either the full year 2023 or the first 3 quarters of 2024[6].
Estimating H100 equivalent chip counts at year end 2024
Microsoft, Meta
Given Microsoft has one of the largest public clouds, is the major provider of compute to OpenAI, does not (unlike Google and possibly Amazon) have a significant installed base of its own custom chips, and appears to have a privileged relationship with Nvidia relative to peers (they were apparently the first to get Blackwell chips, for example) it seems very likely that this largest customer is Microsoft in both years. The revenue share for 2024 is not specified as precisely as for 2023, with 13% of H1 revenue mentioned in the Nvidia Q2 10-Q and just “over 10%” for Q3, but 13% seems a reasonable estimate, suggesting their share of Nvidia sales decreased from 2023.
There are other estimates of customer sizes - Bloomberg data estimates that Microsoft makes up 15% of Nvidia's revenue, followed by Meta Platforms at 13% of revenue, Amazon at 6% of revenue, and Google at about 6% of revenue - it is not clear from the source which years this refers to. Reports of the numbers of H100 chips possessed by these cloud providers as of year end 2023 (150k for Meta and Microsoft, and 50k each for Amazon, Google and Oracle) align better with the Bloomberg numbers.
An anchoring data point here is Meta’s claim that Meta would have 600k H100 equivalents of compute by year end 2024. This was said to include 350k H100s, and it seems likely most of the balance would be H200s and a smaller number of Blackwell chips arriving in the last quarter[7].
If we take this 600k as accurate and use the proportion of revenue numbers, we can get better estimates for Microsoft’s available compute as being somewhere between 25% and 50% higher than this, which would be 750k-900k H100 equivalents.
Google, Amazon
Amazon and Google are consistently suggested to be behind here in terms of their contribution to Nvidia revenues. However, these are two quite different cases.
Google already has substantial amounts of its own custom TPUs, which are the main chips used for their own internal workloads[8]. It seems very likely that Amazon’s internal AI workloads are much smaller than this, and that their comparable amounts of Nvidia chips reflect mostly what they expect to need to service external demand for GPUs via their cloud platforms (most significantly, demand from Anthropic).
Let’s take Google first. As mentioned, TPUs are the main chip used for their internal workloads. A leading subscription service providing data on this sector, Semianalysis, claimed in late 2023 that “[Google] are the only firm with great in-house chips” and “Google has a near-unmatched ability to deploy AI at scale reliably with low cost and high performance”, and that they were “The Most Compute Rich Firm In The World”. Their infrastructure spend has remained high[9] since these stories were published.
Taking a 2-1 estimate for TPU vs GPU spend[9] and assuming (possibly conservatively) that TPU performance per dollar is equivalent to Microsoft’s GPU spend I get to numbers in the range of 1m-1.5m H100 equivalents as of year end 2024.
Amazon, on the other hand, also has their own custom chips, Trainium and Inferentia, but they got started on these far later than Google did with its TPUs, and it seems like they are quite a bit behind the cutting edge with these chips, even offering $110m in free credits to get people to try them out, suggesting they’ve not seen great adaptation to date. Semianalysis suggest “Our data shows that both Microsoft and Google’s 2024 spending plans on AI Infrastructure would have them deploying far more compute than Amazon” and “Furthermore, their upcoming in-house chips, Athena and Trainium2 still lag behind significantly.”
What this means in terms of H100 equivalents is not clear, and numbers on the count of Trainium or Trainium2 chips are hard to come by, with the exception of 40,000 being available for use in the free credits programme mentioned above.
However, as of mid 2024 this may have changed - on their Q3 2024 earnings call CEO Andy Jassy said regarding Trainium2 “We're seeing significant interest in these chips, and we've gone back to our manufacturing partners multiple times to produce much more than we'd originally planned.” At that point however, they were “starting to ramp up in the next few weeks” so it seems unlikely they will have huge supply on board in 2024.
XAI
The last significant player I will cover here is XAI. They have grown rapidly, and have some of the largest clusters and biggest plans in the space. They revealed an operational 100k H100 cluster in late 2024, but there seem to be issues with them getting enough power to the site at the moment.
2025 - Blackwell
The 2024 State of AI report has estimates of Blackwell purchases by major providers - “Large cloud companies are buying huge amounts of these GB200 systems: Microsoft between 700k - 1.4M, Google 400k and AWS 360k. OpenAI is rumored to have at least 400k GB200 to itself. “ These numbers are for the chips in total and so we are at risk of double counting 2024 Blackwell purchases, so I have discounted them by 15%.
The Google and AWS numbers here are consistent with their typical ratio to Microsoft in Nvidia purchases, if we take 1m as the Microsoft estimate. This would also leave Microsoft at 12% of Nvidia total revenues[10], consistent with a small decline in its share of Nvidia revenue as was seen in 2024.
No Meta estimate was given in this report, however Meta anticipates a “"significant acceleration" in artificial intelligence-related infrastructure expenses next year” suggesting its share of Nvidia spending will remain high. I have assumed they will remain at approximately 80% of Microsoft spend in 2025.
For XAI, they are not mentioned much in the context of these chips, but Elon Musk claimed they would have a 300k Blackwell cluster operational in summer 2025. Assuming some typical hyperbole on Musk's part it seems plausible they could have 200k-400k of these chips by year end 2025.
How many H100s is a B200 worth? For the purpose of measuring capacity growth, this is an important question. Different numbers are cited for training and for inference, but for training 2.2x is the current best estimate (Nov 2024).
For Google, I have assumed the Nvidia chips continue to be ⅓ of their total marginal compute. For Amazon, I have assumed they are 75%. These numbers are quite uncertain and the estimates are sensitive to them.
It is worth noting that there are still many, many H100s and GB200s unaccounted for here, and that there could be significant aggregations of them elsewhere, especially under Nvidia’s 10% reporting threshold. Cloud providers like Oracle and other smaller cloud providers likely hold many, and there are likely some non-US customers of significance too, as Nvidia in Q3 2025 said that 55% of revenue came from outside the US in the year to date (down from 62% the previous year). As this is direct revenue, it may not all correspond to non-US final customers.
Summary of estimated chip counts [11]
2024 YE (H100 equivalent) | 2025 (GB200) | 2025YE (H100 equivalent) | |
MSFT | 750k-900k | 800k-1m | 2.5m-3.1m |
GOOG | 1m-1.5m | 400k | 3.5m-4.2m |
META | 550k -650k | 650k-800k | 1.9m-2.5m |
AMZN | 250k-400k | 360k | 1.3m-1.6m |
XAI | ~100k | 200k-400k | 550k-1m |
Model training notes
The above numbers are estimates for total available compute, however many people are likely to care more about how much compute might be used to train the latest frontier models. I will focus on OpenAI, Google, Anthropic, Meta and XAI here. This is all quite speculative as all these companies are either private or so large they do not have to disclose the breakdowns of costs for this, which in Google’s case is a tiny fraction of their business as it stands.
OpenAI 2024 training costs were expected to reach $3bn, with inference costs at $4bn. Anthropic, per one source, “are expected to lose about ~$2B this year, on revenue in the high hundreds of millions”. This suggests total compute costs more on the order of $2bn than OpenAI’s $7bn. Their inference costs will be substantially lower, given their revenue mostly comes from the API and should have positive gross margins, this suggests that most of that $2bn was for training. Let’s say $1.5bn. A factor of two disadvantage for training costs vs OpenAI does not seem like it would prohibit them being competitive. It also seems likely, as their primary cloud provider is AWS, which as we’ve seen has typically had fewer resources than Microsoft, which provides OpenAI’s compute. The state of AI report mentioned earlier suggested 400k GB200 chips were rumoured to be available to OpenAI from Microsoft, which would exceed AWS 'entire rumoured GB200 capacity and therefore likely keep them well above Anthropic’s training capacity.
Google is less clear. The Gemini Ultra 1.0 model was trained on approximately 2.5x the compute of GPT-4, but published 9 months later,, and 25% more than the latest Llama model. Google, as we have seen, probably has more compute available than peers, however as a major cloud provider and a large business it has more demands[12] on its compute than Anthropic or OpenAI or even Meta, which also has substantial internal workflows separate from frontier model training such as recommendation algorithms for its social media products. Llama 3 being smaller in compute terms than Gemini despite being published 8 months later suggests Meta has so far been allocating slightly less resources to these models than OpenAI or Google.
XAI allegedly used 20k H100s to train its Grok 2, and projected up to 100k H100s would be used for Grok 3. Given GPT-4 was allegedly trained on 25,000 Nvidia A100 GPUs over 90-100 days, and a H100 is about 2.25x an A100, this would put Grok 2 at around double the compute of GPT-4 and project another 5x for Grok 3, putting it towards the leading edge.
Note that not all of this has historically come from their own chips - they are estimated to rent 16,000 H100s from Oracle cloud. If XAI is able to devote a similar fraction of its compute to training as OpenAI or Anthropic, I would guess its training is likely to be similar in scale to Anthropic and somewhat below OpenAI and Google.
Thanks to Josh You for feedback on a draft of this post. All errors are my own. Note that Epoch have an estimate of numbers for 2024 here which mostly lines up with the figures I estimated, which I only found after writing this post, though I expect we used much of the same evidence so the estimates are not independent.
- ^
yes, 2025 - Nvidia’s fiscal year annoyingly runs from Feb-Jan and so their earnings in calendar year 2024 are mostly contained in fiscal year 2025
- ^
Note that for ease of comparison with other numbers, I have attempted to adjust nvidia numbers back by a month, allowing calendar years to line up
- ^
Note that this is >10% of total revenue, not Data Center revenue, but Nvidia confirms it is attributable to their Data Center segment for all these customers.
- ^
From the Q2 2025 report - “Indirect customer revenue is an estimation based upon multiple factors including customer purchase order information, product specifications, internal sales data and other sources. Actual indirect customer revenue may differ from our estimates”.
- ^
Q2 2025 - “For the second quarter of fiscal year 2025, two indirect customers which primarily purchase our products through system integrators and distributors, including through Customer B and Customer E, are estimated to each represent 10% or more of total revenue attributable to the Compute & Networking segment. For the first half of fiscal year 2025, an indirect customer which primarily purchases our products from system integrators and distributors, including from Customer E, is estimated to represent 10% or more of total revenue, attributable to the Compute & Networking segment. “ this implies one customer exceeded the threshold only for Q2 and not for H1
- ^
Q3 2025 - For the third quarter and first nine months of fiscal year 2025, an indirect customer which primarily purchases our products through system integrators and distributors, including through Customer C, is estimated to represent 10% or more of total revenue, attributable to the Compute & Networking segment.
- ^
This source suggests 500k H100s, but I think this possibly stems from a misreading of the original Meta announcement which referred to 350k H100s total, and this source also omits H200s entirely.
- ^
From Google: “TPUs have long been the basis for training and serving AI-powered products like YouTube, Gmail, Google Maps, Google Play, and Android. In fact, Gemini was trained on, and is served, using TPUs.”
- ^
Google's Q3 2024 earnings report contained an estimate of $13bn for AI CapEx in Q3 2024,"the majority" on technical infra, 60% of which was servers (GPUs,TPUs). Taking “the majority” to mean $7-11bn, 60% of this being on servers suggests they spent $4.5-7bn that quarter on TPUs/GPUs. If we estimate them as being 6% of Nvidia total revenue as Bloomberg suggests, then they spent about $1.8bn on Nvidia GPUs, so that leaves $2.7bn-$5.2bn to spend on other servers. Given internal workloads run on TPUs, it seems likely the TPU spend is quite a bit higher than GPU spend, so taking the middle of this range we get just under $4bn on TPUs.
- ^
Taking the 7m 2025 GPU production numbers from above, assuming 850k of the 5m Blackwell chips go to Microsoft in 2025 (as they will begin receiving them in 2024 and that is in their 2024 estimate already) and assuming nvidia revenue is 90% Data Center and Blackwell costs 60-70% more than Hopper per Nvidia Q3 2025 earnings.
- ^
Note that the ranges in these estimates are not confidence intervals, but rather ranges in which I think a plausible best guess based on the evidence I looked at might land. I have not attempted to construct confidence intervals here.
- ^
“Today, more than 60 percent of funded gen AI start-ups and nearly 90 percent of gen AI unicorns are Google Cloud customers. “ said Google CEO Sundar Pichai on their Q1 2024 earnings call
1 comments
Comments sorted by top scores.
comment by Vladimir_Nesov · 2024-11-29T00:38:43.753Z · LW(p) · GW(p)
Llama-3-405B is an important anchor for compute of other models. With 4e25 FLOPs and conservative training techniques it's about as capable, so the other models probably don't use much more. If they have better techniques, they need less compute to get similar performance, not more. And they probably didn't train for more than 6 months. At $2 per H100-hour[1], $3 billion buys 6 months of time on 300K H100s. There are no publicly known training systems this large, the first 100K H100s systems started appearing in the later part of this year. Thus the training cost figures must include smaller experiments that in aggregate eat more compute than the largest training runs, through the now-ubiquitous smaller clusters also used for inference.
So anchoring to total number of GPUs is misleading about frontier model training because most GPUs are used for inference and smaller experiments, and the above estimate shows that figures like $3 billion for training are also poor anchors. If instead we look at 20K H100s as the typical scale of largest clusters in mid 2023 to early 2024, and 4 months as a typical duration of frontier model training, we get $120 million at $2 per H100-hour or 8e25 dense BF16 FLOPs at 40% compute utilization, only about 2x Llama-3-405B compute. This agrees with how Dario Amodei claimed that in Jun 2024 the scale of deployed models is about $100 million.
For what it's worth, since training the largest models requires building the training system yourself, which makes the market price of renting fewer GPUs from much smaller clusters not that relevant. ↩︎