Counting AGIs
post by cash (cshunter), Will Taylor · 2024-11-26T00:06:17.845Z · LW · GW · 19 commentsContents
Section I - The Question Section II - The Scenario Section III - Existing Estimates Section IV - Compute Section VI - Inference Section V - Human Equivalents Section VII - The Estimates Section VIII - Implications Acknowledgements Section I - The Question Section II - The Scenario Section III - Existing Estimates Tom Davidson Leopold Aschenbrenner Dario Amodei Section IV - Compute Multinational Compute Training to Inference Allocation Training Run Duration Training Runs Section VI - Inference Differential Progress Inference-Intensive Algorithms Section V - Human Equivalents Word-Based Anchors Human Brain FLOP Anchor Disanalogies with Humans Section VII - The Estimates Method 1: Total training to inference per token ratio Method 2: Flat inference costs Method 3: Human brain equivalent Method 4: Chip capabilities Method 5: Adjusting for capabilities per token Section VIII - Implications Acknowledgements None 19 comments
“The resources used to train the model can be repurposed to run millions of instances of it (this matches projected cluster sizes by ~2027), and the model can absorb information and generate actions at roughly 10x-100x human speed. … We could summarize this as a ‘country of geniuses in a datacenter’.”
Dario Amodei, CEO of Anthropic, Machines of Loving Grace
“Let’s say each copy of GPT-4 is producing 10 words per second. It turns out they would be able to run something like 300,000 copies of GPT-4 in parallel. And by the time they are training GPT-5 it will be a more extreme situation where just using the computer chips they used to train GPT-5, using them to kind of run copies of GPT-5 in parallel, you know, again, each producing 10 words per second, they’d be able to run 3 million copies of GPT-5 in parallel. And for GPT-6, it’ll just increase again, there’ll be another factor of 10 at play, and so it’ll be 30 million copies in parallel.”
Tom Davidson, researcher at OpenPhil, Future of Life Institute interview
“Once we get to AGI, we won’t just have one AGI. … given inference GPU fleets by then, we’ll likely be able to run many millions of them (perhaps 100 million human-equivalents, and soon after at 10x+ human speed).”
Leopold Aschenbrenner, Situational Awareness pg. 47
Table of Contents
Section III - Existing Estimates
Method 1: Total training to inference per token ratio
Method 2: Flat inference costs
Method 3: Human brain equivalent
Method 5: Adjusting for capabilities per token
Section I - The Question
What will the initial AGI population be?
Artificial intelligence (AI) systems have become significantly more capable and general in the last decade, especially since the launch of ChatGPT in December 2022. Many people believe that the technological trajectory of AI will lead to the advent of artificial general intelligence (AGI), an AI system that can autonomously do virtually anything a human professional can do. Leading AI scientists, like Geoffrey Hinton, Yoshua Bengio, and Shane Legg, are publicly raising the alarm that such a system is incoming. There are several AI enterprises premised on the business model of creating AGI (Anthropic, OpenAI, Safe Superintelligence, to name a few).
The development of AGI will be a transformative technology, but the scale of transformation we should expect will hugely depend on how many copies of AGI we can run simultaneously. If AGI is computationally expensive, we might only be able to run a small number. If so, the immediate post-AGI world would be virtually unchanged. Alternatively, if AGIs are computationally cheap, we might be able to run hundreds of millions or more. This latter outcome would entail sudden and ubiquitous transformation. For a sense of scale, consider that a hundred million AGIs, each as productive as a typical American worker, would have an impact similar to doubling the US workforce, which in 2024 had 135 million full-time workers.
There are only a few calculations estimating the likely size of the initial AGI population. This post attempts to add some approaches, while also articulating major considerations for this kind of exercise along the way.
At a high level, our approach involves estimating two variables, namely, the total computing power (“compute”) that is likely to be available for instantiating AGIs, and the amount of compute likely to be required to run (“inference”) a single AGI. With these two variables, we can calculate the AGI population by dividing the available compute by a per-AGI inference rate:
Compute ÷ Inference per AGI = AGI Population
One reason an AGI population estimate may be valuable is to assess the likelihood of AI systems rapidly recursively self-improving. If the AGI population is very small, perhaps human effort will still dominate capability gains. Another reason that the AGI population may be pivotal is in scenarios where recursive self-improvement does not occur. Before proceeding to the calculations, we articulate this second scenario.
Section II - The Scenario
AI Coldsnap
It is a common trope in predictions of AGI that such a system will recursively self-improve in a rapid takeoff resulting in superintelligence. The argument goes, AGI will be general enough to do anything humans can do, and one thing humans can do is work on improving AI capabilities. Current capabilities progress in AI has been driven by somewhere between thousands and tens of thousands of human researchers. If the number of human-level AI workers we can deploy to the task of machine learning (ML) research is in this range or higher, we should expect those AI workers to substantially accelerate progress in capabilities. These new capabilities would feed back into these AI workers, compounding in an intelligence explosion.
However, there are reasons to believe that recursive self-improvement and superintelligence might not immediately follow the advent of AGI. Below are five scenarios whose cumulative probability may be sufficient to militate against recursion:
- Plateau: There may be unexpected development plateaus that come into effect at around human-level intelligence. These plateaus could be architecture-specific (scaling laws break down; getting past AGI requires something outside the deep learning paradigm) or fundamental to the nature of machine intelligence.
- Pause: Government intervention could pause frontier AI development. Such a pause could be international. It is plausible that achieving or nearly achieving an AGI system would constitute exactly the sort of catalyzing event that would inspire governments to sharply and suddenly restrict frontier AI development.
- Collapse: Advances in AI are dependent on the semiconductor industry, which is composed of several fragile supply chains. A war between China and Taiwan is considered reasonably possible by experts and forecasters. Such an event would dramatically disrupt the semiconductor industry (not to mention the world economy). If this happens around the time that AGI is first developed, AI capabilities could be artificially suspended at human-level for years while computer chip supply chains and AI firms recover.
- Abstention: Many frontier AI firms appear to take the risks of advanced AI seriously, and have risk management frameworks in place (see those of Google DeepMind, OpenAI, and Anthropic). Some contain what Holden Karnofsky calls if-then commitments: “If an AI model has capability X, risk mitigations Y must be in place. And, if needed, we will delay AI deployment and/or development to ensure the mitigations can be present in time.” Commitments to pause further development may kick at human-level capabilities. AGI firms might avoid recursive self-improvement to avoid existential or catastrophic risks.
- Windup: There are hard-to-reduce windup times in the production process of frontier AI models. For example, a training run for future systems may run into the hundreds of billions of dollars, consuming vast amounts of compute and taking months of processing. Other bottlenecks, like the time it takes to run ML experiments, might extend this windup period.
If any of these arguments hold up, development of the first AGI would be followed by a non-trivial period in which AI capabilities are about human-level and stay that way. This future scenario would be odd - the engine of AI progress stalling at the same time as AGI has been achieved and is perhaps transforming society, albeit at a fixed scale. This might feel less like an AI winter and more like an AI coldsnap, especially for scenarios where capabilities stop due to exogenous shock or government intervention.
In an AI coldsnap, the transformativeness of AGI would substantially depend on the initial AGI population. For example, in the event of a supply chain collapse, the compute clusters used for inferencing AI models would fail to benefit from the full force of Moore’s law, locking the AGI population into hardware built up prior to the shock.
Additionally, some of these coldsnap triggers have the interesting feature that they seem reasonably likely to occur specifically when AI capabilities approach human-level. Government intervention seems the likeliest to occur around human-level. Less-than-human-level AI systems may not impose on the imagination of non-expert political leaders sufficiently to catalyze a binding international frontier development pause, whereas human-level systems would be immediately recognizable as dangerous to non-experts, and also refute skepticism of AI capabilities (i.e. the refrain that “AI will never be able to do x” would be implausible since AGI is able to do any x).
An architectural plateau could result from last mile problems in creating an AGI, making the first generation of AGI-like products only 90-95% general. Frontier science is hard, perhaps one of the hardest cognitive tasks, and that last 5-10% might be necessary for recursive self-improvement. These nearly AGI systems would still be quite general at 90-95%, and could be deployed widely across the economy (aside from e.g. science), transforming society in proportion to their population.
A multi-year post-AGI period of relatively flat capabilities is a distinct possibility, even if not clearly a probable outcome. It is therefore valuable to consider how many AGIs might exist during this time. Some calculations have been performed, which we move onto in the next section.
Section III - Existing Estimates
Tom Davidson, Leopold Aschenbrenner, Dario Amodei
In the course of researching this question we found three existing attempts to estimate the initial AGI population. These estimates are by Tom Davidson (at Open Philanthropy), Leopold Aschenbrenner (in a private capacity, after leaving OpenAI’s superalignment team), and Dario Amodei (CEO of Anthropic).
Tom Davidson
Tom Davidson is a Senior Research Analyst at Open Philanthropy. In a post on the blog Planned Obsolescence, Davidson calculates that OpenAI is likely able to inference a population of GPT-4s in the hundreds of thousands.
In footnote 2 of that post, Davidson imputes OpenAI’s compute available for inference by the amount of training compute in GPT-4’s training run. The logic here is that if you have x amount of compute for training, then you also have x amount of compute for inference. One measure of compute is floating-point operations or FLOP, which is what Davidson uses. He pins GPT-4’s training run at 3e25 FLOP (citing Epoch AI, though we note Epoch AI’s current number is 2e25). Davidson assumes training took 115 days, and calculates compute available for inference at 3e18 FLOP/s in the following manner:
2e25 FLOP ÷ 115 days ÷ 24 hours ÷ 60 minutes ÷ 60 seconds = 3e18 FLOP/s
Davidson then estimates the inference required for one AI system. He does this by finding the inference required to produce a single token of output. In LLMs, tokens are word-pieces, approximately 3/4ths of a word according to OpenAI.[1] To generate one token, an LLM must compute a single “forward-pass” of a model’s weights, which roughly requires two FLOP for each parameter in the model. To get GPT-4’s parameter count, Davidson uses Chinchilla scaling, which asserts that, given a fixed compute budget, optimal training tokens and parameters should scale together at ~20:1.
Compute = 2 * Parameters * Training Tokens[2]
Davidson obtains a parameter count around 5e11 and multiplies that by 2 to get 1e12 FLOP per token. Dividing 3e18 FLOP/s (available compute) by 1e12 FLOP/token (inference per token) results in ~3e6 (3 million) tokens per second. Davidson sets human equivalence at a benchmark of 10 tokens per second, translating the output of 3 million tokens per second to an AI population of 300,000.
To extrapolate to future models, Davidson says in footnote 3 “I make the simple assumption that GPT-5 will be the same as GPT-4 except for having 10X the parameters and being trained on 10X the data, and that GPT-6 will have an additional 10X parameters and 10X data.” Since inference increases 10x but available compute increases much more than 10x, the AI population increases faster than inference costs.
This ratio will only become more extreme as models get bigger. Once OpenAI trains GPT-5 it’ll have enough compute for GPT-5 to perform millions of tasks in parallel, and once they train GPT-6 it’ll be able to perform tens of millions of tasks in parallel.
We believe Davidson came up with his AGI population estimate in the course of researching and writing his AGI takeoff speeds report that was published in June 2023. In that report he also remarks on the potential AGI population:
Let’s say AGI requires 1e36 FLOP, one OOM more than the Bio Anchors median for TAI. And let’s say it runs in 1e16 FLOP/s. In this case, I think the possibility of trading-off runtime and training compute would significantly shorten timelines. Let’s assume that 10% of a year’s FLOP are ultimately used to train AGI. In that year, 1e37 FLOP were available in total. Let’s also assume that 10% of those FLOP are used to run AGIs doing AI software R&D: 1e36 FLOP. You could run ~3e12 AGIs doing software R&D (and more in total).
3e12 would be 3 trillion instances of AGI. Overall this puts Davidson’s range of AGIs between 30 million and 3 trillion, exploding from returns to software R&D before achieving superintelligence.
Leopold Aschenbrenner
In June 2024, Leopold Aschenbrenner published Situational Awareness. Aschenbrenner had previously been an employee at OpenAI working on their superalignment team before publishing his forecast for the future of AI. In his document, he argues for an initial AGI population in the many millions, or perhaps hundred millions (p. 47): Aschenbrenner explains his reasoning in footnote 35 on page 50:
… GPT-4 API costs less today than GPT-3 when it was released—this suggests that the trend of inference efficiency wins is fast enough to keep inference costs roughly constant even as models get much more powerful. Similarly, there have been huge inference cost wins in just the year since GPT-4 was released; for example, the current version of Gemini 1.5 Pro outperforms the original GPT-4, while being roughly 10x cheaper.
We can also ground this somewhat more by considering Chinchilla scaling laws. On Chinchilla scaling laws, model size—and thus inference costs—grow with the square root of training cost, i.e. half the OOMs of the OOM scaleup of effective compute. However, in the previous piece, I suggested that algorithmic efficiency was advancing at roughly the same pace as compute scaleup, i.e. it made up roughly half of the OOMs of effective compute scaleup. If these algorithmic wins also translate into inference efficiency, that means that the algorithmic efficiencies would compensate for the naive increase in inference cost.
In practice, training compute efficiencies often, but not always, translate into inference efficiency wins. However, there are also separately many inference efficiency wins that are not training efficiency wins. So, at least in terms of the rough ballpark, assuming the $/token of frontier models stays roughly similar doesn’t seem crazy.
Aschenbrenner assumes flat inference costs over time thanks to algorithmic efficiencies. This is a more aggressive assumption than Davidson’s, implying a bigger initial AGI population (hundreds of millions). Overall Aschenbrenner has an initial AGI population estimate between 100,000 and 100 million, or even billions if their operating speed (10x) is taken into account.
Dario Amodei
In Oct 2024, the CEO of Anthropic, Dario Amodei, published a blog post Machines of Loving Grace. In the post, Amodei envisions a “country of geniuses in a datacenter”:
The resources used to train the model can be repurposed to run millions of instances of it (this matches projected cluster sizes by ~2027), and the model can absorb information and generate actions at roughly 10x-100x human speed. … We could summarize this as a ‘country of geniuses in a datacenter’.
In footnote 5 he explains his reasoning:
5. This is roughly the current speed of AI systems – for example they can read a page of text in a couple seconds and write a page of text in maybe 20 seconds, which is 10-100x the speed at which humans can do these things. Over time larger models tend to make this slower but more powerful chips tend to make it faster; to date the two effects have roughly canceled out.
Amodei doesn’t provide as detailed a reasoning as Davidson and Aschenbrenner, but his estimate and what reasoning he does provide seems to conform roughly to both Davidson’s and Aschenbrenner’s. Overall Amodei expects millions of AGIs running 10-100x human speed (so effectively hundreds of millions of human-equivalents).
Section IV - Compute
How much compute will be available to instantiate AGIs?
The first term in our AGI population calculation is a variable for the amount of computing power that is available for instantiating AGIs:
Compute ÷ Inference per AGI = AGI Population
The central method existing estimates use to derive available compute is by imputing it from the size of training runs. We ultimately agree that this method makes sense, though it is worth considering ways the inference budget might be larger than what is imputed by training run size and providing reasons to reject them or incorporate them.
Multinational Compute
The training run imputation approach assumes a firm will reuse its compute for inference after training, but if firms have access to surplus compute for inference, their inference budget might be much larger than their training budget.
There exists a substantial number of supercomputers and datacentres across the world, and if inferencing AI is lucrative, these compute clusters may increasingly lease out compute to AI developers. Furthermore, inference compute does not need to be at the same quality of hardware as training compute, and can be done on older chips.[3]
One might therefore want to estimate the overall commercially available compute spread out across a country with major AI labs (the US, China, the UK) or spread out across an entire geopolitical bloc of friendly compute-rich nations. For example, if AGI is developed by a US company, that firm might be able to buy compute across the US as well as other developed economies like Canada, the United Kingdom, Japan, countries in the European Union, and others.
One reason the training run imputation approach is likely still solid is that competition between firms or countries will crowd out compute or compute will be excluded on national security grounds. Consider the two main actors that could build AGI. If a company builds AGI, they are unlikely to have easy access to commodified compute that they have not themselves built, since they will be in fierce competition with other firms buying chips and obtaining compute. If a government builds AGI, it seems plausible they would impose strict security measures on their compute, reducing the likelihood that anything not immediately in the project would be employable at inference.
We can also incorporate multiple firms or governments building AGI, by multiplying the initial AGI population by the number of such additional AGI projects. For example, 2x if we believe China and the US will be the only two projects, or 3x if we believe OpenAI, Anthropic, and DeepMind each achieve AGI.
Training to Inference Allocation
One consideration that weighs on the training run imputation approach is the relative allocation of compute between training and inference. For example, if a firm only uses a small fraction, say, 10%, of their compute budget on training runs, and the rest is allocated to inferencing the latest model, then when AGI is developed, we should have 10x the amount of compute involved in the training run for inference (the 10% used in training which we can convert over, plus the 90% always reserved for inference).
Epoch AI has researched this question and found the oddly convenient answer that approximately optimal distribution of resources between training and inference centres on 50:50. Epoch AI notes that this appears to be the actual distribution of training to inference compute used by OpenAI, based on public statements by Sam Altman.
While a firm might opt to use all of its compute budget for inference the moment it achieves AGI, it is also possible, even in an AI coldsnap, that training may continue afterward. For example, after developing a very inefficient AGI, the actor building it may continue training runs to improve algorithmic efficiency of an AGI of a given level of capabilities (such as in a plateau scenario). If the 50:50 ratio holds into the future, then the training run imputation approach directly translates into available compute. If training ceases after the first AGI, we multiply the compute available for inference by 2x.
Training Run Duration
One method of increasing the size and capabilities of a model is by simply training it for longer. For example, GPT-3 was trained for 35 days, while GPT-4 was trained for 95. This increases the amount of compute allocated to training, without requiring any more hardware, because you simply use the same hardware over a longer period of time. Since one of the terms in the training run imputation approach is the number of days of the training run, this number impacts the imputed compute available for inference.
So far, frontier AI models are not trained more than 180 days, with 100 days being a common duration (Epoch AI). Epoch AI argues: “This is because both the hardware and software used for a training run risks becoming obsolete at timescales longer than this, and no lab would want to release a model which has become outdated immediately upon release. This sets a practical limit on how long training runs can become.”
Since training run duration does not seem to increase significantly over time, we ignore it and choose the 100 day duration for all of our estimates.
Training Runs
In order to use the training run imputation approach, we need to establish how large we expect training runs to be in future years. For our numbers we lean on Epoch AI’s research on the question of future training run sizes. Epoch extrapolates based on the current aggressive scaling rate and investigates four possible limiting factors: power, chips, data, and the “latency wall” (a training speed limit to do with the speed of communication within and between chips).
Epoch AI ends their analysis in 2030, but have made their code freely available, so we ran it to get an idea of largest possible training runs through 2040, tabulated below. (Of course, uncertainty grows the further into the future we get, but these should be reasonable ballpark figures. It is interesting to note that according to these estimates power will become the limiting factor in 2030, and data in 2038.)
Year | Compute Projection | Limiting Factor |
2025 | 2.43e26 | Historic growth rate |
2026 | 9.70e26 | Historic growth rate |
2027 | 3.88e27 | Historic growth rate |
2028 | 1.55e28 | Historic growth rate |
2029 | 6.21e28 | Historic growth rate |
2030 | 1.90e29 | Power |
2031 | 3.10e29 | Power |
2032 | 4.90e29 | Power |
2033 | 7.60e29 | Power |
2034 | 1.20e30 | Power |
2035 | 1.80e30 | Power |
2036 | 2.60E+30 | Power |
2037 | 3.70E+30 | Power |
2038 | 4.90E+30 | Data |
2039 | 5.60E+30 | Data |
2040 | 6.50E+30 | Data |
(a spreadsheet running these numbers to 2040 can be found here)
With compute estimates for training runs in FLOP, we can proceed to impute the amount of compute in FLOP/s that is theoretically available to the firm that creates AGI, supposing AGI is created in any of these given years. For example, the following table illustrates the FLOP/s that the AGI firm must logically be in possession of, at a minimum, assuming they trained their AGI over the course of 100 days and they continue training:
Year | Compute Projection | Imputed FLOP/s for 100 day runs |
2025 | 2.43e26 | 2.81e19 |
2026 | 9.70e26 | 1.12e20 |
2027 | 3.88e27 | 4.49e20 |
2028 | 1.55e28 | 1.79e21 |
2029 | 6.21e28 | 7.19e21 |
2030 | 1.90e29 | 2.20e22 |
2031 | 3.10e29 | 3.59e22 |
2032 | 4.90e29 | 5.67e22 |
2033 | 7.60e29 | 8.80e22 |
2034 | 1.20e30 | 1.39e23 |
2035 | 1.80e30 | 2.08e23 |
With plausible training runs and imputed compute figures, we can proceed to the next term in our AGI population equation.
Section VI - Inference
Inference-intensity of the first AGIs
The second term in our AGI population equation is a variable for the amount of computing power that is available for instantiating AGIs:
Compute ÷ Inference per AGI = AGI Population
There are different assumptions one can make about the inference-intensity of future AI models. Existing estimates from Davidson and Aschenbrenner make different assumptions. Davidson’s assumption is that inference-intensity rises slowly but not nearly as fast as available compute, while Aschenbrenner’s is that inference-intensity remains flat. Davidson’s method assumes that future models conform to Chinchilla optimal scaling, while Aschenbrenner’s assumption focuses on the empirical observation that API costs did not increase between GPT-3 and GPT-4.
Differential Progress
One argument in favour of Aschenbrenner’s assumption is that it accounts for algorithmic efficiencies that reduce the inference-intensity of future models. Google’s discovery of Transformer model architecture was an algorithmic milestone in compute efficiency, for example, reducing by 61 times the compute necessary for training translation AI (OpenAI). Other innovations that have contributed substantially to algorithmic progress include quantization (by shrinking models by half or more without major performance degradations) and OpenAI’s mixture-of-experts system for GPT-4 (which saves substantial compute at inference by training well-past Chinchilla optimal and doing a forward pass on only part of the model rather than the full model weights).
However, since algorithmic efficiencies affect both training and inference, what we really care about is if there is differential progress. That is because algorithmic efficiencies that economize compute at training merely shorten timelines to AGI, whereas algorithmic efficiencies at inference increase the initial AGI population.
Aschenbrenner indicates he believes efficiencies are happening at the same rate for training and inference. Benjamin Todd’s best guess [LW(p) · GW(p)] is to expect inference compute efficiency to be 10x by 2030 relative to 2024, and if we are reading him correctly, seems to imply training and inference having equal improvements.
One argument that supports equal improvements follows from Epoch AI’s research that optimal compute distribution between training and inference is about 50:50. If firms are spending equal amounts of their compute on both tasks, then optimal distribution of talent and attention should also be 50:50 so as to gain efficiencies across both domains.
In our AGI population estimates we primarily consider algorithmic progress does not differ between inference and training.
Inference-Intensive Algorithms
While algorithmic progress can economize compute by finding efficiencies, some algorithmic innovations increase compute spent at inference to obtain greater capabilities. Leopold Aschenbrenner differentiates these two camps of algorithmic progress as in-paradigm (efficiencies) and unhobbling progress (more capabilities at the cost of more inference compute). Tom Davidson also anticipates this kind of algorithmic regime, considering it on footnote 102 of part II of his takeoff speeds report.[4]
One of the chief reasons current commercial LLMs are bad at many tasks is that they are productized as chatbots. An analogy is to imagine if you had to carry out your day’s work instantaneously and deprived of all external tools (no notepad, no calculator, just straight from your brain). You would likely find yourself making all the same mistakes an LLM makes - you might hazily remember the name and title of some paper you want to cite, blurting out something wrong (a “hallucination”) that sounds almost like the right answer; you would have to guess the answer to any complicated mathematical question by intuition, rather than have the chance to work through it step-by-step with pen and paper.
One way of spending inference compute to improve capabilities is by giving a model the opportunity to “think” before responding. This process would involve the LLM inferencing for minutes, hours, or days, before returning to the user with a final product. Since inferencing for more time is necessarily employing more compute, this form of algorithmic progress increases compute in exchange for capabilities. This change in the algorithmic regime is the idea behind OpenAI’s o1 model.
These capability gains can still be conceived of as compute efficiencies as any capability gains from greater inference saves compute resources at training (what Epoch AI calls a “compute-equivalent-gain”). Aschenbrenner provides a table of how much “effective compute” he thinks a given amount of extra inferencing would potentially equate to:
Number of tokens | Equivalent to me working on something for… | |
100s | A few minutes | ChatGPT (we are here) |
1,000s | Half an hour | +1 OOMs test-time compute |
10,000s | Half a workday | +2 OOMs |
100,000s | A workweek | +3 OOMs |
Millions | Multiple months | +4 OOMs |
(Situational Awareness, pg. 35)
In Aschenbrenner’s table, an AI model productized with the ability to think for a workweek would be similar to two orders of magnitude of capabilities at training.
This type of algorithmic progress may shorten timelines (an AGI model is trainable on less compute because of capability gains from greater inference), but actually decrease the initial AGI population if the initial AGI population can only achieve human-level performance by spending lots of tokens sequentially.
Notably, an optimal product should show a progressive relationship between capabilities from training and capabilities from inference. For example, if an AI model has the baseline capability to correctly respond to the prompt “Who is that guy who starred in Edge of Tomorrow” with “Tom Cruise” when productized as a chatbot, then an algorithmically optimal version of that same AI model should be able to produce the same correct output (“Tom Cruise”) given the same input, without employing extra compute at inference (even though it could waste time doing so). It would seem obviously inefficient if an AI model with algorithms for thinking and agency “knew” the answer to a question “instinctively” (that is, could have answered “Tom Cruise” right away as a chatbot), but only outputted the answer after wasting inference compute in a multi-hour internet search involving deep pondering of the question. An optimally productized general agent AI model should employ compute at inference to obtain capabilities that are inaccessible when “hobbled” as a chatbot. Given that AI firms are improving algorithms at tremendous pace, we might reasonably assume producing this sort of algorithmic outcome is likely to happen if it is possible, and it would seem quite possible (surely an AI agent’s algorithms could include a procedure for using minimal inference compute on questions it can answer instantly with high confidence).
If AGI is simply an AI system with sufficient capabilities, and capabilities are acquirable from either training or inference, then it would seem naively possible for either approach to create an AGI. For example, perhaps GPT-4 is already capable of being AGI, if it is simply productized with the right algorithms, and you are willing to spend enough compute at inference. Likewise, perhaps a very very large training run, without thinking or agency, can one-shot any problem it is tasked with because it has the answers “innately”. This is the classic scaling hypothesis, which is supported by so-called “scaling laws” found by Kaplan et al, which show that “cross-entropy loss” scales “as a power-law with model size, dataset size, and the amount of compute used for training”. Perhaps scale is all you need to achieve a baseline of capabilities so deep and wide that LLMs can accurately answer virtually any question without needing to “think”.
In reality, it seems reasonable to expect firms to use both methods to achieve the suite of capabilities that counts as AGI. This should mean the first AGI is unlikely to possess capabilities that are human-level “out of the box”, and will instead achieve human-level performance by having both a highly capable baseline of capabilities from training a large model, and endowing it with the right algorithms at inference and enough inference compute to achieve human-level performance.
If general intelligence is achievable by properly inferencing a model with a baseline of capability that is lower than human-level, then we can account for the gap between baseline capabilities and inference-enabled capabilities by a number representing the capabilities per token. We can adjust our basic approach by adding a term for this:
Compute ÷ (Inference per AGI × Capabilities per Token) = AGI Population
The capabilities per token may or may not be equal to 1. As of 2024, AI systems have demonstrated extremely uneven capabilities. Some systems are narrowly superintelligent, such as AlphaGo, which plays the game of Go at a superhuman level but is incapable of anything else. More general-purpose systems, such as generative AI chatbots, can achieve high marks on numerous university-grade tests, but then fail surprisingly easy tests, like how many “r”s there are in Strawberry. AI systems exhibiting this extremely uneven suite of capabilities have been referred to as “unbalanced programs” (Cotra, part 1, p .23), or “uneven and peaky” (Aschenbrenner p. 62), or simply “dumbsmart” (Kelly), creating a “jagged frontier” of AI strengths and weaknesses across task types (Mollick).
The actual model that is AGI may be more capable per token than humans in some domains (capabilities per token > 1) and less in others (capabilities per token < 1), and in some domains, pretty close to human level (capabilities per token ~ 1). If the average is ~1, then the system is AGI. Using the average allows us to at least get a general sense of these systems, even if future work should aim to be domain-specific.
Intuitively, it seems plausible that highly advanced AI systems may continue to be unbalanced and have quirky, hard-to-predict failure modes in their reasoning and agency that may be difficult to understand and rapidly improve upon. It may be the case that the first AGI retains artefacts like these due to AI coldsnap scenarios where capabilities are paused or plateau, resulting in a stable post-AGI capabilities per token that is not equal to 1.
What are some reasonable capabilities per token rates? A range of a few orders of magnitude off from 1 seem reasonable:
Capabilities per Token | Token Multiplier |
1 | 1x, no additional tokens |
0.1 | 10x more tokens |
0.01 | 100x more tokens |
0.001 | 1,000x more tokens |
0.0001 | 10,000x more tokens |
To illustrate, suppose the capabilities per token of some future model on some task turn out to be 0.001. This would mean that for this model to achieve human-level performance, it needs to compute 1,000 more tokens than humans would need to “compute” on that task. If one were to count the number of words a human worker actually generates in their thoughts (thinking algorithm) and writing (chatbot algorithm), then we could say it was somewhat like some number of tokens over the period of time of their work. We would then know the human professional performance for this hypothetical AI model would take 1,000x that number of tokens given a capabilities per token of 0.001. Perhaps we can imagine that the AI model, when it tries to do the same task, does the same kinds of things (monologuing, researching, writing words down for future reference, and eventually producing a writeup), but much more laboriously. If we were to look at the AI’s “internal monologue”, we would notice numerous ways the AI system is inefficient - perhaps it makes unnecessary mistakes that require later correction (costing inference compute that the human did not need to employ), or perhaps the AI is worse at prioritization.
With a good sense of our assumptions for the inference-intensity of future AI models we can proceed to choosing human-equivalence benchmarks.
Section V - Human Equivalents
What is one, singular, AGI
Before we can count AGI population, we need to know what counts as one AGI. This means we need to elaborate on the per AGI part of our equation:
Compute ÷ Inference per AGI = AGI Population
One potential approach to this is to imagine AGIs as individually independent agents. Some forecasting work on future AI systems looks at them from this perspective. For example, Ajeya Cotra’s Bio Anchors report (part 1, p. 22) focuses on the notion of future AI systems that are like “drop-in remote workers”. Leopold Aschenbrenner in Situational Awareness also characterizes the future of AI workers this way:
I expect us to get something that looks a lot like a drop-in remote worker. An agent that joins your company, is onboarded like a new human hire, messages you and your colleagues on Slack and uses your softwares, makes pull requests, and that, given big projects, can do the model-equivalent of a human going away for weeks to independently complete the project. … The drop-in remote worker will be dramatically easier to integrate - just, well, drop them in to automate all the jobs that could be done remotely. It seems plausible … by the time the drop-in remote worker is able to automate a large number of jobs, intermediate models won’t yet have been fully harnessed or integrated - so the jump in economic value generated could be somewhat discontinuous. … We are on course for AGI by 2027. These AI systems will basically be able to automate basically all cognitive tasks (think: all jobs that could be done remotely). (p. 37-38, 41)
While in the future we might count the AGIs by counting the actually-existing number of independent AI agents, this approach has the downside that it requires separately accounting for how much faster and smarter these AIs are. For example, suppose you had one drop-in remote worker AI that operates at 10x the speed of a normal human worker. Wouldn’t such an AI system really be more like 10 human workers?
Further, the way that AGI systems are productized could defy easy counting. An AGI might exist less as a series of autonomous agents and more like a single central system that “exists” across numerous computers as needed. In this case, “one” such AGI system would in fact be doing the work many humans, e.g. a whole company.
If we care more about getting a sense of how transformative AGI will be in human terms, it might be more productive to consider how many human professional equivalents AGI is like. To do this we need to bridge the work done by humans in a way that is comparable to the work done by AI systems.
There are likely many potential economic and psychological anchors for constituting a single AGI. For our purposes, we consider two types of anchors:
- Word-based anchors, and
- Human brain FLOP anchor
Word-Based Anchors
One potential way of anchoring a single AGI to a human professional is by asserting a word production rate for human professionals and then finding how many output tokens an LLM needs to meet that rate.
This approach is considered by Aschenbrenner (p. 50), who benchmarks human word production in an internal monologue to 100 tokens per minute. Using OpenAI’s conversion rate of 1 token to 3/4 of a word, Aschenbrenner’s 100 tokens per minute would equate to 1.25 words per second. Tom Davidson’s approach benchmarks to 10 words per second (he also says 10 tokens per second; we assume he is rounding) as a reasonable over-estimate of how much a single person’s writing, thinking, and speaking might sum to over time.
We agree with the basic logic that it is reasonable to add up human word output in typing, in reading, and in one’s thinking or internal monologue over time, and use the average of that for the words per period of time.
Human typing has been assessed as averaging 52 words per minute (0.9 per second). Average human reading speed has been estimated at 238 words per minute (3.96 per second). These numbers present a reasonable range in terms of tokens per second that would would be required for AI systems to match human performance:
Human Benchmark | Source | |
13.33 tokens per second | Davidson | |
5.28 tokens per second | Human reading speed | |
1.66 tokens per second | Aschenbrenner | |
1.16 tokens per second | Human typing speed |
Human Brain FLOP Anchor
Another potential way to benchmark AGI is to use overall human brain activity in terms of computation. Joe Carlsmith produced an extensive report in 2020 for Open Philanthropy attempting to find a reasonable number for how much FLOP/s human brain activity is likely equivalent to. Here is his topline number:
Overall, I think it more likely than not that 1015 FLOP/s is enough to perform tasks as well as the human brain (given the right software, which may be very hard to create).
Disanalogies with Humans
In Tom Davidson’s takeoff report (p. 78), he notes some disanalogies with humans:
- No sleep/leisure (3x, since humans work only 8 hours a day)
- Better motivation (2x)
- Faster serial speed (10x)
If the human professional can only work 8 hours a day, then 3x more human professional equivalents from 24/7 AGI compute clusters seems quite reasonable. We can incorporate this multiplier along with other factors like the number of AGI projects.
We are less confident in the other disanalogies. Motivation gains may be cancelled out by the first AGI having other non-human quirks, like getting stuck in rabbit holes or strange loops that humans would never go down but which perhaps map quite well the sort of things that might reduce human productivity under the motivation banner. The faster serial speed should be simply accounted for by looking at the sum of compute and dividing it by your benchmark, rather than multiplying the benchmark after the fact. We therefore exclude these two considerations.
Section VII - The Estimates
Several Methods for an Initial AGI Population
We are now ready to estimate the AGI population by dividing available compute by inference per AGI.
Compute ÷ Inference per AGI = AGI Population
There are a number of ways of doing this. We walk through several methods that make different assumptions.
Method 1: Total training to inference per token ratio
One naive way of estimating the initial AGI population is to assume that the amount of inference compute per token required for a given model is in a direct proportion to the total training compute required to make that model. For example, the training run of GPT-3 was 3.14e23 FLOP, and its inference required per token is 3.5e11 (2 × GPT-3’s parameter count, which is 175 billion parameters). That would give a Training-to-Inference Ratio of 900 billion to 1 (rounded up).
If we retain our assumptions about total available compute, and we assume this fixed ratio, this gives us the curious result that our AGI population estimate is the same for any training run size we declare to be AGI. This is the case because total available compute is derived from the training run, and we are asserting that that training run is proportional to inference per token.
For example, suppose Epoch AI’s largest training run before 2030 is sufficiently large to create an AGI model. The largest training run they believe is possible before 2030 is 2e29 FLOP. If this training run takes 100 days, we can impute the available FLOP/s for inference:
Compute = 2e29 ÷ 100 days ÷ 24 hours ÷ 60 minutes ÷ 60 seconds
Which equals 2.31e22 FLOP/s, and gives us the Compute variable for our equation:
2.31e22 ÷ Inference per AGI = AGI Population
Since both inference per AGI and total available compute are derived from the same number, the ratio between Compute and Inference per AGI will always be the same, no matter what size of training run. Nonetheless, to complete this method, if we calculate the Inference per token by using the Training-to-Inference Ratio and 2e29 as the training run compute we get 2.2e17 per token of inference. We can multiply this figure by our human equivalence benchmarks to get the AGI populations:
Benchmark | Compute ÷ Inference per Token | AGI Population |
13.33 tokens/s | 2.31e22 FLOP/s ÷ 2.96e18 FLOP/AGI | = 7,814 |
5.28 tokens/s | 2.31e22 FLOP/s ÷ 1.17e18 FLOP/AGI | = 19,729 |
1.66 tokens/s | 2.31e22 FLOP/s ÷ 3.69e17 FLOP/AGI | = 62,751 |
1.16 tokens/s | 2.31e22 FLOP/s ÷ 2.56e17 FLOP/AGI | = 90,234 |
If we use GPT-4, we get different results because the ratio between training compute and inference per token is different. GPT-4 is believed to have 1.76 trillion parameters. If a forward-pass takes 2 FLOP, that’s 3.52e12 for inference (in actuality, GPT-4 is a mixture-of-experts model, and therefore perhaps only 10% of the weights are activated during a forward-pass, which would give us a lower number of 3.52e11).
GPT-4’s total training compute is believed to be 2.1e25, which gives a Ratio of around 59.7 billion to 1, or 597 billion to 1. If GPT-4 was run for 100 days, then compute should equal 2.43e18 FLOP/s, giving us the following AGI populations:
Benchmark | AGI Population (59.7 billion) | AGI Population (597 billion) |
13.33 tokens/s | 51,800 AGIs | 518,004 |
5.28 tokens/s | 130,776 AGIs | 1.3 million |
1.66 tokens/s | 415,963 AGIs | 4.2 million |
1.16 tokens/s | 595,258 AGIs | 5.9 million |
One odd implication of this method is that training per token decreases relative to inference per token. To understand this point, break total training compute into training compute per token × tokens in training run to give:
(Training Compute per token × Tokens) ÷ Inference Compute (per token) = Training-to-Inference Ratio
Since the number of tokens will increase and we are assuming the training-to-inference ratio remains constant, training compute per token must increase less than inference compute per token.
Overall this approach was our first and least appealing one. Its construction is the least justifiable, since it seems hard to imagine future scenarios where a ratio between total compute and inference per token emerges.
Method 2: Flat inference costs
Another assumption one can make about inference costs per token is that they remain flat over time. In this method the total compute required for AGI dictates the AGI population. For example, if training GPT-4 involved 2.1e25 and training some hypothetical future GPT involves 2.1e29 for the same number of days, then the population of AIs would be 10,000 greater for the 2.1e29 system compared to the 2.1e25 system, because there are 4 orders of magnitude separating them.
This method requires grounding an initial AI population in current systems, such as GPT-4, before extrapolating into the future.
OpenAI does not release full ChatGPT usage statistics, so the best we can do is to use estimates from informed industry observers. Last year, SemiAnalysis estimated that ChatGPT had 13 million daily active users, serving 15 queries which each had an average of 2,000 token responses. In November 2023, OpenAI announced that they had 100 million weekly active users; by August 2024, that had grown to over 200 million. We can therefore double the SemiAnalysis numbers to reach an order-of-magnitude estimate of current daily users, though the true number may well be higher.
2 × daily users × responses per user × tokens per response ÷ seconds/day = tokens/s
2 × 13,000,000 × 15 × 2,000 ÷ 86,400 ≈ 9 million tokens/s
OpenAI is therefore potentially inferencing 9 million token/s on average, which we can modify by our different word-based human professional equivalent benchmarks to get a “population” for GPT-4:
Benchmark | GPT-4 “Population” |
13.33 tokens/s | 675,168 GPT-4s |
5.28 tokens/s | 1.7 million GPT-4s |
1.66 tokens/s | 5.42 million GPT-4s |
1.16 tokens/s | 7.76 million GPT-4s |
Let’s take one of these benchmarks, Tom Davidson’s ~10 words per second, and consider what the AGI population would be if AGI were achieved for different training runs through the next decade:
Year | Training Run (Epoch AI) | Increased By | AGI Population |
GPT-4 | 2.1e25 | - | 675,168 |
2025 | 2.43e26 | 11.57x | 7.8 million |
2026 | 9.70e26 | 3.99x | 31.2 million |
2027 | 3.88e27 | 4.00x | 124.7 million |
2028 | 1.55e28 | 3.99x | 498.3 million |
2029 | 6.21e28 | 4.01x | 1.996 billion |
2030 | 1.90e29 | 3.06x | 6.108 billion |
2031 | 3.10e29 | 1.63x | 9.966 billion |
2032 | 4.90e29 | 1.58x | 15.754 billion |
2034 | 7.60e29 | 1.55x | 24.435 billion |
2035 | 1.20e30 | 1.58x | 38.581 billion |
If inference costs are flat, even using the most exacting word-based human performance benchmark results in an initial AGI population that is very large. This makes sense, since the possibility of future compute scaleup is enormous and current inference per token is not too onerous.
Method 3: Human brain equivalent
Another approach to counting the AGIs would be to switch to our raw FLOP benchmark. Joe Carlsmith estimates that the human brain is processing about 10e15 FLOP/s in his research for Open Philanthropy How Much Computational Power Does It Take to Match the Human Brain?:
“Overall, I think it more likely than not that 1015 FLOP/s is enough to perform tasks as well as the human brain (given the right software, which may be very hard to create).”
In this method we simply assert the inference cost of AGI is 1e15 FLOP/s, and can divide the total compute for a variety of years to get an estimate of the AGI population at those times. Continuing our assumptions about training duration and imputability of inference compute from training run size, we get the following populations:
Year | Training Run (Epoch AI) | Implied Compute | AGI Population |
2025 | 2.43e26 | 2.81e19 | 28,125 |
2026 | 9.70e26 | 1.12e20 | 112,269 |
2027 | 3.88e27 | 4.49e20 | 449,074 |
2028 | 1.55e28 | 1.79e21 | 1.79 million |
2029 | 6.21e28 | 7.19e21 | 7.19 million |
2030 | 1.90e29 | 2.20e22 | 21.99 million |
2031 | 3.10e29 | 3.59e22 | 35.88 million |
2032 | 4.90e29 | 5.67e22 | 56.71 million |
2034 | 7.60e29 | 8.80e22 | 87.96 million |
2035 | 1.20e30 | 1.39e23 | 138.89 million |
These numbers are quite a bit lower than other estimates. More closely resembling a country in a datacentre, though their “genius” would depend on capabilities per token.
Method 4: Chip capabilities
Another approach is to consider the capabilities and number of future GPUs.
The SemiAnalysis article on GPT-4 architecture provides an estimate of the number of tokens GPT-4 can produce per unit cost of compute. Benjamin Todd’s article on AI inference [LW · GW] builds on this to estimate that GPT-4 running on Nvidia’s H100 chips can output approximately 256 tokens per second. Importantly, these estimates account for the fact that GPU throughput is often limited not by FLOPs but by memory bandwidth ( as mentioned earlier - see footnote 3 on page 7). Epoch AI projects that a frontier AI company may have access to 1.02e8 H100 equivalents in 2030.
Next we need to estimate the size of a 2030 model. In terms of training compute, the Epoch AI projection suggests it may be ~10,000x larger than GPT-4. Contributions to this compute increase will come from model size (number of parameters) and training set size, with Hoffmann et al suggesting equal contributions should come from both. That means we might expect the model to have ~100x more parameters than GPT-4.
How does that translate into inference speed? The naive approach is to assume it is linearly correlated with model size, in which case we could just divide by 100. Lu et al find that runtime memory usage is generally linearly correlated with parameters. We are uncertain if that holds for both memory and memory bandwidth, but do not have a better assumption at present. We might therefore expect the throughput of H100s running a 2030 model to be ~2.56 tokens per second, with total output per second being 1.02e8 × 2.56 = 261,120,000 tokens. Using our human-equivalence benchmarks, we get the following AGI populations:
Benchmark | AGI Population (2030) |
13.33 tokens/s | 19.6 million |
5.28 tokens/s | 49.5 million |
1.66 tokens/s | 157.3 million |
1.16 tokens/s | 225.1 million |
A further adjustment may be required to account for the probably improved memory bandwidth of future chips. Todd suggests the H100 is limited to ~15% of its theoretical upper bound, and that future chips might approach ~50% of theoretical upper bound. Adjusting for that (AGI Populations × (0.5 ÷ 0.15)) would give the following populations:
Benchmark | AGI Population (2030) |
13.33 tokens/s | 65.3 million |
5.28 tokens/s | 164.7 million |
1.66 tokens/s | 524.3 million |
1.16 tokens/s | 750.3 million |
Method 5: Adjusting for capabilities per token
Methods 1 through 4 have given some staggering numbers. However, common to all of these methods is the presumption that capabilities per token are equal to 1. What happens if we break that assumption, adding a term to our equation?
Compute ÷ (Inference per AGI × Capabilities per Token) = AGI Population
Using our schedule of plausible Capabilities per Token numbers, we can modify our initial AGI population estimates from methods 1 through 4. For this we’ll benchmark to human reading speed and use 2030 as the year of AGI:
Capabilities per Token | Method 1 Rising inference | Method 2 Flat inference | Method 3 Human brain | Method 4 Chips |
0.1 | 13,077 | 610,800,000 | 2,120,000 | 4,950,000 |
0.01 | 1,307 | 61,080,000 | 212,000 | 495,000 |
0.001 | 130 | 61,08,000 | 21,200 | 49,500 |
0.0001 | 13 | 610,800 | 2,200 | 4,950 |
0.00001 | 1 | 61,080 | 220 | 495 |
Section VIII - Implications
Transformativeness of AGI
Our calculations suggest that the initial AGI population will likely fall somewhere between tens of thousands and hundreds of millions of human-professional equivalents, with most methods pointing toward the higher end of this range. This wide spread has profound implications for how transformative the emergence of AGI might be, even in an AI coldsnap scenario where capabilities remain at roughly human level for an extended period of time.
If we take an equal-weighted average across all years considered of every AGI population estimate printed between methods 1 through 4, we get an average guess of 2.8 billion AGIs. If we remove the top 10 and bottom 10 most extreme numbers, we have 61 million AGIs. Note that this is an average over all the years under consideration. In practice, a more valuable way to use this report may be for you to pick a year of interest and interrogate the numbers for that year.
We also have multipliers we can contribute to the numbers:
- 3x for no sleep
- 4x for multiple AGI projects
- 2x for switching training to inference
Together that gives us an overall average weighted number of 67.2 billion or a less extreme 1.5 billion (equal weighted after removing top 10 and bottom 10).
If recursive self-improvement requires at least matching the human ML researcher population of perhaps 1 thousand to 10s of thousands, then most of our estimates suggest easily surpassing this threshold. Despite considering substantial inference-intensity via reduced capabilities per token, most of our estimates even then remain in the tens or hundreds of thousands or more.
At our lowest estimates (~40,000-100,000 AGIs), the immediate impact of AGI would be significant but not necessarily transformative at a societal level. This would be roughly equivalent to adding a mid-sized city's worth of human professionals to the economy. Given that AI systems are liable to have unbalanced and peaky capabilities, the first AGIs may transform a number of industries where their value substantially outpaces humans in the same jobs.
At the higher end (~100-300 million AGIs), we would see an effective doubling or tripling of the workforce of the United States. This would likely trigger rapid and massive economic restructuring. AGIs would likely not only be deployed where their skills are strongest, but could in principle automate all jobs immediately automatable by a drop-in remote worker.
Acknowledgements
The authors of this work would like to thank BlueDot Impact for providing this opportunity to research and write a project. We wrote this as part of BlueDot Impact's AI Safety Fundamentals course.
- ^
You can play around with OpenAI’s tokenizer tool to get a sense of the letters per token.
- ^
Where Training Tokens ≈ 20x Parameters. The general form of this equation, C = 2DN, is originally from Kaplan et al’s 2020 paper establishing scaling laws.
- ^
This is because the main bottleneck for inference is memory bandwidth, among other reasons articulated well by IFP (see e.g. footnote 47): How to Build the Future of AI in the United States.
- ^
Davidson: “... at some earlier time you will have been able to perform 100% of tasks by using a huge amount of runtime FLOP. Each ‘AGI equivalent’ will have taken lots of compute to run, but because you have so much compute lying around you can do it. So initially, your AGI workforce will be smaller than your human one, because it’s so compute-expensive to run.”
19 comments
Comments sorted by top scores.
comment by Steven Byrnes (steve2152) · 2024-11-26T04:04:29.696Z · LW(p) · GW(p)
I understand that you’re basically assuming that the “initial AGI population” is running on only the same amount of compute that was used to train that very AGI. It’s fine to make that assumption but I think you should emphasize it more. There are a lot of situations where that’s not an appropriate assumption, but rather the relevant question is “what’s the AGI population if most of the world’s compute is running AGIs”.
For example, if the means to run AGIs (code, weights, whatever) gets onto the internet, then everybody all over the world would be doing that immediately. Or if a power-seeking AGI escapes human control, then a possible thing it might do is work to systematically get copies of itself running on most of the world’s compute. Or another possible thing it might do is wipe out humanity and then get copies of itself running on most of the world’s compute, and then we’ll want to know if that’s enough AGIs for a self-sufficient stable supply chain (see “Argument 2” here [LW · GW]). Or if we’re thinking more than a few months after AGI becomes possible at all, in a world like today’s where the leader is only slightly ahead of a gaggle of competitors and open-source projects, then AGI would again presumably be on most of the world’s compute. Or if we note that a company with AGI can make unlimited money by renting more and more compute to run more AGIs to do arbitrary remote-work jobs, then we might guess that they would decide to do so, which would lead to scaling up to as much compute around the world as money can buy.
OK, here’s the part of the post where you justified your decision to base your analysis on one training run worth of compute rather than one planet worth of compute, I think:
One reason the training run imputation approach is likely still solid is that competition between firms or countries will crowd out compute or compute will be excluded on national security grounds. Consider the two main actors that could build AGI. If a company builds AGI, they are unlikely to have easy access to commodified compute that they have not themselves built, since they will be in fierce competition with other firms buying chips and obtaining compute. If a government builds AGI, it seems plausible they would impose strict security measures on their compute, reducing the likelihood that anything not immediately in the project would be employable at inference.
The first part doesn’t make sense to me:
Let’s say Company A can make AGIs that are drop-in replacements for highly-skilled humans at any existing remote job (including e.g. “company founder”), and no other company can. And Company C is a cloud provider. Then Company A will be able to outbid every other company for Company C’s cloud compute, since Company A is able to turn cloud compute directly into massive revenue. It can just buy more and more cloud compute from C and every other company, funding itself with rapid exponential growth, until the whole world is saturated.
If Company A and Company B can BOTH make AGIs that are drop-in replacements for highly-skilled humans, and Company C doesn’t do AI research but is just a giant cloud provider, then Company A and Company B will bid against each other to rent Company C’s compute, and no other bidders will be anywhere close to those two. It doesn’t matter whether Company A or Company B wins the auction—Company C’s compute is going to be running AGIs either way. Right?
Next, the second part.
Yes it’s possible that a government would be sufficiently paranoid about IP theft (or loss of control or other things) that it doesn’t want to run its AGI code on random servers that it doesn’t own itself. (We should be so lucky!) It’s also possible that a company would make the same decision for the same reason. Yeah OK, that’s indeed a scenario where one might be interested in the question of what AGI population you get for its training compute. But that’s really only relevant if the government or company rapidly does a pivotal act, I think. Otherwise that’s just an interesting few-month period of containment before AGIs are on most of the world’s compute as above.
we found three existing attempts to estimate the initial AGI population
FWIW Holden Karnofsky wrote a 2022 blog post “AI Could Defeat All Of Us Combined” that mentions the following: “once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run several hundred million copies for about a year each.” Brief justification in his footnote 5. Not sure that adds much to the post, it just popped into my head as a fourth example.
~ ~ ~
For what it’s worth, my own opinion [LW · GW] is that 1e14 FLOP/s is a better guess than 1e15 FLOP/s for human brain compute, and also that we should divide all the compute in the world including consumer PCs by 1e14 FLOP/s to guess (what I would call) “initial AGI population”, for all planning purposes apart from pivotal acts. But you’re obviously assuming that AGI will be an LLM, and I’m assuming that it won’t, so you should probably ignore my opinion. We’re talking about different things. Just thought I’d share anyway ¯\_(ツ)_/¯
Replies from: nathan-helm-burger, ryan_b, nathan-helm-burger↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-11-26T18:25:21.120Z · LW(p) · GW(p)
FWIW, I spent a few months researching and thinking specifically about the brain FLOP/s question, and carefully compared my conclusions to Joe Carlsmith's. With a different set of sources, and different reasoning paths, I also came to an estimate approximately centered on 1e15 FLOP/s . If I were to try to be more specific than nearest OOM, I'd move that slightly upward, but still below 5e15 FLOP/s. This is just one more reasonably-well-researched opinion though. I'd be fascinated by having a conversation about why 1e14 FLOP/s might be a better estimate.
A further consideration is that this an estimate for the compute that occurs over the course of seconds. Which, for simplicity's sake, can focus on action potentials. For longer term brain processes, you need to take into account fractional shares of relatively-slow-but-high-complexity processes like protein signalling cascades and persistent changes in gene expression and protein structures in the cytoplasm and membrane receptor populations. I mention this not because it changes the FLOP/s estimate much (since these processes are relatively slow), but because keeping these in mind should shape one's intuition about the complexity of the computation and learning processes that are occuring. I feel like some set of people greatly underestimate this complexity, while a different set overestimates it. Relevant further thoughts from me on this: https://www.lesswrong.com/posts/uPi2YppTEnzKG3nXD/nathan-helm-burger-s-shortform?commentId=qCSJ2nPsNXC2PFvBW [LW(p) · GW(p)]
Replies from: steve2152↑ comment by Steven Byrnes (steve2152) · 2024-11-27T16:11:59.469Z · LW(p) · GW(p)
I'd be fascinated by having a conversation about why 1e14 FLOP/s might be a better estimate.
I think I don’t want to share anything publicly beyond what I wrote in Section 3 here [LW · GW]. ¯\_(ツ)_/¯
For longer term brain processes, you need to take into account fractional shares of relatively-slow-but-high-complexity processes
Yeah I’ve written about that too (here [LW · GW]). :) I think that’s much more relevant to how hard it is to create AGI rather than how hard it is to run AGI.
But also, I think it’s easy to intuitively mix up “complexity” with “not-knowing-what’s-going-on”. Like, check out this code, part of an AlphaZero-chess clone project. Imagine knowing nothing about chess, and just looking at a minified (or compiled) version of that code. It would feel like an extraordinarily complex, inscrutable, mess. But if you do know how chess works and you’re trying to write that code in the first place, no problem, it’s a few days of work to get it basically up and running. And it would no longer feel very complex to you, because you would have a framework for understanding it.
By analogy, if we don’t know what all the protein cascades etc. are doing in the brain, then they feel like an extraordinarily complex, inscrutable, mess. But if you have a framework for understanding them, and you’re writing code that does the same thing (e.g. sets certain types of long-term memory traces in certain conditions, or increments a counter variable, or whatever) in your AGI, then that code-writing task might feel pretty straightforward.
↑ comment by ryan_b · 2024-11-27T15:32:18.933Z · LW(p) · GW(p)
Let’s say Company A can make AGIs that are drop-in replacements for highly-skilled humans at any existing remote job (including e.g. “company founder”), and no other company can. And Company C is a cloud provider. Then Company A will be able to outbid every other company for Company C’s cloud compute, since Company A is able to turn cloud compute directly into massive revenue. It can just buy more and more cloud compute from C and every other company, funding itself with rapid exponential growth, until the whole world is saturated.
I think this is outside the timeline under consideration. Transforming compute into massive revenue is still gated by the ability of non-AGI enabled customers to decide to spend with Company A; regardless of price the ability of Company C to make more compute available to sell depends quite a bit on the timelines of their contracts with other companies, etc. The ability to outbid the whole rest of the world for commercial compute already crosses the transformational threshold, I claim. This remains true regardless of whether it is a single dominant bidder or several.
I think the timeline we are looking at is from initial launch through the first round of compute-buy. This still leaves all normal customers of compute as bidders, so I would expect the amount of additional compute going to AGI to be a small fraction of the total.
Though let the record reflect based on the other details in the estimate this could still be an enormous increase in the population.
Replies from: steve2152↑ comment by Steven Byrnes (steve2152) · 2024-11-27T16:31:58.354Z · LW(p) · GW(p)
Yeah it’s fine to assume that there might be some period of time that (1) the AGIs don’t escape control, (2) the code doesn’t leak or get stolen, (3) nobody else reinvents the same thing, (4) Company A doesn’t have infinite capital (yet) to spend on renting cloud compute (or the contracts haven’t yet been signed or whatever). And it’s fine to be curious about how many AGIs would Company A have available during this period of time.
And then a key question is whether anything happens during that period of time that would change what happens after that period of time. (And if not, then the analysis isn’t too important.) A pivotal act would certainly qualify. I’m kinda cynical in this area; I think the most likely scenario by far is that nothing happens during this period that has an appreciable impact on what happens afterwards. Like, I’m sure that Company A try to get their AGIs to beat benchmarks, do scientific research, make money, etc. I also expect them to have lots of very serious meetings, both internally and with government officials. But I don’t expect that Company A would succeed at making the world resilient to future out-of-control AGIs, because that’s just a crazy hard thing to do even with millions of intent-aligned AGIs at your disposal. I discussed some of the practical challenges at What does it take to defend the world against out-of-control AGIs? [LW · GW].
Well anyway. My comment above was just saying that the OP could be clearer on what they’re trying to estimate, not that they’re wrong to be trying to estimate it. :)
Replies from: Will Taylor↑ comment by Will Taylor · 2024-11-27T19:20:49.209Z · LW(p) · GW(p)
There are a lot of situations where that’s not an appropriate assumption, but rather the relevant question is “what’s the AGI population if most of the world’s compute is running AGIs”.
Agreed. It would be interesting to extend this to answer that question and in-between scenarios (like having access to a large chunk of the compute in China or the US + allies).
FWIW Holden Karnofsky wrote a 2022 blog post “AI Could Defeat All Of Us Combined” that mentions the following: “once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run several hundred million copies for about a year each.” Brief justification in his footnote 5.
Thanks for pointing us to this. It looks to be the same method as our method 3.
Yeah it’s fine to assume that there might be some period of time that (1) the AGIs don’t escape control, (2) the code doesn’t leak or get stolen, (3) nobody else reinvents the same thing, (4) Company A doesn’t have infinite capital (yet) to spend on renting cloud compute (or the contracts haven’t yet been signed or whatever). And it’s fine to be curious about how many AGIs would Company A have available during this period of time.
We think that period might be substantial, for reasons discussed in Section II.
Replies from: steve2152↑ comment by Steven Byrnes (steve2152) · 2024-11-27T21:13:23.491Z · LW(p) · GW(p)
Yeah it’s fine to assume that there might be some period of time that (1) the AGIs don’t escape control, (2) the code doesn’t leak or get stolen, (3) nobody else reinvents the same thing, (4) Company A doesn’t have infinite capital (yet) to spend on renting cloud compute (or the contracts haven’t yet been signed or whatever). And it’s fine to be curious about how many AGIs would Company A have available during this period of time.
We think that period might be substantial, for reasons discussed in Section II.
I don’t think Section II is related to that. Again, the question I’m asking is How long is the period where an already-existing AGI model type / training approach is only running on the compute already owned by the company that made that AGI, rather than on most of the world’s then-existing compute? If I compare that question to the considerations that you bring up in Section II, they seem almost entirely irrelevant, right? I’ll go through them:
Plateau: There may be unexpected development plateaus that come into effect at around human-level intelligence. These plateaus could be architecture-specific (scaling laws break down; getting past AGI requires something outside the deep learning paradigm) or fundamental to the nature of machine intelligence.
That doesn’t prevent any of those four things I mentioned: it doesn’t prevent (1) the AGIs escaping control and self-reproducing, nor (2) the code / weights leaking or getting stolen, nor (3) other companies reinventing the same thing, nor (4) the AGI company (or companies) having an ability to transform compute into profits at a wildly higher exchange rate than any other compute customer, and thus making unprecedented amounts of money off their existing models, and thus buying more and more compute to run more and more copies of their AGI (e.g. see the “Everything, Inc.” scenario of §3.2.4 here [LW · GW]).
Pause: Government intervention could pause frontier AI development. Such a pause could be international. It is plausible that achieving or nearly achieving an AGI system would constitute exactly the sort of catalyzing event that would inspire governments to sharply and suddenly restrict frontier AI development.
That definitely doesn’t prevent (1) or (2), and it probably doesn’t prevent (3) or (4) either depending on implementation details.
Collapse: Advances in AI are dependent on the semiconductor industry, which is composed of several fragile supply chains. A war between China and Taiwan is considered reasonably possible by experts and forecasters. Such an event would dramatically disrupt the semiconductor industry (not to mention the world economy). If this happens around the time that AGI is first developed, AI capabilities could be artificially suspended at human-level for years while computer chip supply chains and AI firms recover.
That doesn’t prevent any of (1,2,3,4). Running an already-existing AGI model on the world’s already-existing stock of chips is unrelated to how many new chips are being produced. And war is not exactly a time when governments tend to choose caution and safety over experimenting with powerful new technologies at scale. Likewise, war is a time when rival countries are especially eager to steal each other’s military-relevant IP.
Abstention: Many frontier AI firms appear to take the risks of advanced AI seriously, and have risk management frameworks in place (see those of Google DeepMind, OpenAI, and Anthropic). Some contain what Holden Karnofsky calls if-then commitments: “If an AI model has capability X, risk mitigations Y must be in place. And, if needed, we will delay AI deployment and/or development to ensure the mitigations can be present in time.” Commitments to pause further development may kick at human-level capabilities. AGI firms might avoid recursive self-improvement to avoid existential or catastrophic risks.
That could be relevant to (1,2,4) with luck. As for (3), it might buy a few months, before Meta and the various other firms and projects that are extremely dismissive of the risks of advanced AI catch up to the front-runners.
Windup: There are hard-to-reduce windup times in the production process of frontier AI models. For example, a training run for future systems may run into the hundreds of billions of dollars, consuming vast amounts of compute and taking months of processing. Other bottlenecks, like the time it takes to run ML experiments, might extend this windup period.
That doesn’t prevent any of (1,2,3,4). Again, we’re assuming the AGI already exists, and discussing how many servers will be running copies of it, and how soon. The question of training next-generation even-more-powerful AGIs is irrelevant to that question. Right?
Replies from: Will Taylor↑ comment by Will Taylor · 2024-11-28T11:06:11.125Z · LW(p) · GW(p)
Plateau: There may be unexpected development plateaus that come into effect at around human-level intelligence. These plateaus could be architecture-specific (scaling laws break down; getting past AGI requires something outside the deep learning paradigm) or fundamental to the nature of machine intelligence.
That doesn’t prevent any of those four things I mentioned: it doesn’t prevent (1) the AGIs escaping control and self-reproducing, nor (2) the code / weights leaking or getting stolen, nor (3) other companies reinventing the same thing, nor (4) the AGI company (or companies) having an ability to transform compute into profits at a wildly higher exchange rate than any other compute customer, and thus making unprecedented amounts of money off their existing models, and thus buying more and more compute to run more and more copies of their AGI
It doesn't prevent (1) but it does make it less likely. A 'barely general' AGI is less likely to be able to escape control than an ASI. It doesn't prevent (2). We acknowledge (3) in section IV: "We can also incorporate multiple firms or governments building AGI, by multiplying the initial AGI population by the number of such additional AGI projects. For example, 2x if we believe China and the US will be the only two projects, or 3x if we believe OpenAI, Anthropic, and DeepMind each achieve AGI." We think there are likely to be a small number of companies near the frontier, so this is likely to be a modest multiplier. Re. (4), I think ryan_b made relevant points. I would expect some portion of compute to be tied up in long-term contracts. I agree that I would expect the developer of AGI to be able to increase their access to compute over time, but it's not obvious to me how fast that would be.
Pause: Government intervention could pause frontier AI development. Such a pause could be international. It is plausible that achieving or nearly achieving an AGI system would constitute exactly the sort of catalyzing event that would inspire governments to sharply and suddenly restrict frontier AI development.
That definitely doesn’t prevent (1) or (2), and it probably doesn’t prevent (3) or (4) either depending on implementation details.
I mostly agree on this one, though again think it makes (1) less likely for the same reason. As you say, the implementation details matter for (3) and (4), and it's not clear to me that it 'probably' wouldn't prevent them. It might be that a pause would target all companies near the frontier, in which case we could see a freeze at AGI for its developer, and near AGI for competitors.
Abstention: Many frontier AI firms appear to take the risks of advanced AI seriously, and have risk management frameworks in place (see those of Google DeepMind, OpenAI, and Anthropic). Some contain what Holden Karnofsky calls if-then commitments: “If an AI model has capability X, risk mitigations Y must be in place. And, if needed, we will delay AI deployment and/or development to ensure the mitigations can be present in time.” Commitments to pause further development may kick at human-level capabilities. AGI firms might avoid recursive self-improvement to avoid existential or catastrophic risks.
That could be relevant to (1,2,4) with luck. As for (3), it might buy a few months, before Meta and the various other firms and projects that are extremely dismissive of the risks of advanced AI catch up to the front-runners.
Again, mostly agreed. I think it's possible that the development of AGI would precipitate a wider change in attitude towards it, including at other developers. Maybe it would be exactly what is needed to make other firms take the risks seriously. Perhaps it's more likely it would just provide a clear demonstration of a profitable path and spur further acceleration though. Again, we see (3) as a modest multiplier.
Windup: There are hard-to-reduce windup times in the production process of frontier AI models. For example, a training run for future systems may run into the hundreds of billions of dollars, consuming vast amounts of compute and taking months of processing. Other bottlenecks, like the time it takes to run ML experiments, might extend this windup period.
That doesn’t prevent any of (1,2,3,4). Again, we’re assuming the AGI already exists, and discussing how many servers will be running copies of it, and how soon. The question of training next-generation even-more-powerful AGIs is irrelevant to that question. Right?
The question of training next-generation even-more-powerful AGIs is relevant to containment, and is therefore relevant to how long a relatively stable period running a 'first generation AGI' might last. It doesn't prevent (2) ad (3). It doesn't prevent (4) either, though presumably a next-gen AGI would further increase a company's ability in this regard.
↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-11-26T18:39:35.730Z · LW(p) · GW(p)
In regards to whether early AGI will start out as an LLM, I find myself both agreeing and disagreeing with you Steven. I do think that "LLM alone" will almost certainly not be a good description of even the early crude AGIs. On the other hand, I do think that "multiple LLMs embedded in a complex scaffolding system that does various sorts of RL in reaction to interactions with simulators/human-feedback/real-world-sensors-and-actuators" is a pretty reasonable guess. In which case, the bulk of the compute would still be in those component LLMs, and thus compute estimates based on LLMs would still be relevant for approximate estimates.
comment by Matt Goldenberg (mr-hire) · 2024-11-26T16:03:53.192Z · LW(p) · GW(p)
while this paradigm of 'training a model that's an agi, and then running it at inference' is one way we get to transformative agi, i find myself thinking that probably WON'T be the first transformative AI, because my guess is that there are lots of tricks using lots of compute at inference to get not quite transformative ai to transformative ai.
my guess is that getting to that transformative level is gonna require ALL the tricks and compute, and will therefore eek out being transformative BY utilizing all those resources.
one of those tricks may be running millions of copies of the thing in an agentic swarm, but i would expect that to be merely a form of inference time scaling, and therefore wouldn't expect ONE of those things to be transformative AGI on it's own.
and i doubt that these tricks can funge against train time compute, as you seem to be assuming in your analysis. my guess is that you hit diminishing returns for various types of train compute, then diminishing returns for various types of inference compute, and that we'll get to a point where we need to push both of them to that point to get tranformative ai
Replies from: nathan-helm-burger, Will Taylor↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-11-26T18:33:28.238Z · LW(p) · GW(p)
Okay, so I am inclined to agree with Matt that the scenario of "crazy inefficient hacks burning absurd amounts of inference compute" would likely be a good description of the very first ever instance of an AGI.
However!
How long would that situation last? I expect, not long enough to be strategically relevant enough to include in a forecast like this one. If such inefficiencies in inference compute are in place, and the system was trained on and is running on many orders of magnitude more compute than the human brain runs on... Surely there's a huge amount of low-hanging fruit which the system itself will be able to identify to render itself more efficient. Thus, in just a few hours or days you should expect a rapid drop in this inefficiency, until the low-hanging fruit is picked and you end up closer to the estimates in the post.
If this is correct, then the high-inefficiency-initial-run is mainly relevant for informing the search space of the frontier labs for scaffolding experiments.
Replies from: mr-hire↑ comment by Matt Goldenberg (mr-hire) · 2024-11-26T18:48:45.174Z · LW(p) · GW(p)
Why do you imagine this? I imagine we'd get something like one Einstein from such a regime, which would maybe increase the timelines over existing AI labs by 1.2x or something? Eventually this gain compounds but I imagine that could tbe relatively slow and smooth , with the occasional discontinuous jump when something truly groundbreaking is discovered
Replies from: nathan-helm-burger↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-11-26T19:22:19.054Z · LW(p) · GW(p)
I'm not sure how to answer this in a succinct way. I have rather a lot of ideas on the subject, including predictions about several likely ways components x/y/z may materialize. I think one key piece I'd highlight is that there's a difference between:
- coming up with a fundamental algorithmic insight that then needs not only experiments to confirm but also a complete retraining of the base model to take advantage of
- coming up with other sorts of insights that offer improvements to the inference scaffolding or adaptability of the base model, which can be rapidly and cheaply experimented on without needing to retrain the base model.
It sounds to me that the idea of scraping together a system roughly equivalent to an Albert Einstein (or Ilya Sutskever or Geoffrey Hinton or John von Neumann) would put us in a place where there were improvements that the system itself could seek in type 1 or type 2. The trajectory you describe around gradually compounding gains sounds like what I imagine type 1 to look like in a median case. I think there's also some small chance for getting a lucky insight and having a larger type 1 jump forwards. More importantly for expected trajectories is that I expect type 2 insights to have a very rapid feedback cycle, and thus even while having a relatively smooth incremental improvement curve the timeline for substantial improvements would be better measured in days than in years.
Does that make sense? Am I interpreting you correctly?
Replies from: mr-hire↑ comment by Matt Goldenberg (mr-hire) · 2024-11-26T22:02:47.565Z · LW(p) · GW(p)
I still don't quite get it. We already have an Ilya Sutskever who can make type 1 and type 2 improvements, and don't see the sort of jump's in days your talking about (I mean, maybe we do, and they just look discontinuous because of the release cycles?)
↑ comment by Will Taylor · 2024-11-27T19:49:33.757Z · LW(p) · GW(p)
while this paradigm of 'training a model that's an agi, and then running it at inference' is one way we get to transformative agi, i find myself thinking that probably WON'T be the first transformative AI, because my guess is that there are lots of tricks using lots of compute at inference to get not quite transformative ai to transformative ai.
Agreed that this is far from the only possibility, and we have some discussion of increasing inference time to make the final push up to generality in the bit beginning "If general intelligence is achievable by properly inferencing a model with a baseline of capability that is lower than human-level..." We did a bit more thinking around this topic which we didn't think was quite core to the post, so Connor has written it up on his blog here: https://arcaderhetoric.substack.com/p/moravecs-sea
and i doubt that these tricks can funge against train time compute, as you seem to be assuming in your analysis.
Our method 5 is intended for this case - we'd use an appropriate 'capabilities per token' multiplier to account for needing extra inference time to reach human level.
comment by Cleo Nardo (strawberry calm) · 2024-11-26T03:02:56.730Z · LW(p) · GW(p)
Thanks for putting this together — very useful!
comment by Anders Lindström (anders-lindstroem) · 2024-11-26T21:51:14.057Z · LW(p) · GW(p)
Thanks for writing this post!
I don't know what the correct definition of AGI is, but to me it seems that AGI is ASI. Imagine an AI that is on super expert level in most (>95%) subjects and that have access to pretty much all human knowledge and is capable of digesting millions of tokens at a time and and can draw inferences and conclusions from that in seconds. "We" normally have a handful of real geniuses per generation. So now a simulated person that is like Stephen Hawkings in Physics, Terrence Tao in Math, Rembrandt in painting etc etc, all at the same time. Now imagine that you have "just" 40.000-100.000 of these simulated persons able to communicate at the speed of light and that can use all the knowledge in the world within millisecond. I think there there will be a very transformative experience for our society from the get go.
Replies from: Will Taylor↑ comment by Will Taylor · 2024-11-27T19:35:14.232Z · LW(p) · GW(p)
Our pleasure!
I'm not convinced a first generation AGI would be "super expert level in most subjects". I think it's more likely they'd be extremely capable in some areas but below human level in others. (This does mean the 'drop-in worker' comparison isn't perfect, as presumably people would use them for the stuff they're really good at rather than any task.) See the section which begins "As of 2024, AI systems have demonstrated extremely uneven capabilities" for more discussion of this and some relevant links. I agree on the knowledge access and communication speed, but think they're still likely to suffer from hallucination (if they're LLM-like) which could prove limiting for really difficult problems with lots of steps.
Replies from: anders-lindstroem↑ comment by Anders Lindström (anders-lindstroem) · 2024-11-27T22:32:42.463Z · LW(p) · GW(p)
Its interesting that you mention hallucination as a bug/artefact, I think that hallucinations is what we humans do all day and everyday when we are trying to solve a new problem. We think up a solution we really believe is correct and then we try it and more often than not realize that we had it all wrong and we try again and again and again. I think AI's will never be free of this, I just think it will be part of their creative process just as it is in ours. It took Albert Einstein a decade or so to figure out relativity theory, I wonder how many time he "hallucinated" a solution that turned out to be wrong during those years. The important part is that he could self correct and dive deeper and deeper into the problem and finally solve it. I firmly believe that AI will very soon be very good at self correcting, and if you then give your "remote worker" a day or 10 to think through a really hard problem, not even the sky will be the limit...