We are headed into an extreme compute overhang
post by devrandom · 2024-04-26T21:38:21.694Z · LW · GW · 33 commentsContents
Definitions Thesis From AGI to ASI Counterpoints Conclusion Appendix - Training Data Volume References None 33 comments
If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.
Definitions
Although there is some debate [LW · GW] about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff.
I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here [LW · GW]).
I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.
Thesis
Due to practical reasons, the compute requirements for training LLMs is several orders of magnitude larger than what is required for running a single inference instance. In particular, a single NVIDIA H100 GPU can run inference at a throughput of about 2000 tokens/s, while Meta trained Llama3 70B on a GPU cluster[1] of about 24,000 GPUs. Assuming we require a performance of 40 tokens/s, the training cluster can run concurrent instances of the resulting 70B model.
I will assume that the above ratios hold for an AGI level model. Considering the amount of data children absorb via the vision pathway, the amount of training data for LLMs may not be that much higher than the data humans are trained on, and so the current ratios are a useful anchor. This is explored further in the appendix [LW · GW].
Given the above ratios, we will have the capacity for ~1e6 AGI instances at the moment that training is complete. This will likely lead to superintelligence [LW · GW] via "collective superintelligence" approach. Additional speed may be then available via accelerators such as GroqChip, which produces 300 tokens/s for a single instance of a 70B model. This would result in a "speed superintelligence" or a combined "speed+collective superintelligence".
From AGI to ASI
With 1e6 AGIs, we may be able to construct an ASI, with the AGIs collaborating in a "collective superintelligence". Similar to groups of collaborating humans, a collective superintelligence divides tasks among its members for concurrent execution.
AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.
Tasks that are inherently serial would benefit more from a speedup instead of a division of tasks. An accelerator such as GroqChip will be able to accelerate serial thought speed by a factor of 10x or more.
Counterpoints
- It may be the case that a collective of sub-AGI models can reach AGI capability. It would be advantageous if we could achieve AGI earlier, with sub-AGI components, at a higher hardware cost per instance. This will reduce the compute overhang at the critical point in time.
- There may a paradigm change on the path to AGI resulting in smaller training clusters, reducing the overhang at the critical point.
Conclusion
A single AGI may be able to replace one human worker, presenting minimal risk. A fleet of 1,000,000 AGIs may give rise to a collective superintelligence. This capability is likely to be available immediately upon training the AGI model.
We may be able to mitigate the overhang by achieving AGI with a cluster of sub-AGI components.
Appendix - Training Data Volume
A calculation of training data processed by humans during development:
- time: ~20 years, or 6e8 seconds
- raw data input: ~10 mb/s = 1e7 b/s
- total for human training data:
6e15 bits
- Llama3 training size:
1.5e13 tokens * 16 bits =~ 2e14 bits
The amount of data used for training current generation LLMs seems comparable to the amount processed by humans during childhood.
References
- Measuring hardware overhang [LW · GW] - discusses a slightly different dynamic - hardware overhang related to algorithmic improvements
- [added] Before smart AI, there will be many mediocre or specialized AIs [LW · GW]
- ^
two clusters are actually in production, and a 400B model is still being trained
33 comments
Comments sorted by top scores.
comment by faul_sname · 2024-04-26T22:39:36.935Z · LW(p) · GW(p)
AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.
I think this only holds if fine tunes are composable, which as far as I can tell they aren't (fine tuning on one task subtly degrades performance on a bunch of other tasks, which isn't a big deal if you fine tune a little for performance on a few tasks but does mean you probably can't take a million independently-fine-tuned models and merge them into a single super model of the same size with the same performance on all million tasks).
Also there are sometimes mornings where I can't understand code I wrote the previous night when I had all of the necessary context fresh to me, despite being the same person. I expect that LLMs will exhibit the same behavior of some things being hard to understand when examined out of the context which generated them.
That's not to say a worldin which there are a billion copies of GPT-5 running concurrently will have no major changes, but I don't think a single coherent ASI falls out of that world.
Replies from: gwern, wassname, MakoYass, Algon, devrandom↑ comment by gwern · 2024-05-01T21:39:04.439Z · LW(p) · GW(p)
I think this only holds if fine tunes are composable, which as far as I can tell they aren't
You know 'finetunes are composable', because a finetune is just a gradient descent step on a batch of data and a parameter update, and if you train on more than one GPU and share updates, DL training still works {{citation needed}}.
If you can train asynchronously on a thousand, or 20,000, or 100,000 GPUs, that is what you are doing; this is especially true in DRL, where you might be, say, training across 170,000 CPU-cores. This works because you don't insist on everything being up to date every moment and you accept that there will be degrees of inconsistency/outdatedness. (You are certainly not accumulating the gradient across the entire cluster by waiting for every single node, pausing everything, calculating a single global step, and pushing it out, and only then resuming, as if it were a single GPU! Really, you don't even want to do that on a single GPU for DRL if you gotta go fast.) This works so well that people will casually talk about training "an" AlphaZero, even though they actually mean something more like "the 512 separate instances of AlphaZero we are composing finetunes of" (or more).*
You do have issues with stale gradients and off-policyness of updates and how to best optimize throughput of all of the actors vs training nodes and push out model updates efficiently so nodes stop executing outdated parameters as quickly as possible, and DeepMind & OpenAI etc have done a lot of work on that - but at that point, as in the joke, you have conceded that finetunes are composable and you can keep a very large number of replicas in sync, and it is merely a matter of haggling over how much efficiency you lose.
Also note that it takes a lot less compute to keep a model up to date doing simple online learning on new data than it does to train it from scratch on all historical data summed together (obviously), so what devrandom is talking about is actually a lot easier than creating the model in the first place.
A better model to imagine is not "somehow finetunes from millions of independent models magically compose" (although actually they would compose pretty well), but more like, "millions of independent actors do their ordinary business, while spending their spare bandwidth downloading the latest binary delta from peer nodes (which due to sparsity & not falling too far out of sync, is always on the order of megabytes, not terabytes), and once every tens of thousands of forward passes, discover a novel or hard piece of data, and mail back a few kilobytes of text to the central training node of a few thousand GPUs, who are continually learning on the hard samples being passed back to them by the main fleet, and who keep pushing out an immediately updated model to all of the actor models, and so 'the model' is always up to date and no instance is more than hours out of date with 'the model' (aside from the usual long tail of stragglers or unhealthy nodes which will get reaped)".
* I fear this is one of those cases where our casual reification of entities leads to poor intuitions, akin to asking 'how many computers are in your computer you are using right now?'; usually, the answer is just '1', because really, who cares how exactly your 'smartphone' or 'laptop' or 'desktop' or 'server' is made up of a bunch of different pieces of silicon - unless you're discussing something like device performance or security, in which case it may matter quite a lot and you'd better not think of yourself as owning 'a' smartphone.
Replies from: faul_sname↑ comment by faul_sname · 2024-05-01T22:50:54.815Z · LW(p) · GW(p)
I think we may be using words differently. By "task" I mean something more like "predict the next token in a nucleotide sequence" and less like "predict the next token in this one batch of training data that is drawn from the same distribution as all the other batches of training data that the parallel instances are currently training on".
It's not an argument that you can't train a little bit on a whole bunch of different data sources, it's an argument that running 1.2M identical instances of the same model is leaving a lot of predictive power on the table as compared by having those models specialize. For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github.
Once you have a bunch of specialized models "the weights are identical" and "a fine tune can be applied to all members" no longer holds.
Replies from: gwern↑ comment by gwern · 2024-05-02T00:36:50.750Z · LW(p) · GW(p)
For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github.
I don't believe that's obvious, and to the extent that it's true, I think it's largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).
Once you have a bunch of specialized models "the weights are identical" and "a fine tune can be applied to all members" no longer holds.
Nor do I see how this is relevant to your original claim. If you have lots of task-specialist models, how does this refute the claim that those will be able to coordinate? Of course they will. They will just share weight updates in exactly the way I just outlined, which works so well in practice. You may not be able to share parameter-updates across your protein-only and your Python-only LLMs, but they will be able to share updates within that model family and the original claim ("AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.") remains true, no matter how you swap out your definition of 'model'.
DL models are fantastically good at collaborating and updating each other, in many ways completely impossible for humans, whether you are talking about AGI models or narrow specialist models.
Replies from: faul_sname↑ comment by faul_sname · 2024-05-02T01:39:03.126Z · LW(p) · GW(p)
I don't believe that's obvious, and to the extent that it's true, I think it's largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).
Man, that Li et al paper has pretty wild implications if it generalizes. I'm not sure how to square those results with the Chinchilla paper though (I'm assuming it wasn't something dumb like "wall-clock time was better with larger models because training was constrained by memory bandwidth, not compute")
In any case, my point was more "I expect dumb throw-even-more-compute-at-it approaches like MoE, which can improve their performance quite a bit at the cost of requiring ever more storage space and ever-increasing inference costs, to outperform clever attempts to squeeze more performance out of single giant models". If models just keep getting bigger while staying monolithic, I'd count that as pretty definitive evidence that my expectations were wrong.
Edit: For clarity, I specifically expect that MoE-flavored approaches will do better because, to a first approximation, sequence modelers will learn heuristics in order of most to least predictive of the next token. That depends on the strength of the pattern and the frequency with which it comes up.
As a concrete example, the word "literally" occurs with a frequency of approximately 1/100,000. About 1/6,000 times it occurs, the word "literally" is followed by the word "crying", while about 1/40,000 of occurrences of the word "literally" are followed by "sobbing". If you just multiply it out, you should assume that if you saw the word "literally", the word "crying" should be about 7x more likely to occur than the word "sobbing". One of the things a language model could learn, though, is that if your text is similar to text from the early 1900s, that ratio should be more like 4:1, whereas if it's more like text from the mid 1900s it should be more like 50:1. Learning the conditional effect of the year of authorship on the relative frequencies of those 2-grams will improve overall model loss by about 3e-10 bits per word, if I'm calculating correctly (source: google ngrams).
If there's some important fact about one specific unexpected nucleotide which occurs in half of mammalian genomes, but nucleotide sequence data is only 1% of your overall data and the other data you're feeding the model includes text, your model will prefer to learn a gajillion little linguistic facts on the level of the above over learning this cool tidbit of information about genomes. Whereas if you separate out the models learning linguistic tidbits from the ones predicting nucleotide sequences, learning little linguistic tricks will trade off against learning other little linguistic tricks, and learning little genetics facts will trade off against learning other little genetics facts.
And if someone accidentally dumps some database dumps containing a bunch of password hashes into the training dataset then only one of your experts will decide that memorizing a few hundred million md5 digests is the most valuable thing it could be doing, while the rest of your experts continue chipping happily away at discovering marginal patterns in their own little domains.
Replies from: gwern↑ comment by gwern · 2024-05-16T00:52:28.130Z · LW(p) · GW(p)
I'm not sure how to square those results with the Chinchilla paper though
Apples and oranges. The Chinchilla paper simply optimizes the final trained model's loss given a fixed compute budget. It doesn't say anything about any downstream uses - similar to how it doesn't tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you've probably seen some "overtraining" analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run - but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that's hardly what anyone actually does.
(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)
↑ comment by mako yass (MakoYass) · 2024-05-15T22:10:13.687Z · LW(p) · GW(p)
I'm not sure why people would think LLMs understand their own output, we know they're not up to spotting sometimes human-obvious inconsistencies in it (as soon as they are, things will start moving very quickly).
Replies from: faul_sname↑ comment by faul_sname · 2024-05-15T22:59:46.379Z · LW(p) · GW(p)
LLMs can sometimes spot some inconsistencies in their own outputs -- for example, here I ask ChatGPT to produce a list of three notable individuals that share a birth date and year, and here I ask it to judge the correctness of the response to that question, and it is able to tell that the response was inaccurate.
It's certainly not perfect or foolproof, but it's not something they're strictly incapable of either.
Although in fairness you would not be wrong if you said "LLMs can sometimes spot human-obvious inconsistencies in their outputs, but also things are currently moving very quickly".
↑ comment by Algon · 2024-04-27T17:43:40.781Z · LW(p) · GW(p)
I think this only holds if fine tunes are composable, which as far as I can tell they aren't (fine tuning on one task subtly degrades performance on a bunch of other tasks, which isn't a big deal if you fine tune a little for performance on a few tasks but does mean you probably can't take a million independently-fine-tuned models and merge them into a single super model of the same size with the same performance on all million tasks).
I don't think I've ever heard of any evidence for this being the case.
Replies from: faul_sname↑ comment by faul_sname · 2024-04-29T19:20:15.809Z · LW(p) · GW(p)
Probably the best search terms are "catastrophic interference" or "catastrophic forgetting". Basically, the issue is that if you take some model that is tuned on some task, and then fine-tune it on a different, unrelated task, performance on the first task will tend to degrade.
From a certain perspective, it's not particularly surprising that this happens. If you have a language model with 7B 32 bit parameters, that language model can at most contain 28GB of compressed information. If the model is "full", any new information you push into it must necessarily "push" some other information out of it.
There are a number of ways to mitigate this issue, and in fact there's a whole field of research into ways to mitigate this issue. Examples:
- Multitask Learning: Instead of training on a bunch of examples of task A, and then a bunch of examples of task B, interleave the examples of A and B. The model trained on A and B will perform better on both tasks A and B than the pretrained base model on both tasks A and B, though it will not perform as well as (the base model trained only on A) or (the base model trained only on B).
- Knowledge Distillation: Like multitask learning, except that instead of directly fine-tuning a model on both tasks A and B, you instead do separate fine-tunes on A and on B and use knowledge distillation to train a third model to imitate the outputs of the fine-tuned-on-A or fine-tuned-on-B model, as appropriate for the training datapoint
- Mixture of Experts: Fine tune one model on A, and another on B, and then train a third model to predict which model should be used to make a prediction for each input (or more accurately, how the predictions of each expert model should be weighted in determining the output). This can scale to an almost arbitrary number of tasks, but the cost scales linearly with the number of experts (or better-than-linearly if you're clever about it, though the storage requirements still scale linearly with the number of experts).
↑ comment by devrandom · 2024-05-01T11:54:35.146Z · LW(p) · GW(p)
I think this only holds if fine tunes are composable [...] you probably can't take a million independently-fine-tuned models and merge them [...]
The purpose of a fine-tune is to "internalize" some knowledge - either because it is important to have implicit knowledge of it, or because you want to develop a skill.
Although you may have a million instances executing tasks, the knowledge you want to internalize is likely much more sparse. For example, if an instance is tasked with exploring a portion of a search space, and it doesn't find a solution in that portion, it can just summarize its finding in a few words. There might not even be a reason to internalize this summary - it might be merged with other summaries for a more global view of the search landscape.
So I don't see the need for millions of fine-tunes. It seems more likely that you'd have periodic fine-tunes to internalize recent progress - maybe once an hour.
The main point is that the single periodic fine-tune can be copied to all instances. This ability to copy the fine-tune is the main advantage of instances being identical clones.
comment by ryan_greenblatt · 2024-04-26T23:04:47.549Z · LW(p) · GW(p)
See also Before smart AI, there will be many mediocre or specialized AIs [LW · GW].
Replies from: devrandomcomment by snewman · 2024-04-26T22:15:31.552Z · LW(p) · GW(p)
Assuming we require a performance of 40 tokens/s, the training cluster can run concurrent instances of the resulting 70B model
Nit: you mixed up 30 and 40 here (should both be 30 or both be 40).
I will assume that the above ratios hold for an AGI level model.
If you train a model with 10x as many parameters, but use the same training data, then it will cost 10x as much to train and 10x as much to operate, so the ratios will hold.
In practice, I believe it is universal to use more training data when training larger models? Implying that the ratio would actually increase (which further supports your thesis).
On the other hand, the world already contains over 8 billion human intelligences. So I think you are assuming that a few million AGIs, possibly running at several times human speed (and able to work 24/7, exchange information electronically, etc.), will be able to significantly "outcompete" (in some fashion) 8 billion humans? This seems worth further exploration / justification.
Replies from: korin43, devrandom↑ comment by Brendan Long (korin43) · 2024-04-27T04:50:05.348Z · LW(p) · GW(p)
Having 1.6 million identical twins seems like a pretty huge advantage though.
Replies from: snewman↑ comment by snewman · 2024-04-28T15:19:10.238Z · LW(p) · GW(p)
Can you elaborate? This might be true but I don't think it's self-evidently obvious.
In fact it could in some ways be a disadvantage; as Cole Wyeth notes in a separate top-level comment, "There are probably substantial gains from diversity among humans". 1.6 million identical twins might all share certain weaknesses or blind spots.
Replies from: devrandom↑ comment by devrandom · 2024-05-01T12:01:05.221Z · LW(p) · GW(p)
The main advantage is that you can immediately distribute fine-tunes to all of the copies. This is much higher bandwidth compared to our own low-bandwidth/high-effort knowledge dissemination methods.
The monolithic aspect may potentially be a disadvantage, but there are a couple of mitigations:
- AGI are by definition generalists
- you can segment the population into specialists (see also this comment [LW(p) · GW(p)] about MoE)
↑ comment by devrandom · 2024-04-27T10:09:06.419Z · LW(p) · GW(p)
On the other hand, the world already contains over 8 billion human intelligences. So I think you are assuming that a few million AGIs, possibly running at several times human speed (and able to work 24/7, exchange information electronically, etc.), will be able to significantly "outcompete" (in some fashion) 8 billion humans? This seems worth further exploration / justification.
Good point, but a couple of thoughts:
- the operational definition of AGI referred in the article is significantly stronger than the average human
- the humans are poorly organized
- the 8 billion humans are supporting a civilization, while the AGIs can focus on AI research and self-improvement
↑ comment by snewman · 2024-04-28T15:23:25.186Z · LW(p) · GW(p)
All of this is plausible, but I'd encourage you to go through the exercise of working out these ideas in more detail. It'd be interesting reading and you might encounter some surprises / discover some things along the way.
Note, for example, that the AGIs would be unlikely to focus on AI research and self-improvement if there were more economically valuable things for them to be doing, and if (very plausibly!) there were not more economically valuable things for them to be doing, why wouldn't a big chunk of the 8 billion humans have been working on AI research already (such that an additional 1.6 million agents working on this might not be an immediate game changer)? There might be good arguments to be made that the AGIs would make an important difference, but I think it's worth spelling them out.
comment by jacob_cannell · 2024-06-05T17:45:26.102Z · LW(p) · GW(p)
Due to practical reasons, the compute requirements for training LLMs is several orders of magnitude larger than what is required for running a single inference instance. In particular, a single NVIDIA H100 GPU can run inference at a throughput of about 2000 tokens/s, while Meta trained Llama3 70B on a GPU cluster[1] of about 24,000 GPUs. Assuming we require a performance of 40 tokens/s, the training cluster can run concurrent instances of the resulting 70B model.
I agree direction-ally with your headline, but your analysis here assumes flops is the primary constraint on inference scaling. Actually it looks like VRAM is already the more important constraint, and would likely become even more dominant if AGI requires more brain-like models.
LLMs need VRAM for both 'static' and 'dynamic' weights. The static weights are the output of the long training process, and shared over all instances of the same model or fine tune (LORAs share most). However the dynamic 'weights' - in the attention KV cache - are essentially unique to each individual instance of the model, specific to its current working memory context and chain of thought.
So the key parameters here are total model size and dynamic vs static ratio (which depends heavily on context length and many other factors). But for example if dynamic is 50% of the RAM usage then 1M concurrent instances would require almost as many GPUs.
If AGI requires scaling up to very large brain-size models ~100T params (which seems likely), and the dynamic ratio is even just 1%, then 1M concurrent instances would require on order 10M GPUs.
↑ comment by devrandom · 2024-06-06T19:26:17.809Z · LW(p) · GW(p)
These are good points.
But don't the additional GPU requirements apply equally to training and inference? If that's the case, then the number of inference instances that can be run on training hardware (post-training) will still be on the order of 1e6.
Replies from: jacob_cannell↑ comment by jacob_cannell · 2024-06-08T02:24:28.738Z · LW(p) · GW(p)
Not for transformers, for which training and inference are fundamentally different.
Transformer training parallelizes over time, but that isn't feasible for inference. So transformer inference backends have to parallelize over batch/space, just like RNNs, which is enormously less efficient in RAM and RAM bandwidth use.
So if you had a large attention model that uses say 1TB of KV cache (fast weights) and 1TB of slow weights, transformer training can often run near full efficiency, flop limited, parallelizing over time.
But similar full efficient transformer inference would require running about K instances/agents in parallel, where K is the flop/mem_bw ratio (currently up to 1000 on H100). So that would be 1000 * 1TB of RAM for the KV cache (fast weights) as its unique per agent instance.
This, in a nutshell, is part of why we don't already have AGI. Transformers are super efficient at absorbing book knowledge, but just as inefficient as RNNs at inference (generating new experiences, which is a key bottleneck on learning from experience).
Thus there is of course much research in more efficient long kv cache, tree/graph inference that can share some of the KV cache over similar branching agents, etc
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2024-06-08T04:35:15.721Z · LW(p) · GW(p)
In practice, throughput for generating tokens is only perhaps 3-10x worse than reading (input/prompt) tokens. This is true even while optimizing for latency on generation (rather than throughput).
(This is for well optimized workloads: additional inference optimizations are needed for generation.)
For instance, see the pricing on various APIs. (OpenAI has output 3x cheaper than input, Anthropic has 5x cheap input than output.)
I'm skeptical this will change importantly with future larger models.
Replies from: jacob_cannell↑ comment by jacob_cannell · 2024-06-08T16:44:12.802Z · LW(p) · GW(p)
Input vs output tokens are both unique per agent history (prompt + output), so that differentiation doesn't matter for my core argument about the RAM constraint. If you have a model which needs 1TB of KV cache, and you aren't magically sharing that significantly between instances, then you'll need at least 1000 * 1TB of RAM to run 1000 inferences in parallel.
The 3x - 10x cost ratio model providers charge is an economic observation that tells us something about the current cost vs utility tradeoffs, but it's much complicated by oversimpliciation of the current pricing models (they are not currently charging their true costs, probably because that would be too complicated, but also perhaps reveal too much information - their true cost would be more like charging rent on RAM for every timestep). It just tells you that very roughly, that on average, the mean (averaged over many customer requests) flop utilization of the generation phase (parallel over instances) is perhaps 3x to 10x lower than the prefill phase (parallel over time) - but it doesn't directly tell you why.
This is all downstream dependent on model design and economics. There are many useful requests that LLMs can fulfill without using barely any KV cache - essentially all google/oracle type use cases where you are just asking the distilled wisdom of the internet a question. If those were all of the request volume, then the KV cache RAM per instance would be inconsequential, inference batch sizes would be > 1000, inference flop utilization would be the same for prefill vs generation, and providers would charge the same price for input vs output tokens.
On the other extreme, if all requests used up the full training context window, then the flop utilization of inference would be constrained by approximately (max_KV_cache_RAM + weight_RAM / max_KV_cache_RAM ) / alu_ratio. For example if the KV cache is 10% of RAM, and alu_ratio is 1000:1, generation would have max efficiency of 1%. If infill efficiency was 30%, then output tokens would presumably be priced 30x more than input tokens.
So the observed input:output token pricing is dependent on the combination of KV_cache RAM fraction (largely a model design decision), current efficiency of implementations of infill vs generation, and most importantly - the distribution of request prompt lengths, which itself is dependent on the current economic utility of shorter vs longer prompts for current models.
In practice most current models have a much smaller KV cache to weight RAM fraction than my simple 1:1 example, but the basic point holds: training is more flop & interconnect limited, inference is more RAM and ram bw limited. But these constraints already shape the design space of models and how they are deployed.
LLMs currently excel at anything a human knowledge worker can do without any specific training (minimal input prompt length), but largely aren't yet competitive with human experts at most real world economic tasks that require significant unique per-job training. Coding is a good example - human thoughtspeed is roughly 9 token/s, or 32K/hour, or 256K per 8 hour work day, or roughly 1M tokens per week.
Current GPT4-turbo (one of the current leaders for coding), for example, has a max context length of 128K (roughly 4 hours). But if you actually use all of that for each request for typical coding requests that generate say 1K of useful output (equivalent to a few minutes of human thought), that will cost you about $1.25 for the input tokens, but only about $0.03 for the output tokens. That costs about as much as a human worker, per minute of output thought tokens. The cost of any LLM agent today (per minute of output thought) increases linearily with input prompt length - ie the agent's unique differentiating short term memory. Absent more sophisticated algorithms, the cost of running a react-like LLM agent thus grows quadratically with time, vs linear for humans (because each small observe-act time step has cost proportional to input context length, which grows per time step).
Human programmers aren't being replaced en masse (yet) in part because current models aren't especially smarter than humans at equivalent levels of job-specific knowledge/training.
Normalized for similar ability, LLMs currently are cheaper than humans at most any knowledge work that requires very little job-specific knowledge/training, and much more expensive than humans for tasks that require extensive job-specific knowledge/training - and this has everything to do with how transformers currently consume and utilize VRAM.
comment by lemonhope (lcmgcd) · 2024-04-27T00:01:08.510Z · LW(p) · GW(p)
This seems correct and important to me.
comment by Seth Herd · 2024-04-27T14:32:12.791Z · LW(p) · GW(p)
The big question here, it seems like, is: does intelligence stack? Does a hundred thousand instances of GPT4 working together make an intelligence as smart as GPT7?
This far the answer seems to be no. There are some intelligence improvements from combining multiple calls in tree of thought type setups, but not much. And those need carefully hand-structured algorithms.
So I think the limitation is in scaffolding techniques, not the sheer number of instances you can run. I do expect scaffolding LLMs into cognitive architectures to achieve human level fully general AGI, but how and when we get there is tricky to predict.
When we have that, I expect it to stack a lot like human organizations. They can do a lot more work at once, but they're not much smarter than a single individual because it's really hard to coordinate and stack all of that cognitive work.
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-04-27T13:56:00.602Z · LW(p) · GW(p)
As a population of AGI copies, the obvious first step towards 'taking over the world' is to try to improve oneself.
I expect that the described workforce could find improvements within a week of clock time including one or more of:
Improvements to peak intelligence without needing to fully retrain.
Improvements to inference efficiency.
Improvements to ability to cooperate and share knowledge.
comment by Cole Wyeth (Amyr) · 2024-04-27T17:27:00.638Z · LW(p) · GW(p)
I have no reason to question your evidence but I don't agree with your arguments. It is not clear that a million LLM's coordinate better an a million humans. There are probably substantial gains from diversity among humans, so the identical weights you mentioned could cut in either direction. An additional million human level intelligences would have a large economic impact, but not necessarily a transformative one. Also, your argument for speed superintelligence is probably flawed; since you're discussing what happens immediately after the first human level AGI is created, gains from any speedup in thinking should already be factored in and will not lead to superintelligence in the short term.
comment by Stephen McAleese (stephen-mcaleese) · 2024-04-27T09:25:19.358Z · LW(p) · GW(p)
Currently, groups of LLM agents can collaborate using frameworks such as ChatDev, which simulates a virtual software company using LLM agents with different roles. Though I think human organizations are still more effective for now. For example, corporations such as Microsoft have over 200,000 employees and can work on multi-year projects. But it's conceivable that in the future there could be virtual companies composed of millions of AIs that can coordinate effectively and can work continuously at superhuman speed for long periods of time.
comment by devrandom · 2024-06-26T09:57:57.621Z · LW(p) · GW(p)
New Transformer specific chips from Etched are in the works. This might make inference even cheaper compared to compute.
comment by devrandom · 2024-06-18T09:10:47.418Z · LW(p) · GW(p)
Post from Epoch AI about trading off training compute against inference compute.
comment by devrandom · 2024-05-08T19:58:35.478Z · LW(p) · GW(p)
https://www.lesswrong.com/posts/aH9R8amREaDSwFc97/rapid-capability-gain-around-supergenius-level-seems [LW · GW] also seems relevant to this discussion.