Posts

Musings on LLM Scale (Jul 2024) 2024-07-03T18:35:48.373Z
No Anthropic Evidence 2012-09-23T10:33:06.994Z
A Mathematical Explanation of Why Charity Donations Shouldn't Be Diversified 2012-09-20T11:03:48.603Z
Consequentialist Formal Systems 2012-05-08T20:38:47.981Z
Predictability of Decisions and the Diagonal Method 2012-03-09T23:53:28.836Z
Shifting Load to Explicit Reasoning 2011-05-07T18:00:22.319Z
Karma Bubble Fix (Greasemonkey script) 2011-05-07T13:14:29.404Z
Counterfactual Calculation and Observational Knowledge 2011-01-31T16:28:15.334Z
Note on Terminology: "Rationality", not "Rationalism" 2011-01-14T21:21:55.020Z
Unpacking the Concept of "Blackmail" 2010-12-10T00:53:18.674Z
Agents of No Moral Value: Constrained Cognition? 2010-11-21T16:41:10.603Z
Value Deathism 2010-10-30T18:20:30.796Z
Recommended Reading for Friendly AI Research 2010-10-09T13:46:24.677Z
Notion of Preference in Ambient Control 2010-10-07T21:21:34.047Z
Controlling Constant Programs 2010-09-05T13:45:47.759Z
Restraint Bias 2009-11-10T17:23:53.075Z
Circular Altruism vs. Personal Preference 2009-10-26T01:43:16.174Z
Counterfactual Mugging and Logical Uncertainty 2009-09-05T22:31:27.354Z
Bloggingheads: Yudkowsky and Aaronson talk about AI and Many-worlds 2009-08-16T16:06:18.646Z
Sense, Denotation and Semantics 2009-08-11T12:47:06.014Z
Rationality Quotes - August 2009 2009-08-06T01:58:49.178Z
Bayesian Utility: Representing Preference by Probability Measures 2009-07-27T14:28:55.021Z
Eric Drexler on Learning About Everything 2009-05-27T12:57:21.590Z
Consider Representative Data Sets 2009-05-06T01:49:21.389Z
LessWrong Boo Vote (Stochastic Downvoting) 2009-04-22T01:18:01.692Z
Counterfactual Mugging 2009-03-19T06:08:37.769Z
Tarski Statements as Rationalist Exercise 2009-03-17T19:47:16.021Z
In What Ways Have You Become Stronger? 2009-03-15T20:44:47.697Z
Storm by Tim Minchin 2009-03-15T14:48:29.060Z

Comments

Comment by Vladimir_Nesov on Ben Millwood's Shortform · 2024-07-26T07:05:40.200Z · LW · GW

A new Bloomberg article says xAI is building a datacenter in Memphis, planned to become operational by the end of 2025, mentioning a new-to-me detail that the datacenter targets 150 megawatts (more details on DCD). This means the scale of 100,000 GPUs or $4 billion in infrastructure, a bulk of its recently secured $6 billion from Series B.

This should be good for training runs that could be said to cost $1 billion in cost of time (lasting a few months). And Dario Amodei is saying that this is the scale of today, for models that are not yet deployed. This puts xAI at 18 months behind, a difficult place to rebound from unless long-horizon task capable AI that can do many jobs (a commercially crucial threshold that is not quite AGI) is many more years away.

Comment by Vladimir_Nesov on Leon Lang's Shortform · 2024-07-25T07:46:51.771Z · LW · GW

New data! Llama 3.1 report includes data about Chinchilla optimality study on their setup. The surprise is that Llama 3.1 405b was chosen to have the optimal size rather than being 2x overtrained. Their actual extrapolation for an optimal point is 402b parameters, 16.55T tokens, and 3.8e25 FLOPs.

Fitting to the tokens per parameter framing, this gives the ratio of 41 (not 20) around the scale of 4e25 FLOPs. More importantly, their fitted dependence of optimal number of tokens on compute has exponent 0.53, compared to 0.51 from the Chinchilla paper (which was almost 0.5, hence tokens being proportional to parameters). Though the data only goes up to 1e22 FLOPs (3e21 FLOPs for Chinchilla), what actually happens at 4e25 FLOPs (6e23 FLOPs for Chinchilla) is all extrapolation, in both cases, there are no isoFLOP plots at those scales. At least Chinchilla has Gopher as a point of comparison, and there was only 200x FLOPs gap in the extrapolation, while for Llama 3.1 405 the gap is 4000x.

So data needs grow faster than parameters with more compute. This looks bad for the data wall, though the more relevant question is what would happen after 16 repetitions, or how this dependence really works with more FLOPs (with the optimal ratio of tokens to parameters changing with scale).

Comment by Vladimir_Nesov on On extinction risk over time and AI · 2024-07-24T05:35:23.629Z · LW · GW

For a given AGI lab, the decision to keep working on the project despite believing at least 10% risk of extinction depends on the character of counterfactuals. Success is not just another draw out of the extinction urn, taking another step on the path to eventual doom, instead it promises that the new equilibrium involves robust safety with no future draws. So it's all about the alternatives.

One issue for individual labs is that their alternative is likely that the other labs develop AGI instead, they personally have little power to pause AI globally, unless they involve themselves in coordination with all other capable actors. Many arguments stop here, considering such coordination infeasible.

The risk of literal extinction for reasons other than AGI seems vanishingly small for the foreseeable future. There are many global catastrophic risks with moderate probability when added up over decades, some of which might disrupt the course of civilization for millennia, but not literal extinction. The closest risk of actual extinction that doesn't involve AGI I can imagine is advanced biotechnology of the kind that's not even on the horizon yet. It's unclear how long it would take to get there without AI, while dodging civilization-wreaking catastrophes that precede its development, but I would guess a lower bound of many decades before this becomes a near-term possibility. Even then it won't become a certainty of immediate doom, in a similar way to how large nuclear arsenals still haven't cashed out in a global nuclear conflict for many decades. So it makes sense to work towards global coordination to pause AI for at least this long, as long as there is vigorous effort to develop AI alignment theory and prepare in all ways that make sense during this time.

Comment by Vladimir_Nesov on Ben Millwood's Shortform · 2024-07-23T04:00:54.761Z · LW · GW

For some reason current labs are not running $10 billion training runs already, didn't build the necessary datacenters immediately. It would take a million H100s and 1.5 gigawatts, supply issues seem likely. There is also a lot of engineering detail to iron out, so the scaling proceeds gradually.

But some of this might be risk aversion, unwillingness to waste capital where a slower pace makes a better use of it. As a new contender has no other choice, we'll get to see if it's possible to leapfrog scaling after all. And Musk has affinity with impossible deadlines (not necessarily with meeting them), so the experiment will at least be attempted.

Comment by Vladimir_Nesov on Leon Lang's Shortform · 2024-07-21T12:37:53.535Z · LW · GW

Data varies in the loss it enables, doesn't seem to vary greatly in the ratio between the number of tokens and the number of parameters that extracts the best loss out of training with given compute. That is, I'm usually keeping this question in mind, didn't see evidence to the contrary in the papers, but relevant measurements are very rarely reported, even in model series training report papers where the ablations were probably actually done. So could be very wrong, generalization from 2.5 examples. With repetition, there's this gradual increase from 20 to 60. Probably something similar is there for distillation (in the opposite direction), but I'm not aware of papers that measure this, so also could be wrong.

One interesting point is the isoFLOP plots in the StripedHyena post (search "Perplexity scaling analysis"). With hybridization where standard attention remains in 8-50% of the blocks, perplexity is quite insensitive to change in model size while keeping compute fixed, while for pure standard attention the penalty for deviating from the optimal ratio to a similar extent is much greater. This suggests that one way out for overtrained models might be hybridization with these attention alternatives. That is, loss for an overtrained model might be closer to Chinchilla optimal loss with a hybrid model than it would be for a similarly overtrained pure standard attention model. Out of the big labs, visible moves in this directions were made by DeepMind with their Griffin Team (Griffin paper, RecurrentGemma). So that's one way the data wall might get pushed a little further for the overtrained models.

Comment by Vladimir_Nesov on Leon Lang's Shortform · 2024-07-19T20:40:10.257Z · LW · GW

To make a Chinchilla optimal model smaller while maintaining its capabilities, you need more data. At 15T tokens (the amount of data used in Llama 3), a Chinchilla optimal model has 750b active parameters, and training it invests 7e25 FLOPs (Gemini 1.0 Ultra or 4x original GPT-4). A larger $1 billion training run, which might be the current scale that's not yet deployed, would invest 2e27 FP8 FLOPs if using H100s. A Chinchilla optimal run for these FLOPs would need 80T tokens when using unique data.

Starting with a Chinchilla optimal model, if it's made 3x smaller, maintaining performance requires training it on 9x more data, so that it needs 3x more compute. That's already too much data, and we are only talking 3x smaller. So we need ways of stretching the data that is available. By repeating data up to 16 times, it's possible to make good use of 100x more compute than by only using unique data once. So with say 2e26 FP8 FLOPs (a $100 million training run on H100s), we can train a 3x smaller model that matches performance of the above 7e25 FLOPs Chinchilla optimal model while needing only about 27T tokens of unique data (by repeating them 5 times) instead of 135T unique tokens, and the model will have about 250b active parameters. That's still a lot of data, and we are only repeating it 5 times where it remains about as useful in training as unique data, while data repeated 16 times (that lets us make use of 100x more compute from repetition) becomes 2-3 times less valuable per token.

There is also distillation, where a model is trained to predict the distribution generated by another model (Gemma-2-9b was trained this way). But this sort of distillation still happens while training on real data, and it only allows to make use of about 2x less data to get similar performance, so it only slightly pushes back the data wall. And rumors of synthetic data for pre-training (as opposed to post-training) remain rumors. With distillation on 16x repeated 50T tokens of unique data, we then get the equivalent of training on 800T tokens of unique data (it gets 2x less useful per token through repetition, but 2x more useful through distillation). This enables reducing active parameters 3x (as above, maintaining performance), compared to a Chinchilla optimal model trained for 80T tokens with 2e27 FLOPs (a $1 billion training run for the Chinchilla optimal model). This overtrained model would cost $3 billion (and have 1300b active parameters).

So the prediction is that the trend for getting models that are both cheaper for inference and smarter might continue into the imminent $1 billion training run regime but will soon sputter out when going further due to the data wall. Overcoming this requires algorithmic progress that's not currently publicly in evidence, and visible success in overcoming it in deployed models will be evidence of such algorithmic progress within LLM labs. But Chinchilla optimal models (with corrections for inefficiency of repeated data) can usefully scale to at least 8e28 FLOPs ($40 billion in cost of time, 6 gigawatts) with mere 50T tokens of unique data.

Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading "Blackwell GPU" instead of "H100".

Comment by Vladimir_Nesov on dirk's Shortform · 2024-07-17T19:29:34.771Z · LW · GW

Something that sounds patronizing is not a social reward. It's not necessarily possible to formulate in a way that avoids this problem, without doing something significantly indirect. Right now this is upvoting for unspecified reasons.

Comment by Vladimir_Nesov on Seeking feedback on a critique of the paperclip maximizer thought experiment · 2024-07-17T02:18:09.377Z · LW · GW

The point is control over this process, ability to make decisions over development of oneself, instead of leaving it largely in the hands of the inscrutable low level computational dynamics of the brain and influence of external data. Digital immortality doesn't guard against this, and in a million subjective years you might just slip away bit by bit for reasons you don't endorse, not having had enough time to decide how to guide this process. But if there is a way to put uncontrollable drift on hold, then it's your own goal slots, you can do with them what you will when you are ready.

Comment by Vladimir_Nesov on A simple case for extreme inner misalignment · 2024-07-16T05:28:01.507Z · LW · GW

I think the FDT dictum of treating an agent like an abstract algorithm rather than any given physical instance of it ("I am an algorithm") extends to treating goals as about the collective abstract consequences of behavior of abstract algorithms (other algorithms, that are not necessarily the agent) rather than of any given incarnation of those algorithms or consequences in any given incarnation, such as the physical consequences of running algorithms on computers in a physical world.

In this ontology, goals are not about optimizing configurations of the world, they are about optimizing behaviors of abstract algorithms or optimizing properties of mathematical structures. Physically, this predicts computronium (to run acausal interactions with all the abstract things, in order to influence their properties and behaviors) and anti-predicts squiggles or any such focus on the physical form of what's going on, other than efficiency at accessing more computation.

Comment by Vladimir_Nesov on Seeking feedback on a critique of the paperclip maximizer thought experiment · 2024-07-15T23:38:46.429Z · LW · GW

As a spaghetti behavior executor, I'm worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift, especially throughout a much longer life than presently feasible, so I'd like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.

Comment by Vladimir_Nesov on MIRI's July 2024 newsletter · 2024-07-15T21:37:04.606Z · LW · GW

Eliezer also speaks with Bloomberg’s Nate Lanxon and Jackie Davalos, making the case for international coordination to shut down frontier AI development.

This happened in July 2023, a year ago.

Comment by Vladimir_Nesov on Zach Stein-Perlman's Shortform · 2024-07-15T19:41:25.571Z · LW · GW

(The tweet includes a screenshot from The Washington Post article "OpenAI illegally barred staff from airing safety risks, whistleblowers say" that references a letter to SEC.)

Edit: This was in response to the original version of the above comment that only linked to the tweet without other links or elaboration.

Comment by Vladimir_Nesov on Seeking feedback on a critique of the paperclip maximizer thought experiment · 2024-07-15T19:06:37.447Z · LW · GW

Squiggle maximizer (which is tagged for this post) and paperclip maximizer are significantly different points. Paperclip maximizer (as opposed to squiggle maximizer) is centrally an illustration for the orthogonality thesis (see greaterwrong mirror of arbital if the arbital page doesn't load).

What the orthogonality thesis says and the paperclip maximizer example illustrates is that it's possible in principle to construct arbitrarily effective agents deserving of moniker superintelligence with arbitrarily silly or worthless goals (in human view). This seems clearly true, but valuable to notice to fix intuitions that would claim otherwise. Then there's a "practical version of orthogonality thesis", which shouldn't be called "orthogonality thesis", but often enough gets confused with it. It says that by default goals of AIs that will be constructed in practice will tend towards arbitrary things that humans wouldn't find agreeable, including something silly or simple. This is much less obviously correct, and the squiggle maximizer sketch is closer to arguing for some version of this.

Comment by Vladimir_Nesov on Misnaming and Other Issues with OpenAI's “Human Level” Superintelligence Hierarchy · 2024-07-15T17:45:09.265Z · LW · GW

Some levels also collapse. As a capability, reasoning plausibly requires agentic behavior, you need fluency in System 2 skills to be effective at non-routine reasoning. Reasoning at the level of highly intelligent humans might be harder than agentic behavior alone, but then if agentic behavior gets unlocked sufficiently late, it might immediately come with reasoning at the level of highly intelligent humans. And agentic behavior seems even more the same as ability to coordinate organizations of individual instances, as long at it passes some ARA threshold (which is still lower than what's necessary to do research).

None of these are superintelligence. Ability to agentically run organizations together with a research level of reasoning is instead what it takes to start making research progress much faster than humans and soon unlock superintelligence, but it's not by itself superintelligence.

Comment by Vladimir_Nesov on Aaron_Scher's Shortform · 2024-07-15T17:13:33.964Z · LW · GW

The point is that you need to get quantitative in these estimates to claim that data is running out, since it has to run out compared to available compute, not merely on its own. And the repeated data argument seems by itself sufficient to show that it doesn't in fact run out in this sense.

Data still seems to be running out for overtrained models, which is a major concern for LLM labs, so from their point of view there is indeed a salient data wall that's very soon going to become a problem. There are rumors of synthetic data (which often ambiguously gesture at post-training results while discussing the pre-training data wall), but no published research for how something like that improves the situation with pre-training over using repeated data.

Comment by Vladimir_Nesov on Aaron_Scher's Shortform · 2024-07-15T15:50:48.934Z · LW · GW

Might run out of data.

Data is running out for making overtrained models, not Chinchilla-optimal models, because you can repeat data (there's also a recent hour-long presentation by one of the authors). This systematic study was published only in May 2023, though the Galactica paper from Nov 2022 also has a result to this effect (see Figure 6). The preceding popular wisdom was that you shouldn't repeat data for language models, so cached thoughts that don't take this result into account are still plentiful, and also it doesn't sufficiently rescue highly overtrained models, so the underlying concern still has some merit.

As you repeat data more and more, the Chinchilla multiplier of data/parameters (data in tokens divided by number of active parameters for an optimal use of given compute) gradually increases from 20 to 60 (see the data-constrained efficient frontier curve in Figure 5 that tilts lower on the parameters/data plot, deviating from the Chinchilla efficient frontier line for data without repetition). You can repeat data essentially without penalty about 4 times, efficiently 16 times, and with any use at all 60 times (at some point even increasing parameters while keeping data unchanged starts decreasing rather than increasing performance). This gives a use for up to 100x more compute, compared to Chinchilla optimal use of data that is not repeated, while retaining some efficiency (at 16x repetition of data). Or up to 1200x more compute for the marginally useful 60x repetition of data.

The datasets you currently see at 15-30T tokens scale are still highly filtered compared to available raw data (see Figure 4). The scale feasible within a few years is about 2e28-1e29 FLOPs) (accounting for hypothetical hardware improvement and larger datacenters of early 2030s; this is physical, not effective compute). Chinchilla optimal compute for a 50T token dataset is about 8e26 FLOPs, which turns into 8e28 FLOPs with 16x repetition of data, up to 9e29 FLOPs for the barely useful 60x repetition. Note that sometimes it's better to perplexity-filter away half of a dataset and repeat it twice than to use the whole original dataset (yellow star in Figure 6; discussion in the presentation), so using highly repeated data on 50T tokens might still outperform less-repeated usage of less-filtered data, which is to say finding 100T tokens by filtering less doesn't necessarily work at all. There's also some double descent for repetition (Appendix D; discussion in the presentation), which suggests that it might be possible to overcome the 60x repetition barrier (Appendix E) with sufficient compute or better algorithms.

In any case the OOMs match between what repeated data allows and the compute that's plausibly available in the near future (4-8 years). There's also probably a significant amount of data to be found that's not on the web, and every 2x increase in unique reasonable quality data means 4x increase in compute. Where data gets truly scarce soon is for highly overtrained inference-efficient models.

Comment by Vladimir_Nesov on Aaron_Scher's Shortform · 2024-07-15T14:49:01.521Z · LW · GW

it's estimated that the efficiency of algorithms has improved about 3x/year

There was about 5x increase since GPT-3 for dense transformers (see Figure 4) and then there's MoE, so assuming GPT-3 is not much better than the 2017 baseline after anyone seriously bothered to optimize, it's more like 30% per year, though plausibly slower recently.

The relevant Epoch paper says point estimate for compute efficiency doubling is 8-9 months (Section 3.1, Appendix G), about 2.5x/year. Though I can't make sense of their methodology, which aims to compare the incomparable. In particular, what good is comparing even transformers without following the Chinchilla protocol (finding minima on isoFLOP plots of training runs with individually optimal learning rates, not continued pre-training with suboptimal learning rates at many points). Not to mention non-transformers where the scaling laws won't match and so the results of comparison change as we vary the scale, and also many older algorithms probably won't scale to arbitrary compute at all.

(With JavaScript mostly disabled, the page you linked lists "Compute-efficiency in language models" as 5.1%/year (!!!). After JavaScript is sufficiently enabled, it starts saying "3 ÷/year", with a '÷' character, though "90% confidence interval: 2 times to 6 times" disambiguates it. In other places on the same page there are figures like "2.4 x/year" with the more standard 'x' character for this meaning.)

Comment by Vladimir_Nesov on An AI Manhattan Project is Not Inevitable · 2024-07-11T23:21:27.101Z · LW · GW

Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn't matter how much progress is happening now, other than as evidence about what happens later.

heavily overtrained by Chinchilla standards

This is necessarily part of it. It involves using more compute, not less, which is natural given that new training environments are getting online, and doesn't need any algorithmic improvements at all to produce models that are both cheaper for inference and smarter. You can take a Chinchilla optimal model, make it 3x smaller and train it on 9x data, expending 3x more compute, and get approximately the same result. If you up the compute and data a bit more, the model will become more capable. Some current improvements are probably due to better use of pre-training data, but these things won't survive significant further scaling intact. There are also improvements in post-training, but they are even less relevant to my argument, assuming they are not lagging behind too badly in unlocking the key thresholds of capability.

Comment by Vladimir_Nesov on What Other Lines of Work are Safe from AI Automation? · 2024-07-11T17:18:11.823Z · LW · GW

The recent Carl Shulman podcast (part 1, part 2) is informative on this question (though it should be taken in the spirit of exploratory engineering, not forecasting). In particular, in a post-AGI magically-normal world, jobs that humans are uniquely qualified to do won't be important to the industry and will be worked around. What remains of them will have the character of billionaires hiring other billionaires as waiters, so treating this question as being about careers seems noncentral.

Comment by Vladimir_Nesov on An AI Manhattan Project is Not Inevitable · 2024-07-11T16:04:29.212Z · LW · GW

The question is if research capable TAI can lag behind government-alarming long-horizon task capable AI (that does many jobs and so even Robin Hanson starts paying attention). These are two different thresholds that might both be called "AGI", so it's worth making a careful distinction. Even if it turns out that in practice they coincide and the same system becomes the first to qualify for both, for now we don't know if that's the case, and conceptually they are different.

If this lag is sufficient, governments might be able to succeed in locking down enough compute to prevent independent development of research capable TAI for many more years. This includes stopping or even reversing improvements in AI accelerators. If govenments only become alarmed once there is a research capable TAI, that gives the other possibility where TAI is developed by everyone very quickly and the opportunity to do it more carefully is lost.

Increasing investment is the crucial consideration in the sense that if research capable TAI is possible with modest investment, then there is no preventing its independent development. But if the necessary investment turns out to be sufficiently outrageous, controlling development of TAI by controlling hardware becomes feasible. Advancements in hardware are easy to control if most governments are alarmed, the supply chains are large, the datacenters are large. And algorithmic improvements have a sufficiently low ceiling to keep what would otherwise be $10 trillion training runs infeasible for independent actors even if done with better methods. The hypothetical I was describing has research capable TAI 2-3 OOMs above the $100 billion necessary for long-horizon task capable AI, which as a barrier for feasibility can survive some algorithmic improvements.

I also think the improvements themselves are probably running out. There's only about 5x improvement in all these years for the dense transformer, a significant improvement from MoE, possibly some improvement from Mixture of Depths. All attention alternatives remain in the ballpark despite having very different architectures. Something significantly non-transformer-like is probably necessary to get more OOMs of algorithmic progress, which is also the case if LLMs can't be scaled to research capable TAI at all.

(Recent unusually fast improvement in hardware was mostly driven by moving to lower precision, first BF16, then FP8 with H100s, and now Microscaling (FP4, FP6) with Blackwell. This process is also at an end, lower-level hardware improvement will be slower. But unlike algorithmic improvements, this point is irrelevant to the argument, since improvement in hardware available to independent actors can be stopped or reversed by governments, unlike algorithmic improvements.)

Comment by Vladimir_Nesov on Bogdan Ionut Cirstea's Shortform · 2024-07-11T00:13:03.298Z · LW · GW

The $100 million figure is used in the same sentence for cost of currently deployed models. Original GPT-4 was probably trained on A100s in BF16 (A100s can't do FP8 faster), which is 6e14 FLOP/s, 7 times less than 4e15 FLOP/s in FP8 from an H100 (there is no change in quality of trained models when going from BF16 to FP8, as long as training remains stable). With A100s in BF16 at 30% utilization for 150 days, you need 9K A100s to get 2e25 FLOPs. Assuming $30K per A100 together with associated infrastructure, the cluster would cost $250 million, but again assuming $2 per hour, the time would only cost $60 million. This is 2022, deployed in early 2023. I expect recent models to cost at least somewhat more, so for early 2024 frontier models $100 million would be solidly cost of time, not cost of infrastructure.

The $1 billion for cost of time suggests ability to train on multiple clusters, and Gemini 1.0 report basically says they did just that. So the $10 billion figure needs to be interpreted as being about scale of many clusters taken together, not individual clusters. The estimate for training on H100s for 200 days says you need 150 megawatts for $1 billion in training time, or 1.5 gigawatts for $10 billion in training time. And each hyperscaler has datacenters that consume 2-3 gigawatts in total (they are much smaller individually) with current plans to double. So at least the OOMs match the $10 billion claim interpreted as cost of training time.

Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading "Blackwell GPU" instead of "H100".

Comment by Vladimir_Nesov on Bogdan Ionut Cirstea's Shortform · 2024-07-10T21:06:01.287Z · LW · GW

Dario Amodei claims there are current $1 billion training runs. At $2/hour with H100s, this means 2e12 H100-seconds. Assuming 30% utilization and 4e15 FP8 FLOP/s, this is 2e27 FLOPs, 2 OOMs above estimates for the original GPT-4. This corresponds to 200 days with 100K H100s (and 150 megawatts). 100K H100 clusters don't seem to be built yet, the largest publicly known ones are Meta's two clusters with 24K H100s each. But it might be possible to train on multiple clusters if the inter-cluster network is good enough.

Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading "Blackwell GPU" instead of "H100".

Comment by Vladimir_Nesov on An AI Manhattan Project is Not Inevitable · 2024-07-09T15:39:37.484Z · LW · GW

It depends on how much time there is between the first impactful demonstration of long-horizon task capabilities (doing many jobs) and commoditization of research capable TAI, even with governments waking up during this interval and working to extend it. It might be that by default this is already at least few years, and if the bulk of compute is seized, it extends to even longer. This seems to require long-horizon task capabilities to be found at the limits of scaling, and TAI significantly further.

But we don't know until it's tried if even a $3 billion training run won't already enable long-horizon task capabilities (with appropriate post-training, even if it arrives a bit later), and we don't know if the first long-horizon task capable AI won't immediately be capable of research, with no need for further scaling (even if it helps). And if it's not immediately obvious how to elicit these capabilities with post-training, there will be an overhang of sufficient compute and sufficiently strong base models in many places before the alarm is sounded. If enough of such things align, there won't be time for anyone to prevent prompt commoditization of research capable TAI. And then there's ASI 1-2 years later, with the least possible time for anyone to steer any of this.

Comment by Vladimir_Nesov on Response to Dileep George: AGI safety warrants planning ahead · 2024-07-08T19:57:11.548Z · LW · GW

if you assign an extremely low credence to that scenario, then whatever

I don't assign low credence to the scenario where LLMs don't scale to AGI (and my point doesn't depend on this). I assign low credence to the scenario where it's knowable today that LLMs very likely won't scale to AGI. That is, that there is a thing I could study that should change my mind on this. This is more of a crux than the question as a whole, studying that thing would be actionable if I knew what it is.

whether or not LLMs will scale to AGI

This wording mostly answers one of my questions, I'm now guessing that you would say that LLMs are (in hindsight) "the right kind of algorithm" if the scenario I described comes to pass, which wasn't clear to me from the post.

Comment by Vladimir_Nesov on Response to Dileep George: AGI safety warrants planning ahead · 2024-07-08T18:57:23.733Z · LW · GW

expecting LLMs to not be the right kind of algorithm for future powerful AGI—the kind that can ... do innovative science

I don't know what could serve as a crux for this. When I don't rule out LLMs, what I mean is that I can't find an argument with the potential to convince me to become mostly confident that scaling LLMs to 1e29 FLOPs in the next few years won't produce something clunky and unsuitable for many purposes, but still barely sufficient to then develop a more reasonable AI architecture within 1-2 more years. And by an LLM that does this I mean the overall system that allows LLM's scaffolding environment to create and deploy new tuned models using new preference data that lets the new LLM variant do better on particular tasks as the old LLM variant encounters them, or even pre-train models on datasets with heavy doses of LLM-generated problem sets with solutions, to distill the topics that the previous generation of models needed extensive search to stumble through navigating, taking a lot of time and compute to retrain models in a particular stilted way where a more reasonable algorithm would do it much more efficiently.

Many traditionally non-LLM algorithms reduce to such a setup, at an unreasonable but possibly still affordable cost. So this quite fits the description of LLMs as not being "the right kind of algorithm", but the prediction is that the scaling experiment could go either way, that there is no legible way to be confident in either outcome before it's done.

Comment by Vladimir_Nesov on An AI Manhattan Project is Not Inevitable · 2024-07-07T00:21:04.740Z · LW · GW

The relevant distinction is between compute that proliferated before there were long-horizon task capable AIs, and compute that's necessary to train autonomous researcher AIs. A lot of compute might even be needed to maintain their ability to keep working on novel problems, since an AI trained on data that didn't include the very recent progress might be unable to make further progress, and continued training isn't necessarily helpful enough compared to full retraining, so that stolen weights would be relatively useless for getting researcher AIs to do deep work.

There are only 2-3 OOMs of compute scaling left to explore if capabilities of AIs don't dramatically improve, and LLMs at current scale robustly fail at long-horizon tasks. If AIs don't become very useful at something, there won't be further OOMs until many years pass and there are larger datacenters, possibly well-tested scalable asynchronous distributed training algorithms, more energy-efficient AI accelerators, more efficient training, ways of generating more high quality data. Now imagine if long-horizon task capable AIs were developed just before or even during this regime of stalled scaling, it took more than a year, $100 billion, and 8 gigawatts to train one, and it's barely working well enough to unlock the extreme value of there being cheap and fast autonomous digital workers capable of routine jobs, going through long sequences of unreliable or meandering reasoning but eventually catching the systematic problems in a particular train of thought, recovering well enough to do their thing. And further scaling resulting from a new investment boom still fails to produce a researcher AI, as it might take another 2-3 OOMs and we are all out of AI accelerators and gigawatts for the time being.

In this scenario, which seems somewhat plausible, the governments both finally actually notice the astronomical power of AI, and have multiple years to get all large quantities of compute under control, so that the compute available for arbitrary non-government use gets somewhat lower than what it takes to train even a barely long-horizon task capable AI that's not at all a researcher. Research-capable TAI then by default won't appear in all these years, and after the transition to centralized control over compute is done, future progress towards such AI can only happen under government control.

Comment by Vladimir_Nesov on An AI Manhattan Project is Not Inevitable · 2024-07-06T17:29:24.263Z · LW · GW

I think the main reason governments may fail to take control (no comment on keeping it) is that TAI might be both the first effective wakeup call and the point when it's too late to take control. It can be too late if there is already too much proliferation, sufficient-if-not-optimal code and theory and models already widely available, sufficient compute to compete with potential government projects already abundant and impossible to sufficiently take down. So even if the first provider of TAI is taken down, in a year everyone has TAI, and the government fails to take sufficient advantage of its year of lead time to dissuade the rest of the world.

The alternative where government control is more plausible is first making a long-horizon task capable AI that can do many jobs, but can't itself do research or design AIs, and a little bit of further scaling or development isn't sufficient to get there. The economic impact then acts as a wakeup call, but the AI itself isn't yet a crucial advantage, can be somewhat safely used by all sides, and doesn't inevitably lead to ASI a few years later. At this point governments might get themselves a monopoly on serious compute, so that any TAI projects would need to go through them.

Comment by Vladimir_Nesov on Can agents coordinate on randomness without outside sources? · 2024-07-06T16:44:27.232Z · LW · GW

The other known-code agent is only an ingredient in defining the outcome you care about (the source code of the world), but not on its own a salient thing when considering coordination. Acausal coordination acts in a broader way, potential contracts you might decide to sign (and so introduce as processes with influence over the a priori defined outcome, the result of computing the code of the world) are not restricted by what's already explicitly in the initial code of either yourself or your opponents. Coordination here refers to how a contract acts in the same way in all places where it has influence, since it's the same thing in all these places, and in particular the same contract can act through actions of multiple agents, coordinating them. But contracts are new programs chosen by the players as they engage in setting up coordination between each other, they are not the initial players themselves.

So the distinction between the other agent and its parent rigging the outcome shouldn't help. Even though both of these are technically potential contracts and could be involved in coordination, both are obviously bad choices for contracts. You wouldn't want to directly give your opponent control over your own action, hence your opponent shouldn't be an acausal contract that coordinates your action. Similarly for your opponent's parent, by proxy of acting through your opponent's code, it's not a good contract to grant influence with the outcome through your action.

I don't know how to solve this puzzle, it seems potentially important if it can be solved. What comes to mind is picking an obvious Schelling point like the procedure of taking the hash of the agent codes concatenated in ascending order as natural numbers, using it as a contract. Parents of agents might try to manipulate behavior of such a contract (by rigging otherwise behaviorally irrelevant details of codes of the agents). Or they might manipulate the choice of a contract (by rigging agents' intuition for which procedure would be a Schelling point). But that is symmetric and the parents have an incentive to let the agents settle on some procedure rather than to never agree, and so the outcome remains pseudorandom. Which also incentivises the parents to expend less resources on competing over it.

Comment by Vladimir_Nesov on Can agents coordinate on randomness without outside sources? · 2024-07-06T14:31:26.778Z · LW · GW

Acausal coordination is centrally about choosing a procedure based on its merits other than the concrete actions it determines, and then following the procedure without faltering at its verdict. As in signing a contract, where the thing you agree to is the overall contract that governs many possibilities, and not a particular outcome.

This is very similar to what makes a procedure a pseudorandomness generator, the numbers it determines shouldn't be related to any relevant considerations, instead you choose the generator for reasons other than the consequences of the actual numbers it determines. And like in ASP, being technically capable of simulating the generator in order to rig it to choose a favorable outcome (or to prepare to what it generates) doesn't mean that this is the correct thing to be doing, since doing this breaks down coordination.

Comment by Vladimir_Nesov on AI #71: Farewell to Chevron · 2024-07-05T13:46:07.875Z · LW · GW

[Carl Shulman] is assuming normality where he shouldn’t, and this is one of the key places for that. It is a vision of AGI without ASI

This is valid exploratory engineering, which assumes some capabilities and considers what can be done with at least those capabilities. There is no implication that this is what will be done, or that capabilities won't be much greater. We can still conclude that what can be done with merely these capabilities will remain an option given greater capabilities. Forecasting of optionality, not of actuality.

Comment by Vladimir_Nesov on Static Analysis As A Lifestyle · 2024-07-04T15:56:01.864Z · LW · GW

My objection is to "IRL" and "in practice" in your top level comment, as if this is what static analysis is actually concerned with. In various formal questions, halting problem is a big deal, and I expect the same shape to be important in decision theory of the future (which might be an invention that humans will be too late to be first to develop). Being cooperative about loops of mutual prediction seems crucial for (acausal) coordination, the halting problem just observes that the same ingredients can be used to destroy all possibility of coordination.

Comment by Vladimir_Nesov on Static Analysis As A Lifestyle · 2024-07-04T15:30:24.749Z · LW · GW

Rice's theorem or halting problem are completely irrelevant in practice as sources of difficulty. Take a look at the proofs. The program being analyzed would basically need to reason about the static analyzer, and then act contrary to the analyzer's expectation. Programs you find in the real world don't do that. Also they wouldn't know (care about) the specific analyzer enough to anticipate its predictions.

Halting problem is a thing because there exists such an ornery program that makes a deliberate effort so that it can't be predicted specifically by this analyzer. But it's not something that happens on its own in the real world, not unless that program is intelligent and wants to make itself less vulnerable to analyzer's insight.

Comment by Vladimir_Nesov on Decaeneus's Shortform · 2024-07-03T21:15:14.220Z · LW · GW

I'm more certain about ASI being 1-2 years after TAI than about TAI in 2-5 years from now, as the latter could fail if the current training setups can't make LLMs long-horizon capable at a scale that's economically feasible absent TAI. But probably 20 years is sufficient to get TAI in any case, absent civilization-scale disruptions like an extremely deadly pandemic.

A model can update on discussion of its gears. Given predictions that don't cite particular reasons, I can only weaken it as a whole, not improve it in detail (when I believe the predictions know better, without me knowing what specifically they know). So all I can do is mirror this concern by citing particular reasons that shape my own model.

Comment by Vladimir_Nesov on Decaeneus's Shortform · 2024-07-03T20:13:10.804Z · LW · GW

My model is that the current scaling experiment isn't done yet but will be mostly done in a few years, and LLMs can plausibly surpass the data they are training on. Also, LLMs are digital and 100x faster than humans. Then once there are long-horizon task capable AIs that can do many jobs (the AGI/TAI milestone), even if the LLM scaling experiment failed and it took 10-15 years instead, we get another round of scaling and significant in-software improvement of AI within months that fixes all remaining crippling limitations, making them cognitively capable of all jobs (rather than only some jobs). At that point growth of industry goes off the charts, closer to biological anchors of say doubling in fruit fly biomass every 1.5 days than anything reasonable in any other context. This quickly gives the scale sufficient for ASI even if for some unfathomable reason it's not possible to create with less scale.

Unclear what cryonics not yet working could mean, even highly destructive freezing is not a cryptographically secure method for erasing data, redundant clues about everything relevant will endure. A likely reason to expect cryonics not to work is not believing that ASI is possible, with actual capabilities of a superintelligence. This is similar to how economists project "reasonable" levels of post-TAI growth by not really accepting the premise of AIs actually capable of all jobs, including all new jobs their introduction into the economy creates. More practical issues are unreliability of arrangements that make cryopreservation happen for a given person and of subsequent storage all the way until ASI, through all the pre-ASI upheaval.

Comment by Vladimir_Nesov on Decaeneus's Shortform · 2024-07-03T19:32:12.813Z · LW · GW

Not for those who think AGI/TAI plausible within 2-5 years, and ASI 1-2 years after. Accelerating even further than whatever feasible caution can hopefully slow it down a bit and shape it more carefully would mostly increase doom, not personal survival. Also, there's cryonics.

Comment by Vladimir_Nesov on An AI Race With China Can Be Better Than Not Racing · 2024-07-02T19:47:55.681Z · LW · GW

I think the PRC is behind on TAI, compared to the US, but only about one year.

Unless TAI is close to current scale, there will be an additional issue with hardware in the future that's not yet relevant today. It's not insurmountable, but it costs more years.

Comment by Vladimir_Nesov on Habryka's Shortform Feed · 2024-07-01T16:02:40.162Z · LW · GW

I left [...] and am not under any such agreement.

Neither is Daniel Kokotajlo. Context and wording strongly suggest that what you mean is that you weren't ever offered paperwork with such an agreement and incentives to sign it, but there remains a slight ambiguity on this crucial detail.

Comment by Vladimir_Nesov on Habryka's Shortform Feed · 2024-06-30T21:22:58.064Z · LW · GW

(I'm a full-time employee at Anthropic.)
I carefully read my contract both before signing and a few days ago [...] there wasn't anything like this in there.

Current employees of OpenAI also wouldn't yet have signed or even known about the non-disparagement agreement that is part of "general release" paperwork on leaving the company. So this is only evidence about some ways this could work at Anthropic, not others.

Comment by Vladimir_Nesov on mesaoptimizer's Shortform · 2024-06-30T18:26:37.119Z · LW · GW

Collections of datacenter campuses sufficiently connected by appropriate fiber optic probably should count as one entity for purposes of estimating training potential, even in the current synchronous training paradigm. My impression is that laying such fiber optic is both significantly easier and significantly cheaper than building power plants or setting up power transmission over long distances in the multi-GW range.

Thus for training 3M B100s/6GW scale models ($100 billion in infrastructure, $10 billion in cost of training time), hyperscalers "only" need to upgrade the equipment and arrange for "merely" on the order of 1GW in power consumption at multiple individual datacenter campuses connected to each other, while everyone else is completely out of luck. This hypothetical advantage makes collections of datacenter campuses an important unit of measurement, and also it would be nice to have a more informed refutation or confirmation that this is a real thing.

Comment by Vladimir_Nesov on Population ethics and the value of variety · 2024-06-29T18:17:24.461Z · LW · GW

Incidentally, since weights need to be maintained in hardware available for computation, spinning up another thread of an existing person might be 10,000 times cheaper than instantiating a novel person.

Comment by Vladimir_Nesov on Richard Ngo's Shortform · 2024-06-29T02:00:16.083Z · LW · GW

Okay, but why isn't this exactly the same as them just thinking to themselves "conditional on me taking action K, here's the distribution over their actions" for each of N actions they could take, and then maximizing expected value?

The main trick with PD is that instead of an agent only having two possible actions C and D, we consider many programs the agent might self-modify into (commit to becoming) that each might in the end compute C or D. This effectively changes the action space, there are now many more possible actions. And these programs/actions can be given access (like quines, by their own construction) to initial source code of all the agents, allowed to reason about them. But then programs have logical uncertainty about how they in the end behave, so the things you'd be enumerating don't immediately cash out in expected values. And these programs can decide to cause different expected values depending of what you'll do with their behavior, anticipate how you reason about them through reasoning about you in turn. It's hard to find clear arguments for why any particular desirable thing could happen as a result of this setup.

UDT is notable for being one way of making this work. The "open source game theory" of PD (through Löb's theorem, modal fixpoints, Payor's lemma) pinpoints some cases where it's possible to say that we get cooperation in PD. But in general it's proven difficult to say anything both meaningful and flexible about this seemingly in-broad-strokes-inevitable setup, in particular for agents with different values that are doing more general things than playing PD.

(The following relies a little bit on motivation given in the other comment.)

When both and consider listening to a shared subagent , subagent is itself considering what it should be doing, depending on what and do with 's behavior. So for example with there are two stages of computation to consider: first, it was and didn't yet decide to sign the contract, then it became a composite system , where is 's policy for giving influence to C's behavior (possibly and include a larger part of the world where the first agent exists, not just the agent itself). The commitment of is to the truth of the equality , which gives influence over the computational consequences of in the particular shape . The trick with the logical time of this process is that should be able to know (something about) updatelessly, without being shown observations of what it is, so that the instance of within would also know of and be able to take it into account in choosing its joint policy that acts both through and . (Of course, the same is happening within .)

This sketch frames decision making without directly appealing to consequentialism. Here, controls through the incentives it creates for (a particular way in which gets to project influence from 's place in the world), where also has influence over . So doesn't seek to manipulate directly by considering the consequences for 's behavior of various ways that might behave.

Comment by Vladimir_Nesov on Richard Ngo's Shortform · 2024-06-29T00:59:57.233Z · LW · GW

UDT doesn't do multistage commitments, it has a single all-powerful "past" version that looks into all possible futures before pronouncing a global policy that all of them would then follow. This policy is not a collection of commitments in a reasonable informal sense, it's literally all details of behavior of future versions of the agent in response to all possible observations. In case of logical updatelessness, also in response to all possible observations of computational facts. (UDT for the idealized past version defines a single master model, future versions are just passively running inference from the contexts of their particular situations.)

The convergent idea for acausal coordination between systems and seems to be constructing a shared subagent whose instances exist as part of both and (after and successfully both construct the same , not before), so that can then act within them in the style of FDT, though really it's mostly about thinking of the effects of its behavior in terms of "I am an algorithm" rather than "I am a physical object". (For UDT, the shared subagent is the idealized common past version of its different possible future versions and . This assumes that and already have a lot in common, so maybe is instead Buddha.)

A bulk of the blind alleys seem to be about allowing subagents various superpowers, instead of focusing on managing the fallout of making them small and bounded (but possibly more plentiful). I think this is where investigations into logical updatelessness go wrong. It does need solving, but not by considering some fact unknown globally, or even at certain logical times. Instead a fact can remain unknown to some small subagent, and can be observed by it at some point, or computed by another subagent. Values are also knowledge, so sufficiently small subagents shouldn't even by default know full values of the larger system, and should be prepared to learn more about them. This is a consideration that doesn't even depend on there initially being multiple big agents with different values.

Another point is that coordination doesn't necessarily need construction of exactly the same shared subagent, or it doesn't need to be "exactly the same" in a straightforward sense, which the results on coordination in PD illustrate. The role of subagents in this case is that can create a subagent , while creates a subagent . And even where and remain intractable for each other, and can be much smaller and by construction prepared to coordinate with each other, from within and . (It seems natural for the big agents to treat such subagents as something like their copies of an assurance contract, which is signed through commitment to give them influence over the big agent's thinking or behavior. And letting contracts be agents in their own right gives a lot of flexibility in coordination they can arrange.)

Comment by Vladimir_Nesov on OpenAI #8: The Right to Warn · 2024-06-28T20:30:24.960Z · LW · GW

Do you think Sam Altman is seen as a reckless

How can it NOT be reckless to pursue something without extreme caution that is believed by people with the most knowledge in the field to be close to a round of Russian roulette for humankind?

It doesn't follow that he is seen as reckless even by those giving the 5-10% answer on the human extinction question, and this is a distinct fact from actually being reckless.

Comment by Vladimir_Nesov on Richard Ngo's Shortform · 2024-06-28T15:26:23.885Z · LW · GW

UDT never got past the setting of unchanging preferences, so the present agent blindly defers to all decisions of the idealized past agent (which doesn't physically exist). And if the past agent doesn't try to wade in the murky waters of logical updatelessness, it's not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can "see everything". Coordinating agents with different values was instead explored under the heading of Prisoner's Dilemma. Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.

Comment by Vladimir_Nesov on p.b.'s Shortform · 2024-06-28T15:07:56.645Z · LW · GW

Nice! And the "scaling laws" terminology in this sense goes way back:

Comment by Vladimir_Nesov on How Big a Deal are MatMul-Free Transformers? · 2024-06-28T14:07:35.665Z · LW · GW

has been a lot of interest in this going back to at least early this year

This is 2015-2016 tech though. The value of the recent ternary BitNet result is demonstrating that it works well for transformers (which wasn't nearly as much the case for binary BitNet).

The immediate practical value of this recent paper is more elusive: they try to do even more by exorcising multiplication from attention, which is a step in an important direction, but the data they get doesn't seem sufficient to overcome the prior that this is very hard to do successfully. Only Mamba got close to attention as a pure alternative (without the constraint of avoiding multiplication), and even then it has issues unless we hybridize it with (local) attention (which also works well with other forms of attention alternatives, better even than vanilla attention on its own).

Comment by Vladimir_Nesov on Yitz's Shortform · 2024-06-27T18:27:26.583Z · LW · GW

(See this comment for more context.) The point is to make inference cheaper in operations and energy, which seems crucial primarily for local inference on smartphones, but in principle might make datacenter inference cheaper in the long run, if a new generation of hardware specialized for inference adapts to this development. The bulk of the improvement (without significant degradation of performance) was already demonstrated for transformers with ternary BitNet (see also this "Code and FAQ" followup report with better data on degradation of performance; only "download raw file" button works for me).

What they attempt to do in the paper you link is extract even more improvement by getting rid of multiplication in attention, and so they explore alternative ways of implementing attention, since the general technique doesn't work with standard attention out of the box. But attention has long evaded attempts to approximate it without degradation of performance (usually when trying to enable long context), the best general approach seems to be to hybridize an efficient attention alternative with precise sliding window (local) attention (by including one or the other in different layers). They reference the Griffin paper, but don't seem to engage with this point on hybridization, so it's something for future work to pick up.

Comment by Vladimir_Nesov on New fast transformer inference ASIC — Sohu by Etched · 2024-06-26T14:54:36.880Z · LW · GW

One 8xSohu server replaces 160 H100 GPUs.
Benchmarks are for Llama-3 70B in FP8 precision, 2048 input/128 output lengths.
What would happen if AI models get 20x faster and cheaper overnight?

So there is an oblique claim that they might potentially offer 20x cheaper inference in a setup with unknown affordances. Can it run larger models, or use more context? Is generation latency reasonable and at which cost?

The claims of being "faster" and "500k tokens per second" are about throughput per black box with unspecified characteristics, so in isolation meaningless. You can correctly say exactly the same thing about "speed" for Llama-3 70B inference using giant black boxes powered by a sufficient number of Pentium 4.

Comment by Vladimir_Nesov on Andrew Burns's Shortform · 2024-06-26T14:35:48.872Z · LW · GW

Papers like the one involving elimination of matrix-multiplication suggest that there is no need for warehouses full of GPUs to train advanced AI systems.

The paper is about getting rid of multiplication in inference, not in training (specifically, in focuses on attention rather than MLP). Quantization aware training creates models with extreme levels of quantization that are not much worse than full precision models (this is currently impossible to do post-training, if training itself wasn't built around targeting this outcome). The important recent result is ternary quantization where weights in MLP become {-1, 0, 1}, and thus multiplication by such a matrix no longer needs multiplication by weights. So this is relevant for making inference cheaper or running models locally.

Comment by Vladimir_Nesov on Zachary's Shortform · 2024-06-25T22:58:05.200Z · LW · GW

These are considerations about prior plans, not change of plans caused by recent events ("pushed back GPT-5 to late 2025"). They don't necessarily need much more compute than for other recent projects either, just ease up on massive overtraining to translate similar compute into more capability at greater inference cost, and then catch up on efficiency with "turbo" variants later.