vladimir_nesov

It's a shape of a possible ceiling on capability with R1-like training methods, see previous discussion here. The training is very useful, but it might just be ~pass@400 useful rather than ~without limit like AlphaZero. Since base models are not yet reliable at crucial capabilities at ~pass@400, neither would their RL-trained variants become reliable.

It's plausibly a good way of measuring how well an RL training method works for LLMs, another thing to hill-climb on. The question is how easy it will be to extend this ceiling when you are aware of it, and the paper tries a few things that fail utterly (multiple RL training methods, different numbers of training steps, see Figure 7), which weakly suggests it might be difficult to get multiple orders of magnitude further quickly.

Comment by Vladimir_Nesov on o3 Is a Lying Liar · 2025-04-24T06:01:06.700Z · LW · GW

This is evidence that fixing such issues even to a first approximation still takes at least many months and can't be done faster, as o3 was already trained in some form by December^[1], it's been 4 months, and who knows how long it'll take to actually fix. Since o3 is not larger than o1 and so releasing it doesn't depend on securing additional hardware, plausibly the time to release was primarily determined by the difficulty of getting post-training in shape and fixing the lying (which is systematically beyond "hallucinations" on some types of queries).

If o3 is based on GPT-4.1's base model, and the latter used pretraining knowledge distillation from GPT-4.5-base, it's not obviously possible to do all that by the time of Dec 2024 announcement. Assuming GPT-4.5 was pretrained for 3-4 months since May 2024, the base model was done in Aug-Sep 2024, logits for the pretraining dataset for GPT-4.1 were collected by Sep-Oct 2024, and GPT-4.1 itself got pretrained by Nov-Dec 2024, with almost no margin for post-training.

The reasoning training would need to either be very fast or mostly SFT from traces of a GPT-4.5's reasoning variant (which could start training in Sep 2024 and be done to some extent by Nov 2024). Both might be possible to do quickly R1-Zero style, so maybe this is not impossible given that o3-preview only needed to pass benchmarks and not be shown directly to anyone yet. ↩︎

Comment by Vladimir_Nesov on Comments on "AI 2027" · 2025-04-23T04:58:50.331Z · LW · GW

Control over many datacenters is useful for coordinating a large training run, but otherwise it doesn't mean you have to find a use for all of that compute all the time, since you could lease/sublease some for use by others (which at the level of datacenter buildings is probably not overly difficult technically, you don't need to suddenly become a cloud provider yourself).

So the quesion is more about the global AI compute buildout not finding enough demand to pay for itself, rather than what happens with companies that build the datacenters or create the models, and whether these are the same companies. It's not useful to let datacenters stay idle, even if that perfectly extends hardware's lifespan (which seems to be several years), since progress in hardware means the time of current GPUs will be much less valuable in several years, plausibly 5x-10x less valuable. And TCO over a datacenter's lifetime is only 10-20% higher than the initial capex. So in a slowdown timeline prices of GPU-time can drop all the way to maybe 20-30% of what they would need to be to pay for the initial capex, before the datacenters start going idle. This proportionally reduces cost of inference (and also of training).

Project Stargate is planning on spending 100 billion at first, 50 billion of which would be debt.

The Abilene site in 2026 only costs $22-35bn, and they've raised a similar amount for it recently, so the $100bn figure remains about as nebulous as the $500bn figure. For inference (where exclusive use of a giant training system in a single location is not necessary) they might keep using Azure, so there is probably no pressing need to build even more for now.

Though I think there's unlikely to be an AI slowdown until at least late 2026, and they'll need to plan to build more in 2027-2028, raising money for it in 2026, so it's likely they'll get to try to secure those $100bn even in the timeline where there'll be an AI slowdown soon after.

Comment by Vladimir_Nesov on Vladimir_Nesov's Shortform · 2025-04-23T03:40:06.503Z · LW · GW

The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL'd model should succeed about 99.8% of the time.

The interesting concept in the paper is the location of the crossover point, which seems remarkably stable (for a given task) across specific RL techniques and amount of RL training. It can be measured experimentally for a task by doing a little bit of RL training, and RL@1 performance won't get better than that with more training, so you're unlikely to get the RL model to succeed 99.8% of the time (at pass@1) ever unless the level of performance of the base model at the crossover point with a weak RL model was already higher than 99.8%.

Probably the crossover point for a task depends on things that can be changed (such as strength of the pretrained model, or size/relevance of the verifiable task dataset, or possibly the inference time reasoning budget). The issue isn't for example as straightforward as losing entropy in RL policy (as a formulation of reduced exploration), since DAPO specifically addresses this issue (otherwise present in vanilla GRPO), but the pass@k plot for DAPO (Figure 7, top) barely moves (compared to other methods), in their experiment it's even slightly worse at the crossover point.

So in the context of this paper it remains unclear how to move the plot to reach ever higher base@k performance using RL@1, higher than the ceiling of where base@k already was at the crossover point when comparing with some method at only 100-500 RL steps.

Comment by Vladimir_Nesov on Vladimir_Nesov's Shortform · 2025-04-22T00:12:27.912Z · LW · GW

In the hypothetical where the paper's results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1's base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1's base model).

So the difference of 200x (10K vs. 50) in the number of samples becomes much smaller when comparing performance of the base models. For GPT-4o vs. GPT-4.1, a difference of ~4x in the number of samples doesn't seem too strange. There's also the possibility of distillation from a reasoning variant of GPT-4.5, which could have an even larger effect on pass@k performance at low k (Figure 6, right).

Comment by Vladimir_Nesov on Vladimir_Nesov's Shortform · 2025-04-21T22:58:57.860Z · LW · GW

It's evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn't expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it's not actually worse.

Comment by Vladimir_Nesov on Vladimir_Nesov's Shortform · 2025-04-21T18:35:43.238Z · LW · GW

Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k^[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!

Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)

A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model. ↩︎

Comment by Vladimir_Nesov on Why Should I Assume CCP AGI is Worse Than USG AGI? · 2025-04-19T15:49:28.763Z · LW · GW

The state of the geopolitical board will influence how the pre-ASI chaos unfolds, and how the pre-ASI AGIs behave. Less plausibly intentions of the humans in charge might influence something about the path-dependent characteristics of ASI (by the time it takes control). But given the state of the "science" and lack of the will to be appropriately cautious and wait a few centuries before taking the leap, it seems more likely that the outcome will be randomly sampled from approximately the same distribution regardless of who sets off the intelligence explosion.

Comment by Vladimir_Nesov on o3 Will Use Its Tools For You · 2025-04-19T10:55:30.111Z · LW · GW

For me the main update from o3 is that since it's very likely GPT-4.1 with reasoning and is at Gemini 2.5 Pro level, the latter is unlikely to be a GPT-4.5 level model with reasoning. And so we still have no idea what a GPT-4.5 level model with reasoning can do, let alone when trained to use 1M+ token reasoning traces. As Llama 4 was canceled, irreversible proliferation of the still-unknown latent capabilities is not yet imminent at that level.

Comment by Vladimir_Nesov on Training AGI in Secret would be Unsafe and Unethical · 2025-04-19T04:26:19.010Z · LW · GW

the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI

Its goals could also end up mostly forming on their own, regardless of intent of those attempting to instill them, with indirect influence from all the voices in the pretraining dataset.

Consider what it means for power to "never concentrate to an an extreme degree", as a property of the civilization as a whole. This might also end up a property of an ASI as a whole.

Comment by Vladimir_Nesov on Comprehensive up-to-date resources on the Chinese Communist Party's AI strategy, etc? · 2025-04-18T15:59:33.327Z · LW · GW

(The relevance is that whatever the plans are, they need to be grounded in what's technically feasible, and this piece of news changed my mind on what might be technically feasible in 2026 on short notice. The key facts are systems with a large scale-up world size, and enough compute dies to match the compute of Abilene site in 2026, neither of which was obviously possible without more catch-up time, by which time the US training systems would've already moved on to an even greater scale.)

Comment by Vladimir_Nesov on Comprehensive up-to-date resources on the Chinese Communist Party's AI strategy, etc? · 2025-04-18T06:01:07.908Z · LW · GW

There are new Huawei Ascend 910C CloudMatrix 384 systems that form scale-up worlds comparable to GB200 NVL72, which is key to being able to run long reasoning inference for large models much faster and cheaper than possible using systems with significantly smaller world sizes like the current H100/H200 NVL8 (and also makes it easier to run training, though not as essential unless RL training really does scale to the moon).

Apparently TSMC produced ~2.1M compute dies for these systems in 2024-2025, which is 1.1M chips, and an Ascend 910C chip is 0.8e15 dense BF16 FLOP/s (compared to 2.5e15 for a GB200 chip). So the compute is about the same as that of ~350K GB200 chips (not dies or superchips), which is close to 400K-500K GB200 chips that will be installed at the Abilene site of Crusoe/Stargate/OpenAI in 2026. There also seems to be potential to produce millions more without TSMC.

These systems are 2.3x less power-efficient per FLOP/s than GB200 NVL72. They are using 7nm process instead of 4nm process of Blackwell, the scale-up network is using optical transceivers instead of copper, and the same compute needs more chips to produce it, so they are probably significantly more expensive per FLOP/s. But if there is enough funding and the 2.1M compute dies from TSMC are used to build a single training/inference system (about 2.5 GW), there is in principle some potential for parity between US and China at the level of a single frontier AI company for late 2026 compute (with no direct implications for 2027+ compute, in particular Nvidia Rubin buildout will begin around that time).

Comment by Vladimir_Nesov on Vladimir_Nesov's Shortform · 2025-04-18T03:08:43.123Z · LW · GW

Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don't transfer between these very different machines. A better design doesn't just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.

Comment by Vladimir_Nesov on Shortform · 2025-04-16T23:44:34.634Z · LW · GW

My first impression of o3 (as available via Chatbot Arena) is that when I'm showing it my AI scaling analysis comments (such as this and this), it responds with confident unhinged speculation teeming with hallucinations, compared to the other recent models that usually respond with bland rephrasings that get almost everything correctly with a few minor hallucinations or reasonable misconceptions carrying over from their outdated knowledge.

Don't know yet if it's specific to speculative/forecasting discussions, but it doesn't look good (for faithfulness of arguments) when combined with good performance on benchmarks. Possibly stream of consciousness style data is useful to write down within long reasoning traces and can add up to normality for questions with a short final answer, but results in spurious details within confabulated summarized arguments for that answer (outside the hidden reasoning trace) that aren't measured by hallucination benchmarks and so allowed to get worse. Though in the o3 System Card hallucination rate also significantly increased compared to o1 (Section 3.3).

Comment by Vladimir_Nesov on GPT-4.1 Is a Mini Upgrade · 2025-04-16T21:37:24.829Z · LW · GW

Will Brown: it's simple, really. GPT-4.1 is o3 without reasoning ... o1 is 4o with reasoning ... and o4 is GPT-4.5 with reasoning.

Price and knowledge cutoff for o3 strongly suggest it's indeed GPT-4.1 with reasoning. And so again we don't get to see the touted scaling of reasoning models, since the base model got upgraded instead of remaining unchanged. (I'm getting the impression that GPT-4.5 with reasoning is going to be called "GPT-5" rather than "o4", similarly to how Gemini 2.5 Pro is plausibly Gemini 2.0 Pro with reasoning.)

In any case, the fact that o3 is not GPT-4.5 with reasoning means that there is still no word on what GPT-4.5 with reasoning is capable of. For Anthropic, Sonnet 3.7 with reasoning is analogous to o1 (it's built on the base model of the older Sonnet 3.5, similarly to how o1 is built on the base model of GPT-4o). Internally, they probably already have a reasoning model for some larger Opus model (analogous to GPT-4.5) and for a newer Sonnet (analogous to GPT-4.1) with a newer base model different from that of Sonnet 3.5.

This also makes it less plausible that Gemini 2.5 Pro is based on a GPT-4.5 scale model (even though TPUs might've been able to make its price/speed possible even if it was), so there might be a Gemini 2.0 Ultra internally after all, at least as a base model. One of the new algorithmic secrets disclosed in Gemma 3 report was that pretraining knowledge distillation works even when the teacher model is much larger (rather than modestly larger) than the student model, it just needs to be trained for enough tokens for this to become an advantage rather than a disadvantage (Figure 8), something that for example Llama 3.2 from Sep 2024 still wasn't taking advantage of. This makes it useful to train the largest possible compute optimal base model regardless of whether its better quality justifies its inference cost, merely to make the smaller overtrained base models better by pretraining them from the large model logits with knowledge distillation instead of from raw tokens.

Comment by Vladimir_Nesov on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-15T19:39:40.369Z · LW · GW

To me these kinds of failures feel more "seem to be at the core of the way LLMs reason".

Right, I was more pointing out that if the analogy holds to some extent, then long reasoning training is crucial as the only locus of feedback (and also probably insufficient in current quantities relative to pretraining). The analogy I intended is this being a perception issue that can be worked around without too much fundamental difficulty, but only with sufficient intentional caution. Humans have the benefit of lifelong feedback and optimization by evolution, so LLMs with no feedback whatsoever might suffer much more from a similar issue, and the severity of its impact doesn't strongly argue its different nature.

I was specifically talking about whether scaling just base models seems enough to solve the issue

To the extent long reasoning training might elicit relevant things, base model scaling shouldn't be evaluated without it. Some capabilities that are made available by scaling the base model will only become visible after long reasoning training elicits them.

Comment by Vladimir_Nesov on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-15T18:18:16.286Z · LW · GW

the fact that e.g. GPT-4.5 was disappointing

It's not a reasoning variant though, the only credible reasoning model at the frontier ~100K H100s scale that's currently available is Gemini 2.5 Pro (Grok 3 seems to have poor post-training, and is suspiciously cheap/fast without Blackwell or presumably TPUs, so likely rather overtrained). Sonnet 3.7 is a very good GPT-4 scale reasoning model, and the rest are either worse or trained for even less compute or both. These weird failures might be analogous to optical illusions (but they are textual, not known to human culture, and therefore baffling), in which case working around them needs some form of feedback in the training process, and currently only long reasoning training plausibly offers relevant feedback.

Reasoning traces themselves haven't yet been scaled beyond 30K-70K tokens, benchmarks show strong improvement with reasoning trace length and it should work until at least ~1M (since non-reasoning models are sometimes able to handle 100K-200K input tokens with an OK quality). This is 20x more, the same as the distance between 2K and 40K reasoning tokens. And Blackwell (NVL72/NVL36, not DGX/HGX B200) that will enable this or use of larger models for AI companies that are not Google is still getting installed.

So my crux for the observations of this post is whether it ages well enough to survive late 2025 (though 1M token reasoning traces might take longer, maybe first half of 2026, at which point we'll also start seeing models from ~100K Blackwell chip scale training runs).

Comment by Vladimir_Nesov on Comments on "AI 2027" · 2025-04-15T16:48:01.990Z · LW · GW

I see what you mean (I did mostly change the topic to the slowdown hypothetical). There is another strange thing about AI companies, I think giving ~50% in cost of inference too much precision in the foreseeable future is wrong, as it's highly uncertain and malleable in a way that's hard for even the company itself to anticipate.

About ~2x difference in inference cost (or size of a model) can be merely hard to notice when nothing substantial changes in the training recipe (and training cost), and better post-training (which is relatively cheap) can get that kind of advantage or more, but not reliably. Pretraining knowledge distillation might get another ~1.5x at the cost of training a larger teacher model (plausibly GPT-4.1 has this because of the base model for GPT-4.5, but GPT-4o doesn't). And there are all the other compute multipliers that become less fake if the scale stops advancing. The company itself won't be able to plan with any degree of certainty how good its near future models will be relative to their cost, or how much its competitors will be able to cut prices. So the current state of cost of inference doesn't seem like a good anchor for where it might settle in the slowdown timelines.

Comment by Vladimir_Nesov on Reactions to METR task length paper are insane · 2025-04-15T01:43:11.117Z · LW · GW

We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models.

This kind of thing isn't known to meaningfully work, as something that can potentially be done on pretraining scale. It also doesn't seem plausible without additional breakthroughs given the nature and size of verifiable task datasets, with things like o3-mini getting ~matched on benchmarks by post-training on datasets containing 15K-120K problems. All the straight lines for reasoning models so far are only about scaling a little bit, using scarce resources that might run out (verifiable problems that help) and untried-at-more-scale algorithms that might break down (in a way that's hard to fix). So the known benefit is still plausible to remain a one-time improvement, extending it significantly (into becoming a new direction of scaling) hasn't been demonstrated.

I think even remaining as a one-time improvement, long reasoning training might still be sufficient to get AI takeoff within a few years just from pretraining scaling of the underlying base models, but that's not the same as already believing that RL post-training actually scales very far by itself. Most plausibly it does scale with more reasoning tokens in a trace, getting from the current ~50K to ~1M, but that's separate from scaling with RL training all the way to pretraining scale (and possibly further).

Comment by Vladimir_Nesov on Comments on "AI 2027" · 2025-04-15T00:12:01.267Z · LW · GW

OpenAI continuing to lose money

They are losing money only if you include all the R&D (where the unusual thing is very expensive training compute for experiments), which is only important while capabilities keep improving. If/when the capabilities stop improving quickly, somewhat cutting research spending won't affect their standing in the market that much. And also after revenue grows some more, essential research (in the slow capability growth mode) will consume a smaller fraction. So it doesn't seem like they are centrally "losing money", the plausible scenarios still end in profitability (where they don't end the world) if they don't lose the market for normal reasons like failing on products or company culture.

OpenAI cannot raise exponentially more money without turning a profit, which it cannot do

This does seem plausible in some no-slowdown worlds (where they ~can't reduce R&D spending in order to start turning profit), if in fact more investors don't turn up there. On the other hand, if every AI company is forced to reduce R&D spending because they can't raise money to cover it, then they won't be outcompeted by a company that keeps R&D spending flowing, because such a competitor won't exist.

Comment by Vladimir_Nesov on Commitment Races are a technical problem ASI can easily solve · 2025-04-14T14:36:12.388Z · LW · GW

in real life no intelligent being ... can convert themselves into a rock

if they become a rock ... the other players will not know it

Refusing in the ultimatum game punishes the prior decision to be unfair, not what remains after the decision is made. It doesn't matter if what remains is capable of making further decisions, the negotiations backed by ability to refuse an unfair offer are not with them, but with the prior decision maker that created them.

If you convert yourself into a rock (or a utility monster), it's the decision to convert yourself that's the opponent of refusal to accept the rock's offer, the rock is not the refusal's opponent, even as the refusal is being performed against a literal rock. Predictions about the other players turn anti-inductive when they get exploited, exploiting a prediction about behavior too much makes it increasingly incorrect, since the behavior adapts in response to exploitation starting to show up in the prior. If most rocks that enter the ultimatum game are remains of former unfair decision makers with the rocks' origins perfectly concealed (as a ploy to make it so that the other player won't suspect anything and so won't refuse), then this general fact makes the other player suspect all rocks and punish their possible origins, destroying the premise of not-knowing necessary for the strategy of turning yourself into a rock to shield the prior unfair decision makers from negotiations.

Comment by Vladimir_Nesov on Cole Wyeth's Shortform · 2025-04-13T20:03:41.759Z · LW · GW

LW doesn't punish, it upvotes-if-interesting and then silently judges.

confidence / effort ratio

(Effort is not a measure of value, it's a measure of cost.)

Comment by Vladimir_Nesov on Commitment Races are a technical problem ASI can easily solve · 2025-04-13T19:42:37.035Z · LW · GW

The other side is forced to agree to that, just to get a little.

That's not how the ultimatum game works in non-CDT settings, you can still punish the opponent for offering too little, even at the cost of getting nothing in the current possible world (thereby reducing its weight and with it the expected cost). In this case it deters commitment racing.

Comment by Vladimir_Nesov on Alexander Gietelink Oldenziel's Shortform · 2025-04-13T15:19:12.753Z · LW · GW

The term is a bit conflationary. Persuasion for the masses is clearly a thing, its power is coordination of many people and turning their efforts to (in particular) enforce and propagate the persuasion (this works even for norms that have no specific persuader that originates them, and contingent norms that are not convergently generated by human nature). Individual persuasion with a stronger effect that can defeat specific people is probably either unreliable like cults or conmen (where many people are much less susceptible than some, and objective deception is necessary), or takes the form of avoidable dangers like psychoactive drugs: if you are not allowed to avoid exposure, then you have a separate problem that's arguably more severe.

With AI, it's plausible that coordinated persuasion of many people can be a thing, as well as it being difficult in practice for most people to avoid exposure. So if AI can achieve individual persuasion that's a bit more reliable and has a bit stronger effect than that of the most effective human practitioners who are the ideal fit for persuading the specific target, it can then apply it to many people individually, in a way that's hard to avoid in practice, which might simultaneously get the multiplier of coordinated persuasion by affecting a significant fraction of all humans in the communities/subcultures it targets.

Comment by Vladimir_Nesov on Crash scenario 1: Rapidly mobilise for a 2025 AI crash · 2025-04-12T03:12:39.684Z · LW · GW

the impact of new Blackwell chips with improved computation

It's about world size, not computation, and has a startling effect that probably won't occur again with future chips, since Blackwell sufficiently catches up to models at the current scale.

But even then, OpenAI might get to ~$25bn annualized revenue that won't be going away

What is this revenue estimate assuming?

The projection for 2025 is $12bn at 3x/year growth (1.1x per month, so $1.7bn per month at the end of 2025, $3bn per month in mid-2026), and my pessimistic timeline above assumes that this continues up to either end of 2025 or mid-2026 and then stops growing after the hypothetical "crash", which gives $20-36bn per year.

Comment by Vladimir_Nesov on Weird Random Newcomb Problem · 2025-04-11T23:12:11.266Z · LW · GW

Not knowing n(-) results in not knowing expected utility of b (for any given b), because you won't know how the terms a(n(a), n(a)) are formed.

(And also the whole being given numeric codes of programs as arguments thing gets weird when you are postulated to be unable to interpret what the codes mean. The point of Newcomblike problems is that you get to reason about behavior of specific agents.)

Comment by Vladimir_Nesov on Comments on "AI 2027" · 2025-04-11T21:37:58.296Z · LW · GW

I can't think of any reason to give a confident, high precision story that you don't even believe in!

Datapoints generalize, a high precision story holds gears that can be reused in other hypotheticals. I'm not sure what you mean by the story being presented as "confident" (in some sense it's always wrong to say that a point prediction is "confident" rather than zero probability, even if it's the mode of a distribution, the most probable point). But in any case I think giving high precision stories is a good methodology for communicating a framing, pointing out which considerations seem to be more important in thinking about possibilities, and also which events (that happen to occur in the story) seem more plausible than their alternatives.

Comment by Vladimir_Nesov on Weird Random Newcomb Problem · 2025-04-11T20:48:54.310Z · LW · GW

Question 1: Assume you are program b. You want to maximize the money you receive. What should you output if your input is (x,x) (i.e., the two numbers are equal)?

Question 2: Assume you are the programmer writing program b. You want to maximize the expected money program b receives. How should you design b to behave when it receives an input (x,x)?

Do you mean to ask how b should behave on input (n(b), n(b)), and how b should be written to behave on input (n(b), n(b)) for that b?

If x differs from n(b), it might matter in some subtle ways but not straightforwardly how b behaves on (x, x), because that never occurs explicitly in the actual thought experiment (where the first argument is always the code for the program itself). And if the programmer knows x before writing b, and x must be equal to n(b), then since n(-) is bijective, they don't have any choice about how to write b other than to be the preimage of x under n(-).

Comment by Vladimir_Nesov on On Google’s Safety Plan · 2025-04-11T16:33:12.065Z · LW · GW

Official policy documents from AI companies can be useful in bringing certain considerations into the domain of what is allowed to be taken seriously (in particular, by the governments), as opposed to remaining weird sci-fi ideas to be ignored by most Serious People. Even declarations by AI company leaders or Turing award winners of Nobel laureates or some of the most cited AI scientists won't by themselves have that kind of legitimizing effect. So it's not necessary for such documents to be able to directly affect actual policies of AI companies, they can still be important in affecting these policies indirectly.

Comment by Vladimir_Nesov on Crash scenario 1: Rapidly mobilise for a 2025 AI crash · 2025-04-11T15:05:04.131Z · LW · GW

I think it's overdetermined by Blackwell NVL72/NVL36 and long reasoning training that there will be no AI-specific "crash" until at least late 2026. Reasoning models want a lot of tokens, but their current use is constrained by cost and speed, and these issues will be going away to a significant extent. Already Google has Gemini 2.5 Pro (taking advantage of TPUs), and within a few months OpenAI and Anthropic will make reasoning variants of their largest models practical to use as well (those pretrained at the scale of 100K H100s / ~3e26 FLOPs, meaning GPT-4.5 for OpenAI).

The same practical limitations (as well as novelty of the technique) mean that long reasoning models aren't using as many reasoning tokens as they could in principle, everyone is still at the stage of getting long reasoning traces to work at all vs. not yet, rather than scaling things like the context length they can effectively use (in products rather than only internal research). It's plausible that contexts with millions of reasoning tokens can be put to good use, where other training methods failed to make contexts at that scale work well.

So later in 2025 there's better speed and cost, driving demand in terms of the number of prompts/requests, and for early to mid-2026 potentially longer reasoning traces, driving demand in terms of token count. After that, it depends on whether capabilities get much better than Gemini 2.5 Pro. Pretraining scale in deployed models will only advance 2x-5x by mid-2026 compared to now (using 100K-200K Blackwell chip training systems built in 2025), which is not a large enough change to be very noticeable, so it's not by itself sufficient to prevent a return of late 2024 vaguely pessimistic sentiment, and other considerations might get more sway with funding outcomes. But even then, OpenAI might get to ~$25bn annualized revenue that won't be going away, and in 2027 or slightly earlier there will be models pretrained for ~4e27 FLOPs using the training systems built in 2025-2026 (400K-600K Blackwell chips, 0.8-1.4 GW, $22-35bn), which as a 10x-15x change (compared to the models currently or soon-to-be deployed in 2025) is significant enough to get noticeably better across the board, even if nothing substantially game-changing gets unlocked. So the "crash" might be about revenue no longer growing 3x per year, and so the next generation training systems built in 2027-2028 not getting to the $150bn scale they otherwise might've aspired to.

Comment by Vladimir_Nesov on The case for AGI by 2030 · 2025-04-10T15:11:52.111Z · LW · GW

I think the idea of effective FLOPs has more narrow applicability than what you are running with, many things that count as compute multipliers don't scale. They often only hold for particular capabilities that stop being worth boosting separately at greater levels of scale, or particular data that stops being available in sufficient quantity. An example of a scalable compute multiplier is MoE (even as it destroys data efficiency, and so damages some compute multipliers that rely on selection of high quality data). See Figure 4 in the Mamba paper for another example of a scalable compute multiplier (between GPT-3 transformer and Llama 2 transformer, Transformer and Transformer++ respectively in the figure). This issue is particularly serious when we extrapolate by many OOMs, and I think only very modest compute multipliers (like 1.5x/year) survive across all that, because most things that locally seem like compute multipliers don't compound very far.

There are also issues I have with Epoch studies on this topic, mainly extrapolating scaling laws from things that are not scalable much at all and weren't optimized for compute optimality, with trends being defined by limits of scalability and idiosyncratic choices of hyperparameters driven by other concerns, rather than a well-defined and consistent notion of compute optimality (which I'd argue wasn't properly appreciated and optimized-for by the field until very recently, with Chinchilla still managing to fix glaring errors merely in 2022). Even now, papers arguing compute multipliers keep showing in-training loss/perplexity plots that weren't cooled down before measurement, I think Figure 11 from the recent OLMo 2 paper illustrates this point brilliantly, showing how what looks like a large compute multiplier^[1] before the learning rate schedule runs its course can lose all effect once it does.

To be fair it's kind of a toy case where the apparent "compute multiplier" couldn't seriously be expected to be real, but it does illustrate the issue with the methodology of looking at loss/perplexity plots where the learning rate is still high, or might differ significantly between points being compared on the plots for different architecture variants, hopelessly confounding the calculation of a compute multiplier. ↩︎

Comment by Vladimir_Nesov on The case for AGI by 2030 · 2025-04-09T22:53:36.748Z · LW · GW

spending tens of billions of dollars to build clusters that could train a GPT-6-sized model in 2028

Traditionally steps of GPT series are roughly 100x in raw compute (I'm not counting effective compute, since it's not relevant to cost of training). GPT-4 is 2e25 FLOPs. Which puts "GPT-6" at 2e29 FLOPs. To train a model in 2028, you would build an Nvidia Rubin Ultra NVL576 (Kyber) training system in 2027. Each rack holds 576 compute dies at about 3e15 BF16 FLOP/s per die^[1] or 1.6e18 FLOP/s per rack. A Blackwell NVL72 datacenter costs about $4M per rack to build, possibly a non-Ultra Rubin NVL144 datacenter will cost about $5M per rack, and a Rubin Ultra NVL576 datacenter might cost about $12M per rack^[2].

To get 2e29 BF16 FLOPs in 4 months at 40% utilization, you'd need 30K racks that would cost about $360B all-in (together with the rest of the training system). Which is significantly more than "tens of billions of dollars".

GPT-8 would require trillions

"GPT-8" is two steps of 100x in raw compute up from "GPT-6", at 2e33 FLOPs. You'd need to use 10000x more compute than what $360B buy in 2027. Divide it by how much cheaper that compute gets within a few years, let's say 8x cheaper. What we get is $450T, which is much more than merely "trillions", and also technologically impossible to produce at that time without transformative AI.

Chips in Blackwell GB200 systems are manufactured with 4nm process and produce about 2.5 dense BF16 FLOP/s per chip, with each chip holding 2 almost reticle sized compute dies. Rubin moves to 3nm, compared to Blackwell at 4nm, which makes each die about 2x more performant (from more transistors and higher clock speed, but the die size must remain the same), which predicts about 2.5 dense BF16 FLOP/s per die or 5 BF16 FLOP/s per 2-die chip. (Nvidia announced that dense FP8 performance will increase 3.3x, but that's probably due to giving more transistors to FP8, which can't be done as much for BF16 since it already needs a lot.)

To separately support this, today Google announced Ironwood, their 7th generation of TPU (that might go into production in late 2026). The announcement includes a video that shows that it's a 2-die chip, same as non-Ultra Rubin, and it was also previously reported to be manufactured with 3nm. In today's announcement, its performance is quoted as 4.6e15 FLOP/s, which from context of comparing with 459e12 FLOP/s of TPUv5p is likely dense BF16. This means 2.3e15 dense BF16 FLOP/s per compute die, close to my estimate for a Rubin compute die.

A Kyber rack was announced to need 600 KW per rack (1.04 KW/die within-rack all-in), compared to Blackwell NVL72 at 120-130 KW per rack (0.83-0.90 KW/die within-rack all-in). Earlier non-Ultra Rubin NVL144 is a rack with the same number of chips and compute dies as Blackwell NVL72, so it might be using at most slightly higher power per compute die (let's say 0.90 KW/die within-rack all-in). Thus the clock speed for Rubin Ultra might be up to ~1.15x higher than for non-Ultra Rubin, meaning performance of Rubin Ultra might reach 2.9e15 dense BF16 FLOP/s per die (12e15 FLOP/s per chip, 1.6e18 FLOP/s per rack). ↩︎
In a Rubin Ultra NVL576 rack, chips have 4 compute dies each, compared to only 2 dies per chip in a non-Ultra Rubin NVL144 rack. Since Nvidia sells at a large margin per compute die, and its real product is the whole system rather than the individual compute dies, it can afford to keep cutting the margin per die, while the cost of the rest of the system scales with the number of chips rather then the number of dies. The NVL576 rack has 2x more chips than the ~$5M NVL144 rack, so if the cost per chip only increases slightly, we get $12M per rack. ↩︎

Comment by Vladimir_Nesov on AI 2027: What Superintelligence Looks Like · 2025-04-09T21:23:54.002Z · LW · GW

probability mass for AI that can automate all AI research is in the 2030s ... broadly due to the tariffs and ...

Without AGI, scaling of hardware runs into the financial ~$200bn individual training system cost wall in 2027-2029. Any tribulations on the way (or conversely efforts to pool heterogeneous and geographically distributed compute) only delay that point slightly (when compared to the current pace of increase in funding), and you end up in approximately the same place, slowing down to the speed of advancement in FLOP/s per watt (or per dollar). Without transformative AI, anything close to the current pace is unlikely to last into the 2030s.

Comment by Vladimir_Nesov on abramdemski's Shortform · 2025-04-09T16:34:28.625Z · LW · GW

With AI assistance, the degree to which an alternative is ready-to-go can differ a lot compared to its prior human-developed state. Also, an idea that's ready-to-go is not yet an edifice of theory and software that's ready-to-go in replacing 5e28 FLOPs transformer models, so some level of AI assistance is still necessary with 2 year timelines. (I'm not necessarily arguing that 2 year timelines are correct, but it's the kind of assumption that my argument should survive.)

The critical period includes the time when humans are still in effective control of the AIs, or when vaguely aligned and properly incentivised AIs are in control and are actually trying to help with alignment, even if their natural development and increasing power would end up pushing them out of that state soon thereafter. During this time, the state of current research culture shapes the path-dependent outcomes. Superintelligent AIs that are reflectively stable will no longer allow path dependence in their further development, but before that happens the dynamics can be changed to an arbitrary extent, especially with AI efforts as leverage in implementing the changes in practice.

Comment by Vladimir_Nesov on Llama Does Not Look Good 4 Anything · 2025-04-09T14:13:07.458Z · LW · GW

The most important thing about Llama 4 is that the 100K H100s run that was promised got canceled, and its flagship model Behemoth will be a 5e25 FLOPs compute optimal model^[1] rather than a ~3e26 FLOPs model that a 100K H100s training system should be able to produce. This is merely 35% more compute than Llama-3-405B from last year, while GPT-4.5, Grok 3 and Gemini 2.5 Pro are probably around 3e26 FLOPs or a bit more. They even explicitly mention that it was trained on 32K GPUs (which must be H100s). Since Behemoth is the flagship model, a bigger model got pushed back to Llama 5, which will only come out much later, possibly not even this year.

In contrast, capabilities of Maverick are unsurprising and prompt no updates. It's merely a 2e24 FLOPs ~7x overtrained model^[2], which is 2x less compute than DeepSeek-V3 and 100x less than the recent frontier models, and also it's not a reasoning model for now. So of course it's not very good. If it was very good with this little compute, that would be a feat on the level of Anthropic or DeepSeek, which would be a positive update about Meta's model training competence, but this unexpected thing merely didn't happen, so nothing to see here, what are people even surprised about (except some benchmarking shenanigans).

To the extent Llamas 1-3 were important open weights releases that could be run by normal people locally, Llama 4 does seem disappointing, because there are no small models (in total params), though as Llama 3.2 demonstrated this might change shortly. Even the smallest Scout model still has 109B total params, meaning a 4 bit quantized version might fit on high end consumer hardware, but all the rest is only practical with datacenter hardware.

288B active params, 30T training tokens gives 5.2e25 FLOPs by 6ND. At 1:8 sparsity (2T total tokens, maybe ~250T active params within experts), data efficiency is 3x lower than for a dense model, and for Llama-3-405B the compute optimal amount of data was 40 tokens per param. This means that about 120 tokens per param would be optimal for Behemoth, and in fact it has 104 tokens per active param, so it's not overtrained. ↩︎
17B active params, 22T tokens, which is 2.25e24 FLOPs by 6ND, and 1300 tokens per active param. It's a weird mix of dense and MoE, so the degree of its sparsity probably doesn't map to measurements for pure MoE, but at ~1:23 sparsity (from 400B total params) it might be ~5x less data efficient than dense, predicting ~200 tokens per param compute optimal, meaning 1300 tokens per param give ~7x overtraining. ↩︎

Comment by Vladimir_Nesov on abramdemski's Shortform · 2025-04-09T00:43:49.310Z · LW · GW

haven't heard this said explicitly before

Okay, this prompted me to turn the comment into a post, maybe this point is actually new to someone.

Comment by Vladimir_Nesov on abramdemski's Shortform · 2025-04-08T19:51:03.789Z · LW · GW

prioritization depends in part on timelines

Any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. Even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we'd need to rely on AI's judgment more completely. So it makes sense to still prioritize things that have no hope at all of becoming practical for decades (with human effort), to make as much partial progress as possible in developing (and deconfusing) them in the next few years.

In this sense current human research, however far from practical usefulness, forms the data for alignment of the early AI-assisted or AI-driven alignment research efforts. The judgment of human alignment researchers who are currently working makes it possible to formulate more knowably useful prompts for future AIs that nudge them in the direction of actually developing practical alignment techniques.

Comment by Vladimir_Nesov on AI 2027: What Superintelligence Looks Like · 2025-04-08T18:41:00.328Z · LW · GW

"Revenue by 2027.5" needs to mean "revenue between summer 2026 and summer 2027". And the time when the $150bn is raised needs to be late 2026, not "2027.5", in order to actually build the thing by early 2027 and have it completed for several months already by mid to late 2027 to get that 5e28 BF16 FLOPs model. Also Nvidia would need to have been expecting this or similar sentiment elsewhere months to years in advance, as everyone in the supply chain can be skeptical that this kind of money actually materializes by 2027, and so that they need to build additional factories in 2025-2026 to meet the hypothetical demand of 2027.

By "used for inference," this just means basically letting people use the model?

It means using the compute to let people use various models, not necessarily this one, while the model itself might end up getting inferenced elsewhere. Numerous training experiments can also occupy a lot of GPU-time, but they will be smaller than the largest training run, and so the rest of the training system can be left to do other things. In principle some AI companies might offer cloud provider services and sell the time piecemeal on the older training systems that are no longer suited for training frontier models, but very likely they have a use for all that compute themselves.

Comment by Vladimir_Nesov on AI 2027: What Superintelligence Looks Like · 2025-04-08T17:03:47.180Z · LW · GW

A 100K H100s training system is a datacenter campus that costs about $5bn to build. You can use it to train a 3e26 FLOPs model in ~3 months, and that time costs about $500M. So the "training cost" is $500M, not $5bn, but in order to do the training you need exclusive access to a giant 100K H100s datacenter campus for 3 months, which probably means you need to build it yourself, which means you still need to raise the $5bn. Outside these 3 months, it can be used for inference or training experiments, so the $5bn is not wasted, it's just a bit suboptimal to build that much compute in a single place if your goal is primarily to serve inference around the world, because it will be quite far from most places in the world. (The 1e27 estimate is the borderline implausible upper bound, and it would take more than $500M in GPU-time to reach, 3e26 BF16 FLOPs or 6e26 FP8 FLOPs are more likely with just the Goodyear campus).

Abilene site of Stargate is only building about 100K chips (2 buildings, ~1500 Blackwell NVL72 racks, ~250 MW, ~$6bn) by summer 2025, most of the rest of the 1.2 GW buildout happens in 2026. The 2025 system is sufficient to train a 1e27 BF16 FLOPs model (or 2e27 FP8 FLOPs).

Rubin arriving 1.5 years after Blackwell means you have 1.5 years of revenue growth to use as an argument about valuation to raise money for Rubin, not 1 year. The recent round raised money for a $30bn datacenter campus, so if revenue actually keeps growing at 3x per year, then it'll grow 5x in 1.5 years. As the current expectation is $12bn, in 1.5 years the expectation would be $60bn (counting from an arbitrary month, without sticking to calendar years). And 5x of $30bn is $150bn, anchoring to revenue growth, though actually raising this kind of absurd amount of money is a separate matter that also needs to happen.

If miraculously Nvidia actually ships 30K Rubin racks in early 2027 (to a single customer), training will only happen a bit later, that is you'll only have an actual 5e28 BF16 FLOPs model by mid-2027, not in 2026. Building the training system costs $150bn, but the minimum necessary cost of 3-4 months of training system's time is only about $15bn.

More likely this only happens several months later, in 2028, and at that point there's the better Rubin Ultra NVL576 (Kyber) coming out, so that's a reason to avoid tying up the $150bn in capital in the inferior non-Ultra Rubin NVL144 racks and instead wait for Rubin Ultra, only expending somewhat less than $150bn on non-Ultra Rubin NVL144 in 2027, meaning only a ~2e28 BF16 FLOPs model in 2027 (and at this lower level of buildout it's more likely to actually happen in 2027). Of course the AI 2027 timeline assumes all-encompassing capability progress in 2027, which means AI companies won't be saving money for next year, but hardware production still needs to ramp, money won't be able to speed it up that much on the timescale of months.

Comment by Vladimir_Nesov on AI 2027: What Superintelligence Looks Like · 2025-04-08T03:34:20.190Z · LW · GW

GPT-4.5 might've been trained on 100K H100s of the Goodyear Microsoft site ($4-5bn, same as first phase of Colossus), about 3e26 FLOPs (though there are hints in the announcement video it could've been trained in FP8 and on compute from more than one location, which makes up to 1e27 FLOPs possible in principle).

Abilene site of Crusoe/Stargate/OpenAI will have 1 GW of Blackwell servers in 2026, about 6K-7K racks, possibly at $4M per rack all-in, for the total of $25-30bn, which they've already raised money for (mostly from SoftBank). They are projecting about $12bn in revenue for 2025. If used as a single training system, it's enough to train models for 5e27 BF16 FLOPs (or 1e28 FP8 FLOPs).

The AI 2027 timeline assumes reliable agentic models work out, so revenue continues scaling, with the baseline guess of 3x per year. If Rubin NVL144 arrives 1.5 years after Blackwell NVL72, that's about 5x increase in expected revenue. If that somehow translates into proportional investment in datacenter construction, that might be enough to buy $150bn worth of Rubin NVL144 racks, say at $5M per rack all-in, which is 30K racks and 5 GW. Compared to Blackwell NVL72, that's 2x more BF16 compute per rack (and 3.3x more FP8 compute). This makes the Rubin datacenter of early 2027 sufficient to train a 5e28 BF16 FLOPs model (or 1.5e29 FP8 FLOPs) later in 2027. Which is a bit more than 100x the estimate for GPT-4.5.

(I think this is borderline implausible technologically if only the AI company believes in the aggressive timeline in advance, and ramping Rubin to 30K racks for a single company will take more time. Getting 0.5-2 GW of Rubin racks by early 2027 seems more likely. Using Blackwell at that time means ~2x lower performance for the same money, undercutting the amount of compute that will be available in 2027-2028 in the absence of an intelligence explosion, but at least it's something money will be able to buy. And of course this still hinges on the revenue actually continuing to grow, and translating into capital for the new datacenter.)

Comment by Vladimir_Nesov on Meta releases Llama-4 herd of models · 2025-04-07T16:08:59.885Z · LW · GW

Your point is one of the clues I mentioned that I don't see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretical argument along the lines of what you are saying shifts my expectation for 10x-20x repetition (which might degrade faster when working with lower quality data), but not yet for 3x repetition (which I still expect to get an ~unchanged loss).

Also, check out https://www.reddit.com/r/LocalLLaMA, they are very disappointed how bad the released models turned out to be (yeah I know that's not directly indicative of Behemoth performance)

So far I haven't even seen anyone there notice that Behemoth means that Llama 4 was essentially canceled and instead we got some sort of Llama 3.5 MoE. That is, a 100K+ H100s training run that was the expected and announced crown jewel of Llama 4 won't be coming out, probably until at least late 2025 and possibly even 2026. Since Behemoth is the flagship model for Llama 4, a 3e26+ FLOPs model that would've been appropriate for a 100K H100s training system instead got pushed back to Llama 5.

As Behemoth is only a 5e25 FLOPs model, even once it comes out it won't be competing in the same weight class as GPT-4.5, Grok 3, and Gemini 2.5 Pro. Maverick is only a 2e24 FLOPs^[1] model (2x less than DeepSeek-V3, ~100x less than the recent frontier models), so of course it's not very good compared to the frontier models. Since Meta didn't so far demonstrate competence on the level of DeepSeek or Anthropic, they do need the big compute to remain in the game, and Maverick is certainly not big compute.

(LocalLLaMA specifically is annoyed by absence of models with a small number of total parameters in the current Llama 4 announcement, which means you need high end consumer hardware to run even Scout in 4 bit quantization locally, and datacenter hardware for the rest.)

It's a 17B active parameter model trained for 22T tokens. ↩︎

Comment by Vladimir_Nesov on An Optimistic 2027 Timeline · 2025-04-07T15:30:10.199Z · LW · GW

I meant "realiable agents" in the AI 2027 sense, that is something on the order of being sufficient for automated AI research, leading to much more revenue and investment in the lead-up rather than stalling at ~$100bn per individual training system for multiple years. My point is that it's not currently knowable if it happens imminently in 2026-2027 or at least a few years later, meaning I don't expect that evidence exists that distinguishes these possibilities even within the leading AI companies.

Comment by Vladimir_Nesov on An Optimistic 2027 Timeline · 2025-04-07T15:10:56.325Z · LW · GW

The reason Rubin NVL576 probably won't help as much as the current transition from Hopper is that Blackwell NVL72 is already ~sufficient for the model sizes that are compute optimal to train on $30bn Blackwell training systems (which Rubin NVL144 training systems probably won't significantly leapfrog before Rubin NVL576 comes out, unless there are reliable agents in 2026-2027 and funding goes through the roof).

when we get 576 (194 gpus)

The terminology Huang was advocating for at GTC 2025 (at 1:28:04) is to use "GPU" to refer to compute dies rather than chips/packages, and in these terms a Rubin NVL576 rack has 144 chips and 576 GPUs, rather than 144 GPUs. Even though this seems contentious, the terms compute die and chip/package remain less ambiguous than "GPU".

Comment by Vladimir_Nesov on An Optimistic 2027 Timeline · 2025-04-07T14:55:45.422Z · LW · GW

The solution is increase in scale-up world size, but the "bug" I was talking about is in how it used to be too small for the sizes of LLMs that are compute optimal at the current level of training compute. With Blackwell NVL72, this is no longer the case, and shouldn't again become the case going forward. Even though there was a theoretical Hopper NVL256, for whatever reason in practice everyone ended up with only Hopper NVL8.

The size of the effect of insufficient world size^[1] depends on the size of the model, and gets more severe for reasoning models on long context, where with this year's models each request would want to ask the system to generate (decode) on the order of 50K tokens while needing to maintain access to on the order of 100K tokens of KV-cache per trace. This might be the reason Hopper NVL256 never shipped, as this use case wasn't really present in 2022-2024, but in 2025 it's critically important, and so the incoming Blackwell NVL72/NVL36 systems will have a large impact.

(There are two main things a large world size helps with: it makes more HBM for KV-cache available, and it enables more aggressive tensor parallelism. When generating a token, the data for all previous tokens (KV-cache) needs to be available to process the attention blocks, and tokens for a given trace need to be generated sequentially, one at a time (or something like 1-4 at a time with speculative decoding). Generating one token only needs a little bit of compute, so it would be best to generate tokens for many traces at once, one for each, using more compute across these many tokens. But for this to work, all the KV-caches for all these traces need to sit in HBM. If the system would run out of memory, it needs to constrain the number of traces it'll process within a single batch, which means the cost per trace (and per generated token) goes up, since the cost to use the system's time is the same regardless of what it's doing.

Tensor parallelism lets matrix multiplications go faster by using multiple chips for the same matrix multiplication. Since tokens need to be generated sequentially, one of the only ways to generate a long reasoning trace faster (with given hardware) is by using tensor parallelism (expert parallelism should also help when using high granularity MoE, where a significant number of experts within a layer is active at once, rather than the usual 2). And practical tensor parallelism is constrained to the world size.)

As in this image (backup in-blog link) that in its most recent incarnation appeared in the GTC 2025 keynote (at 1:15:56). ↩︎

Comment by Vladimir_Nesov on Meta releases Llama-4 herd of models · 2025-04-06T23:01:38.269Z · LW · GW

The loss goes down; whether that helps in some more legible way that also happens to be impactful is much harder to figure out. The experiments in the May 2023 paper show that training on some dataset and training on a random quarter of that dataset repeated 4 times result in approximately the same loss (Figure 4). Even 15 repetitions remain useful, though at that point somewhat less useful than 15 times more unique data. There is also some sort of double descent where loss starts getting better again after hundreds of repetitions (Figure 9 in Appendix D).

This strongly suggests that repeating merely 3 times will robustly be about as useful as having 3 times more data from the same distribution. I don't know of comparably strong clues that would change this expectation.

Comment by Vladimir_Nesov on An Optimistic 2027 Timeline · 2025-04-06T17:25:21.056Z · LW · GW

I think Blackwell will change the sentiment by late 2025 compared to 2024, with a lot of apparent progress in capabilities and reduced prices (which the public will have a hard time correctly attributing to Blackwell). In 2026 there will be some Blackwell-trained models, using 2x-4x more compute than what we see today (or what we'll see more of in a few weeks to months with the added long reasoning option, such as GPT-4.5 with reasoning).

But then the possibilities for 2027 branch on whether there are reliable agents, which doesn't seem knowable either way right now. If this doesn't work out, in particular because R1-like RL training doesn't scale or generalize, then by 2027 nothing substantially new will happen, and the 2024-style slowdown sentiment will return, since 3x-5x increase in training compute is not a game-changing amount (unless there is a nearby threshold to be reached), and Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference) and can't be repeated even with Rubin Ultra NVL576. At that point individual training systems will cost on the order of $100bn, and so won't have much further to scale other than at the slower pace of chip improvement (within the assumption of absence of reliable agents). The Chinese AI companies will be more than 10x but less than 100x behind in training compute (mostly because AI fails to become a priority), which can occasionally but not reliably be surmounted with brilliant engineering innovations.

Comment by Vladimir_Nesov on Milan W's Shortform · 2025-04-06T04:55:22.773Z · LW · GW

A power seeker is ambitious without an ambition, which is not an implication of being agentic.

Comment by Vladimir_Nesov on Meta releases Llama-4 herd of models · 2025-04-05T21:44:20.634Z · LW · GW

The announcement post says the following on the scale of Behemoth:

we focus on efficient model training by using FP8 precision, without sacrificing quality and ensuring high model FLOPs utilization—while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs, we achieved 390 TFLOPs/GPU. The overall data mixture for training consisted of more than 30 trillion tokens

This puts Llama 4 Behemoth at 5e25 FLOPs (30% more than Llama-3-405B), trained on 32K H100s (only 2x more than Llama-3-405B) instead of the 128K H100s (or in any case, 100K+) they should have. They are training in FP8 (which gets 2x more FLOP/s per chip than the easier-to-work-with BF16), but with 20% compute utilization (2x lower than in dense Llama-3-405B; training MoE is harder).

At 1:8 sparsity (2T total parameters, ~250B in active experts), it should have 3x lower data efficiency than a dense model (and 3x as much effective compute, so it has 4x effective compute of Llama-3-405B even at merely 1.3x raw compute). Anchoring to Llama-3-405B, which is dense and has 38 tokens per parameter compute optimal with their dataset, we get about 120 tokens per active parameter optimal for a model with Behemoth's shape, which for 288B active parameters gives 35T tokens. This fits their 30T tokens very well, so it's indeed a compute optimal model (and not a middle-sized overtrained model that inherited the title of "Behemoth" from a failed 128K H100s run).

In any case, for some reason they didn't do a large training run their hardware in principle enables, and even then their training run was only about 2 months (1.5 months from total compute and utilization, plus a bit longer at the start to increase critical batch size enough to start training on the whole training system). (Running out of data shouldn't be a reason to give up on 128K H100s, as a compute optimal 1:8 sparsity model would've needed only 90T tokens at 750B active parameters, if trained in FP8 with 20% compute utilization for 3 months. Which could just be the same 30T tokens repeated 3 times.)

Comment by Vladimir_Nesov on AI 2027: What Superintelligence Looks Like · 2025-04-04T00:39:33.911Z · LW · GW

For me a specific crux is scaling laws of R1-like training, what happens when you try to do much more of it, which inputs to this process become important constraints and how much they matter. This working out was extensively brandished but not yet described quantitatively, all the reproductions of long reasoning training only had one iteration on top of some pretrained model, even o3 isn't currently known to be based on the same pretrained model as o1.

The AI 2027 story heavily leans into RL training taking off promptly, and it's possible they are resonating with some insider rumors grounded in reality, but from my point of view it's too early to tell. I guess in a few months to a year there should be enough public data to tell something, but then again a quantitative model of scaling for MoE (compared to dense) was only published in Jan 2025, even though MoE was already key to original GPT-4 trained in 2022.

Comment by Vladimir_Nesov on AI 2027: What Superintelligence Looks Like · 2025-04-03T19:23:36.416Z · LW · GW

Non-Google models of late 2027 use Nvidia Rubin, but not yet Rubin Ultra. Rubin NVL144 racks have the same number of compute dies and chips as Blackwell NVL72 racks (change in the name is purely a marketing thing, they now count dies instead of chips). The compute dies are already almost reticle sized, can't get bigger, but Rubin uses 3nm (~180M Tr/mm2) while Blackwell is 4nm (~130M Tr/mm2). So the number of transistors per rack goes up according to transistor density between 4nm and 3nm, by 1.4x, plus better energy efficiency enables higher clock speed, maybe another 1.4x, for the total of 2x in performance. The GTC 2025 announcement claimed 3.3x improvement for dense FP8, but based on the above argument it should still be only about 2x for the more transistor-hungry BF16 (comparing Blackwell and Rubin racks).

Abilene site of Stargate^[1] will probably have 400K-500K Blackwell chips in 2026, about 1 GW. Nvidia roadmap puts Rubin (VR200 NVL144) 1.5-2 years after Blackwell (GB200 NVL72), which is not yet in widespread use, but will get there soon. So the first models will start being trained on Rubin no earlier than late 2026, much more likely only in 2027, possibly even second half of 2027. Before that, it's all Blackwell, and if it's only 1 GW Blackwell training systems^[2] in 2026 for one AI company, shortly before 2x better Rubin comes out, then that's the scale where Blackwell stops, awaiting Rubin and 2027. Which will only be built at scale a bit later still, similarly to how it's only 100K chips in GB200 NVL72 racks in 2025 for what might be intended to be a single training system, and not yet 500K chips.

This predicts at most 1e28 BF16 FLOPs (2e28 FP8 FLOPs) models in late 2026 (trained on 2 GW of GB200/GB300 NVL72), and very unlikely more than 1e28-4e28 BF16 FLOPs models in late 2027 (1-4 GW Rubin datacenters in late 2026 to early 2027), though that's alternatively 3e28-1e29 FP8 FLOPs given the FP8/BF16 performance ratio change with Rubin I'm expecting. Rubin Ultra is another big step ~1 year after Rubin, with 2x more compute dies per chip and 2x more chips per rack, so it's a reason to plan pacing the scaling a bit rather than rushing it in 2026-2027. Such plans will make rushing it more difficult if there is suddenly a reason to do so, and 4 GW with non-Ultra Rubin seems a bit sudden.

So pretty similar to Agent 2 and Agent 4 at some points, keeping to the highest estimates, but with less compute than the plot suggests for months while the next generation of datacenters is being constructed (during the late 2026 to early 2027 Blackwell-Rubin gap).

It wasn't confirmed all of it goes to Stargate, only that Crusoe is building it on the same site as it did the first buildings that do go to Stargate. ↩︎
500K chips, 1M compute dies, 1.25M H100-equivalents, ~4e27 FLOPs for a model in BF16. ↩︎

User info

Posts

Comments