Posts

Musings on Text Data Wall (Oct 2024) 2024-10-05T19:00:21.286Z
Vladimir_Nesov's Shortform 2024-10-04T14:20:52.975Z
Superintelligence Can't Solve the Problem of Deciding What You'll Do 2024-09-15T21:03:28.077Z
OpenAI o1, Llama 4, and AlphaZero of LLMs 2024-09-14T21:27:41.241Z
Musings on LLM Scale (Jul 2024) 2024-07-03T18:35:48.373Z
No Anthropic Evidence 2012-09-23T10:33:06.994Z
A Mathematical Explanation of Why Charity Donations Shouldn't Be Diversified 2012-09-20T11:03:48.603Z
Consequentialist Formal Systems 2012-05-08T20:38:47.981Z
Predictability of Decisions and the Diagonal Method 2012-03-09T23:53:28.836Z
Shifting Load to Explicit Reasoning 2011-05-07T18:00:22.319Z
Karma Bubble Fix (Greasemonkey script) 2011-05-07T13:14:29.404Z
Counterfactual Calculation and Observational Knowledge 2011-01-31T16:28:15.334Z
Note on Terminology: "Rationality", not "Rationalism" 2011-01-14T21:21:55.020Z
Unpacking the Concept of "Blackmail" 2010-12-10T00:53:18.674Z
Agents of No Moral Value: Constrained Cognition? 2010-11-21T16:41:10.603Z
Value Deathism 2010-10-30T18:20:30.796Z
Recommended Reading for Friendly AI Research 2010-10-09T13:46:24.677Z
Notion of Preference in Ambient Control 2010-10-07T21:21:34.047Z
Controlling Constant Programs 2010-09-05T13:45:47.759Z
Restraint Bias 2009-11-10T17:23:53.075Z
Circular Altruism vs. Personal Preference 2009-10-26T01:43:16.174Z
Counterfactual Mugging and Logical Uncertainty 2009-09-05T22:31:27.354Z
Bloggingheads: Yudkowsky and Aaronson talk about AI and Many-worlds 2009-08-16T16:06:18.646Z
Sense, Denotation and Semantics 2009-08-11T12:47:06.014Z
Rationality Quotes - August 2009 2009-08-06T01:58:49.178Z
Bayesian Utility: Representing Preference by Probability Measures 2009-07-27T14:28:55.021Z
Eric Drexler on Learning About Everything 2009-05-27T12:57:21.590Z
Consider Representative Data Sets 2009-05-06T01:49:21.389Z
LessWrong Boo Vote (Stochastic Downvoting) 2009-04-22T01:18:01.692Z
Counterfactual Mugging 2009-03-19T06:08:37.769Z
Tarski Statements as Rationalist Exercise 2009-03-17T19:47:16.021Z
In What Ways Have You Become Stronger? 2009-03-15T20:44:47.697Z
Storm by Tim Minchin 2009-03-15T14:48:29.060Z

Comments

Comment by Vladimir_Nesov on Akash's Shortform · 2024-11-20T23:55:25.426Z · LW · GW

Still consistent with great concern. I'm pointing out that O O's point isn't locally valid, observing concern shouldn't translate into observing belief that alignment is impossible.

Comment by Vladimir_Nesov on Akash's Shortform · 2024-11-20T18:51:18.063Z · LW · GW

A mere 5% chance that the plane will crash during your flight is consistent with considering this extremely concerning and doing anything in your power to avoid getting on it. "Alignment is impossible" is not necessary for great concern, isn't implied by great concern.

Comment by Vladimir_Nesov on Q Home's Shortform · 2024-11-19T04:35:41.146Z · LW · GW

I'm talking about finding world-models in which real objects (such as "strawberries" or "chairs") can be identified.

My point is that chairs and humans can be considered in a similar way.

The most straightforward way of finding a world-model is just predicting your sensory input. But then you're not guaranteed to get a model in which something corresponding to "real objects" can be easily identified.

There's the world as a whole that generates observations, and particular objects on their own. A model that cares about individual objects needs to consider them separately from the world. The same object in a different world/situation should still make sense, so there are many possibilities for the way an object can be when placed in some context and allowed to develop. This can be useful for modularity, but also for formulating properties of particular objects, in a way that doesn't get distorted by the influence of the rest of the world. Human preferences is one such property.

Comment by Vladimir_Nesov on Q Home's Shortform · 2024-11-18T18:48:34.493Z · LW · GW

Models or real objects or things capture something that is not literally present in the world. The world contains shadows of these things, and the most straightforward way of finding models is by looking at the shadows and learning from them. Hypotheses is another toy example.

One of the features of models/things seems to be how they capture the many possibilities of a system simultaneously, rather than isolated particular possibilities. So what I gestured at was that when considering models of humans, the real objects or models behind a human capture the many possibilities of the way that human could be, rather than only the actuality of how they actually are. And this seems useful for figuring out their preferences.

Path-dependence is the way outcomes depend on the path that was taken to reach them. A path-independent outcome is convergent, it's always the same destination regardless of the path that was taken. Human preferences seem to be path dependent on human timescales, growing up in Egypt may lead to a persistently different mindset from the same human growing up in Canada.

Comment by Vladimir_Nesov on O O's Shortform · 2024-11-17T19:39:16.862Z · LW · GW

for anything related to human judgement, in theory this isn’t why it’s not doing well

The facts are in there, but not in the form of a sufficiently good reward model that can tell as well as human experts which answer is better or whether a step of an argument is valid. In the same way, RLHF is still better with humans on some queries, hasn't been fully automated to superior results by replacing humans with models in all cases.

Comment by Vladimir_Nesov on Q Home's Shortform · 2024-11-17T15:59:16.338Z · LW · GW

Creating an inhumanly good model of a human is related to formulating their preferences. A model captures many possibilities and the way many hypothetical things are simulated in the training data. Thus it's a step towards eliminating path-dependence of particular life stories (and preferences they motivate), by considering these possibilities altogether. Even if some on the possible life stories interact with distortionary influences, others remain untouched, and so must continue deciding their own path, for there are no external influences there and they are the final authority for what counts as aiding them anyway.

Comment by Vladimir_Nesov on Alexander Gietelink Oldenziel's Shortform · 2024-11-17T15:45:53.998Z · LW · GW

Creativity is RL, converting work into closing the generation-discrimination gap wherever it's found (or laboriously created by developing good taste). The resulting generations can be novelty-worthy, imitating them makes it easier to close the gap, reducing the need for creativity.

Comment by Vladimir_Nesov on O O's Shortform · 2024-11-17T15:28:54.922Z · LW · GW

A reasoning model depends on starting from a sufficient base model that captures the relevant considerations. Solving AIME is like winning at chess, except the rules of chess are trivial, and the rules of AIME are much harder. But the rules of AIME are still not that hard, it's using them to win that is hard.

In the real world, the rules get much harder than that, so it's unclear how far o1 can go if the base model doesn't get sufficiently better (at knowing the rules), and it's unclear how much better it needs to get. Plausibly it needs to get so good that o1-like post-training won't be needed for it to pursue long chains of reasoning on its own, as an emergent capability. (This includes the possibility that RL is still necessary in some other way, as an engine of optimization to get better at rules of the real world, that is to get better reward models.)

Comment by Vladimir_Nesov on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-16T13:02:31.930Z · LW · GW

Having preferenes is very different from knowing them. There's always a process of reflection that refines preferences, so any current guess is always wrong at least in detail. For a decision theory to have a shot at normativity, it needs to be able to adapt to corrections and ideally anticipate their inevitability (not locking in the older guess and preventing further reflection; instead facilitating further reflection and being corrigible).

Orthogonality asks the domain of applicability to be wide enough that both various initial guesses and longer term refinements to them won't fall out of scope. When a theory makes assumptions about value content, that makes it a moral theory rather than a decision theory. A moral theory explores particular guesses about preferences of some nature.

So in the way you use the term, quantum immortality seems to be a moral theory, involving claims that quantum suicide can be a good idea. For example "use QI to earn money" is a recommendation that depends on this assumption about preferences (of at least some people in some situations).

Comment by Vladimir_Nesov on johnswentworth's Shortform · 2024-11-15T23:47:04.299Z · LW · GW

Use of repeated data was first demonstrated in the 2022 Galactica paper (Figure 6 and Section 5.1), at 2e23 FLOPs but without a scaling law analysis that compares with unique data or checks what happens for different numbers of repeats that add up to the same number of tokens-with-repetition. The May 2023 paper does systematic experiments with up to 1e22 FLOPs datapoints (Figure 4).

So that's what I called "tiny experiments". When I say that it wasn't demonstrated at scale, I mean 1e25+ FLOPs, which is true for essentially all research literature[1]. Anchoring to this kind of scale (and being properly suspicious of results several orders of magnitude lower) is relevant because we are discussing the fate of 4e27 FLOPs runs.


  1. The largest datapoints in measuring the Chinchilla scaling laws for Llama 3 are 1e22 FLOPs. This is then courageously used to choose the optimal model size for the 4e25 FLOPs run that uses 4,000 times more compute than the largest of the experiments. ↩︎

Comment by Vladimir_Nesov on johnswentworth's Shortform · 2024-11-15T22:24:42.909Z · LW · GW

Nobody admitted to trying repeated data at scale yet (so we don't know that it doesn't work), which from the tiny experiments can 5x the data with little penalty and 15x the data in a still-useful way. It's not yet relevant for large models, but it might turn out that small models would greatly benefit already.

There are 15-20T tokens in datasets whose size is disclosed for current models (Llama 3, Qwen 2.5), plausibly 50T tokens of tolerable quality can be found (pretraining only needs to create useful features, not relevant behaviors). With 5x 50T tokens, even at 80 tokens/parameter[1] we can make good use of 5e27-7e27 FLOPs[2], which even a 1 gigawatt 500K B200s system of early 2026 would need 4-6 months to provide.

The isoFLOP plots (varying tokens per parameter for fixed compute) seem to get loss/perplexity basins that are quite wide, once they get about 1e20 FLOPs of compute. The basins also get wider for hybrid attention (compare 100% Attention isoFLOPs in the "Perplexity scaling analysis" Figure to the others). So it's likely that using a slightly suboptimal tokens/parameter ratio of say 40 won't hurt performance much at all. In which case we get to use 9e27-2e28 FLOPs by training a larger model on the same 5x 50T tokens dataset. The data wall for text data is unlikely to be a 2024-2026 issue.


  1. Conservatively asking for much more data than Chinchilla's 20 tokens per parameter, in light of the range of results in more recent experiments and adding some penalty for repetition of data. For example, Llama 3 had 40 tokens per parameter estimated as optimal for 4e25 FLOPs from isoFLOPs for smaller runs (up to 1e22 FLOPs, Figure 2), and linear extrapolation in log-coordinates (Figure 3) predicts that this value slowly increases with compute. But other experiments have it decreasing with compute, so this is unclear. ↩︎

  2. The usual estimate for training compute of a dense transformer is 6ND, but a recent Tencent paper estimates 9.6ND for their MoE model (Section 2.3.1). ↩︎

Comment by Vladimir_Nesov on johnswentworth's Shortform · 2024-11-15T19:47:00.085Z · LW · GW

Original GPT-4 is rumored to be a 2e25 FLOPs model. With 20K H100s that were around as clusters for more than a year, 4 months at 40% utilization gives 8e25 BF16 FLOPs. Llama 3 405B is 4e25 FLOPs. The 100K H100s clusters that are only starting to come online in the last few months give 4e26 FLOPs when training for 4 months, and 1 gigawatt 500K B200s training systems that are currently being built will give 4e27 FLOPs in 4 months.

So lack of scaling-related improvement in deployed models since GPT-4 is likely the result of only seeing the 2e25-8e25 FLOPs range of scale so far. The rumors about the new models being underwhelming are less concrete, and they are about the very first experiments in the 2e26-4e26 FLOPs range. Only by early 2025 will there be multiple 2e26+ FLOPs models from different developers to play with, the first results of the experiment in scaling considerably past GPT-4.

And in 2026, once the 300K-500K B200s clusters train some models, we'll be observing the outcomes of scaling to 2e27-6e27 FLOPs. Only by late 2026 will there be a significant chance of reaching a scaling plateau that lasts for years, since scaling further would need $100 billion training systems that won't get built without sufficient success, with AI accelerators improving much slower than the current rate of funding-fueled scaling.

Comment by Vladimir_Nesov on AI Craftsmanship · 2024-11-13T22:27:13.339Z · LW · GW

Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector?

Incidentally, there's a recent paper that investigates how this works in SAEs on transformers:

we search for what we term crystal structure in the point cloud of SAE features ... initial search for SAE crystals found mostly noise ... consistent with multiple papers pointing out that (man,woman,king,queen) is not an accurate parallelogram

We found the reason to be the presence of what we term distractor features. ... To eliminate such semantically irrelevant distractor vectors, we wish to project the data onto a lower-dimensional subspace orthogonal to them. ... Figure 1 illustrates that this dramatically improves the cluster and trapezoid/parallelogram quality, highlighting that distractor features can hide existing crystals.

Comment by Vladimir_Nesov on Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims · 2024-11-13T21:09:26.269Z · LW · GW

turns out that Ilya Sutskever was misinterpreted

That's not exactly my claim. If he said more to the reporters than his words quoted in the article[1], then it might've been justified to interpret him as saying that pretraining is plateauing. The article isn't clear on whether he said more. If he said nothing more, then the interpretation about plateauing doesn't follow, but could in principle still be correct.

Another point is that Sutskever left OpenAI before they trained the first 100K H100s model, and in any case one datapoint of a single training run isn't much evidence. The experiment that could convincingly demonstrate plateauing hasn't been performed yet. Give it at least a few months, for multiple labs to try and fail.


  1. “The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said. “Scaling the right thing matters more now than ever.”

    ↩︎
Comment by Vladimir_Nesov on Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims · 2024-11-13T18:55:58.245Z · LW · GW

There is a newer post-o1 Noam Brown talk from Sep 2024 that covers similar ground.

Comment by Vladimir_Nesov on Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims · 2024-11-13T18:47:23.575Z · LW · GW

Scaling progress is constrained by the physical training systems[1]. The scale of the training systems is constrained by funding. Funding is constrained by the scale of the tech giants and by how impressive current AI is. Largest companies backing AGI labs are spending on the order of $50 billion a year on capex (building infrastructure around the world). The 100K H100s clusters that at least OpenAI, xAI, and Meta recently got access to cost about $5 billion. The next generation of training systems is currently being built, will cost $25-$40 billion each (at about 1 gigawatt), and will become available in late 2025 or early 2026.

Without a shocking level of success, for the next 2-3 years the scale of the training compute that the leading AGI labs have available to them is out of their hands, it's the systems they already have or the systems already being built. They need to make the optimal use of this compute in order to secure funding for the generation of training systems that come after and will cost $100-$150 billion each (at about 5 gigawatts). The decisions about these systems will be made in the next 1-2 years, so that they might get built in 2026-2027.

Thus paradoxically there is no urgency for the AGI labs to make use of all their compute to improve their products in the next few months. What they need instead is to maximize how their technology looks in a year or two, which motivates more research use of compute now, rather than immediately going for the most scale current training systems enable. One exception might be xAI, which still needs to raise money for the $25-$40 billion training system. And of course even newer companies like SSI, but they don't even have the $5 billion training systems to demonstrate their current capabilities unless they do something sufficiently different.


  1. Training systems are currently clusters located on a single datacenter campus. But this might change soon, possibly even in 2025-2026, which lets the power needs at each campus remain manageable. ↩︎

Comment by Vladimir_Nesov on Bogdan Ionut Cirstea's Shortform · 2024-11-13T14:02:31.176Z · LW · GW

I think the journalists might have misinterpreted Sutskever, if the quote provided in the article is the basis for the claim about plateauing:

Ilya Sutskever ... told Reuters recently that results from scaling up pre-training - the phase of training an AI model that uses a vast amount of unlabeled data to understand language patterns and structures - have plateaued.
“The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said. “Scaling the right thing matters more now than ever.”

What he's likely saying is that there are new algorithmic candidates for making even better use of scaling. It's not that scaling LLM pre-training plateaued, but rather other things became available that might be even better targets for scaling. Focusing on these alternatives could be more impactful than focusing on scaling of LLM pre-training further.

He's also currently motivated to air such implications, since his SSI only has $1 billion, which might buy a 25K H100s cluster, while OpenAI, xAI, and Meta recently got 100K H100s clusters (Google and Anthropic likely have that scale of compute as well, or will imminently).

Comment by Vladimir_Nesov on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-12T23:37:17.153Z · LW · GW

A decision theory needs to have orthogonality, otherwise it's not going to be applicable. Decisions about content of values are always wrong, the only prudent choice is to defer them.

Comment by Vladimir_Nesov on papetoast's Shortforms · 2024-11-11T15:23:57.661Z · LW · GW

I just don't think it applies to example 1 because there exists alternatives that can keep the sunk resources.

What matters is if those alternatives are better (and can be executed on, rather than being counterfactual). It doesn't matter why they are better. Being better because they made use of the sunk resources (and might've become cheaper as a result) is no different from being better for other reasons. The sunk cost fallacy is giving additional weight to the alternatives that specifically use sunk resources, instead of simply choosing based on which alternatives are now better.

Comment by Vladimir_Nesov on papetoast's Shortforms · 2024-11-11T14:58:46.748Z · LW · GW

spending more resources beforehand really increases the chance of "success" most of the time

The decision to go on with the now-easier rest-of-the-plan can be correct, it's not the case that all plans must always be abandoned on the grounds of "sunk cost fallacy". The fallacy is when the prior spending didn't actually secure the rest of the current plan as the best course of action going forward. Alternatives can emerge that are better than continuing and don't make any use of the sunk resources.

Comment by Vladimir_Nesov on papetoast's Shortforms · 2024-11-11T14:03:48.777Z · LW · GW

Now Bob cannot just spend $300 to get a quality headphone. He would also waste Tim's $100

That's a form of sunk cost fallacy, a collective "we've sacrificed too much to stop now".

Andy and Bob never touching it again because they have other books to work on

That doesn't follow, the other books would've also been there without existence of this book's poor translation. If the poor translation eats some market share, so that competing with it is less appealing, that could be a valid reason.

Comment by Vladimir_Nesov on AI #89: Trump Card · 2024-11-10T23:17:36.662Z · LW · GW

Yes, my mistake, thank you. Should be 2ND or something when not computing gradients. I'll track down the details shortly.

Comment by Vladimir_Nesov on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-10T12:52:16.576Z · LW · GW

Death/survival/selection have the might makes right issue, of maintaining the normativity/actuality distinction. I think a major use of weak orthogonality thesis is in rescuing these framings. That is, for most aims, there is a way of formulating their pursuit as "maximally ruthless" without compromising any nuance of the aims/values/preferences, including any aspects of respect for autonomy or kindness within them. But that's only the strange framing adding up to normality, useful where you need that framing for technical reasons.

Making decisions in a way that ignores declining measure of influence on the world due to death in most eventualities doesn't add up to normality. It's a bit like saying that you can be represented by a natural number, and so don't need to pay attention to reality at all, since all natural numbers are out there somewhere, including those representing you. I don't see a way of rescuing this kind of line of argument.

Comment by Vladimir_Nesov on LLMs Look Increasingly Like General Reasoners · 2024-11-10T04:12:32.175Z · LW · GW

Cost of inference is determined by the shape of the model, things like the number of active parameters, which screens off compute used in training (the compute could be anything, cost of inference doesn't depend on it as long as the model shape doesn't change).

So compare specific prices with those of models with known size[1]. GPT-4o costs $2.5 per million input tokens, while Llama-3-405B costs $3.5 per million input tokens. That is, it could be a 200-300B model (in active parameters). Original GPT-4 is rumored to be about 270B active parameters (at 1.8T total parameters). It's OpenAI, not an API provider for an open weights model, so in principle the price could be misleading (below cost), but what data we have points to it being about the same size, maybe 2x smaller if there's still margin in the price.

Edit: There's a mistake in the estimate, I got confused between training and inference. Correcting the mistake points to even larger models, though comparing to Llama-3-405B suggests that there is another factor that counterbalances the correction, probably practical issues with getting sufficient batch sizes, so the original conclusion should still be about right.


  1. I just did this exercise for Claude 3.5 Haiku, more details there. ↩︎

Comment by Vladimir_Nesov on AI #89: Trump Card · 2024-11-10T03:55:41.948Z · LW · GW

Sully likes Claude Haiku 3.5 but notes that it’s in a weird spot after the price increase - it costs a lot more than other small models

The price is $1 per million input tokens (which are compute bound, so easier to evaluate than output tokens), while Llama-3-405B costs $3.5. At $2 per H100-hour we buy 3600 seconds of 1e15 FLOP/s at say 40% utilization, $1.4e-18 per useful FLOP. So $1 buys 7e17 useful FLOPs, or inference with 75-120B[1] active parameters for 1 million tokens. That's with zero margin and perfect batch size, so should be smaller.

Edit: 6ND is wrong, counts computation of gradients that's not done during inference. So the corrected estimate would suggest that the model could be even larger, but anchoring to open weights API providers says otherwise, still points to about 100B.


  1. Estimate of compute for a dense transformer is 6ND (N is number of active parameters, D number of tokens), a recent Tencent paper says they estimate about 9.6ND for a MoE model (see Section 2.3.1). I get 420B with the same calculation for $3.5 of Llama-3-405B (using 6ND, since it's dense), so that checks out. ↩︎

Comment by Vladimir_Nesov on LLMs Look Increasingly Like General Reasoners · 2024-11-10T01:52:22.223Z · LW · GW

I think GPT-4o is a distilled version of GPT-4

Original GPT-4 is rumored to have been a 2e25 FLOPs model (trained on A100s). Then there was a GPT-4T, which might've been smaller, and now GPT-4o. In early 2024, 1e26 FLOPs doesn't seem out of the question, so GPT-4o was potentially trained on 5x compute of original GPT-4.

There is a technical sense of knowledge distillation[1] where in training you target logits of a smarter model rather than raw tokens. It's been used for training Gemma 2 and Llama 3.2. It's unclear if knowledge distillation is useful for training similarly-capable models, let alone more capable ones, and GPT-4o seems in most ways more capable than original GPT-4.


  1. See this recent paper for example. ↩︎

Comment by Vladimir_Nesov on Cole Wyeth's Shortform · 2024-11-09T19:38:17.041Z · LW · GW

I'm in Canada so can't access the latest Claude

Use Chatbot Arena, both versions of Claude 3.5 Sonnet are accessible in Direct Chat (third tab). There's even o1-preview in Battle Mode (first tab), you just need to keep asking the question until you get o1-preview. In general Battle Mode (for a fixed question you keep asking for multiple rounds) is a great tool for developing intuition about model capabilities, since it also hides the model name from you while you are evaluating the response.

Comment by Vladimir_Nesov on Cole Wyeth's Shortform · 2024-11-09T19:27:22.522Z · LW · GW

Base model scale has only increased maybe 3-5x in the last 2 years, from 2e25 FLOPs (original GPT-4) up to maybe 1e26 FLOPs[1]. So I think to a significant extent the experiment of further scaling hasn't been run, and the 100K H100s clusters that have just started training new models in the last few months promise another 3-5x increase in scale, to 2e26-6e26 FLOPs.

possibly have already plateaued a year or so ago

Right, the metrics don't quite capture how smart a model is, and the models haven't been getting much smarter for a while now. But it might be simply because they weren't scaled much further (compared to original GPT-4) in all this time. We'll see in the next few months as the labs deploy the models trained on 100K H100s (and whatever systems Google has).


  1. This is 3 months on 30K H100s, $140 million at $2 per H100-hour, which is plausible, but not rumored about specific models. Llama-3-405B is 4e25 FLOPs, but not MoE. Could well be that 6e25 FLOPs is the most anyone trained for with models deployed so far. ↩︎

Comment by Vladimir_Nesov on LLMs Look Increasingly Like General Reasoners · 2024-11-09T17:03:27.760Z · LW · GW

Performance after post-training degrades if behavior gets too far from that of the base/SFT model (see Figure 1). Solving this issue would be an entirely different advancement from what o1-like post-training appears to do. So I expect that the model remains approximately as smart as the base model and the corresponding chatbot, it's just better at packaging its intelligence into relevant long reasoning traces.

Comment by Vladimir_Nesov on LLMs Look Increasingly Like General Reasoners · 2024-11-09T12:43:42.273Z · LW · GW

I now think it doesn't work easily, because the training on written language doesn't have enough examples of people explicitly stating their cognitive steps in applying System 2 reasoning.

The cognitive steps are still part of the hidden structure that generated the data. That GPT-4 level models are unable to capture them is not necessarily evidence that it's very hard. They've just breached the reading comprehension threshold, started to reliably understand most nuanced meaning directly given in the text.

Only in second half of 2024 there's now enough compute to start experimenting with scale significantly beyond GPT-4 level (with possible recent results still hidden within frontier labs). Before that there wasn't opportunity to see if something else starts appearing just after GPT-4 scale, so absence of such evidence isn't yet evidence of absence, that additional currently-absent capabilities aren't within easy reach. It's been 2 years at about the same scale of base models, but that isn't evidence that additional scale stops helping in crucial ways, as no experiments with significant additional scale have been performed in those 2 years.

Comment by Vladimir_Nesov on LLMs Look Increasingly Like General Reasoners · 2024-11-09T12:20:43.789Z · LW · GW

Most of my additional credence is on something like 'the full o1 turns out to already be close to the grand prize mark'

Keep in mind that o1 is still probably a derivative of GPT-4o's or GPT-4T's base model, which was probably trained on at most 1e26 FLOPs[1]. While the new 100K H100s cluster can train 4e26+ FLOPs models, and the next OpenAI model at this scale will probably be ready early next year. The move from o1-preview to full o1 is not obviously as significant as what happens when you also upgrade the base model. If some Orion rumors are correct, it might additionally get better than just from scale using o1-generated synthetic data in pretraining.


  1. WSD and Power learning rate schedules might enable effective continued pretraining, and it's possible to continue training on repeated data, so fixed-compute base model scale is not obviously the correct assumption. That is, even though GPT-4o was released in May 2024, that doesn't necessarily mean that its base model didn't get stronger since then, that stronger performance is entirely a result of additional post-training. And 1e26 FLOPs is about 3 months on 30K H100s, which could be counted as a $140 million training run at $2 per H100-hour (not contradicting the claim that $100 million training runs were still the scale of models deployed by Jun 2024). ↩︎

Comment by Vladimir_Nesov on Phib's Shortform · 2024-11-09T09:27:12.749Z · LW · GW

I (of course?) buy that emissions don't matter in short term

Emissions don't matter in the long term, ASI can reshape the climate (if Earth is not disassembled outright). They might matter before ASI, especially if there is an AI Pause. Which I think is still a non-negligible possibility if there is a recoverable scare at some point; probably not otherwise. Might be enforceable by international treaty through hobbling semiconductor manufacturing, if AI of that time still needs significant compute to adapt and advance.

Comment by Vladimir_Nesov on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-08T00:36:58.484Z · LW · GW

who seems to think that first-person perspective is illusion and only third-person perspective is real

The taste of cheese is quite real, it's just not a technical consideration relevant for chip design. Concepts worth noticing are usually meaningful in some way, but most of them are unclear and don't offer a technical foothold in any given endeavor.

Comment by Vladimir_Nesov on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-07T20:47:19.168Z · LW · GW

A person is a complicated machine, we can observe how this machine develops or could develop through processes that we could set up in the world or hypothetically. This is already quite clear, and things like "first person perspective" or "I will observe" don't make this clearer.

So I don't see a decision theory proclaiming "QI is false!", it's just not a consideration it needs to deal with at any point, even if somehow there was a way of saying more clearly what that consideration means. Like a chip designer doesn't need to appreciate the taste of good cheese to make better AI accelerators.

Comment by Vladimir_Nesov on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-07T20:11:55.700Z · LW · GW

If quantum immortality is true...

To discuss truth of a claim, it's first crucial to clarify what it means. What does it mean for quantum immortality to be true or not? The only relevant thing that comes to mind is whether MWI is correct. Large quantum computers might give evidence to that claim (though ASI very likely will be here first, unless there is a very robust AI Pause).

Once we know there are physical branching worlds, there is no further fact of "quantum immortality" to figure out. There are various instances of yourself in various world branches, a situation that doesn't seem that different from multiple instances that can occur within a single world. Decision theory then ought to say how to weigh the consequences of possible influences and behaviors spread across those instances.

Comment by Vladimir_Nesov on In the Name of All That Needs Saving · 2024-11-07T17:28:47.575Z · LW · GW

and makes it satisfy the desires of all the things which ought be happy

Some things don't endorse themselves being happy as an important consideration.

Comment by Vladimir_Nesov on Winning isn't enough · 2024-11-06T06:28:37.697Z · LW · GW

A choice can influence the reality of the situation where it could be taken. Thus a "dominated strategy" can be winning when choosing the "better possibilities" prevents the situation where you would be considering the decision from occurring. Problem statements in classical forms (such as payoff matrices of games) prohibit such considerations. In Newcomb's problem, where "winning" is a good way of looking at what's wrong with two-boxing, the issue is that the game theory way of framing possible outcomes doesn't recognize that some of the outcomes refute the situation where the outcomes are being chosen. This is clearer in examples like Transparent Newcomb. Overall behavior of an algorithm influences whether it's given the opportunity to run in the first place.

So the relevance of "winning" isn't so much about balancing the many senses of winning across the many possibilities where some winning occurs or doesn't, expected utility vs. other framings. It's more about paying attention to which possibilities are real, and whether winning in the more central senses occurs on those possibilities or not.

Comment by Vladimir_Nesov on The Compendium, A full argument about extinction risk from AGI · 2024-11-01T00:03:09.503Z · LW · GW

From footnote 2 to The state of AI today:

GPT-2 cost an estimated $43,000 to train in 2019; today it is possible to train a 124M parameter GPT-2 for $20 in 90 minutes.

Isn't $43,000 the estimate for the 1.5B replication of GPT-2 rather than for the 124M? If so, this phrasing is somewhat misleading. We only need $250 even for the 1.5B version, but still.

Comment by Vladimir_Nesov on The Compendium, A full argument about extinction risk from AGI · 2024-10-31T23:13:57.484Z · LW · GW

From chapter The state of AI today:

Later this year, the first 100,000 GPU cluster will go online

It's not the first, there's xAI cluster from September, and likely a Microsoft cluster from May.

Even the cited The Information article says about the Meta cluster in question that

The previously unreported cluster, which could be fully completed by October or November, comes as two other companies have touted their own.

Comment by Vladimir_Nesov on The Compendium, A full argument about extinction risk from AGI · 2024-10-31T22:13:58.449Z · LW · GW

From chapter The state of AI today:

The most likely and proximal blocker is power consumption (data-centers training modern AIs use enormous amounts of electricity, up to the equivalent of the yearly consumption of 1000 average US households) and ...

Clusters like xAI's Memphis datacenter with 100K H100s consume about 150 megawatts. An average US household consumes 10,800 kilowatt-hours a year, which is 1.23 kilowatts on average. So the power consumption of a 100K H100s cluster is equivalent to that of 121,000 average US households, not 1,000 average US households. If we take a cluster of 16K H100s that trained Llama-3-405B, that's still 24 megawatts and equivalent to 19,000 average US households.

So you likely mean the amount of energy (as opposed to power) consumed in training a model ("yearly consumption of 1000 average US households"). The fraction of all power consumed by a cluster of H100s is about 1500 watts per GPU, and that GPU at 40% compute utilization produces 0.4e15 FLOP/s of useful dense BF16 compute. Thus about 3.75e-12 joules is expended per FLOP that goes into training a model. For 4e25 FLOPs of Llama-3-405B, that's 1.5e14 joules, or 41e6 kilowatt-hours, which is consumed by 3,800 average US households in a year[1].

This interpretation fits the numbers better, but it's a bit confusing, since the model is trained for much less than a year, while the clusters will go on consuming their energy all year long. And the power constraints that are a plausible proximal blocker of scaling are about power, not energy.


  1. If we instead take 2e25 FLOPs attributed to original GPT-4, and 700 watts of a single H100, while ignoring the surrounding machinery of a datacenter (even though you are talking about what a datacenter consumes in this quote, so this is an incorrect way of estimating energy consumption), and train on H100s (instead of A100s used for original GPT-4), then this gives 9.7e6 kilowatt-hours, or the yearly consumption of 900 average US households. With A100s, we instead have 400 watts and 0.3e15 FLOP/s (becoming 0.12e15 FLOP/s at 40% utilization), which gets us 18.5e6 kilowatt-hours for a 2e25 FLOPs model, or yearly consumption of 1,700 average US households (again, ignoring the rest of the datacenter, which is not the correct thing to do). ↩︎

Comment by Vladimir_Nesov on The Alignment Trap: AI Safety as Path to Power · 2024-10-31T16:50:23.895Z · LW · GW

A posthuman king is not centrally a king (not mortal, very different incentives), and "an AI" is a very vague bag-of-everything that might include things like simulated worlds or bureaucracies with checks and balances as special cases. The reason His Majesty's Democratic Government doesn't really work while the king retains ultimate authority is that the next king can be incompetent or malevolent, or its activities start threatening the king's position and so the king is motivated to restrict them. So even "giving keys to the universe back" is not necessarily that important in the case of a posthuman god-king, but it remains a possibility after the acute risk period passes and it's more clear how to make the next thing work.

Comment by Vladimir_Nesov on A path to human autonomy · 2024-10-30T05:05:42.559Z · LW · GW

I do think that these things are relevant to 'compute it takes to get to a given capability level'.

In practice, there are no 2e23 FLOPs models that cost $300K to train that are anywhere close to Llama-3-405B smart. If there were such models at leading labs (based on unpublished experimental results and more algorithmic insights), they would be much smarter than Llama3-405B when trained with 8e25 FLOPs they have to give, rather than the reference 2e23 FLOPs. Better choice of ways of answering questions doesn't get us far in the actual technical capabilities.

(Post-training like o1 is a kind of "better choice of ways of answering questions" that might help, but we don't know how much compute it saves. Noam Brown gestures at 100,000x from his earlier work, but we haven't seen Llama 4 yet, it might just spontaneously become capable of coherent long reasoning traces as a result of more scale, the bitter lesson making Strawberry Team's efforts moot.)

Many improvements observed at smaller scale disappear at greater scale, or don't stack with each other. Many papers have horrible methodologies, plausibly born of scarcity of research compute, that don't even try (or make it possible) to estimate the compute multiplier. Most of them will be eventually forgotten, for a good reason. So most papers that seem to demonstrate improvements are not strong evidence for the hypothesis of a 1000x cumulative compute efficiency improvement, while this hypothesis predicts observations about what's actually already possible in practice that we are not getting, strong evidence against it. There are multiple competent teams that don't have Microsoft compute, and they don't win over Llama-3-405B, which we know doesn't have all of these speculative algorithmic improvements and uses 4e25 FLOPs (2.5 months on 16K H100s rather than 1.5 months on 128 H100s for 2e23 FLOPs).

In other words, the importance of Llama-3-405B for the question about speculative algorithmic improvements is that the detailed report shows it has no secret sauce, it merely competently uses about as much compute as the leading labs in very conservative ways. And yet it's close in capabilities to all the other frontier models. Which means the leading labs don't have significantly effective secret sauce either, which means nobody does, since the leading labs would've already borrowed it if it was that effective.

There's clearly a case in principle for it being possible to learn with much less data, anchoring to humans blind from birth. But there's probably much more compute happening in a human brain per the proverbial external data token. And a human has the advantage of not learning everything about everything, with greater density of capability over encyclopedic knowledge, which should help save on compute.

Comment by Vladimir_Nesov on A path to human autonomy · 2024-10-30T03:20:27.900Z · LW · GW

I'm talking about the compute multiplier, as a measure of algorithmic improvement, how much less compute it takes to get to the same place. Half of these things are not relevant to it. Maybe another datapoint, Mosaic's failure with DBRX, when their entire thing was hoarding compute multipliers.

Consider Llama-3-405B, a 4e25 FLOPs model that is just Transformer++ from the Mamba paper I referenced above, not even MoE. A compute multiplier of 1000x over the original transformer would be a 200x multiplier over this Llama, meaning matching its performance with 2e23 FLOPs (1.5 months of training on 128 H100s). Yi-Lightning is exceptional for its low 2e24 FLOPs compute (10x more than our target), but it feels like a lot of it is better post-training, subjectively it doesn't appear quite as smart, so it would probably lose the perplexity competition.

Comment by Vladimir_Nesov on Habryka's Shortform Feed · 2024-10-30T02:21:51.204Z · LW · GW

Bug: I can no longer see the number of agreement-votes (which is distinct from the number of Karma-votes). It shows the Agreement Downvote tooltip when hovering over the agreement score (the same for Karma score works correctly, saying for example "This comment has 31 overall karma (17 Votes)").

Edit: The number of agreement votes can be seen when hovering over two narrow strips, probably 1 pixel high, one right above and one right below the agreement rating.

Comment by Vladimir_Nesov on A path to human autonomy · 2024-10-30T02:04:05.913Z · LW · GW

but have so far only found relatively incremental improvements to transformers (in the realm of 1000x improvement)

What 1000x improvement? Better hardware and larger scale are not algorithmic improvements. Careful study of scaling laws to get Chinchilla scaling and set tokens per parameter more reasonably[1] is not an algorithmic improvement. There was maybe 5x-20x algorithmic improvement, meaning the compute multiplier, how much less compute one would need to get the same perplexity on some test data. The upper bound is speculation based on published research for which there are no public results of large scale experiments, including for combinations of multiple methods, and absence of very strong compute multiplier results from developers of open weights models who publish detailed reports like DeepSeek and Meta. The lower bound can be observed in the Mamba paper (Figure 4, Transformer vs. Transformer++), though it doesn't test MoE over dense transformer (which should be a further 2x or so, but I still don't know of a paper that demonstrates this clearly).

Recent Yi-Lightning is an interesting example that wins on Chatbot Arena in multiple categories over all but a few of the strongest frontier GPT-4 level models (original GPT-4 itself is far behind). It was trained for about 2e24 FLOPs, 10x less than original GPT-4, and it's a small overtrained model, so its tokens per parameter are very unfavorable, that is it was possible to make it even more capable with the same compute.


  1. It's not just 20 tokens per parameter. ↩︎

Comment by Vladimir_Nesov on The Alignment Trap: AI Safety as Path to Power · 2024-10-30T01:09:02.271Z · LW · GW

The point is that the "controller" of a "controllable AI" is a role that can be filled by an AI and not only by a human or a human institution. AI is going to quickly grow the pie to the extent that makes current industry and economy (controlled by humans) a rounding error, so it seems unlikely that among the entities vying for control over controllable AIs, humans and human institutions are going to be worth mentioning. It's not even about a takeover, Google didn't take over Gambia.

Comment by Vladimir_Nesov on The Alignment Trap: AI Safety as Path to Power · 2024-10-30T00:26:20.128Z · LW · GW

If your work makes AI systems more controllable, who will ultimately wield that control?

A likely answer is "an AI".

Comment by Vladimir_Nesov on The Alignment Trap: AI Safety as Path to Power · 2024-10-30T00:22:54.138Z · LW · GW

Recent discussions about artificial intelligence safety have focused heavily on ensuring AI systems remain under human control. While this goal seems laudable on its surface, we should carefully examine whether some proposed safety measures could paradoxically enable rather than prevent dangerous concentrations of power.

The aim of avoiding AI takeover that ends poorly for humanity is not about preventing dangerous concentrations of power. Power that is distributed among AIs and not concentrated is entirely compatible with an AI takeover than ends poorly for humanity.

Comment by Vladimir_Nesov on Habryka's Shortform Feed · 2024-10-29T23:23:36.537Z · LW · GW

would not want the comment font be the same as the post font [...] the small font-size that you want to display comments as

I had to increase the zoom level by about 20% (from 110% to 130%) after this change to make the comments readable[1]. This made post text too big to the point where I would normally adjust zoom level downward, but I can't in this case[2], since the comments are on the same site as the posts. Also the lines in both posts and comments are now too long (with greater zoom).

I sit closer to the monitor than standard to avoid need for glasses[3], so long lines have higher angular distance. In practice modern sites usually have a sufficiently narrow column of text in the middle so this is almost never a problem. Before the update, LW line lengths were OK (at 110% zoom). At monitor/window width 1920px, substack's 728px seems fine (at default zoom), but LW's 682px get balooned too wide with 130% zoom.

The point is not that accomodating sitting closer to the monitor is an important use case for a site's designer, but that somehow the convergent design of most of the web manages to pass this test, so there might be more reasons for that.

Incidentally, the footnote font size is 12.21px, even smaller than the comment font size of 15.08px.


  1. The comment font still doesn't feel "sharp", like there's more anti-aliasing at work. It's Gill Sans Nova Medium, size 15.08px (130% zoom applies on top of that). OpenSans Regular 18px on RoyalRoad (100% zoom; as an example sans font) doesn't have this problem. LW post text is fine (at either zoom), Warnock Pro 18.2px. I'm in Firefox on Arch Linux, 1920x1080.
    Here's a zoomed-in screenshot from LW (from 130% zoom in Firefox):

    Here's a zoomed-in screenshot from RoyalRoad (from 100% zoom in Firefox):
    ↩︎

  2. I previously never felt compelled to figure out how to automate font change in some places of a site. ↩︎

  3. That is, with more myopia than I have I would wear glasses, and will less myopia I would put the monitor further back on the desk. ↩︎

Comment by Vladimir_Nesov on Vladimir_Nesov's Shortform · 2024-10-26T17:30:40.050Z · LW · GW

Kai-Fu Lee, CEO of 01 AI, posted on LinkedIn:

Yi-Lightning is a small MOE model that is extremely fast and inexpensive. Yi-Lightning costs only $0.14 (RMB0.99 ) /mil tokens [...] Yi-Lightning was pre-trained on 2000 H100s for 1 month, costing about $3 million, a tiny fraction of Grok-2.

Assuming it's trained in BF16 with 40% compute utilization, that's a 2e24 FLOPs model (Llama-3-70B is about 6e24 FLOPs, but it's not MoE, so the FLOPs are not used as well). Assuming from per token price that it has 10-20B active parameters, it's trained on 15-30T tokens. So not an exercise in extreme compute scaling, just excellent execution.