Posts
Comments
almost no difference between 180b vs 800b model, when r=1(table 4)
It's a 3B parameter model, so training it for 180B tokens already overtrains it maybe 3x, and training for 800B tokens overtrains it 13x. The loss of compute efficiency from the latter is about 1.6x more than from the former, with 4.4x more raw compute, so should have 2.7x more in effective compute, or act like a compute optimal model that's 1.6x larger, trained on 1.6x more tokens. So the distinction is smaller than 180 vs. 800.
I think this framing doesn't work, programs almost never control each other. Instead they can coordinate with each other by agreeing to follow decisions of a third program, which is identical between them, a "contract". Initially, the contract isn't yet "signed", so seeing each other's code sets up the conditions for defining a shared contract (deciding to follow what it'll say once computed).
There could be many contracts simultaneously, each weakly nudging decisions of multiple agents coordinated through them. Social norms are contracts in this sense. I think some computations of circuits of deep learning models are contracts with the environment, these computations (numerous and small) decide both what the environment will do (the way arithmetic decides what a physical calculator will do), and what the model will predict, and so the two become coordinated, even in situations never seen in the training dataset.
Whether upvotes need to be explained overall is not relevant to my comment, as I'm talking about the specific considerations named by Noah Birnbaum.
It's not yet known if there is a way of turning R1-like training into RSI with any amount of compute. This is currently gated by quantity and quality of graders for outcomes of answering questions, which resist automated development.
If the reasons to leave are too legible, they are either toothless or will be gamed and become too costly to actually enforce, including in injustice and drama. Trivial inconveniences that differentially apply to people that should leave anyway are still effective, but don't have these downsides.
(My own policy is to almost always avoid downvoting precisely when I have a comment to make. Otherwise the vote is all the feedback I have to give, so I'm going to give it rather than metaphorically slash their tires by staying silent and maintaining a misleading impression about the reception of their post/comment.)
These considerations also apply to upvotes (to the extent that they do).
It's crucial that some people get discouraged and leave for illegible reasons, without a need for hard enforcement, which has unwieldy externalities. For almost everyone who should stay, figuring out reasons for significant downvoting is probably not very difficult. Any discussion would then be about correctness or endorsement of those reasons, not about finding out what they are.
For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it's not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won't hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.
For 1:32 MoE at 5e28 FLOPs (5 GW $150bn training systems of 2028), we get maybe 700 tokens/param optimal (counting effect of sparsity, effect of repetition, and effect of more compute), so that's 3.5T active and 110T total params trained for 2.5e15 tokens (maybe 80T tokens repeated 30 times). Not sure if this kind of total params can be made to work.
Chinchilla's 20 tokens/param (at 6e23 FLOPs) change significantly when working with different datasets, architectures, or amounts of compute. For Llama-3-405B, it's 37 tokens/param at 4e25 FLOPs and increasing 1.5x for every 1000x of compute (Figure 3). When training on data repeated 60 times, optimal tokens/param increase about 2.5x (Figure 3).
For MoE models with 87% (1:8) sparsity, optimal tokens/param increase 3x, and at 97% (1:32) sparsity by 6x (Figure 12, left). This suggests that if Llama-3-405B was instead a MoE model with 97% sparsity, it would have 220 tokens/param optimal and not 37.
Overtraining or undertraining is use of a suboptimal tokens/param ratio. The effect is not that large, rule of thumb is that a compute multiplier penalty is given by a degree of overtraining raised to the power 1/3. So 30x overtraining (using 600 tokens/param instead of 20 tokens/param) results in the same penalty as training a compute optimal model with 3x less compute, and 10x overtraining (or undertraining) corresponds to using 2x less compute (which can be compensated by using 2x more compute instead, in order to maintain the same performance).
This curiously suggests that original GPT-4 was also undertrained, similarly to GPT-3. Rumored compute is 2e25 FLOPs, and rumored architecture is 1.8T total parameter MoE with 2:16 sparsity, so 220B params for active experts, and say another 40B for non-expert params, for the total of 260B. This gives 13T tokens or 50 tokens/param. If the dataset has Llama-3's 37 tokens/param optimal for a dense model at 2e25 FLOPs, then with 1:8 sparsity the optimal ratio would be 110 tokens/param, so at 50 tokens/param it's undertrained about 2x. The effect of this is losing 1.3x in effective compute, not a whole lot but more than nothing.
With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it'll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.
Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters.
With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum point. But with finite data you need to repeat it to train with fewer active params, which damages loss. This moves the minima of isoFLOPs to the right if the minima already required 5x repetition or more. So under data scarcity, compute optimal models have more active params than under infinite data, and the effect gets worse with more compute. This way we maintain the framing of search for compute optimal hyperparameters rather than undertraining.
Now consider the 1e20 FLOPs plot in Figure 12, left. If there's only 2B tokens of training data and no more, all minima already ask for 12-31 epochs, so the distortion that increases loss will move the minima to the right (and up), and move the high sparsity minima further than lower sparsity minima compared to their original (infinite data) locations. The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity at 1e20 FLOPs, however you vary the number of epochs and active params! This seems counterintuitive, as in an infinite data regime more sparsity only makes things better (if we ignore practical difficulties). But sure, 90% sparsity will still be better than dense, at least until we use even more compute and sparser minima start asking for even more epochs.
10.5-13% on text only part of HLE
This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.
And how much the improved reasoning is from using a different base model vs. different post-training. It's possible R1-like training didn't work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.
A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).
But there's a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be about 2.5x. Keeping compute unchanged, 2.5x fewer parameters means 2.5x more data, or 6x greater tokens/parameter ratio for a compute optimal training run.
Thus a dense model can be replaced with a 97% sparse MoE model trained using 6x less compute that will achieve the same perplexity, but the tokens/parameter ratio of this MoE model will be 6x greater than for the original dense model. Both data and active parameters would go down by 2.5x from reducing compute 6x if the ratio didn't change, but since it does change, in actuality only the number of active parameters goes down 6x, while the number of tokens stays the same.
Let's take Llama-3-405B as an example, which is a 405B parameter compute optimal model trained for 15T tokens at 40 tokens/parameter, using 4e25 FLOPs. An equivalent 97% sparse model will have 70B active parameters, 2T total parameters, and will need to be trained for the same 15T tokens to reach the same perplexity/loss at 220 tokens/parameter, using 6e24 FLOPs. (Which is close to DeepSeek-V3's 4e24-5e24 FLOPs actually, so anchoring to Llama-3-405B might be a good way of framing its compute efficiency.)
didn't run red-teaming and persuasion evals on the actually-final-version
Asking for this is a bit pointless, since even after the actually-final-version there will be a next update for which non-automated evals won't be redone, so it's equally reasonable to do non-automated evals only on some earlier version rather than the actually-final one.
they write: "We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks."
Ah, I failed to take a note of that when reading the paper. My takeaway was the opposite. In Figure 2 for R1-Zero, the first impression is convergence, both from saturation of the benchmark, and in the graph apparently leveling off. But if replotted in log-steps instead of linear steps, there isn't even any leveling off for pass@1, despite near-saturation of the benchmark for cons@16: accuracy for pass@1 is 0.45 after 2K steps, 0.55 (+0.10) after 4K steps, then 0.67 (+0.12) after 8K steps, it just keeps going up by +0.10 every doubling in training steps. And the plots-that-don't-level-off in the o1 post are in log-steps. Also, the average number of reasoning steps for R1-Zero in Figure 3 is a straight line that's probably good for something if it goes further up. So I guess I might disagree with the authors even, in characterizing step 10K as "at convergence", though your quote is about R1 rather than R1-Zero for which there are plots in the paper...
your analysis of GPT-5--which is worrying for short-term scaling
Well, I mostly argued about naming, not facts, though the recent news seem to be suggesting that the facts are a bit better than I expected only a month ago, namely 1 GW training systems might only get built in 2026 rather than in 2025, except possibly at Google. And as a result even Google might feel less pressure to actually get this done in 2025.
DeepSeek-R1 ... Run RL to convergence
Not to convergence, the graphs in the paper keep going up. Which across the analogy might explain some of the change from o1 to o3 (the graphs in the o1 post also keep going up), though new graders coded for additional verifiable problems are no doubt a large part of it as well.
o3-mini has the same knowledge cutoff date as 4o and o1 (late 2023)
It seems like o1-mini is its own thing, might even start with a base model that's unrelated to GPT-4o-mini (it might be using its own specialized pretraining data mix). So a clue about o3-mini data doesn't obviously transfer to o3.
if it used GPT-5 as a base model
The numbering in GPT-N series advances with roughly 100x in raw compute at a time. If original GPT-4 is 2e25 FLOPs, then a GPT-5 would need 2e27 FLOPs, and a 100K H100s training system (like the Microsoft/OpenAI system at the site near the Goodyear airport) can only get you 3e26 FLOPs or so (in BF16 in 3 months). The initial Stargate training system at Abilene site, after it gets 300K B200s, will be 7x stronger than that, so will be able to get 2e27 FLOPs. Thus I expect GPT-5 in 2026 if OpenAI keeps following the naming convention, while the new 100K H100s model this year will be GPT-4.5o or something like that.
The fact that RL seems to be working well on LLMs now, without special tricks, as reported by many replications of r1, suggests to me that AGI is indeed not far off.
Still, at least as long as base model effective training compute isn't scaled another 1,000x (which is 2028-2029), this kind of RL training probably won't generalize far enough without neural (LLM) rewards, which for now don't let RL scale as much as with explicitly coded verifiers.
This is an obvious thing to try, but it's not what currently already works, and it's not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn't work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.
Relative to GPT-4o, which was trained at a time when 30K H100s clusters were around, and so in BF16 could be expected to be around 8e25 FLOPs, possibly overtrained to a degree that's not too different from DeepSeek-V3 itself.
Amodei's post you linked says a few of tens of millions of dollars for Claude 3.5 Sonnet, which is maybe 4e25 FLOPs in BF16, but I think Claude 3.5 Sonnet is better than DeepSeek-V3, which is not as clearly the case for GPT-4o and DeepSeek-V3, making them easier to compare. Being better than GPT-4o at 2x fewer FLOPs, Claude 3.5 Sonnet has at least a 4x compute multiplier over it (under all the assumptions), but not necessarily more, while with DeepSeek-V3 there's evidence for more. As DeepSeek-V3 was trained more than half a year later than Claude 3.5 Sonnet, it's somewhat "on trend" in getting a compute multiplier of 16x instead of 4x, if we anchor to Amodei's claim of 4x per year.
Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn't building giant frontier training systems fast enough, probably because they aren't seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.
The $80bn Microsoft capex is not relevant to this if it goes to many smaller systems[1], which is only natural as there are millions of datacenter GPUs but only a few 100K GPU frontier training systems, a tiny fraction of inference and smaller/research training compute. The $500bn figure is not relevant as for now it's only a vague plan. But Microsoft not agreeing to build training systems on OpenAI's schedule is some evidence.
OpenAI would want to get from under Microsoft's thumb anyway[2], and this gets ever more difficult over time, since frontier training systems get ever more expensive, so the sooner they try the more likely they are to succeed. But even this consideration is some evidence of slowdown, since it only motivates saying you want to build frontier training systems even faster, but doesn't in itself motivate actually going through with it, beyond building a competitive training system that makes you independent.
So the clues that support the prospect of scaling to 1 GW in 2025 and to 5 GW in 2027 could be misleading, running contrary to hyperscaler attitudes and not aligning even with OpenAI's immediate incentives.
I previously expected that $80bn is evidence that they are building a large training system this year, but it now seems that they are building more inference instead. ↩︎
As Satya Nadella said, "If OpenAI disappeared tomorrow... we have all the IP rights and all the capability. We have the people, we have the compute, we have the data, we have everything. We are below them, above them, around them." ↩︎
From what I remember, the training-compute optimal number of experts was like 64
I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.
Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so "number of experts" doesn't really capture the ratio of total to activated, probably not a good anchor by itself.
Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
This still doesn't help with the question of why 37B active parameters is sufficient. Even with 100500 experts you can't expect 1B active parameters to be sufficient to maintain GPT-4 quality. The rumor for original GPT-4 is that it has 2 activated experts out of 16 in total, so the ratio is 1:8, while for DeepSeek-V3 it's 1:32.
that's why I wrote: "possibly 4x fewer training steps for the same number of tokens if predicting tokens only once" (assuming predicting 4 tokens at a time), but that's not demonstrated nor published (given my limited knowledge on this)
Not sure how to parse this, my point is that the number of training steps remains the same, training efficiency doesn't significantly increase, there's even slight overhead from adding the predict-the-token-after-next blocks of parameters. This is described in Section 2.2 of the paper. You get better speculative decoding at inference time (and also better quality of output), but training time is the same, not with 2x fewer or 4x fewer steps.
32B active parameters instead of likely ~
220280B for GPT4 =>6.88.7x lower training cost per token.
It's 37B active parameters, not 32B.
The bet that "makes sense" is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we're going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also "suggesting" that quality is compute-insensitive in a large range, so there is no benefit from more compute per token.
But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivates using inference for more tokens even with models that have the same active parameter count (as in Jevons paradox), that argument doesn't work. Also, the ceiling of quality at the possible scaling slowdown point depends on efficiency of training (compute multiplier) applied to the largest training system that the AI economics will support (maybe 5-15 GW without almost-AGI), and improved efficiency of DeepSeek-V3 raises that ceiling.
Taken in isolation, DeepSeek-V3 looks like a 15x compute multiplier. But if a lot of it is data, the multiplier won't scale (when you need much more data, it necessarily becomes worse, or alternatively you need a teacher model that's already better). In any case, this raises the ceiling for what 5 GW training systems can do (at which point there's either almost-AGI or scaling slows down a lot). And there the 15x multiplier of DeepSeek-V3 (or what remains of it after scaling) needs to be compared with the algorithmic advancements of 2025-2028, which would've included most of the things in DeepSeek-V3 anyway, so the counterfactual impact is small.
32B active parameters instead of likely ~220B for GPT4
It's 37B instead of maybe 280B (non-expert parameters also count), but in any case the question is how this manages to maintain quality. If this wasn't an issue, why not 8B active parameters, or 1M active parameters?
32B active parameters instead of likely ~220B for GPT4 => 6.8x lower training ... cost
Doesn't follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
The training costs are maybe 5e24 FLOPs and 2e25 FLOPs, differ by 4x. DeepSeek-V3 is better than original GPT-4 though, you need to compare with GPT-4o, which almost certainly uses more compute in training than original GPT-4 (maybe 4x more, so maybe 16x more than DeepSeek-V3).
8bits training instead of 16bits => 4x lower training cost
FLOP/s for FP8 are almost always 2x the FLOP/s for BF16, not 4x.
Multi-token training => ~2x training efficiency
You still train on every token. There is an additional "layer" in model parameters that predicts the token-after-next (Figure 3 in the paper), so there's a bit of overhead in training (not much, with 61 total layers). The results are better, but not that much better (Table 4).
training on O1 outputs
Outputs of o1 don't include reasoning traces, so not particularly useful compared to outputs of chatbot models, and very expensive, so only a modest amount can be collected.
Imitation helps with post-training, but the compute-heavy part is pretraining, and obtaining good quality with little pretraining is a novel feat that isn't known to be explainable by good post-training, or by including a lot of outputs from good models in the pretraining/annealing mix.
This seems unlikely to be a neglected concern, unless there are specific signs that it is.
could end up being the most important thing I’ve ever written
The $6 million is disputed by a video arguing that DeepSeek used far more compute than they admit to.
The prior reference is a Dylan Patel tweet from Nov 2024, in the wake of R1-Lite-Preview release:
Deepseek has over 50k Hopper GPUs to be clear.
People need to stop acting like they only have that 10k A100 cluster.
They are omega cracked on ML research and infra management but they aren't doing it with that many fewer GPUs
DeepSeek explicitly states that
DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
This seems unlikely to be a lie, the reputational damage would've motivated not mentioning amount of compute instead, but the most interesting thing about DeepSeek-V3 is precisely this claim, that its quality is possible with so little compute.
Certainly designing the architecture, the data mix, and the training process that made it possible required much more compute than the final training run, so in total it cost much more to develop than $6 million. And the 50K H100/H800 system is one way to go about that, though renting a bunch of 512-GPU instances from various clouds probably would've sufficed as well.
Found the following in the Jan 23 newsletter:
AI doesn’t accelerate my writing much, although it is often helpful in parsing papers and helping me think through things. But it’s a huge multiplier on my coding, like more than 10x.
What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that
I would not be surprised if in 2026 we have more than a million of some kind of chip.
Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.
Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.
OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.
Meta announced a 2 GW datacenter at Richland Parish site, but 1 GW for 2025 seems to be across all datacenters, not for a single training system. So the training system will be smaller by end of 2025.
What can be done for $6 million, can be done even better with 6 million GPUs[1]. What can be done with 6 million GPUs, can't be done for $6 million. Giant training systems are the moat.
By "3rd person perspective" I mean considering the world itself, there is no actual third person needed for it. It's the same framing as used by a physicist when talking about the early stages of the universe when humans were not yet around, or when talking about a universe with alternative laws of physics, or when talking about a small system that doesn't include any humans as its part. Or when a mathematician talks about a curve on a plane.
Knowing absolutely everything is not necessary to know the relevant things, and in this case we know all the people at all times, and the states of their minds, their remembered experiences and possible reasoning they might perform based on those experiences. Observations take time and cognition to process, they should always be considered from slightly in the future relative to when raw data enters a mind. So it's misleading to talk about a person that will experience an observation shortly, and what that experience entails, the clearer situation is looking at a person who has already experienced that observation a bit in the past and can now think about it. When a copied person looks back at their memories, or a person about to be copied considers what's about to happen, the "experience" of being copied is nowhere to be found, there is only the observation of the new situation that the future copies find themselves in, and that has nothing to do with the splitting into multiple copies of the person from the past.
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1
Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.
GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million
Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision.
The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it's in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3.
toy model ... f(x) = Ax and g(x) = Bx, where x is the compute invested
Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone.
That is, raw utilized compute. I'm assuming the same compute utilization for all models. ↩︎
all copies ... will claim to be the original ... regardless of whether they are the original
Not if they endorse Litany of Tarski and understand the thought experiment!
Any "perceive yourself to X" phenomenon is something that happens within cognition of some abstract agent/person instance, whether they exist in some world or not. What kind of person instance is "perceiving themselves to black out" (that is, having blacked out)? Ghosts and afterlife seem more grounded than that. But for Earth/Mars question, both options are quite clear, and there is a you that perceives either of them in some of the possibilities, we can point to where those that perceive each of them are, and that is what would be correct for those instances to conclude about themselves, that they exist in the situations that contain them, known from the statement of the thought experiment.
A 3rd person perspective is there anyway, can be used regardless, even if other perspectives are also applicable. In this case it explains everything already, so we can't learn additional things in other ways.
There is a full explanation right there, in the description of the thought experiment. It describes all outcomes, including all observations and theoretical conclusions made by all the people-instances. We can look at this and ask whether those theoretical conclusions are correct, whether the theories the people-instances use to arrive at them are valid. You can tell what all the details of outcomes are in advance of actually doing this.
Personal experimence of people existing in the world is mediated by the physical states of their brains (or other physical hardware). So we can in principle predict what it says by asking about the physical content of the world. There are agents/people that don't have concrete instances in the world, and we can ask what they experience. They might leave the physical world, or enter it back, getting instantiated once more or for the first time. They might persistently exist outside concrete instantiation in the world, only communicating with it through reasoning about their behavior, which might be a more resource efficient way to implement a person than a mere concrete upload. But that's a different setting, not what this post describes.
One you in the worlds with total weight of 0.001 will observe remaining on Earth, while either the exact or approximate you in the worlds with total weight of 1.000 will observe arriving on Mars. That is all that actually happens.
Then they'll start making strange proclamations about their newfound epistemic states and empirical observations from the personal observation stream relevant to theories of identity, but that's beside the point.
Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren't clear signs so far that this is happening.
But this is a constraint on how the data can be generated, not on how efficiently other models can be retrained using such data to channel the capabilities. If at some point there will be a process for generating high quality training data for general intelligence, that data might also turn out to be effective for cheaply training other models. The R1-generated data used to train the distill models is 800K samples[1], which is probably 1B-10B tokens, less than 0.1% of typical amounts of pretraining data.
This is according to the report, though they don't seem to have released this data, so distill models can't be reproduced by others in the same way they were made by DeepSeek. ↩︎
it took people about 8 months to accelerate Andrej Karpathy's PyTorch GPT-2 trainer from llm.c by 14x on a 124M parameter GPT-2
The baseline is weak, the 8 months is just catching up to the present. They update the architecture (giving maybe a 4x compute multiplier), shift to a more compute optimal tokens/parameter ratio (1.5x multiplier). Maybe there is another 2x from the more obscure changes (which are still in the literature, so the big labs have the opportunity to measure how useful they are, select what works).
It's much harder to improve on GPT-4 or Llama-3 that much.
what's even more remarkable is that almost all that acceleration is due to better sample efficiency with the required training data dropping from 10 billion tokens to 0.73 billion tokens on the same training set with the fixed order of training tokens
That's just in the rules of the game, the number of model parameters isn't allowed to change, so in order to reduce training FLOPs (preserving perplexity) they reduce the amount of data. It also incidentally improves optimality of tokens/parameter ratio, though at 0.73B tokens it already overshoots, turning the initial overtrained 10B token model into a slightly undertrained model.
There is a difference in external behavior only if you need to communicate knowledge about the environment and the other players explicitly. If this knowledge is already part of an agent (or rock), there is no behavior of learning it, and so no explicit dependence on its observation. Yet still there is a difference in how one should interact with such decision-making algorithms.
I think this describes minds/models better (there are things they've learned long ago in obscure ways and now just know) than learning that establishes explicit dependence of actions on observed knowledge in behavior (which is more like in-context learning).
What distinguishes a cooperate-rock from an agent that cooperates in coordination with others is the decision-making algorithm. Facts about this algorithm also govern the way outcome can be known in advance or explained in hindsight, how for a cooperate-rock it's always "cooperate", while for a coordinated agent it depends on how others reason, on their decision-making algorithms.
So in the same way that Newcomblike problems are the norm, so is the "unfair" interaction with decision-making algorithms. I think it's just a very technical assumption that doesn't make sense conceptually and shouldn't be framed as "unfairness".
Training frontier models needs a lot of chips, situations where "a chip notices something" (and any self-destruct type things) are unimportant because you can test on fewer chips and do it differently next time. Complicated ways of circumventing verification or resetting clocks are not useful if they are too artisan, they need to be applied to chips in bulk and those chips then need to be able to work for weeks in a datacenter without further interventions (that can't be made into part of the datacenter).
AI accelerator chips have 80B+ transistors, much more than an instance of certificate verification circuitry would need, so you can place multiple of them (and have them regularly recheck the certificates). There are EUV pitch metal connections several layers deep within a chip, you'd need to modify many of them all over the chip without damaging the layers above, so I expect this to be completely infeasible to do for 10K+ chips on general principle (rather than specific knowledge of how any of this works).
For clocks or counters, I guess AI accelerators normally don't have any rewritable persistent memory at all, and I don't know how hard it would be to add some in a way that makes it too complicated to keep resetting automatically.
Chips have 15+ metal interconnect layers, so if verification is placed sufficiently all over the place physically, it probably can't be circumvented. I'm guessing a more challenging problem is replay attacks, where the chip needs some sort of persistent internal clocks or counters that can't be reset to start in order to repeatedly reuse old (but legitimate) certificates that enabled some computations at some point in the past.
You don't survive for anthropic reasons. Anthropic reasons explain the situations where you happen to survive by blind luck.
for example Zvi insisting that anyone who is not using LLMs to 10x their productivity is not serious ... a vibe not a direct quote
I expect he'd disagree, for example I vaguely recall him mentioning that LLMs are not useful in a productivity-changing way for his own work. And 10x specifically seems clearly too high for most things even where LLMs are very useful, other bottlenecks will dominate before that happens.
IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).
GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla's compute optimal 20 tokens/parameter is approximately correct for GPT-3, it's 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity.
(The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn't worth mentioning compared to everything else about it that's different from GPT-4.)
There is enough natural text data until 2026-2028, as I describe in the Peak Data section of the linked post. It's not very good data, but with 2,500x raw compute of original GPT-4 (and possibly 10,000x-25,000x in effective compute due to algorithmic improvement in pretraining), that's a lot of headroom that doesn't depend on inventing new things (such as synthetic data suitable for improving general intelligence through pretraining the way natural text data is).
Insufficient data could in principle be an issue with making good use of 5e28 FLOPs, but actually getting 5e28 FLOPs by 2028 (from a single training system) only requires funding. The decisions about this don't need to be taken based on AIs that exist today, they'll be taken based on AIs that exist in 2026-2027, trained on 1 GW training systems being built this year. With o3-like post-training, the utility and impressiveness of an LLM improves, so the chances of getting that project funded improve (compared to absence of such techniques).
A reflectively stable agent prefers to preserve some property of itself. This doesn't in general prevent it from being able to self-improve, in the same way that unchanging laws of physics don't prevent presence of self-improving agents in the world.
The content of the world keeps changing under the unchanging laws of how it changes, and similarly a reflectively stable agent (against safety properties) has content (such as beliefs) that keeps changing, in principle enabling unfettered self-improvement. Mesa-agents existing in the form of the content of the outer agent's cognition don't even need to have its safety properties. This is one framing for the way people might live within a superintelligence.