Posts
Comments
The $25-40bn figure is an estimate for about 1 GW worth of GB200s. SemiAnalysis expects 1 GW training systems for Google in 2025 and something comparable for Microsoft/OpenAI. This is discussed by Dylan Patel publicly on Dwarkesh Podcast, claiming that there is a 300K B200s cluster and 500K-700K B200s in total currently being constructed, possibly networked into a single training system. So if planned Microsoft capex was $60bn, that would've been surprising, too little for this project without cutting something else, but $80bn fits this story, that's my takeaway.
With Stargate, $100bn is still too much for the training systems of 2024-2025, so it's either not about what's being built in 2024-2025 at all, or a larger project that has current activities as part (which wouldn't fit building a big training system using a specific generation of hardware). Musk's 100K H100s Colossus tells me that building a training system in a year is feasible, even though it normally takes longer. The preliminary steps (land, power, permits, buildings) are much cheaper, but securing power and permits can require starting years in advance. So talking about a $100bn Stargate in 2024 is consistent with building it mostly in late 2026, once there is a plot with 3-5 GW of power and datacenter permits, most of the expense will then be in 2026 (Nvidia Rubin probably).
Stargate is not 2025, and going from $50bn to $80bn is in line with building a $25-40bn training system this year (an unusual expense distinct from other projects), a clue separate from SemiAnalysis claims.
When personal life expectancy of these same people alive today is something like 1e34 years, billions of years is very little.
How long would it even take to reach any of these places? Billions of years, right?
Noticing progress in long reasoning models like o3 creates a different blind spot compared to popular reporting on how scaling of pretraining is stalling out. It can appear that long reasoning models reconcile the central point of pretraining stalling out with AI progress moving fast. But plausible success of reasoning models instead suggests that pretraining will continue scaling even more[1] than could be expected before.
Training systems were already on track to go from 50 MW, training current models for up to 1e26 FLOPs, to 150 MW in late 2024, and then 1 GW by end on 2025, training models for up to 5e27 FLOPs in 2026, 250x compute of original GPT-4. But with o3, it now seems more plausible that $150bn training systems will be built in 2026-2027, training models for up to 5e28 FLOPs in 2027-2028, which is 500x compute of the currently deployed 1e26 FLOPs models or 2500x compute of original GPT-4.
Scaling of pretraining is not stalling out, even without the new long reasoning paradigm. It might begin stalling out in 2026 at the earliest, but now more likely only in 2028. The issue is that the scale of training systems is not directly visible, there is a 1-2 year lag between decisions to build them and the observed resulting AI progress.
Reporting on how scaling is stalling out might have a point in returns on scale getting worse than expected. But if scale still keeps increasing despite that, there will be capabilities resulting from additional scale. Scaling by 10x in compute might do very little, and this is compatible with scaling by 500x in compute bringing a qualitative change. ↩︎
Grabby distributed backup computronium that can't be overtaken in the expansion chase.
AGI-powered defense
Concrete existence, they point out, is less resource efficient than dreams of the machine. Hard to tell how much value is tied up in physical form and not computation, if humans would agree on this either way on reflection.
Ignoring such confusion is good for hardening the frame where the content is straightforward. It's inconvenient to always contextualize, refusing to do so carves out the space for more comfortable communication.
Some benchmarks got saturated across this range, so we can imagine "anti-saturated" benchmarks that didn't yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
Eating of the Sun is reversible, it's letting it burn that can't be reversed. The environmentalist option is to eat the Sun as soon as possible.
As argued briefly in the section on FDT, the embedded agency frame may not have a clean mathematical decision theory.
I think most FDT/embeddedness weirdness is about explaining the environment using bounded computations that are not (necessarily) literally already found in the environment as part of it. Not about sharing the actual source code, just any information about what's going on, captured in the form of computations, known to have captured that information before they are carried out. Things like static program analysis and deep learning models try to do this, but don't confront the weirdness of FDT/embeddedness.
Solomonoff induction is a very clean way of doing something like this, but doesn't go into decision theory. AIXI is closest to both doing it cleanly and confronting the weirdness, but something basic might be missing to make it applicable, that should be possible to fix.
To 10x the compute, you might need to 10x the funding, which AI capable of automating AI research can secure in other ways. Smaller-than-frontier experiments don't need unusually giant datacenters (which can be challenging to build quickly), they only need a lot of regular datacenters and the funding to buy their time. Currently there are millions of H100 chips out there in the world, so 100K H100 chips in a giant datacenter is not the relevant anchor for the scale of smaller experiments, the constraint is funding.
you can replace a lot of human labor. But an equivalent replacement for physical space or raw materials for manufacturing does not exist.
There is a lot of space and raw materials in the universe. AI thinks faster, so technological progress happens faster, which opens up access to new resources shortly after takeoff. Months to years, not decades to centuries.
An algorithm that computes 22+117 or something like that is free to compute it correctly, even as it's running on a physical computer that might be broken in a subtle way, possibly producing a different result. Identifying with an algorithm that your brain currently implements when making a decision doesn't seem different, you are just a more complicated algorithm, producing some result. What the physical world does with that result is a separate issue, but for purposes of this argument the algorithm is selected to be in tune with the world, it's an algorithm that the brain is currently simulating in detail.
This narrative (on timing) promotes building $150bn training systems in 2026-2027. AGI is nigh, therefore it makes sense to build them. If they aren't getting built, that might be the reason AGI hasn't arrived yet, so build them already (implies the narrative).
Actual knowledge that this last step of scaling is just enough to be relevant doesn't seem likely. This step of scaling seems to be beyond what happens by default, so a last push to get it done might be necessary. And the step after it won't be possible to achieve with mere narrative. While funding keeps scaling, the probability of triggering an intelligence explosion is higher; once it stops scaling, the probability (per year) goes down (if intelligence hasn't exploded by then). In this sense the narrative has a point.
I'm not making any claims about feasibility, I only dispute the claim that it's known that permanently giving up the potential for human control is an acceptable thing to do, or that making such a call (epistemic call about what is known) is reasonable in the foreseeable future. To the extent it's possible to defer this call, it should therefore be deferred (this is a normative claim, not a plan or a prediction of feasibility). If it's not possible to keep the potential for human control despite this uncertainty, then it's not possible, but that won't be because the uncertainty got resolved to the extent that it could be humanly resolved.
It was to stop treating any solution that didn't involve human control as axiomatically unacceptable, without regard to other outcomes.
The issue is that it's unclear if it's acceptable, so should be avoided if at all possible, pending more consideration. In principle there is more time for that than what's relevant for any other concerns that don't involve the risk of losing control in a less voluntary way. The revealed preference looks the same as finding it unacceptable to give up the potential for human control, but the argument is different, so long term implied behavior following from that argument is different. It might only take a million years to decide to give up control.
Learning from human data might have large attractors that motivate AIs to build towards better alignment, in which case prosaic alignment might find them. If those attractors are small, and there are more malign attractors in the prior that remain after learning human data, short-term manual effort of prosaic alignment fails. So malign priors have the same mechanism of action as effectiveness of prosaic alignment, it's the question of how learning on human data ends up being expressed in the models, what happens after the AIs built from them are given more time to reflect.
Managing to scale RL too early can make this irrelevant, enabling sufficiently competent paperclip maximization without dominant influence from either malign priors of from beneficial attractors in human data. Unclear if o1/o3 are pointing in this direction yet, so far they might just be getting better at eliciting human System 2 capabilities from base models, rather than being creative at finding novel ways of effective problem solving.
But humans have never had much control.
Not yet. There's been barely thousands of years of civilization, and there are 1e34-1e100 years more to figure it out.
There is a Feb 2024 paper that predicts high compute multipliers from using more finer-grained experts in MoE models, optimally about 64 experts activated per token at 1e24-1e25 FLOPs, whereas MoE models with known architecture usually have 2 experts activated per token. DeepSeek-V3 has 8 routed experts activated per token, a step in that direction.
On the other hand, things like this should've already been tested at the leading labs, so the chances that it's a new idea being brought to attention there seem slim. Runners-up like xAI and Meta might find this more useful, if that's indeed the reason, rather than extremely well-done post-training or even pretraining dataset construction.
Its pretraining recipe is now public, so it could get reproduced with much more compute soon. It might also suggest that scaling of pretraining has already plateaued, that leading labs have architectures that are at least as good as DeepSeek-V3, pump 20-60 times more compute into them, and get something only marginally better.
There is water, H2O, drinking water, liquid, flood. Meanings can abstract away some details of a concrete thing from the real world, or add connotations that specialize it into a particular role. This is very useful in clear communication. The problem is sloppy or sneaky equivocation between different meanings, not the content of meanings getting to involve emotions, connotations, things not found in the real world, or combining them with concrete real world things into compound meanings.
best-of-n sampling which solved ARC-AGI
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
There are many things that can't be done at all right now. Some of them can become possible through scaling, and it's unclear if it's scaling of pretraining or scaling of test-time compute that gets them first, at any price, because scaling is not just amount of resources, but also the tech being ready to apply them. In this sense there is some equivalence.
When they tested the original GPT-4, under far less dangerous circumstances, for months.
My impression is that it's the product-relevant post-training effort for GPT-4 that took months, the fact that there was also safety testing in the meantime is incidental rather than the cause of it taking months. This claim gets repeated, but I'm not aware of a reason to attribute the gap between Aug 2022 end of pretraining (if I recall the rumors or possibly claims by developers correctly) and Mar 2023 release to safety testing rather than to getting post-training right (in ways that are not specifically about safety).
Test time compute is applied to solving a particular problem, so it's very worthwhile to scale, getting better and better at solving an extremely hard problem by spending compute on this problem specifically. For some problems, no amount of pretraining with only modest test-time compute would be able to match an effort that starts with the problem and proceeds from there with a serious compute budget.
How many symbols are there for it to eat? Are there enough to give the same depth of understanding that a human gets from processing spatial info for instance?
Yes. It's not the case that humans blind from birth are dramatically less intelligent, learning from sound and touch is sufficient. LLMs are much less data efficient with respect to external data, because they only learn external data. For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue. For LLMs, there aren't yet general ways of generating synthetic data that can outright compensate for scarcity of external data and improve their general intelligence the way natural text data does, instead of propping up particular narrow capabilities (and hoping for generalization).
The way performance of o1 falls off much faster than for o3 depending on size of ARC-AGI problems is significant evidence in favor of o3 being built on a different base model than o1, with better long context training or different handling of attention in model architecture. So probably post-trained Orion/GPT-4.5o.
Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.
It's a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it's very overtrained, that is not even compute optimal.
It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters kept the same (Figure 5b), the rest is extrapolation from fitted scaling laws.
A new architecture has a compute multiplier M (at a given level of compute) if it would take M times more compute to train a compute optimal model with a reference architecture (in this case, a dense transformer) to match the perplexity it achieves when trained on data sampled from the same dataset. ↩︎
Mixture of Experts AI's "experts."
Experts in MoE transformers are just smaller MLPs[1] within each of the dozens of layers, and when processing a given prompt can be thought of as instantiated on top of each of the thousands of tokens. Each of them only does a single step of computation, not big enough to implement much of anything meaningful. There are only vague associations between individual experts and any coherent concepts at all.
For example, in DeepSeek-V3, which is an MoE transformer, there are 257 experts in each of the layers 4-61[2] (so about 15K experts), and each expert consists of two 2048x7168 matrices, about 30M parameters per expert, out of the total of 671B parameters.
reasoning (o3) is largely solved
Solving competition problems could well be like a chess-playing AI playing chess well. Does it generalize far enough, can the method be applied to train the AI on a wide variety of tasks that are not like competition problems (distinct in it being possible to write verifiers for attempted solutions)? We know that this is not the case with AlphaZero. Is it the case with o3-like methods? Hard to tell. I don't see how it could be known either way yet.
The range of capabilities between what can be gained at a reasonable test-time cost and at an absurd cost (but in reasonable time) can remain small, with most improvements to the system exceeding this range, likely to move what could only be obtained at an absurd cost before into the reasonable range. This is true right now (for general intelligence), and it could well remain true until the intelligence explosion.
DeepSeek-V3 might be the only example (and it's from the future, released after I asked the question). Not sure if it generalizes to expecting more FP8 training, as it's a MoE model with 257 experts and uses relatively small 7Kx2K matrices in its experts, while GPT-3-175B tested in FP8 in the Sep 2022 paper has much larger matrices, and that result wasn't sufficient to promote widespread adoption (at least where it's possible to observe).
On the other hand, if DeepSeek-V3 really is as good for its compute (4e24-6e24 FLOPs) as the benchmarks indicate, it might motivate more training with a huge number of smaller experts (it activates 8 experts per token, so the number of experts is even higher than one would expect from its ratio of total to active parameters). There was a Feb 2024 paper claiming 20x or higher compute multipliers for MoE models compared to dense (Figure 1b), appearing only if they activate a lot of experts per token, predicting 64 to be optimal at 1e24-1e25 FLOPs (the usual practice is to activate 2 experts). So DeepSeek-V3 weakly supports this surprising claim, though actual experimental results with more compute than that paper's 3e19-4e20 FLOPs per datapoint would be better. The paper also predicts reduction in tokens per parameter with more compute (Table 2), reaching 8 tokens per active parameter at 5e25 FLOPs (in a MoE model with 4096 experts, 64 of which get activated per token). If this too is somehow correct, natural text data can be sufficient for 10 times more compute than with dense models.
DeepSeek-V3 is a MoE model with 37B active parameters trained for 15T tokens, so at 400 tokens per parameter it's very overtrained and could've been smarter with similar compute if hyperparameters were compute optimal. It's probably the largest model known to be trained in FP8, it extracts 1.4x more compute per H800 than most models trained in BF16 get from an H100, for about 6e24 FLOPs total[1], about as much as Llama-3-70B. And it activates 8 routed experts per token (out of 256 total routed experts), which a Feb 2024 paper[2] suggests to be a directionally correct thing to do (compared to a popular practice of only activating 2 experts), with about 64 experts per token being optimal around 1e24-1e25 FLOPs. Taken together, these advantages predict that it should be smarter than Llama-3-70B, if done well.
Models that are smarter than Llama-3-70B can show impressive benchmark performance that then doesn't cash out in the hard-to-operationalize impression of being as smart as Claude 3.5 Sonnet. The jury is still out, but it's currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon. It would be shocking if a 37B active parameter model actually manages that though.
H800 seems to produce 1.4e15 dense FP8 FLOP/s, the model was trained for 2.8e6 H800-hours, and I'm assuming 40% compute utilization. ↩︎
That same paper estimates the compute multiplier of a compute optimal MoE at about 20x compared to a dense model, see Figure 1b, which is hard to believe. It's based on experiments of up to about 3e19-4e20 FLOPs per datapoint. Still, the claim of many more activated experts than 2 being better might survive in practice. ↩︎
Aggregating from independent reasoning traces is a well-known technique that helps somewhat but quickly plateaus, which is the reason o1/o3 are an important innovation, they use additional tokens much more efficiently and reach greater capability, as long as those tokens are within a single reasoning trace. Once a trace is done, more compute can only go to consensus or best-of-k aggregation from multiple traces, which is more wasteful in compute and quickly plateaus.
The $4000 high resource config of o3 for ARC-AGI was using 1024 traces of about 55K tokens, the same length as with the low resource config that runs 6 traces. Possibly longer reasoning traces don't work yet, otherwise a pour money on the problem option would've used longer traces. So a million dollar config would just use 250K reasoning traces of length 55K, which is probably slightly better than what 1K traces produce already.
I think explicitly computing details in full (as opposed to abstract reasoning about approximate properties) has no bearing on moral weight (degree of being real), but some kind of computational irreducibility forces the simulation of interesting things to get quite close to low level detail in order to figure out most global facts about what's going on there, such as values/culture of people living in a world after significant time passes.
They've probably scaled up 2x-4x compared to the previous scale of about 8e25 FLOPs, it's not that far (from 30K H100 to 100K H100). One point as I mentioned in the post is inability to reduce minibatch size, which might make this scaling step even less impactful than it should be judging from compute alone, though that doesn't apply to Google.
In any case this doesn't matter yet, since the 1 GW training systems are already being built (in case of Nvidia GPUs with larger scale-up worlds of GB200 NVL72), the decision to proceed to the yet-unobserved next level of scaling doesn't depend on what's observed right now. The 1 GW training systems allow training up to about 5e27 FLOPs, about 60x[1] the currently deployed models, a more significant change. We'll see its impact in late 2026.
The number of chips increases 5x from 100K H100 to 500K B200, and the new chips are 2.5x faster. If 1 GW systems are not yet expected to be quickly followed by larger systems, more time will be given to individual frontier model training runs, let's say 1.5x more. And there was that 3x factor from 30K H100 to 100K H100. ↩︎
It's as efficient to work on many frames while easily switching between them. Some will be poorly developed, but won't require commitment and can anchor curiosity, progress on blind spots of other frames.
Don't just disagree and walk away!
Feeding this norm creates friction, filters evidence elicited in the agreement-voting. If there is a sense that a vote needs to be explained, it often won't be cast.
Are there any signs to be found in public that anyone is training 10B+ LLMs in a precision that is not 16 bits? There are experiments that are specifically about precision on smaller LLMs, but they don't seem to get adopted in practice for larger models, despite the obvious advantage of getting to 2x the compute.
In general, I don't understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
Pipeline parallelism doesn't reduce batch size, it just moves the processing of a given sequence around the cluster in stages, but the number of sequences being processed by the cluster at a given time doesn't change (the time needed to process a layer for some sequence doesn't change, so the time between optimizer steps doesn't change, other than through bubbles). Tensor parallelism spreads the processing of a sequence across multiple GPUs, so there are fewer sequences processed at once within the cluster, which can be used to reduce the batch size (the time needed to process a layer for some sequence is divided by degree of tensor parallelism, so the time between optimizer steps reduces, and so does the total compute expended in a batch, proportional to the total number of sequences in it). You can only do tensor parallelism within a scale-up world without murdering compute utilization, which puts a bound on how much you can reduce the batch size.
I believe the l3 paper indicates the training seqlen was increased mid-training.
Section 3.4 says they start with sequences of length 4K, move to sequences of length 8K after 250M tokens, then to 16M tokens per batch after 2.9T tokens, and finally to long context training in the last 800B tokens (out of about 15T tokens in total). So 11T out of 15T tokens were learned in batches of 2K sequences of length 8K.
I think it's plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations
Good catch, TP=32 on 400K Trn2 gives the same batch size as TP=8 on 100K H100, so there is only an advantage with TP=64, which is not a priori a sure thing to work well. And a hypothetical non-Ultra 400K Trn2 cluster with its 16 GPU scale-up worlds is worse even though there's more compute in 16 Trn2 than in 8 H100. Though it would be surprising if the Rainier cluster doesn't have the Ultra config, as what else is it supposed to be for.
given that this is RL, there isn't any clear reason this won't work (with some additional annoyances) for scaling through very superhuman performance
Not where they don't have a way of generating verifiable problems. Improvement where they merely have some human-written problems is likely bounded by their amount.
An AGI broadly useful for humans needs to be good at general tasks for which currently there is no way of finding legible problem statements (where System 2 reasoning is useful) with verifiable solutions. Currently LLMs are slightly capable at such tasks, and there are two main ways in which they become more capable, scaling and RL.
Scaling is going to continue rapidly showing new results at least until 2026-2027, probably also 2028-2029. If there's no AGI or something like a $10 trillion AI company by then, there won't be a trillion dollar training system and the scaling experiments will fall back to the rate of semiconductor improvement.
Then there's RL, which as o3 demonstrates applies to LLMs as a way of making them stronger and not merely eliciting capabilities formed in pretraining. But it only works directly around problem statements with verifiable solutions, and it's unclear how to generate them for more general tasks or how far will the capabilities generalize from the training problems that are possible to construct in bulk. (Arguably self-supervised learning is good at instilling general capabilities because the task of token prediction is very general, it subsumes all sorts of things. But it's not legible.) Here too scale might help with generalization stretching further from the training problems, and with building verifiable problem statements for more general tasks, and we won't know how much it will help until the experiments are done.
So my timelines are concentrated on 2025-2029, after that the rate of change in capabilities goes down. Probably 10 more years of semiconductor and algorithmic progress after that are sufficient to wrap it up though, so 2040 without AGI seems unlikely.
My thesis is that the o3 announcement is timelines-relevant in a strange way. The causation goes from o3 to impressiveness or utility of its successors trained on 1 GW training systems, then to decisions to build 5 GW training systems, and it's those 5 GW training systems that have a proximate effect on timelines (in comparison to the world only having 1 GW training systems for a few years). The argument goes through even if o3 and its successors don't particularly move timelines directly through their capabilities, they can remain a successful normal technology.
The funding constraint stopping $150bn training systems previously seemed more plausible, but with o3 it might be lifted. This is timelines-relevant precisely because there aren't any other constraints that come into play before that point.
About 4T parameters, which is 8 TB in BF16. With about 100x more compute (compared to Llama 3 405B), we get a 10x larger model by Chinchilla scaling, the correction from a higher tokens/parameter ratio is relatively small (and in this case cancels out the 1.5 factor in compute being 150x actually).
Not completely sure if BF16 remains sufficient at 6e27-5e28 FLOPs, as these models will have more layers and larger sums in matrix multiplications. If BF16 doesn't work, the same clusters will offer less compute (at a higher precision). Seems unlikely though, as 3 OOMs of compute only increase model size 30x, which means 3x more layers and 3x larger matrices (in linear size), which is not that much. There are block number formats like microscaling that might help if this is somehow a problem, but usability of this remains unclear, as everyone is still training in BF16 in practice.
In the other direction, there is a Nov 2024 paper that suggests 7-8 bit precision might be compute optimal at any scale, that the proper way to adapt to scale is by increasing the number of parameters rather than increasing precision (Section 4.3.2). If this can be made practical at a given scale, there'll be 2x more compute, and even more in effective compute, which is essentially the paper's claim. (I don't know how this interacts with scarce data, possibly either higher or lower precision can improve the situation.)
What is o3 doing that you couldn't do by running o1 on more computers for longer?
Unclear, but with $20 per test settings on ARC-AGI it only uses 6 reasoning traces and still gets much better results than o1, so it's not just about throwing $4000 at the problem. Possibly it's based on GPT-4.5 or trained on more tests.
Should explicitly depend on values instead of gesturing at conflationary social approval. It could be undignified for a credentialist student to pass on an opportunity to safely cheat. It's undignified to knowingly do the clearly wrong thing, for some notion of "wrong" you endorse (or would endorse on reflection, if it was working properly).
OpenAI shared they trained the o3 we tested on 75% of the Public Training set
Probably a dataset for RL, that is the model was trained to try and try again to solve these tests with long chains of reasoning, not just tuned or pretrained on them, as a detail like 75% of examples sounds like a test-centric dataset design decision, with the other 25% going to the validation part of the dataset.
Altman: "didn't go do specific work ... just the general effort"
Seems plausible they trained on ALL the tests, specifically targeting various tests. The public part of ARC-AGI is "just" a part of that dataset of all the tests. Could be some part of explaining the o1/o3 difference in $20 tier.
who lose sight of the point of it all
Pursuing some specific "point of it all" can be much more misguided.
In the same terms as the $100-200bn I'm talking about, o3 is probably about $1.5-5bn, meaning 30K-100K H100, the system needed to train GPT-4o or GPT-4.5o (or whatever they'll call it) that it might be based on. But that's the cost of a training system, its time needed for training is cheaper (since the rest of its time can be used for other things). In the other direction, it's more expensive than just that time because of research experiments. If OpenAI spent $3bn in 2024 on training, this is probably mostly research experiments.
$100-200bn 5 GW training systems are now a go. So in the worlds that slow down for years if there are only $30bn systems available and would need an additional scaling push, timelines moved up a few years. Not sure how unlikely $100-200bn systems would've been without o1/o3, but they seem likely now.