Posts
Comments
An LLM trained with a sufficient amount of RL maybe could learn to compress its thoughts into more efficient representations than english text, which seems consistent with the statement. I'm not sure if this is possible in practice; I've asked here if anyone knows of public examples.
Makes sense. Perhaps we'll know more when o3 is released. If the model doesn't offer a summary of CoT it makes neuralese more likely.
I've often heard it said that doing RL on chain of thought will lead to 'neuralese' (e.g. most recently in Ryan Greenblatt's excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
(Based on public knowledge, it seems plausible (perhaps 25% likely) that o3 uses neuralese which could put it in this category.)
What public knowledge has led you to this estimate?
I was able to replicate this result. Given other impressive results of o1, I wonder if the model is intentionally sandbagging? If it’s trained to maximize human feedback, this might be an optimal strategy when playing zero sum games.
> I was grateful for the experiences and the details of how he prepares for conversations and framing AI that he imparted on me.
I'm curious, what was his strategy for preparing for these discussions? What did he discuss?
> This updated how I perceive the “show down” focused crowd
possible typo?
Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.
I’m confused about how to think about this. Are there any evals where fine-tuning on a sufficient amount of data wouldn’t saturate the eval? E.g. if there’s an eval measuring knowledge of virology, then I would predict that fine-tuning on 1B tokens of the relevant virology papers would lead to a large increase in performance. This might be true even if the 1B tokens were already in the pretraining dataset, because in some sense it’s the most recent data that the model has seen.
I am also increasingly wondering if talking too much to LLMs is an infohazard akin to taking up psychedelics or Tiktok or meditation as a habit.
Why is meditation an infohazard?
For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue.
Could you say more about this? What do you think is the ratio of external to internal data?
That’s a good point, it could be consensus.
Thoughts on o3 and search:
[Epistemic status: not new, but I thought I should share]
An intuition I’ve had for some time is that search is what enables an agent to control the future. I’m a chess player rated around 2000. The difference between me and Magnus Carlsen is that in complex positions, he can search much further for a win, such than I gave virtually no chance against him; the difference between me and an amateur chess player is similarly vast. It’s not just about winning either - in Shogi, the top professionals when they know they have won continue searching over future variations to find the most aesthetically appealing mate.
This is one of the reasons that I’m concerned about AI. It’s not bound by the same constraints of time, energy, and memory as humans, and as such it’s possible for it to search through possible futures very deeply to find the narrow path in which it achieves its goal. o3 looks to be on this path. It has both very long chains of though (depth of search), as well as the ability to parallelize across multiple instances (best-of-n sampling which solved ARC-AGI). To be clear, I don’t think this search is very efficient, and there are many obvious ways in which it can be improved. E.g. recurrent architectures which don’t waste as much compute computing logprobs for several tokens and sampling just one, or multi-token objectives for the base model as shown in Deepseek v3. But the basis for search is there. Till now, it seemed like AI was improving its intuition, now it can finally begin to think.
Concretely, I expect that by 2030, AI systems will use as much compute in inference time for hard problems as is currently used for pretraining the largest models. Possibly more if humanity is not around by then.
This makes sense, I think you could be right. Llama 4 should give us more evidence on numerical precision and scaling of experts.
Deepseek v3 is one example, and semianalysis has claimed that most labs use FP8.
FP8 Training is important as it speeds up training compared to BF16 & most frontier labs use FP8 Training.
In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft's 3 buildings in Goodyear, Arizona, xAI's Memphis cluster, Meta's training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.
Then there are Google's 100K TPUv6e clusters and Amazon's 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.
Anthropic might need more time than the other players to gets its new hardware running, but there is also an advantage to Trn2 and TPUv6e over H100, larger scale-up domains that enable more tensor parallelism and smaller minibatch sizes. This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Do we know much about TPU and Trn2 performance at lower precision? I expect most training runs are using 4-8 bit precision by this point.
Note that "The AI Safety Community" is not part of this list. I think external people without much capital just won't have that much leverage over what happens.
What would you advise for external people with some amount of capital, say $5M? How would this change for each of the years 2025-2027?
What do you think is the current cost of o3, for comparison?
How have your AGI timelines changed after this announcement?
Being overweight and weak hurts in ways that fade into the background and you might not even know there is a comfier way of being; being pathologically weak hurts in ways that you notice now and it ruins your quality of life.[2]
Could you say more about this?
I believe that the OAI api does offer the logprobs in the chat completions api (https://platform.openai.com/docs/api-reference/chat/create). Not sure about Anthropic.
This is a great post, and I like the research process. Do you know if the LLM code completion in cursor is compatible with ipynb notebooks?
Nice work! It's surprising that something so simple works so well. Have you tried applying this to more recent models like o1 or QwQ?
What do you think would be the cause(s) of the slowdown?
I donated $10, but I hope to be able to donate much more next year. Thank you!
I wonder how you found out about the Curve conference? I’ve been to a few fun Lighthaven events (e.g. Manifest), and I’m looking for a way to be alerted to new ones.
What signal do we get from DeepSeek continuing to publish?
I wonder by what factor do you think automated AI R&D speeds up research? How large is the effective compute overhang made possible by the discovery of better architectures and training methods?
I'm confused about why we would expect the misaligned behavior to be displayed in the Shoggoth rather than the Face. For example, let's take the setting from the AI control paper, where a misaligned model is trying to insert a backdoor in a coding question, despite it passing a series of unit tests. This could be incentivized in the training loop if e.g. the reward supervision is weak and the backdoor grants the model access to additional resources/compute. I can see a few strategies succeeding:
- The Shoggoth speculates on different ways to insert a backdoor, and the Face writes a solution using one of these backdoors.
- The Shoggoth outputs the correct solution to the coding question, and the Face modifies it to insert a backdoor.
- Some combination of 1) and 2), for instance the Shoggoth writes several versions of the function, one of which contains a backdoor, and the Face returns the version with the backdoor.
All of these result in the Face outputting potentially same backdoored function, and should be equally rewarded by the reward model. Why should we expect that 2) is less likely than 1) or 3)?
I wonder if you could say more about how the training pipeline would work. E.g. if the reward model is applied to the outputs of the face, how do we train the Shoggoth to produce useful CoTs for the face? Is the idea to fix the Face, sample many CoTs from the Shoggoth, and fine-tune the Shoggoth on the ones which achieve high reward according to the Face's output?
So $1 buys 7e17 useful FLOPs, or inference with 75-120B[1] active parameters for 1 million tokens.
Is this right? My impression was that the 6ND (or 9.6 ND) estimate was for training, not inference. E.g. in the original scaling law paper, it states
C ~ 6 NBS – an estimate of the total non-embedding training compute, where B is the batch size, and S is the number of training steps (ie parameter updates).
An important caveat to the data movement limit:
“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”
I haven’t looked carefully at Zhang et al., but assuming their analysis is correct and the data wall is at 3e30 FLOP, it’s plausible that we hit resource constraints ($10-100 trillion training runs, 2-20 TW power required) before we hit the data movement limit.
The post argues that there is a latency limit at 2e31 FLOP, and I've found it useful to put this scale into perspective.
Current public models such as Llama 3 405B are estimated to be trained with ~4e25 flops , so such a model would require 500,000 x more compute. Since Llama 3 405B was trained with 16,000 H-100 GPUs, the model would require 8 billion H-100 GPU equivalents, at a cost of $320 trillion with H-100 pricing (or ~$100 trillion if we use B-200s). Perhaps future hardware would reduce these costs by an order of magnitude, but this is cancelled out by another factor; the 2e31 limit assumes a training time of only 3 months. If we were to build such a system over several years and had the patience to wait an additional 3 years for the training run to complete, this pushes the latency limit out by another order of magnitude. So at the point where we are bound by the latency limit, we are either investing a significant percentage of world GDP into the project, or we have already reached ASI at a smaller scale of compute and are using it to dramatically reduce compute costs for successor models.
Of course none of this analysis applies to the earlier data limit of 2e28 flop, which I think is more relevant and interesting.
Interesting paper, though the estimates here don’t seem to account for Epoch’s correction to the chinchilla scaling laws: https://epochai.org/blog/chinchilla-scaling-a-replication-attempt
This would imply that the data movement bottleneck is a bit further out.
Maybe he thinks that much faster technological progress would cause social problems and thus wouldn’t be implemented by an aligned AI, even if it were possible. Footnote 2 points at this:
“ I do anticipate some minority of people’s reaction will be ‘this is pretty tame’… But more importantly, tame is good from a societal perspective. I think there’s only so much change people can handle at once, and the pace I’m describing is probably close to the limits of what society can absorb without extreme turbulence. ”
A separate part of the introduction argues that causing this extreme societal turbulence would be unaligned:
“Many things cannot be done without breaking laws, harming humans, or messing up society. An aligned AI would not want to do these things (and if we have an unaligned AI, we’re back to talking about risks). Many human societal structures are inefficient or even actively harmful, but are hard to change while respecting constraints like legal requirements on clinical trials, people’s willingness to change their habits, or the behavior of governments. ”
I found this comment valuable, and it caused me to change my mind about how I think about misalignment/scheming examples. Thank you for writing it!
It’s still not trivial to finetune Llama 405B. You require 16 bytes/parameter using Adam + activation memory, so a minimum of ~100 H100s.
A particularly notable section (pg. 19):
“The current implementation of The AI Scientist has minimal direct sandboxing in the code, leading to several unexpected and sometimes undesirable outcomes if not appropriately guarded against. For example, in one run, The AI Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention. In another run, The AI Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage. In some cases, when The AI Scientist's experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime. While creative, the act of bypassing the experimenter's imposed constraints has potential implications for AI safety (Lehman et al., 2020).”
I’m rated about 2100 USCF and 2300 Lichess, and I’m open to any of the roles. I’m free on the weekend and weekdays after 3 pm pacific. I’m happy to play any time control including multi-month correspondence.