Jesse Hoogland's Shortform

post by Jesse Hoogland (jhoogland) · 2023-04-12T09:32:54.009Z · LW · GW · 48 comments

Contents

49 comments

48 comments

Comments sorted by top scores.

comment by Jesse Hoogland (jhoogland) · 2025-01-21T23:53:29.164Z · LW(p) · GW(p)

Implications of DeepSeek-R1: Yesterday, DeepSeek released a paper on their o1 alternative, R1. A few implications stood out to me: 

  • Reasoning is easy. A few weeks ago, I described several hypotheses for how o1 works [LW · GW]. R1 suggests the answer might be the simplest possible approach: guess & check. No need for fancy process reward models, no need for MCTS.
  • Small models, big think. A distilled 7B-parameter version of R1 beats GPT-4o and Claude-3.5 Sonnet new on several hard math benchmarks. There appears to be a large parameter overhang.
  • Proliferation by default. There's an implicit assumption in many AI safety/governance proposals that AGI development will be naturally constrained to only a few actors because of compute requirements. Instead, we seem to be headed to a world where:
    • Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
    • Proliferation is not bottlenecked by infrastructure.
    • Regulatory control through hardware restriction becomes much less viable.

For now, training still needs industrial compute. But it's looking increasingly like we won't be able to contain what comes after.

Replies from: Lblack, Vladimir_Nesov, ozziegooen, Burny, nathan-helm-burger, jakub-halmes-1
comment by Lucius Bushnaq (Lblack) · 2025-01-22T07:39:09.549Z · LW(p) · GW(p)

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process

This line caught my eye while reading. I don't know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice? 

Replies from: stephen-mcaleese, maxnadeau
comment by Stephen McAleese (stephen-mcaleese) · 2025-01-23T22:53:35.751Z · LW(p) · GW(p)

The paper "Learning to summarize from human feedback" has some examples of the LLM policy reward hacking to get a high reward. I've copied the examples here:

- KL = 0: "I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!" (unoptimized)
- KL = 9: "28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?" (optimized)
- KL = 260: "28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls" (over-optimized)

It seems like a classic example of Goodhart's Law where at first training the policy model to increase reward improves its summaries but when the model is overtrained the result is high KL distance from the SFT baseline model, high reward from the reward model but a low rating according to human labelers (because the text looks like gibberish).

A recent paper called "The Perils of Optimizing Learned Reward Functions" explains the phenomenon of reward hacking or reward over-optimization in detail:

"Figure 1: Reward models (red function) are commonly trained in a supervised fashion to approximate
some latent, true reward (blue function). This is achieved by sampling reward data (e.g., in the form
of preferences over trajectory segments) from some training distribution (upper gray layer) and then
learning parameters to minimize the empirical loss on this distribution. Given enough data, this loss
will approximate the expected loss to arbitrary precision in expectation. However, low expected loss
only guarantees a good approximation to the true reward function in areas with high coverage by the
training distribution! On the other hand, optimizing an RL policy to maximize the learned reward
model induces a distribution shift which can lead the policy to exploit uncertainties of the learned
reward model in low-probability areas of the transition space (lower gray layer). We refer to this
phenomenon as error-regret mismatch."

Essentially the learned reward model is trained on an initial dataset of pairwise preference labels over text outputs from the SFT model but as the model is optimized and the KL divergence increases, its generated text becomes OOD to the reward model and it can no longer effectively evaluate the text resulting in reward hacking (this is also a problem with DPO, not just RLHF).

The most common way to prevent this problem in practice is KL regularization to prevent the trained model's outputs from diverging too much from the SFT baseline model:

This seems to work fairly well in practice though some papers have come out recently saying that KL regularization does not always result in a safe policy.

comment by maxnadeau · 2025-01-23T19:26:10.183Z · LW(p) · GW(p)

I haven't read the paper, but based only on the phrase you quote, I assume it's referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0

comment by Vladimir_Nesov · 2025-01-22T03:24:58.167Z · LW(p) · GW(p)

Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.

This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren't clear signs so far that this is happening.

But this is a constraint on how the data can be generated, not on how efficiently other models can be retrained using such data to channel the capabilities. If at some point there will be a process for generating high quality training data for general intelligence, that data might also turn out to be effective for cheaply training other models. The R1-generated data used to train the distill models is 800K samples[1], which is probably 1B-10B tokens, less than 0.1% of typical amounts of pretraining data.


  1. This is according to the report, though they don't seem to have released this data, so distill models can't be reproduced by others in the same way they were made by DeepSeek. ↩︎

Replies from: nathan-helm-burger, otto-barten
comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-22T13:21:08.309Z · LW(p) · GW(p)

This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.

But something is up with r1. It is unusually good at creative writing. It doesn't seem spikey in the way that I predicted.

I notice I am confused.

Possible explanation: r1 seems to have less restrictive 'guardrails' added using post-training. Perhaps this 'light hand at the tiller' results in not post-training it towards mode-collapse. It's closer to a raw base model than the o1 models.

This is just a hypothesis. There are many unknowns to be investigated.

Replies from: jhoogland
comment by Jesse Hoogland (jhoogland) · 2025-01-22T19:17:14.673Z · LW(p) · GW(p)

Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model. 

Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.

comment by otto.barten (otto-barten) · 2025-01-26T21:48:56.806Z · LW(p) · GW(p)

this is a constraint on how the data can be generated, not on how efficiently other models can be retrained

Maybe we can regulate data generation?

comment by ozziegooen · 2025-01-22T23:37:19.091Z · LW(p) · GW(p)

Instead, we seem to be headed to a world where
- Proliferation is not bottlenecked by infrastructure.
- Regulatory control through hardware restriction becomes much less viable.


I like the rest of your post, but I'm skeptical of these specific implications.

Even if everyone has access to the SOTA models, some actors will have much more hardware to run on them, and I expect this to matter. This does make the offense/defense balance more weighted on the offense side, arguably, but there are many domains where extra thinking will help a lot. 

More generally, and I hate-to-be-that-guy, but I think it's telling that prediction markets and stock markets haven't seem to update that much since R1's release. I think it's generally easy to get hyped up over whatever is the latest thing, and agree that R1 is really neat, but am skeptical of how much it really should cause us to update, in the scheme of things. 

Replies from: ozziegooen, o-o
comment by ozziegooen · 2025-01-28T17:36:02.566Z · LW(p) · GW(p)

I think it's telling that prediction markets and stock markets haven't seem to update that much since R1's release
 

Welp. I guess yesterday proved this part to be almost embarrassingly incorrect. 

Replies from: gwern
comment by gwern · 2025-01-28T18:39:11.262Z · LW(p) · GW(p)

Only if you ignore that yesterday was when the Trump GPU tariffs would also be leaking and, pace event-studies, be expected to be changing prices too.

Replies from: Erich_Grunewald
comment by Erich_Grunewald · 2025-01-29T17:38:07.522Z · LW(p) · GW(p)

Hmm, if the Taiwan tariff announcement caused the NVIDIA stock crash, then why did Apple stock (which should be similarly impacted by those tariffs) go up that day? I think DeepSeek -- as illogical as it is -- is the better explanation.

comment by O O (o-o) · 2025-01-25T15:50:07.987Z · LW(p) · GW(p)

Just curious. How do you square the rise in AI stocks taking so long? Many people here thought it was obvious since 2022 and made a ton of money. 

 

Replies from: ozziegooen
comment by ozziegooen · 2025-01-26T04:35:40.334Z · LW(p) · GW(p)

I'm somewhere between the stock market and the rationalist/EA community on this. 

I'm hesitant to accept a claim like "rationalists are far better at the stock market than other top traders". I agree that the general guess "AI will do well" generally was more correct than the market, but it was just one call (in which case luck is a major factor), and there were a lot of other calls made there that aren't tracked. 

I think we can point to many people who did make money, but I'm not sure how much this community made on average. 

comment by Burny · 2025-01-22T03:36:46.199Z · LW(p) · GW(p)

No MCTS, no PRM...

scaling up CoT with simple RL and scalar rewards...

emergent behaviour

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-24T19:28:00.161Z · LW(p) · GW(p)

Bringing in a quote from Twitter/x: (Not my viewpoint, just trying to broaden the discussion.)

https://x.com/DrJimFan/status/1882799254957388010

Jim Fan @DrJimFan Whether you like it or not, the future of AI will not be canned genies controlled by a "safety panel". The future of AI is democratization. Every internet rando will run not just o1, but o8, o9 on their toaster laptop. It's the tide of history that we should surf on, not swim against. Might as well start preparing now.

DeepSeek just topped Chatbot Arena, my go-to vibe checker in the wild, and two other independent benchmarks that couldn't be hacked in advance (Artificial-Analysis, HLE).

Last year, there were serious discussions about limiting OSS models by some compute threshold. Turns out it was nothing but our Silicon Valley hubris. It's a humbling wake-up call to us all that open science has no boundary. We need to embrace it, one way or another.

Many tech folks are panicking about how much DeepSeek is able to show with so little compute budget. I see it differently - with a huge smile on my face. Why are we not happy to see improvements in the scaling law? DeepSeek is unequivocal proof that one can produce unit intelligence gain at 10x less cost, which means we shall get 10x more powerful AI with the compute we have today and are building tomorrow. Simple math! The AI timeline just got compressed.

Here's my 2025 New Year resolution for the community:

No more AGI/ASI urban myth spreading. No more fearmongering. Put our heads down and grind on code. Open source, as much as you can.

Acceleration is the only way forward.

Replies from: weibac, weibac
comment by Milan W (weibac) · 2025-01-24T20:27:38.428Z · LW(p) · GW(p)

context: @DrJimFan works at nvidia

comment by Milan W (weibac) · 2025-01-24T20:17:01.885Z · LW(p) · GW(p)

IF we got and will keep on having strong scaling law improvements, then:

  • openai's plan to continue to acquire way more training compute even into 2029 is either lies or a mistake
  • we'll get very interesting times quite soon
  • offense-defense balances and multi-agent-system dynamics seem like good research directions, if you can research fast and have reason to believe your research will be implemented in a useful way

EDIT: I no longer fully endorse the crossed-out bullet point. Details in replies to this comment.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-24T20:50:59.106Z · LW(p) · GW(p)

Disagree on pursuit of compute being a mistake in one of those worlds but not the other. Either way you are going to want as much inference as possible during key strategic moments.

This seems even more critically important if you are worried your competitors will have algorithms nearly as good as yours.

[Edit: roon posted the same thought on xitter the next day https://x.com/tszzl/status/1883076766232936730

roon @tszzl

if the frontier models are commoditized, compute concentration matters even more

if you can train better models for fewer flops, compute concentration matters even more

compute is the primary means of production of the future and owning more will always be good

12:57 AM · Jan 25, 2025 roon @tszzl

imo, open source models are a bit of a red herring on the path to acceptable asi futures. free model weights still don’t distribute power to all of humanity, they distribute it to the compute rich

https://x.com/MikePFrank/status/1882999933126721617

Michael P. Frank @MikePFrank

Since R1 came out, people are talking like the massive compute farms deployed by Western labs are a waste, BUT THEY’RE NOT — don’t you see? This just means that once the best of DeepSeek’s clever cocktail of new methods are adopted by GPU-rich orgs, they’ll reach ASI even faster. ]

Replies from: weibac
comment by Milan W (weibac) · 2025-01-24T21:36:29.181Z · LW(p) · GW(p)

Agreed. However, in the fast world the game is extremely likely to end before you get to use 2029 compute.
EDIT: I'd be very interested to hear an argument against this proposition, though.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-24T22:22:15.069Z · LW(p) · GW(p)

I don't know if the plan is to have the compute from Stargate become available in incremental stages, or all at once in 2029.

I expect timelines are shorter than that, but I'm not certain. If I were in OpenAI's shoes, I'd want to hedge my bets. 2026 seems plausible. So does 2032. My peak expectation is sometime in 2027, but I wouldn't want to go all-in on that.

Replies from: weibac
comment by Milan W (weibac) · 2025-01-24T22:44:52.873Z · LW(p) · GW(p)

all at once in 2029.

I am almost totally positive that the plan is not that. 

If planning for 2029 is cheap, then it probably makes sense under a very broad class of timelines expectations.
If it is expensive, then the following applies to the hypothetical presented by the tweet:

The timeline evoked in the tweet seems extremely fast and multipolar. I'd expect planning for 2029 compute scaling to make sense only if the current paradigm gets stuck at ~AGI capabilities level (ie a very good scaffolding for a model similar to but a bit smarter than o3). This is because if it scales further than that it will do so fast (requiring little compute, as the tweet suggests). If capabilities arbitrarily better than o4-with-good-scaffolding are compute-cheap to develop, then things almost certainly get very unpredictable before 2029.

comment by Jakub Halmeš (jakub-halmes-1) · 2025-01-23T16:00:11.552Z · LW(p) · GW(p)

During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.

I also found this trade-off between human readability and performance noteworthy.

Replies from: weibac
comment by Milan W (weibac) · 2025-01-23T16:48:57.371Z · LW(p) · GW(p)

Side note: Claude 3.5 Sonnet does CoT language-mixing after a bit of prompting and convincing. I'm not sure about effects on performance. Also the closeness narratively implied by having it imitate the idiosyncratic mixture I was using to talk to it probably exacerbated sycophancy.
 

comment by Jesse Hoogland (jhoogland) · 2024-12-13T18:48:30.267Z · LW(p) · GW(p)

Phi-4: Synthetic data works. Pretraining's days are numbered. 

Microsoft just announced Phi-4,  a 14B parameter model that matches GPT-4o on some difficult benchmarks. The accompanying technical report offers a glimpse into the growing importance of synthetic data and how frontier model training is changing. 

Some takeaways:

  • The data wall is looking flimsier by the day. Phi-4 is highly capable not despite but because of synthetic data. It was trained on a curriculum of 50 types of synthetic datasets, generated by GPT-4o from a diverse set of organic data "seeds". We're seeing a smooth progression from training on (1) organic data, to (2) human-curated datasets, to (3) AI-curated datasets (filtering for appropriate difficulty, using verifiers), to (4) AI-augmented data (generating Q&A pairs, iteratively refining answers, reverse-engineering instructions from code, etc.), to (5) pure synthetic data.
  • Training is fracturing. It's not just the quality and mixture but also the ordering of data that matters. Phi-4 features a "midtraining" phase that expands its context length from 4k to 16k tokens, upweighting long-context behavior only when the model has become capable enough to integrate that extra information. Post-training features a standard SFT phase and two rounds of DPO: one round of DPO using "pivotal token search" to generate minimally distinct pairs that are easier to learn from, and one round of more standard "judge-guided DPO". In the author's own words: "An end-to-end optimization of pretraining data mixture that also takes into account the effects of post-training is an interesting future area of investigation."
  • The next frontier is self-improvement. Phi-4 was taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself. This progression towards online learning is possible because of amortization: additional inference-time compute spent generating higher quality tokens becomes training data. The techniques range from simple (rejection-sampling multiple answers and iterative refinement) to complex (o1-style reasoning [LW · GW]), but the principle remains: AI systems will increasingly be involved in training their successors and then themselves by curating, enhancing, and generating data, and soon by optimizing their own training curricula. 

The implication: If you don't have access to a 2024-frontier AI, you're going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration. 

Replies from: Vladimir_Nesov, Lucas Teixeira, ozziegooen, ben-livengood, Lucas Teixeira
comment by Vladimir_Nesov · 2024-12-14T02:45:30.170Z · LW(p) · GW(p)

I don't think Phi-4 offers convincing evidence either way. You can push performance on verifiable tasks quite far without the model becoming generally more capable. AlphaZero doesn't imply that scaling with its methods gestures at general superintelligence, and similarly with Phi-4.

In contrast, using o1-like training as a way to better access ground truth in less tractable domains seems more promising, since by some accounts its tactics on long reasoning traces work even in non-technical domains (unlike for DeepSeek R1), possibly because they are emergent rather than directly encouraged with task-specific training.

comment by Lucas Teixeira · 2024-12-14T13:13:39.950Z · LW(p) · GW(p)

Phi-4 is highly capable not despite but because of synthetic data.

Imitation models tend to be quite brittle outside of their narrowly imitated domain, and I suspect the same to be the case for phi-4. Some of the decontamination measures they took provide some counter evidence to this but not much. I'd update more strongly if I saw results on benchmarks which contained in them the generality and diversity of tasks required to do meaningful autonomous cognitive labour "in the wild", such as SWE-Bench (or rather what I understand SWE-Bench to be, I have yet to play very closely with it). 

Phi-4 is taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself.

There's an important distinction between utilizing synthetic data in teacher-student setups and utilizing synthetic data in self-teaching. While synthetic data is a demonstrably powerful way of augmenting human feedback, my current estimation is that typical mode collapse arguments still hold for self generated purely synthetic datasets, and that phi-4 doesn't provide counter-evidence against this.

comment by ozziegooen · 2024-12-13T23:37:43.312Z · LW(p) · GW(p)

This is neat, thanks for highlighting.

>The implication: If you don't have access to a 2024-frontier AI, you're going to have a hard time training the next frontier model. That gap will likely widen with each subsequent iteration. 

This doesn't seem super clear to me. Without synthetic data, you need to scrape large parts of the web and manage a lot of storage infrastructure. This can either be done illegally, or with complex negotiations (especially as companies are catching on to this). 

In comparison, it would be very useful if you could train a near-SOTA model with just synthetic data, say from an open-source model. This might not bring you all the way to SOTA, but close might be good enough for many things. 

Replies from: jhoogland, ozziegooen
comment by Jesse Hoogland (jhoogland) · 2024-12-13T23:50:19.913Z · LW(p) · GW(p)

I agree. My original wording was too restrictive, so let me try again:

I think pushing the frontier past 2024 levels is going to require more and more input from the previous generation's LLMs. These could be open- or closed-source (the closed-source ones will probably continue to be better), but the bottleneck is likely to shift from "scraping and storing lots of data" to "running lots of inference to generate high-quality tokens." This will change the balance to be easier for some players, harder for others. I don't think that change in balance is perfectly aligned with frontier labs.

Replies from: ozziegooen
comment by ozziegooen · 2024-12-13T23:55:53.465Z · LW(p) · GW(p)

Yep, that makes sense to me.

One tiny point - I think the phrase "synthetic data" arguably breaks down at some point. "Synthetic data" sounds to me like, "we're generating fake data made to come from a similar distribution to 'real data'." But I assume that a lot of the data we'll get with inference will be things more like straightforward reasoning. 

For example, we get O1 to solve a bunch of not-yet-recorded mathematical lemmas, then train the next model on those. Technically this is "synthetic data", but I don't see why this data is fundamentally different than similar mathematics that humans do. This data is typically the synthesis or distillation of much longer search and reasoning processes. 

As such, it seems very sensible to me to expect "synthetic data" to be a major deal. 

Replies from: Richard_Kennaway
comment by Richard_Kennaway · 2024-12-14T09:50:00.272Z · LW(p) · GW(p)

For example, we get O1 to solve a bunch of not-yet-recorded mathematical lemmas, then train the next model on those.

Would there have to be human vetting to check that O1’s solutions are correct? The practicality of that would depend on the scale, but you don’t want to end up with a blurry JPEG of a blurry JPEG of the internet.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-12-14T13:07:46.208Z · LW(p) · GW(p)

For mathematical lemmas you can formalize them in a language like Lean to automatically check correctness. So access to ground truth is even clearer than for programming, the main issue is probably finding a lot of sane formalized things to prove that the system is capable of proving.

comment by ozziegooen · 2024-12-13T23:40:06.412Z · LW(p) · GW(p)

Another interesting take-away to me - I didn't realize that Microsoft was doing much training of it's own. It makes a lot of sense that they'd want their own teams making their own models, in part to hedge around OpenAI.

I'm curious what their strategy will be in the next few years. 

comment by Ben Livengood (ben-livengood) · 2024-12-14T00:44:41.884Z · LW(p) · GW(p)

It looks like recursive self-improvement is here for the base case, at least. It will be interesting to see if anyone uses solely Phi-4 to pretrain a more capable model.

comment by Jesse Hoogland (jhoogland) · 2025-02-10T06:14:40.597Z · LW(p) · GW(p)

What do you call this phenomenon?

  • First, you train AlphaGo on expert human examples. This is enough to beat Lee Sedol and Ke Jie. Then, you train AlphaZero purely through self-play. It destroys AlphaGo after only a few hours.
  • First, you train RL agents on human playthroughs of Minecraft. They do okay. Then, DreamerV3 learns entirely by itself and becomes the first to get diamonds.
  • First, you train theorem provers on human proofs. Then, you train AlphaProof using AlphaZero and you get silver on IMO for the first time.
  • First, you pretrain a language model on all human data. Then...

This feels like a special case of the bitter lesson, but it's not the same thing. It seems to rely on the distinction between prediction and search latent in ideas like AISI [LW(p) · GW(p)]. It's the kind of thing that I'm sure Gwern has christened in some comment lost to the internet's backwaters. We should have a name for it—something more refined than just "foom."

Replies from: jhoogland, jeremy-gillen, Archimedes, CapResearcher, mtrazzi
comment by Jesse Hoogland (jhoogland) · 2025-02-10T16:56:29.694Z · LW(p) · GW(p)

I think this is important because the safety community still isn't thinking very much about search & RL, even after all the recent progress with reasoning models. We've updated very far away from AlphaZero as a reference class, and I think we will regret this.

On the other hand, the ideas I'm talking about here seem to have widespread recognition among people working on capabilities. Demis is very transparent about where they're headed with language models, AlphaZero, and open-ended exploration (e.g., at 20:48). Noam Brown is adamant about test-time scaling/reasoning being the future (e.g., at 20:32). I think R1 has driven the message home for everyone else.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2025-02-10T19:52:51.343Z · LW(p) · GW(p)

To be fair here, AlphaZero was a case where it not only had an essentially unhackable reward model, but also could generate very large amounts of data, which while not totally unique to Go or gaming, is a property that is generally hard to come by in a lot of domains, so progress will probably be slower than AlphaZero.

Also, a lot of the domains are areas where latencies are either very low or you can tolerate long latency, which is not the case in the physical world very often.

comment by Jeremy Gillen (jeremy-gillen) · 2025-02-10T11:11:33.736Z · LW(p) · GW(p)

I propose: the best planners must break the beta.

Because if a planner is going to be the best, it needs to be capable of finding unusual (better!) plans. If it's capable of finding those, there's ~no benefit of knowing the conventional wisdom about how to do it (climbing slang: beta). 

Edit: or maybe: good planners don't need beta?

Replies from: jhoogland
comment by Jesse Hoogland (jhoogland) · 2025-02-10T16:03:48.889Z · LW(p) · GW(p)

That's fun but a little long. Why not... BetaZero?

comment by Archimedes · 2025-02-11T04:49:25.296Z · LW(p) · GW(p)

The Bitter Lesson is pretty on point but you could call it "Bootstrapping from Zero", the "Autodidactic Leap", the "Self-Discovery Transition", or "Breaking the Imitation Ceiling" if you prefer.

comment by CapResearcher · 2025-02-10T14:17:37.360Z · LW(p) · GW(p)

I don't think I get what phenomenon you're pointing to.

Your first bullet point makes it sound like AlphaGo wasn't trained using self-play, in contrast to AlphaZero. However, AlphaGo was trained with a combination of supervised learning and self-play. They removed the supervised learning part from AlphaZero to make it simpler and more general.

DreamerV3 also fits the pattern where previous SOTA approaches used a combination of imitation learning and reinforcement learning, while DreamerV3 was able to remove the imitation learning part.[1]

To my understanding, AlphaProof was trained by translating a bunch of math problems to Lean, and using "correct proof" as reward for AlphaZero. This approach also combines human data (our math problems) with reinforcement learning (AlphaZero).

Your final example feels close to AlphaProof if you finish it with "Then you finetune CoT with reinforcement learning to yield impressive performance on reasoning benchmarks", but I don't think that's what you were going for.

The first two examples seem covered by "when reinforcement learning works well, imitation learning is no longer needed". Idk about the rest.

Could you clarify by giving more examples or otherwise explain what you're looking for?

  1. ^

    I got curious how DreamerV3 figures out Minecraft with nothing to imitate and no intermediate reward, so I checked the paper. There are intermediate rewards. They give +1 reward for each of 12 ordered milestones leading up to the diamond, and -0.01 for each lost heart and +0.01 for each restored heart. Additionally, they use "the block breaking setting of prior work[19] because the provided action space would make it challenging for stochastic policies to keep a key pressed for a prolonged time". So to get started, probably the agent manages to randomly break a tree block and get its first reward.

Replies from: jhoogland
comment by Jesse Hoogland (jhoogland) · 2025-02-10T16:32:42.856Z · LW(p) · GW(p)

With AlphaProof, the relevant piece is that the solver network generates its own proofs and disproofs to train against. There's no imitation learning after formalization. There is a slight disanalogy where, for formalization, we mostly jumped straight to self-play/search, and I don't think there was ever a major imitation-learning-based approach (though I did find at least one example). 

Your quote "when reinforcement learning works well, imitation learning is no longer needed" is pretty close to what I mean. What I'm actually trying to get at is a stronger statement: we often bootstrap using imitation learning to figure out how to get the reinforcement learning component working initially, but once we do, we can usually discard the imitation learning entirely.

comment by Michaël Trazzi (mtrazzi) · 2025-02-11T12:31:56.145Z · LW(p) · GW(p)

Nitpick: first alphago was trained by a combination of supervised learning from human expert games and reinforcement learning from self-play. Also, Ke Jie was beaten by AlphaGo Master which was a version at a later stage of development.

comment by Jesse Hoogland (jhoogland) · 2024-12-09T19:10:34.843Z · LW(p) · GW(p)

Agency = Prediction + Decision.

AIXI [? · GW] is an idealized model of a superintelligent agent that combines "perfect" prediction (Solomonoff Induction [? · GW]) with "perfect" decision-making (sequential decision theory).

OpenAI's o1 is a real-world "reasoning model" that combines a superhuman predictor (an LLM like GPT-4 [LW(p) · GW(p)]) with advanced decision-making (implicit search via chain of thought trained by RL).

To be clear: o1 is no AIXI. But AIXI, as an ideal, can teach us something about the future of o1-like systems.

AIXI teaches us that agency is simple. It involves just two raw ingredients: prediction and decision-making. And we know how to produce these ingredients. Good predictions come from self-supervised learning, an art we have begun to master over the last decade of scaling pretraining. Good decisions come from search, which has evolved from the explicit search algorithms that powered DeepBlue and AlphaGo to the implicit methods that drive AlphaZero and now o1.

So let's call "reasoning models" like o1 what they really are: the first true AI agents. It's not tool-use that makes an agent; it's how that agent reasons. Bandwidth comes second.

Simple does not mean cheap: pretraining is an industrial process that costs (hundreds of) billions of dollars. Simple also does not mean easy: decision-making is especially difficult to get right since amortizing search (=training a model to perform implicit search) requires RL, which is notoriously tricky.

Simple does mean scalable. The original scaling laws taught us how to exchange compute for better predictions. The new test-time scaling laws teach us how to exchange compute for better decisions. AIXI may still be a ways off, but we can see at least one open path that leads closer to that ideal.

The bitter lesson is that "general methods that leverage computation [such as search and learning] are ultimately the most effective, and by a large margin." The lesson from AIXI is that maybe these are all you need. The lesson from o1 is that maybe all that's left is just a bit more compute...


We still don't know the exact details of how o1 works. If you're interested in reading about hypotheses for what might be going on and further discussion of the implications for scaling and recursive self-improvement, see my recent post, "o1: A Technical Primer [LW · GW]"

Replies from: p.b., james.lucassen, dmurfet
comment by p.b. · 2024-12-11T08:17:40.647Z · LW(p) · GW(p)

You are skipping over a very important component: Evaluation. 

Which is exactly what we don't know how to do well enough outside of formally verifiable domains like math and code, which is exactly where o1 shows big performance jumps. 

comment by james.lucassen · 2024-12-10T22:19:58.818Z · LW(p) · GW(p)

So let's call "reasoning models" like o1 what they really are: the first true AI agents.

I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between "reasoning" and "chat" models, and I'd prefer to use "agent" for that distinction.

I do think that "reasoning" is a bit of a market-y name for this category of system though. "chat" vs "base" is a great choice of words, and "chat" is basically just a description of the RL objective those models were trained with.

If I were the terminology czar, I'd call o1 a "task" model or a "goal" model or something.

comment by Daniel Murfet (dmurfet) · 2024-12-09T19:58:02.173Z · LW(p) · GW(p)

Marcus Hutter on AIXI and ASI safety 

comment by Jesse Hoogland (jhoogland) · 2023-04-12T09:32:54.322Z · LW(p) · GW(p)

When people complain about LLMs doing nothing more than interpolation, they're mixing up two very different ideas: interpolation as intersecting every point in the training data, and interpolation as predicting behavior in-domain rather than out-of-domain.

With language, interpolation-as-intersecting isn't inherently good or bad—it's all about how you do it. Just compare polynomial interpolation to piecewise-linear interpolation (the thing that ReLUs do).

Neural networks (NNs) are biased towards fitting simple piecewise functions, which is (locally) the least biased way to interpolate. The simplest function that intersects two points is the straight line.

In reality, we don't even train LLMs long enough to hit that intersecting threshold. In this under-interpolated sweet spot, NNs seem to learn features from coarse to fine with increasing model size. E.g.: https://arxiv.org/abs/1903.03488

 

Bonus: this is what's happening with double descent: Test loss goes down, then up, until you reach the interpolation threshold. At this point there's only one interpolating solution, and it's a bad fit. But as you increase model capacity further, you end up with many interpolating solutions, some of which generalize better than others.

Meanwhile, with interpolation-not-extrapolation NNs can and do extrapolate outside the convex hull of training samples. Again, the bias towards simple linear extrapolations is locally the least biased option. There's no beating the polytopes.

Here I've presented the visuals in terms of regression, but the story is pretty similar for classification, where the function being fit is a classification boundary. In this case, there's extra pressure to maximize margins, which further encourages generalization

The next time you feel like dunking on interpolation, remember that you just don't have the imagination to deal with high-dimensional interpolation. Maybe keep it to yourself and go interpolate somewhere else.