MiloSal's Shortform

milosal

MiloSal's Shortform

post by MiloSal (milosal) · 2025-02-01T03:35:10.936Z · LW · GW · 2 comments

2 comments

2 comments

Comments sorted by top scores.

comment by MiloSal (milosal) · 2025-02-01T03:35:10.931Z · LW(p) · GW(p)

What is o3's base model?

To create DeepSeek-R1, they:

Start with DeepSeek-V3-Base as a base model
Fine-tune base model on synthetic long CoT problem solving examples
Run RL to convergence on challenging verifiable math/coding/etc. problems, with reward for (a) formatting and (b) correctness

Therefore, I roughly expect o1's training process was:

Start with 4o as a base model
Some sort of SFT on problem solving examples
Run RL on verifiable problems with some similar reward setup.

An important question for the near-term scaling picture is whether o3 uses 4o as its base model. This question arises because we need some way to explain the capability gains from o1 to o3. A convenient explanation is that o3 was trained using approximately the same process as above, except the base model is something like GPT-4.5 or GPT-5.

However, some recent evidence has come to light against this view. As a friend points out, o3-mini has the same knowledge cutoff date as 4o and o1 (late 2023). This seems like strong evidence that o3 uses 4o as the base model. Additionally, I would expect o3 to be more performant than it currently is if it used GPT-5 as a base model.

My current best guess is that o3 actually comes from a process like this:

Start with 4o+ as a base model (that is, 4o fine-tuned with some o1 distillation)
Some sort of SFT on problem solving examples, as before
A somewhat improved RL setup, again on verifiable problems. I am imagining a setup that also takes slightly better advantage of compute/bitter lesson. This is because o1 feels like it was a bit of an experiment, while o3 probably got "full-scale" compute resources.

In other words, I suspect o3's base model is 4o+ (that is, 4o fine-tuned with some o1 distillation). If this view is correct, it has startling consequences for near-time scaling. Once the reasoning paradigm is plugged into GPT-5, we'll have big problems.

Replies from: milosal

↑ comment by MiloSal (milosal) · 2025-02-01T03:35:21.747Z · LW(p) · GW(p)

Another possibility is that only o3-mini has this knowledge cutoff and the full o3 has a later knowledge cutoff. This could happen if o3-mini is distilled into an older model (e.g., 4o-mini). If the full o3 turns out to have a knowledge cutoff later than 2023, I'd take that as convincing evidence 4o is not the base model.

MiloSal's Shortform

Contents

2 comments