MiloSal's Shortform
post by MiloSal (milosal) · 2025-02-01T03:35:10.936Z · LW · GW · 2 commentsContents
2 comments
2 comments
Comments sorted by top scores.
comment by MiloSal (milosal) · 2025-02-01T03:35:10.931Z · LW(p) · GW(p)
What is o3's base model?
To create DeepSeek-R1, they:
- Start with DeepSeek-V3-Base as a base model
- Fine-tune base model on synthetic long CoT problem solving examples
- Run RL to convergence on challenging verifiable math/coding/etc. problems, with reward for (a) formatting and (b) correctness
Therefore, I roughly expect o1's training process was:
- Start with 4o as a base model
- Some sort of SFT on problem solving examples
- Run RL on verifiable problems with some similar reward setup.
An important question for the near-term scaling picture is whether o3 uses 4o as its base model. This question arises because we need some way to explain the capability gains from o1 to o3. A convenient explanation is that o3 was trained using approximately the same process as above, except the base model is something like GPT-4.5 or GPT-5.
However, some recent evidence has come to light against this view. As a friend points out, o3-mini has the same knowledge cutoff date as 4o and o1 (late 2023). This seems like strong evidence that o3 uses 4o as the base model. Additionally, I would expect o3 to be more performant than it currently is if it used GPT-5 as a base model.
My current best guess is that o3 actually comes from a process like this:
- Start with 4o+ as a base model (that is, 4o fine-tuned with some o1 distillation)
- Some sort of SFT on problem solving examples, as before
- A somewhat improved RL setup, again on verifiable problems. I am imagining a setup that also takes slightly better advantage of compute/bitter lesson. This is because o1 feels like it was a bit of an experiment, while o3 probably got "full-scale" compute resources.
In other words, I suspect o3's base model is 4o+ (that is, 4o fine-tuned with some o1 distillation). If this view is correct, it has startling consequences for near-time scaling. Once the reasoning paradigm is plugged into GPT-5, we'll have big problems.
Replies from: milosal↑ comment by MiloSal (milosal) · 2025-02-01T03:35:21.747Z · LW(p) · GW(p)
Another possibility is that only o3-mini has this knowledge cutoff and the full o3 has a later knowledge cutoff. This could happen if o3-mini is distilled into an older model (e.g., 4o-mini). If the full o3 turns out to have a knowledge cutoff later than 2023, I'd take that as convincing evidence 4o is not the base model.