Posts
Comments
Comment by
Lawrence Tang (lawrence-tang) on
o1: A Technical Primer ·
2025-02-23T00:31:01.805Z ·
LW ·
GW
What evidence is there that a model's labels can benefit its own training? Or that an "ORM" or "PRM" can benefit an LLM? This is the big problem which is not addressed in this article.
Comment by
Lawrence Tang (lawrence-tang) on
o1: A Technical Primer ·
2025-02-23T00:24:47.597Z ·
LW ·
GW
The reinforcement learning is an innovation during train-time, not test-time. This was not clear to me in your article. There are few changes made to test-time, as the model is simply allowed to keep outputting text and decide when to terminate, which 4o does not do.