lawrence-tang

Posts
Comments

Posts

Comments

Comment by Lawrence Tang (lawrence-tang) on o1: A Technical Primer · 2025-02-23T00:31:01.805Z · LW · GW

What evidence is there that a model's labels can benefit its own training? Or that an "ORM" or "PRM" can benefit an LLM? This is the big problem which is not addressed in this article.

Comment by Lawrence Tang (lawrence-tang) on o1: A Technical Primer · 2025-02-23T00:24:47.597Z · LW · GW

The reinforcement learning is an innovation during train-time, not test-time. This was not clear to me in your article. There are few changes made to test-time, as the model is simply allowed to keep outputting text and decide when to terminate, which 4o does not do.

User info

Posts

Comments