Thane Ruthenis's Shortform
post by Thane Ruthenis · 2024-09-13T20:52:23.396Z · LW · GW · 7 commentsContents
7 comments
7 comments
Comments sorted by top scores.
comment by Thane Ruthenis · 2024-12-24T20:30:22.936Z · LW(p) · GW(p)
Here's something that confuses me about o1/o3. Why was the progress there so sluggish?
My current understanding is that they're just LLMs trained with RL to solve math/programming tasks correctly, hooked up to some theorem-verifier and/or an array of task-specific unit tests to provide ground-truth reward signals. There are no sophisticated architectural tweaks, not runtime-MCTS or A* search, nothing clever.
Why was this not trained back in, like, 2022 or at least early 2023; tested on GPT-3/3.5 and then default-packaged into GPT-4 alongside RLHF? If OpenAI was too busy, why was this not done by any competitors, at decent scale? (I'm sure there are tons of research papers trying it at smaller scales.)
The idea is obvious; doubly obvious if you've already thought of RLHF; triply obvious after "let's think step-by-step" went viral. In fact, I'm pretty sure I've seen "what if RL on CoTs?" discussed countless times in 2022-2023 (sometimes in horrified whispers regarding what the AGI labs might be getting up to).
The mangled hidden CoT and the associated greater inference-time cost is superfluous. DeepSeek r1/QwQ/Gemini Flash Thinking have perfectly legible CoTs which would be fine to present to customers directly; just let them pay on a per-token basis as normal.
Were there any clever tricks involved in the training? Gwern speculates about that here. But none of the follow-up reasoning models have a o1-style deranged CoT, so the more straightforward approaches probably Just Work.
Did nobody have the money to run the presumably compute-intensive RL-training stage back then? But DeepMind exists. Did nobody have the attention to spare, with OpenAI busy upscaling/commercializing and everyone else catching up? Again, DeepMind exists: my understanding is that they're fairly parallelized and they try tons of weird experiments simultaneously. And even if not DeepMind, why have none of the numerous LLM startups (the likes of Inflection, Perplexity) tried it?
Am I missing something obvious, or are industry ML researchers surprisingly... slow to do things?
(My guess is that the obvious approach doesn't in fact work and you need to make some weird unknown contrivances to make it work, but I don't know the specifics.)
Replies from: brambleboy, leogao↑ comment by brambleboy · 2024-12-24T22:49:00.509Z · LW(p) · GW(p)
While I don't have specifics either, my impression of ML research is that it's a lot of work to get a novel idea working, even if the idea is simple. If you're trying to implement your own idea, you'll be banging your head against the wall for weeks or months wondering why your loss is worse than the baseline. If you try to replicate a promising-sounding paper, you'll bang your head against the wall as your loss is worse than the baseline. It's hard to tell if you made a subtle error in your implementation or if the idea simply doesn't work for reasons you don't understand because ML has little in the way of theoretical backing. Even when it works it won't be optimized, so you need engineers to improve the performance and make it stable when training at scale. If you want to ship a working product quickly then it's best to choose what's tried and true.
comment by Thane Ruthenis · 2024-09-13T20:52:24.130Z · LW(p) · GW(p)
On the topic of o1's recent release: wasn't Claude Sonnet 3.5 (the subscription version at least, maybe not the API version) already using hidden CoT? That's the impression I got from it, at least.
The responses don't seem to be produced in constant time. It sometimes literally displays a "thinking deeply" message which accompanies a unusually delayed response. Other times, the following pattern would play out:
- I pose it some analysis problem, with a yes/no answer.
- It instantly produces a generic response like "let's evaluate your arguments".
- There's a 1-2 second delay.
- Then it continues, producing a response that starts with "yes" or "no", then outlines the reasoning justifying that yes/no.
That last point is particularly suspicious. As we all know, the power of "let's think step by step" is that LLMs don't commit to their knee-jerk instinctive responses, instead properly thinking through the problem using additional inference compute. Claude Sonnet 3.5 is the previous out-of-the-box SoTA model, competently designed and fine-tuned. So it'd be strange if it were trained to sabotage its own CoTs by "writing down the bottom line first" like this, instead of being taught not to commit to a yes/no before doing the reasoning.
On the other hand, from a user-experience perspective, the LLM immediately giving a yes/no answer followed by the reasoning is certainly more convenient.
From that, plus the minor-but-notable delay, I'd been assuming that it's using some sort of hidden CoT/scratchpad, then summarizes its thoughts from it.
I haven't seen people mention that, though. Is that not the case?
(I suppose it's possible that these delays are on the server side, my requests getting queued up...)
(I'd also maybe noticed a capability gap between the subscription and the API versions of Sonnet 3.5, though I didn't really investigate it and it may be due to the prompt.)
Replies from: habryka4, quetzal_rainbow↑ comment by habryka (habryka4) · 2024-09-13T20:59:22.446Z · LW(p) · GW(p)
My model was that Claude Sonnet has tool access and sometimes does some tool-usage behind the scenes (which later gets revealed to the user), but that it wasn't having a whole CoT behind the scenes, but I might be wrong.
↑ comment by quetzal_rainbow · 2024-09-13T23:42:48.972Z · LW(p) · GW(p)
I think you heard about this thread (I didn't try to replicate it myself).
Replies from: Thane Ruthenis↑ comment by Thane Ruthenis · 2024-09-14T00:13:16.778Z · LW(p) · GW(p)
Thanks, that seems relevant! Relatedly, the system prompt indeed explicitly instructs it to use "<antThinking>" tags when creating artefacts. It'd make sense if it's also using these tags to hide parts of its CoT.