Posts

Comments

Comment by moozooh (vladimir-sergeyev) on Implications of the inference scaling paradigm for AI safety · 2025-01-17T23:20:54.860Z · LW · GW

I agree with this and would like to add that scaling along the inference-time axis seems to be more likely to rapidly push performance in certain closed-domain reasoning tasks far beyond human intelligence capabilities (likely already this year!) which will serve as a very convincing show of safety to many people and will lead to wide adoption of such models for intellectual task automation. But without the various forms of experiential and common-sense reasoning humans have, there's no telling where and how such a "superhuman" model may catastrophically mess up simply because it doesn't understand a lot of things any human being takes for granted. Given the current state of AI development, this strikes me as literally the shortest path to a paperclip maximizer. Well, maybe not that catastrophic, but hey, you never know.

In terms of how immediately it accelerates certain adoption-related risks, I don't think this bodes particularly well. I would prefer a more evenly spread cognitive capability.

Comment by moozooh (vladimir-sergeyev) on Implications of the inference scaling paradigm for AI safety · 2025-01-17T22:55:22.295Z · LW · GW

I don't think o3 is a bigger model if we're talking just raw parameter count. I am reasonably sure that both o1, o3, and the future o-series models for the time being are all based on 4o and scale its fundamental capabilities and knowledge. I also think that 4o itself was created specifically for the test-time compute scaffolding because the previous GPT-4 versions were far too bulky. You might've noticed that pretty much the entire of 2024 for the top labs was about distillation and miniaturization where the best-performing models were all significantly smaller than the best-performing models up through the winter of 2023/2024.

In my understanding, the cost increase comes from the fact that better, tighter chains of thought enable juicing the "creative" settings like higher temperature which expand the search space a tiny bit at a time. So the model actually searches for longer despite being more efficient than its predecessor because it's able to search more broadly and further outside the box instead of running in circles. Using this approach with o1 likely wouldn't be possible because it would take several times more tokens to meaningfully explore each possibility and risk running out of the context window before it could match o3's performance.

I also believe the main reason Altman is hesitant to release GPT-5 (which they already have and most likely already use internally as a teacher model for the others) is that strictly economically speaking, it's just a hard sell in the world where o3 exists and o4 is coming in three months, and then o5, etc. So it can't outsmart those, yet unless it has a similar speed and latency as 4o or o1-mini, it cannot be used for real-time tasks like conversation mode or computer vision. And then the remaining scope of real-world problems for which it is the best option narrows down to something like creative writing and other strictly open-ended tasks that don't benefit from "overthinking". And I think the other labs ran into the same issue with Claude 3.5 Opus, Gemini 1.5 Ultra (if it was ever a thing at all), and any other trillion-scale models we were promised in 2024. The age of trillion-scale models is very likely over.