Posts

PapersToAGI's Shortform 2025-04-11T19:54:51.440Z
Why Bigger Models Generalize Better 2025-04-11T19:54:51.423Z

Comments

Comment by PapersToAGI (nee-1) on Debunk the myth -Testing the generalized reasoning ability of LLM · 2025-04-12T08:58:13.353Z · LW · GW

Amazing post and very valuable research. As another comment said, if you can adjust the writing a bit then this could be a top post

Comment by PapersToAGI (nee-1) on METR: Measuring AI Ability to Complete Long Tasks · 2025-04-11T18:20:58.044Z · LW · GW

Might Claude 3.7 performance above the predicted line be because it is specifically trained for long-horizon tasks? I know Claude 3.7 acts more like a coding agent than a standard LLM, which makes me suspect it had considerable RL on larger tasks than say next-word prediction or solving math AIME questions. If this is the case, we can expect an increase in slope from scaling RL in the direction of long-horizon tasks.

Comment by PapersToAGI (nee-1) on PapersToAGI's Shortform · 2025-04-11T16:45:57.573Z · LW · GW

Could we scale to true world understanding? LLMs possess extensive knowledge, yet we don’t often see them produce cross-domain insights in the way a person deeply versed in, say, physics and medicine might. Why doesn’t their breadth of knowledge translate into a similar depth of insight? I suspect a lack of robust conceptual world models is behind this shortfall, and that using outcome-based RL during posttraining might be the missing element for achieving deeper understanding and more effective world models.

Let’s start with a pretrained LLM that was only trained to predict the next token. Research shows these models do form some abstractions and world models. Due to implicit and explicit regularization, gradient descent prefers generalizations over overfitting because generalizations require fewer (lower-magnitude) weights, while overfitting demands many more. How well such a pretrained model generalizes (versus overfits) can vary, and overall, they still display notable overfitting when tested on out-of-distribution (OOD) tasks.

Now consider the post-training approach: RL scaling. It’s been shown that reasoning-oriented models generalize extremely well OOD, with almost no drop in performance. That could be because RL is concerned with achieving the correct final answer, not necessarily how it’s reached. As a result, there’s less motivation to overfit, since multiple chains of thought can produce the same reward. Essentially, outcome-based RL (as in the GPRO method from the DeepSeek R1 paper) reinforces the correct conceptual understanding rather than specific reasoning traces in narrow cases (otherwise you’d see a drop in OOD performance, which we don’t).

So the key question is: do reasoning models have an enhanced model of the world, compared to non-reasoning models? That is, are their representations more coherent and consistent, and less reliant on heuristics and statistical shortcuts? Based on their ability to generalize, plus the GPRO RL approach, one might infer that they reinforce conceptual understanding and a consistent internal world model, rather than just memorizing chains of thought.

One expected outcome here is that their hallucination rate declines, even when they aren’t explicitly engaging in chain-of-thought. During posttraining, if the model stumbles on inconsistent information (hallucinations), those pathways get penalized because they lead to faulty reasoning and therefore incorrect answers. Hence, simply scaling RL might enrich the internal world models within LLMs, not only boosting reasoning performance but also refining the more intuitive facets of world knowledge—areas we typically attribute solely to pretraining.

What do you think?