Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

post by Matrice Jacobine · 2025-04-24T14:11:27.625Z · LW · GW · 0 comments

This is a link post for https://limit-of-rlvr.github.io/

Contents

  Does RLVR actually expand LLMs' reasoning capabilities, or does it merely optimize existing ones?
None
No comments

Recent breakthroughs in reasoning-focused large language models (LLMs) like OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have largely relied on Reinforcement Learning with Verifiable Rewards (RLVR), which replaces human annotations with automated rewards (e.g., verified math solutions or passing code tests) to scale self-improvement. While RLVR enhances reasoning behaviors such as self-reflection and iterative refinement, we challenge a core assumption:

Does RLVR actually expand LLMs' reasoning capabilities, or does it merely optimize existing ones?

By evaluating models via pass@k, where success requires just one correct solution among k attempts, we uncover that RL-trained models excel at low k (e.g., pass@1) but are consistently outperformed by base models at high k (e.g., pass@256). This demonstrates that RLVR narrows the model's exploration, favoring known high-reward paths instead of discovering new reasoning strategies. Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.

0 comments

Comments sorted by top scores.