TLDR: A new research paper introduces a principled framework and a practical recipe, ScaleRL, for predictably scaling reinforcement learning (RL) compute in large language models (LLMs). Through a massive 400,000 GPU-hour study, the authors fit sigmoidal compute-performance curves to extrapolate RL performance, revealing that some design choices affect asymptotic performance while others primarily modulate compute efficiency. ScaleRL, a combination of best practices, demonstrates state-of-the-art performance and predictable scaling across various compute axes, bringing RL training closer to the predictability seen in LLM pre-training.
Reinforcement Learning (RL) has become a cornerstone in the training of large language models (LLMs), enabling many of their advanced capabilities, from complex reasoning to agentic behaviors. However, unlike the well-understood scaling laws in LLM pre-training, the field of RL for LLMs has largely lacked a principled, predictive methodology for scaling compute. This has made it challenging to evaluate algorithmic improvements and understand how different design choices impact performance at scale.
A recent research paper, titled “The Art of Scaling Reinforcement Learning Compute for LLMs,” addresses this critical gap. Authored by Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal, this extensive study involved over 400,000 GPU-hours of experimentation. The researchers aimed to establish a scientific framework for analyzing and predicting RL scaling in LLMs, moving the methodology from an ‘art’ to a ‘science’.
The core of their framework involves fitting sigmoidal compute-performance curves to RL training data. These curves help predict how performance will evolve with increasing compute. The key parameters of this sigmoid are: ‘A’ (asymptotic performance, the maximum achievable reward), ‘B’ (scaling exponent, indicating compute efficiency), and ‘Cmid’ (the compute midpoint where half of the total gain is achieved). This framework allows researchers to extrapolate performance from smaller-scale runs to much larger compute budgets, significantly reducing the cost and time of experimentation.
Through a comprehensive empirical study, the team identified three crucial principles:
RL Performance Ceilings are Not Universal
Different RL methods encounter varying ceilings on their achievable performance (A) as training compute scales. Choices like loss type and batch size can shift this limit.
Embracing the Bitter Lesson
Methods that appear superior at low compute budgets might perform worse when extrapolated to large-compute regimes. The framework helps identify truly scalable methods by estimating scaling parameters early on.
Also Read:
- Laminar: A New Era for LLM Reinforcement Learning Scalability
- New Research Questions How We Measure AI Progress in Language Models
Re-evaluating Common Wisdom
Many interventions often thought to improve peak performance, such as loss aggregation, data curriculum, and advantage normalization, primarily modulate compute efficiency (B) rather than significantly altering the performance ceiling (A).
Based on these insights, the researchers propose a best-practice recipe called ScaleRL. ScaleRL integrates several existing methods rather than inventing new ones. Key components include an asynchronous Pipeline-RL setup, interruption-based length control, FP32 precision for logits, prompt-level loss aggregation, batch-level advantage normalization, truncated importance-sampling REINFORCE loss (CISPO), zero-variance filtering, and no-positive resampling. Each component’s contribution was validated through rigorous leave-one-out ablations.
ScaleRL not only scales predictably but also achieves state-of-the-art performance, demonstrating higher asymptotic performance and compute efficiency compared to established RL recipes like DeepSeek (GRPO), Qwen-2.5 (DAPO), Magistral, and Minimax-M1. The recipe’s effectiveness was dramatically showcased in a single RL run scaled up to 100,000 GPU-hours, where predicted validation performance closely matched the actual results.
Furthermore, ScaleRL maintains predictable scaling across various training axes, including larger batch sizes, longer generation lengths (up to 32,768 tokens), multi-task RL (math and code), and larger Mixture-of-Experts (MoE) models (e.g., Llama-4 17B×16). The benefits consistently transferred to downstream tasks, highlighting the recipe’s robustness and generalizability.
This work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training. It offers a rigorous methodology for cost-effectively predicting the scalability of new RL algorithms. For more details, you can read the full paper here.


