Unlocking Predictable Scaling for Reinforcement Learning in Large Language Models

TLDR: A new research paper introduces a principled framework and a practical recipe, ScaleRL, for predictably scaling reinforcement learning (RL) compute in large language models (LLMs). Through a massive 400,000 GPU-hour study, the authors fit sigmoidal compute-performance curves to extrapolate RL performance, revealing that some design choices affect asymptotic performance while others primarily modulate compute efficiency. ScaleRL, a combination of best practices, demonstrates state-of-the-art performance and predictable scaling across various compute axes, bringing RL training closer to the predictability seen in LLM pre-training.

Reinforcement Learning (RL) has become a cornerstone in the training of large language models (LLMs), enabling many of their advanced capabilities, from complex reasoning to agentic behaviors. However, unlike the well-understood scaling laws in LLM pre-training, the field of RL for LLMs has largely lacked a principled, predictive methodology for scaling compute. This has made it challenging to evaluate algorithmic improvements and understand how different design choices impact performance at scale.

A recent research paper, titled “The Art of Scaling Reinforcement Learning Compute for LLMs,” addresses this critical gap. Authored by Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal, this extensive study involved over 400,000 GPU-hours of experimentation. The researchers aimed to establish a scientific framework for analyzing and predicting RL scaling in LLMs, moving the methodology from an ‘art’ to a ‘science’.

The core of their framework involves fitting sigmoidal compute-performance curves to RL training data. These curves help predict how performance will evolve with increasing compute. The key parameters of this sigmoid are: ‘A’ (asymptotic performance, the maximum achievable reward), ‘B’ (scaling exponent, indicating compute efficiency), and ‘Cmid’ (the compute midpoint where half of the total gain is achieved). This framework allows researchers to extrapolate performance from smaller-scale runs to much larger compute budgets, significantly reducing the cost and time of experimentation.

Through a comprehensive empirical study, the team identified three crucial principles:

RL Performance Ceilings are Not Universal

Different RL methods encounter varying ceilings on their achievable performance (A) as training compute scales. Choices like loss type and batch size can shift this limit.

Embracing the Bitter Lesson

Methods that appear superior at low compute budgets might perform worse when extrapolated to large-compute regimes. The framework helps identify truly scalable methods by estimating scaling parameters early on.

Also Read:

Re-evaluating Common Wisdom

Many interventions often thought to improve peak performance, such as loss aggregation, data curriculum, and advantage normalization, primarily modulate compute efficiency (B) rather than significantly altering the performance ceiling (A).

Based on these insights, the researchers propose a best-practice recipe called ScaleRL. ScaleRL integrates several existing methods rather than inventing new ones. Key components include an asynchronous Pipeline-RL setup, interruption-based length control, FP32 precision for logits, prompt-level loss aggregation, batch-level advantage normalization, truncated importance-sampling REINFORCE loss (CISPO), zero-variance filtering, and no-positive resampling. Each component’s contribution was validated through rigorous leave-one-out ablations.

ScaleRL not only scales predictably but also achieves state-of-the-art performance, demonstrating higher asymptotic performance and compute efficiency compared to established RL recipes like DeepSeek (GRPO), Qwen-2.5 (DAPO), Magistral, and Minimax-M1. The recipe’s effectiveness was dramatically showcased in a single RL run scaled up to 100,000 GPU-hours, where predicted validation performance closely matched the actual results.

Furthermore, ScaleRL maintains predictable scaling across various training axes, including larger batch sizes, longer generation lengths (up to 32,768 tokens), multi-task RL (math and code), and larger Mixture-of-Experts (MoE) models (e.g., Llama-4 17B×16). The benefits consistently transferred to downstream tasks, highlighting the recipe’s robustness and generalizability.

This work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training. It offers a rigorous methodology for cost-effectively predicting the scalability of new RL algorithms. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Predictable Scaling for Reinforcement Learning in Large Language Models

RL Performance Ceilings are Not Universal

Embracing the Bitter Lesson

Re-evaluating Common Wisdom

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates