UloRL: Boosting LLM Reasoning with Efficient Ultra-Long Output Training

TLDR: UloRL is a new reinforcement learning approach designed to improve the reasoning abilities of Large Language Models (LLMs) by efficiently handling ultra-long output sequences. It addresses traditional RL inefficiencies through ‘segment rollouts’ for faster training and ‘Dynamic Masking of Well-Mastered Positive Tokens’ (DMMPTs) to prevent entropy collapse. Combined with a generative verifier for accurate rewards and rigorous data cleaning, UloRL significantly enhances LLM performance on complex reasoning tasks, even allowing smaller models to outperform larger ones.

Large Language Models (LLMs) have made incredible strides in complex tasks like mathematics and programming, largely thanks to a technique called reinforcement learning with verifiable rewards (RLVR). This method uses rule-based systems to check final answers, providing a strong signal for the model to learn and generate correct, well-reasoned solutions, often through very long chains of thought.

However, a significant challenge arises when these models need to produce extremely long outputs, sometimes up to 128,000 tokens. Traditional reinforcement learning struggles here because all samples in a training batch must finish decoding before the next step can begin. This creates a bottleneck, especially with a few very long outputs slowing down the entire process, leading to inefficiencies and wasted computational resources.

Introducing UloRL: A New Approach for Ultra-Long Outputs

To tackle these issues, researchers have developed UloRL, or Ultra-Long Output Reinforcement Learning. This innovative approach introduces several key techniques to make training LLMs with ultra-long outputs more efficient and effective.

Segment Rollouts: Speeding Up Training

One of UloRL’s core ideas is ‘segment rollouts’. Instead of waiting for an entire ultra-long output to complete, the decoding process is divided into smaller segments. As soon as a segment is decoded, or if the entire output is complete, that data can immediately be used for training. Incomplete outputs simply continue decoding in the next step. For example, an output of 128,000 tokens might be broken into eight segments of 16,000 tokens each. This significantly boosts training speed; experiments showed a 2.06x increase in speed when using four segments compared to one.

To ensure stable training with these segments, UloRL employs ‘Pseudo On-policy Importance Sampling’ (POIS). This method helps the model learn effectively even when parts of the output were generated by slightly older versions of the model, mimicking the benefits of on-policy training where all data is generated by the current model.

Dynamic Masking of Well-Mastered Positive Tokens (DMMPTs): Preventing Entropy Collapse

Another common problem in reinforcement learning is ‘entropy collapse’, where the model’s diversity in generating responses diminishes too quickly, leading to suboptimal performance. UloRL addresses this by identifying ‘Well-Mastered Positive Tokens’ (MPTs) – tokens the model already predicts with very high confidence in correct answers. The UloRL approach, called Dynamic Masking of MPTs (DMMPTs), adaptively controls whether these MPTs are included in training. If the model’s diversity (entropy) drops below a certain level, these well-mastered tokens are temporarily excluded from training. This prevents the model from over-optimizing on what it already knows, helping it maintain a healthy level of exploration and diversity in its outputs.

Generative Verifier Model: Ensuring Accurate Rewards

For reinforcement learning to work, the model needs accurate feedback, or ‘rewards’. Traditional rule-based systems for checking if an answer is correct can sometimes make mistakes, especially with complex or semantically equivalent answers (like “27cm” and “0.27m”). UloRL incorporates a ‘generative verifier model’ trained to understand if two answers are semantically equivalent, leading to more precise reward signals for the LLM.

Refining the Data: Quality Matters

The quality of training data is crucial. UloRL includes extensive data cleaning and transformation steps. This involves removing questions with multiple sub-questions, converting various question formats into short-answer types, and filtering out overly simple or incorrectly answered questions. This meticulous data preparation ensures that the model learns from high-quality, unambiguous examples.

Also Read:

Impressive Results

UloRL has shown remarkable improvements. When applied to the Qwen3-30B-A3B model, training with 128,000-token outputs boosted its performance on the AIME2025 benchmark from 70.9% to 85.1%, and on BeyondAIME from 50.7% to 61.9%. These gains are so significant that the UloRL-trained Qwen3-30B-A3B even outperformed the much larger Qwen3-235B-A22B model. The research clearly demonstrates that extending the output length, combined with UloRL’s innovative training methods, is a powerful way to enhance the reasoning capabilities of large language models. For more technical details, you can refer to the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

UloRL: Boosting LLM Reasoning with Efficient Ultra-Long Output Training

Introducing UloRL: A New Approach for Ultra-Long Outputs

Segment Rollouts: Speeding Up Training

Dynamic Masking of Well-Mastered Positive Tokens (DMMPTs): Preventing Entropy Collapse

Generative Verifier Model: Ensuring Accurate Rewards

Refining the Data: Quality Matters

Impressive Results

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates