spot_img
HomeResearch & DevelopmentUloRL: Boosting LLM Reasoning with Efficient Ultra-Long Output Training

UloRL: Boosting LLM Reasoning with Efficient Ultra-Long Output Training

TLDR: UloRL is a new reinforcement learning approach designed to improve the reasoning abilities of Large Language Models (LLMs) by efficiently handling ultra-long output sequences. It addresses traditional RL inefficiencies through ‘segment rollouts’ for faster training and ‘Dynamic Masking of Well-Mastered Positive Tokens’ (DMMPTs) to prevent entropy collapse. Combined with a generative verifier for accurate rewards and rigorous data cleaning, UloRL significantly enhances LLM performance on complex reasoning tasks, even allowing smaller models to outperform larger ones.

Large Language Models (LLMs) have made incredible strides in complex tasks like mathematics and programming, largely thanks to a technique called reinforcement learning with verifiable rewards (RLVR). This method uses rule-based systems to check final answers, providing a strong signal for the model to learn and generate correct, well-reasoned solutions, often through very long chains of thought.

However, a significant challenge arises when these models need to produce extremely long outputs, sometimes up to 128,000 tokens. Traditional reinforcement learning struggles here because all samples in a training batch must finish decoding before the next step can begin. This creates a bottleneck, especially with a few very long outputs slowing down the entire process, leading to inefficiencies and wasted computational resources.

Introducing UloRL: A New Approach for Ultra-Long Outputs

To tackle these issues, researchers have developed UloRL, or Ultra-Long Output Reinforcement Learning. This innovative approach introduces several key techniques to make training LLMs with ultra-long outputs more efficient and effective.

Segment Rollouts: Speeding Up Training

One of UloRL’s core ideas is ‘segment rollouts’. Instead of waiting for an entire ultra-long output to complete, the decoding process is divided into smaller segments. As soon as a segment is decoded, or if the entire output is complete, that data can immediately be used for training. Incomplete outputs simply continue decoding in the next step. For example, an output of 128,000 tokens might be broken into eight segments of 16,000 tokens each. This significantly boosts training speed; experiments showed a 2.06x increase in speed when using four segments compared to one.

To ensure stable training with these segments, UloRL employs ‘Pseudo On-policy Importance Sampling’ (POIS). This method helps the model learn effectively even when parts of the output were generated by slightly older versions of the model, mimicking the benefits of on-policy training where all data is generated by the current model.

Dynamic Masking of Well-Mastered Positive Tokens (DMMPTs): Preventing Entropy Collapse

Another common problem in reinforcement learning is ‘entropy collapse’, where the model’s diversity in generating responses diminishes too quickly, leading to suboptimal performance. UloRL addresses this by identifying ‘Well-Mastered Positive Tokens’ (MPTs) – tokens the model already predicts with very high confidence in correct answers. The UloRL approach, called Dynamic Masking of MPTs (DMMPTs), adaptively controls whether these MPTs are included in training. If the model’s diversity (entropy) drops below a certain level, these well-mastered tokens are temporarily excluded from training. This prevents the model from over-optimizing on what it already knows, helping it maintain a healthy level of exploration and diversity in its outputs.

Generative Verifier Model: Ensuring Accurate Rewards

For reinforcement learning to work, the model needs accurate feedback, or ‘rewards’. Traditional rule-based systems for checking if an answer is correct can sometimes make mistakes, especially with complex or semantically equivalent answers (like “27cm” and “0.27m”). UloRL incorporates a ‘generative verifier model’ trained to understand if two answers are semantically equivalent, leading to more precise reward signals for the LLM.

Refining the Data: Quality Matters

The quality of training data is crucial. UloRL includes extensive data cleaning and transformation steps. This involves removing questions with multiple sub-questions, converting various question formats into short-answer types, and filtering out overly simple or incorrectly answered questions. This meticulous data preparation ensures that the model learns from high-quality, unambiguous examples.

Also Read:

Impressive Results

UloRL has shown remarkable improvements. When applied to the Qwen3-30B-A3B model, training with 128,000-token outputs boosted its performance on the AIME2025 benchmark from 70.9% to 85.1%, and on BeyondAIME from 50.7% to 61.9%. These gains are so significant that the UloRL-trained Qwen3-30B-A3B even outperformed the much larger Qwen3-235B-A22B model. The research clearly demonstrates that extending the output length, combined with UloRL’s innovative training methods, is a powerful way to enhance the reasoning capabilities of large language models. For more technical details, you can refer to the original research paper.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -