Boosting Language Model Reasoning with Reinforcement Learning on Pre-Training Data

TLDR: Reinforcement Learning on Pre-Training data (RLPT) is a new method for optimizing large language models (LLMs) that uses reinforcement learning directly on vast amounts of unlabeled pre-training data. Unlike traditional methods that rely on human feedback, RLPT teaches LLMs to predict subsequent text segments, rewarding accurate predictions. This approach significantly improves reasoning capabilities across general and mathematical tasks, demonstrating strong scalability and offering a foundation for further advancements in LLM training.

Large Language Models (LLMs) have made incredible strides, powering everything from conversational AI to autonomous agents. However, their growth is increasingly challenged by two main factors: the sheer computational resources needed for training and the dwindling supply of high-quality text data. To tackle this, researchers have introduced a novel approach called Reinforcement Learning on Pre-Training data (RLPT).

RLPT represents a significant shift in how LLMs are optimized. Traditionally, scaling LLMs has relied heavily on supervised learning, where models learn from explicitly labeled data. In contrast, RLPT empowers the model to learn autonomously from vast amounts of unlabeled pre-training data using reinforcement learning (RL). This means the model explores different ways to understand and generate text, improving its abilities without needing human intervention for reward signals.

The core innovation of RLPT lies in its “next-segment reasoning objective.” Instead of requiring human annotations for rewards, as seen in methods like Reinforcement Learning from Human Feedback (RLHF) or Reinforcement Learning with Verifiable Rewards (RLVR), RLPT generates its own reward signals directly from the pre-training data. It does this by challenging the model to accurately predict subsequent text segments based on the preceding context. This objective encourages the model to explore richer and broader reasoning pathways, leading to more generalized reasoning skills.

How RLPT Works

Autoregressive Segment Reasoning (ASR): In this task, the model is given a context and must predict the complete next sentence. This aligns with how LLMs typically generate text sequentially.
Middle Segment Reasoning (MSR): Here, the model is presented with a context that has masked (hidden) tokens in the middle. It must use both the preceding and following text to infer and fill in the missing segment. This is particularly useful for tasks like code completion or understanding context from both directions.

During training, these ASR and MSR tasks are interleaved, allowing the model to simultaneously enhance its ability to generate text autoregressively and understand context deeply. The reward for the model’s predictions is determined by a “generative reward model” which assesses the semantic consistency between the predicted segment and the actual subsequent text. This reward system is designed to be flexible, allowing for linguistic variations while ensuring the core meaning is preserved.

Also Read:

Impressive Results and Future Potential

Extensive experiments have validated the effectiveness of RLPT across various benchmarks. On general-domain tasks like MMLU, MMLU-Pro, GPQA-Diamond, and KOR-Bench, RLPT consistently delivered substantial improvements. For instance, when applied to the Qwen3-4B-Base model, it showed absolute gains of 3.0 on MMLU and 8.1 on GPQA-Diamond. Similar positive results were observed on mathematical reasoning benchmarks such as AIME24 and AIME25, with improvements of 6.6 and 5.3 in Pass@1, respectively.

Beyond standalone performance, RLPT also strengthens the foundational reasoning capabilities of LLMs, proving to be an excellent starting point for other advanced RL strategies like RLVR. When RLPT was used as an initialization for RLVR, it further boosted performance on mathematical reasoning tasks, demonstrating that RLPT enhances both the model’s ability to exploit known information and explore new solutions.

A key finding from the research is that RLPT’s performance follows a favorable scaling law with respect to training compute. This suggests that as more computational resources become available, RLPT has strong potential for continued gains, pushing the boundaries of what LLMs can achieve. The paper also highlights that RLPT fosters structured reasoning processes within the LLM, akin to human problem-solving, which contributes to its effectiveness.

This innovative approach, detailed in the paper Reinforcement Learning on Pre-Training Data, offers a promising path forward for training more capable and generalizable large language models by leveraging the vast, untapped potential of unlabeled data.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Boosting Language Model Reasoning with Reinforcement Learning on Pre-Training Data

How RLPT Works

Impressive Results and Future Potential

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates