Bridging the Gap: How Rollout Routing Replay Stabilizes MoE Reinforcement Learning

TLDR: A new method called Rollout Routing Replay (R3) addresses instability in Reinforcement Learning (RL) for Mixture-of-Experts (MoE) models. MoE models often suffer from training collapse because their “routers” (which select specialized experts) behave differently during inference (generating data) and training (updating the model). R3 fixes this by recording the expert selections made during inference and replaying them during training, ensuring consistency. This significantly reduces discrepancies, stabilizes training, prevents collapse, and improves performance without slowing down the process.

Reinforcement Learning (RL) has become a cornerstone for enhancing the capabilities of large language models (LLMs), enabling them to tackle complex problems from advanced mathematics to practical coding tasks. However, a significant challenge arises when applying RL to Mixture-of-Experts (MoE) models: instability in the training process, often leading to catastrophic collapse.

A recent research paper, “Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers”, delves into this critical issue. Authored by Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo from Peking University and LLM-Core Xiaomi, the paper identifies a fundamental inconsistency in MoE models’ routing mechanisms as the primary culprit behind this instability.

The Root of the Problem: Routing Discrepancies

MoE models work by dynamically selecting a subset of specialized “experts” for each input token via a component called a router. The researchers found a notable discrepancy in how these routers behave during the inference phase (when the model generates responses) and the training phase (when the model learns from those responses). This “training-inference consistency” issue means that the same input might lead to different expert selections in different phases. What’s more, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes, introducing noise and making the RL process unreliable.

This inconsistency is particularly problematic for MoE models compared to their dense counterparts. The non-continuous nature of expert selection in MoE means that even small changes in input can lead to entirely different experts being chosen, causing large shifts in model output probabilities. The paper empirically demonstrates this, showing that MoE models exhibit significantly higher KL divergence (a measure of how one probability distribution differs from another) and a greater proportion of “extreme tokens” (tokens with large probability discrepancies) between training and inference compared to dense models.

Introducing Rollout Routing Replay (R3)

To address this foundational inconsistency, the researchers propose a novel method called Rollout Routing Replay (R3). R3 is a simple yet highly effective approach that tackles the instability by aligning the routing behavior between training and inference.

The core idea is straightforward: during the inference stage, R3 records the exact routing distributions – essentially, which experts were selected for each token. Then, during the training engine’s forward pass, these recorded routing distributions are “replayed.” This means the training process is guided to use the same expert selections that occurred during inference. Crucially, while the expert selection mask is replayed, the gradient flow for optimizing the router is preserved, allowing the model to learn effectively.

Significant Improvements in Stability and Performance

The empirical analysis of R3’s impact is compelling. After applying R3, the KL divergence between training and inference in MoE models was nearly halved, bringing it close to the levels observed in stable dense models. The frequency of tokens with large training-inference discrepancies was reduced by an order of magnitude.

Extensive experiments on mathematical reasoning tasks confirmed R3’s superiority. It consistently stabilized RL training, preventing the catastrophic collapses often seen in MoE models without it. R3 also outperformed existing methods like GSPO and TIS in terms of both stability and overall performance across various training configurations (e.g., multi-step and single-step updates, and different base models). The training runs without R3 frequently collapsed, a phenomenon directly linked to abnormally high KL divergence and extreme token distribution values. In contrast, R3 kept these values consistently low, ensuring stable learning.

Beyond preventing collapse, R3 also enhanced the overall optimization and generation behavior. Models trained with R3 exhibited smaller gradient norms, indicating a more stable optimization process. They also showed a smoother and faster increase in generated sequence length and more stable entropy, suggesting better exploration and quicker convergence to effective strategies.

Also Read:

Broader Implications

The R3 method is designed to be compatible with existing infrastructure, including KVCache prefix caching strategies used in many inference engines. This makes it particularly efficient for multi-turn dialogue and agent tasks, where reusing cached routing masks avoids redundant computations. The authors believe this work offers a new and practical solution for stabilizing RL in MoE models, paving the way for more robust and powerful LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Gap: How Rollout Routing Replay Stabilizes MoE Reinforcement Learning

The Root of the Problem: Routing Discrepancies

Introducing Rollout Routing Replay (R3)

Significant Improvements in Stability and Performance

Broader Implications

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates