TLDR: A new method called Rollout Routing Replay (R3) addresses instability in Reinforcement Learning (RL) for Mixture-of-Experts (MoE) models. MoE models often suffer from training collapse because their “routers” (which select specialized experts) behave differently during inference (generating data) and training (updating the model). R3 fixes this by recording the expert selections made during inference and replaying them during training, ensuring consistency. This significantly reduces discrepancies, stabilizes training, prevents collapse, and improves performance without slowing down the process.
Reinforcement Learning (RL) has become a cornerstone for enhancing the capabilities of large language models (LLMs), enabling them to tackle complex problems from advanced mathematics to practical coding tasks. However, a significant challenge arises when applying RL to Mixture-of-Experts (MoE) models: instability in the training process, often leading to catastrophic collapse.
A recent research paper, “Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers”, delves into this critical issue. Authored by Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo from Peking University and LLM-Core Xiaomi, the paper identifies a fundamental inconsistency in MoE models’ routing mechanisms as the primary culprit behind this instability.
The Root of the Problem: Routing Discrepancies
MoE models work by dynamically selecting a subset of specialized “experts” for each input token via a component called a router. The researchers found a notable discrepancy in how these routers behave during the inference phase (when the model generates responses) and the training phase (when the model learns from those responses). This “training-inference consistency” issue means that the same input might lead to different expert selections in different phases. What’s more, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes, introducing noise and making the RL process unreliable.
This inconsistency is particularly problematic for MoE models compared to their dense counterparts. The non-continuous nature of expert selection in MoE means that even small changes in input can lead to entirely different experts being chosen, causing large shifts in model output probabilities. The paper empirically demonstrates this, showing that MoE models exhibit significantly higher KL divergence (a measure of how one probability distribution differs from another) and a greater proportion of “extreme tokens” (tokens with large probability discrepancies) between training and inference compared to dense models.
Introducing Rollout Routing Replay (R3)
To address this foundational inconsistency, the researchers propose a novel method called Rollout Routing Replay (R3). R3 is a simple yet highly effective approach that tackles the instability by aligning the routing behavior between training and inference.
The core idea is straightforward: during the inference stage, R3 records the exact routing distributions – essentially, which experts were selected for each token. Then, during the training engine’s forward pass, these recorded routing distributions are “replayed.” This means the training process is guided to use the same expert selections that occurred during inference. Crucially, while the expert selection mask is replayed, the gradient flow for optimizing the router is preserved, allowing the model to learn effectively.
Significant Improvements in Stability and Performance
The empirical analysis of R3’s impact is compelling. After applying R3, the KL divergence between training and inference in MoE models was nearly halved, bringing it close to the levels observed in stable dense models. The frequency of tokens with large training-inference discrepancies was reduced by an order of magnitude.
Extensive experiments on mathematical reasoning tasks confirmed R3’s superiority. It consistently stabilized RL training, preventing the catastrophic collapses often seen in MoE models without it. R3 also outperformed existing methods like GSPO and TIS in terms of both stability and overall performance across various training configurations (e.g., multi-step and single-step updates, and different base models). The training runs without R3 frequently collapsed, a phenomenon directly linked to abnormally high KL divergence and extreme token distribution values. In contrast, R3 kept these values consistently low, ensuring stable learning.
Beyond preventing collapse, R3 also enhanced the overall optimization and generation behavior. Models trained with R3 exhibited smaller gradient norms, indicating a more stable optimization process. They also showed a smoother and faster increase in generated sequence length and more stable entropy, suggesting better exploration and quicker convergence to effective strategies.
Also Read:
- Enhancing Language Model Reasoning Through Representation-Based Exploration
- Dynamic Temperature Control Enhances LLM Reasoning in Reinforcement Learning
Broader Implications
The R3 method is designed to be compatible with existing infrastructure, including KVCache prefix caching strategies used in many inference engines. This makes it particularly efficient for multi-turn dialogue and agent tasks, where reusing cached routing masks avoids redundant computations. The authors believe this work offers a new and practical solution for stabilizing RL in MoE models, paving the way for more robust and powerful LLMs.


