Stable LLM Training in Decentralized Networks: The HeteroRL and GEPO Approach

TLDR: A new research paper introduces HeteroRL, an asynchronous reinforcement learning framework, and GEPO, an optimization algorithm, to enable stable and efficient training of Large Language Models (LLMs) in decentralized, heterogeneous computing environments. HeteroRL decouples data sampling from model learning, while GEPO addresses training instability caused by network latency by significantly reducing the variance of importance sampling weights. Experiments show GEPO maintains high performance even under extreme network delays, outperforming existing methods.

As the demand for powerful Large Language Models (LLMs) continues to grow, the traditional approach of training these models on single, massive computing centers is reaching its physical limits. This has led to a significant shift towards decentralized, distributed training, where computing resources are spread across different locations and operate asynchronously. While this offers immense potential for scalability, it introduces complex challenges, particularly for Reinforcement Learning (RL)-driven post-training, a crucial step for enhancing LLM reasoning capabilities.

The core issue in these decentralized setups is network latency. When data samplers (which generate reasoning trajectories) and parameter learners (which update the model) are geographically separated, network delays become inevitable. These delays cause a mismatch between the policy version used by the sampler and the latest policy version on the learner. This mismatch, quantified as increased KL divergence, can lead to the failure of importance sampling, a technique vital for correcting distributional shifts in off-policy learning, ultimately causing training instability and performance degradation.

Introducing HeteroRL: A Framework for Asynchronous Training

To tackle these challenges, researchers have proposed HeteroRL, an asynchronous RL architecture specifically designed for heterogeneous environments. HeteroRL’s fundamental innovation is decoupling the Rollout sampler from the parameter learner. This allows them to be deployed on independent computing nodes, even if those nodes have different capabilities and are subject to varying network delays. Both the sampler and learner operate continuously without waiting for each other, communicating model parameters and sampled data at low frequencies or under high latency.

GEPO: Stabilizing Importance Sampling Under Latency

While HeteroRL provides the architectural foundation, the unpredictable nature of internet network latency still poses a significant threat to training stability. To mitigate this, the paper introduces Group Expectation Policy Optimization (GEPO). GEPO refines the standard importance sampling calculation to reduce the variance of importance weights, thereby maintaining training stability and enhancing performance.

GEPO’s approach involves two key ideas:

1. Sample-Level Importance Weighting: Traditional methods often compute importance weights at the token level (for each word in a sequence). However, since rewards are typically assigned at the entire sequence level, this can lead to high variance and inefficient optimization. GEPO shifts to sample-level importance weighting, treating the entire response as a single sampling unit. This aligns the optimization unit with the reward unit, leading to more stable and effective policy updates.

2. Group Expectation Smoothing (GES): To further enhance stability, GEPO replaces the individual proposal probability in the importance weight calculation with its group-wise expected value. By leveraging statistical information from a group of responses, this mechanism avoids extreme weight values that can occur when individual probabilities are very low. This makes the importance weights more robust, even under large policy divergence caused by significant network delays.

The researchers also incorporated a ‘defensive sampling’ mechanism, which blends the target policy probability into the denominator of the importance weight. This helps to mitigate potential bias and smoothly transitions the objective towards a standard policy gradient update when the variance of the behavior policy is high, further enhancing robustness.

Also Read:

Experimental Validation and Impact

Extensive experiments were conducted using the Qwen3-1.7B model on the MATH lv.3–5 dataset, simulating heterogeneous computing environments with network delays up to 1800 seconds. GEPO demonstrated superior training stability and performance compared to popular methods like GRPO and GSPO. Even under extreme network delays, GEPO’s performance degradation remained within 3% compared to synchronous training, a remarkable achievement that validates its potential for efficient decentralized distributed RL training in high-latency networks.

The research highlights a strong correlation between network latency, KL divergence between the sampler and learner, and the variance of importance weights. Higher latency leads to increased KL divergence, which in turn inflates importance weight variance and ultimately causes training instability. GEPO effectively addresses this by significantly reducing this variance, leading to more stable gradient estimates and preventing training collapse.

This work not only offers new system and algorithmic perspectives for overcoming data bottlenecks in pre-training and unlocking the post-training potential of LLMs but also lays a practical foundation for building large-scale distributed AI training systems adapted to future heterogeneous compute network infrastructures.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Stable LLM Training in Decentralized Networks: The HeteroRL and GEPO Approach

Introducing HeteroRL: A Framework for Asynchronous Training

GEPO: Stabilizing Importance Sampling Under Latency

Experimental Validation and Impact

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates