spot_img
HomeResearch & DevelopmentStable LLM Training in Decentralized Networks: The HeteroRL and...

Stable LLM Training in Decentralized Networks: The HeteroRL and GEPO Approach

TLDR: A new research paper introduces HeteroRL, an asynchronous reinforcement learning framework, and GEPO, an optimization algorithm, to enable stable and efficient training of Large Language Models (LLMs) in decentralized, heterogeneous computing environments. HeteroRL decouples data sampling from model learning, while GEPO addresses training instability caused by network latency by significantly reducing the variance of importance sampling weights. Experiments show GEPO maintains high performance even under extreme network delays, outperforming existing methods.

As the demand for powerful Large Language Models (LLMs) continues to grow, the traditional approach of training these models on single, massive computing centers is reaching its physical limits. This has led to a significant shift towards decentralized, distributed training, where computing resources are spread across different locations and operate asynchronously. While this offers immense potential for scalability, it introduces complex challenges, particularly for Reinforcement Learning (RL)-driven post-training, a crucial step for enhancing LLM reasoning capabilities.

The core issue in these decentralized setups is network latency. When data samplers (which generate reasoning trajectories) and parameter learners (which update the model) are geographically separated, network delays become inevitable. These delays cause a mismatch between the policy version used by the sampler and the latest policy version on the learner. This mismatch, quantified as increased KL divergence, can lead to the failure of importance sampling, a technique vital for correcting distributional shifts in off-policy learning, ultimately causing training instability and performance degradation.

Introducing HeteroRL: A Framework for Asynchronous Training

To tackle these challenges, researchers have proposed HeteroRL, an asynchronous RL architecture specifically designed for heterogeneous environments. HeteroRL’s fundamental innovation is decoupling the Rollout sampler from the parameter learner. This allows them to be deployed on independent computing nodes, even if those nodes have different capabilities and are subject to varying network delays. Both the sampler and learner operate continuously without waiting for each other, communicating model parameters and sampled data at low frequencies or under high latency.

GEPO: Stabilizing Importance Sampling Under Latency

While HeteroRL provides the architectural foundation, the unpredictable nature of internet network latency still poses a significant threat to training stability. To mitigate this, the paper introduces Group Expectation Policy Optimization (GEPO). GEPO refines the standard importance sampling calculation to reduce the variance of importance weights, thereby maintaining training stability and enhancing performance.

GEPO’s approach involves two key ideas:

1. Sample-Level Importance Weighting: Traditional methods often compute importance weights at the token level (for each word in a sequence). However, since rewards are typically assigned at the entire sequence level, this can lead to high variance and inefficient optimization. GEPO shifts to sample-level importance weighting, treating the entire response as a single sampling unit. This aligns the optimization unit with the reward unit, leading to more stable and effective policy updates.

2. Group Expectation Smoothing (GES): To further enhance stability, GEPO replaces the individual proposal probability in the importance weight calculation with its group-wise expected value. By leveraging statistical information from a group of responses, this mechanism avoids extreme weight values that can occur when individual probabilities are very low. This makes the importance weights more robust, even under large policy divergence caused by significant network delays.

The researchers also incorporated a ‘defensive sampling’ mechanism, which blends the target policy probability into the denominator of the importance weight. This helps to mitigate potential bias and smoothly transitions the objective towards a standard policy gradient update when the variance of the behavior policy is high, further enhancing robustness.

Also Read:

Experimental Validation and Impact

Extensive experiments were conducted using the Qwen3-1.7B model on the MATH lv.3–5 dataset, simulating heterogeneous computing environments with network delays up to 1800 seconds. GEPO demonstrated superior training stability and performance compared to popular methods like GRPO and GSPO. Even under extreme network delays, GEPO’s performance degradation remained within 3% compared to synchronous training, a remarkable achievement that validates its potential for efficient decentralized distributed RL training in high-latency networks.

The research highlights a strong correlation between network latency, KL divergence between the sampler and learner, and the variance of importance weights. Higher latency leads to increased KL divergence, which in turn inflates importance weight variance and ultimately causes training instability. GEPO effectively addresses this by significantly reducing this variance, leading to more stable gradient estimates and preventing training collapse.

This work not only offers new system and algorithmic perspectives for overcoming data bottlenecks in pre-training and unlocking the post-training potential of LLMs but also lays a practical foundation for building large-scale distributed AI training systems adapted to future heterogeneous compute network infrastructures.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -