TLDR: G-Core is a new RLHF training framework developed by Tencent that significantly improves the scalability and efficiency of training large language models and diffusion models. It achieves this through a parallel controller programming model, which eliminates single-point bottlenecks, and a dynamic scaling placement schema that optimizes GPU utilization by adaptively partitioning resources and scheduling workloads. Successfully deployed in WeChat, G-Core demonstrates robust performance in real-world, large-scale AI training environments.
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone in training advanced AI models, particularly large language models (LLMs) and diffusion models. While RLHF has driven significant progress, existing systems often struggle with scaling to complex multi-modal tasks, adapting to changing workloads, and efficiently managing resources. These challenges include limitations in controller scalability, inflexible resource allocation, and inefficient orchestration of intricate RLHF pipelines, especially when dealing with dynamic data sampling or generative reward modeling.
Introducing G-Core: A New Era in RLHF Training
To address these critical issues, researchers from Tencent have introduced G-Core, a novel RLHF training framework designed for simplicity, scalability, and balance. G-Core aims to overcome the bottlenecks of traditional systems, providing a robust foundation for developing large-scale, human-aligned AI models.
Parallel Controllers: Breaking the Centralized Bottleneck
One of G-Core’s key innovations is its parallel controller programming model. Unlike conventional systems that rely on a single, centralized controller, G-Core distributes control across multiple parallel controllers. This approach prevents a single point of failure or bottleneck, which can occur when transferring large features like images or videos, or when complex procedures overwhelm a single CPU or network bandwidth. By partitioning RL tasks using a Single Program Multiple Data (SPMD) approach, G-Core ensures that each controller manages only a portion of the resources, leading to a more balanced workload distribution, especially with larger batch sizes. This design allows multiple stages of the RLHF workflow to coexist and enables flexible, local state transitions, which are crucial for advanced sampling processes like dynamic sampling or reward-augmented generation.
Dynamic Placement: Optimizing Resource Utilization
G-Core also introduces a dynamic scaling placement schema that significantly improves efficiency, particularly in scenarios involving generative rewarding and dynamic sampling. Traditional co-location strategies, where multiple models share the same GPUs, can introduce overhead from model swapping, especially during frequent re-sampling. While this overhead might be negligible in some cases, it can become a bottleneck as training progresses and models improve, leading to more frequent re-sampling and increased swapping. Furthermore, long-tail outputs in the generation stage can reduce GPU cluster utilization, a problem amplified by frequent model swapping.
G-Core tackles this by integrating both co-existing (asynchronous workflow) and co-location (synchronous workflow) strategies. It intelligently partitions the GPU cluster, allowing policy generation and reward model generation to co-exist on separate portions of devices, eliminating the need for frequent model swaps. For the preparation and training stages, G-Core retains the co-location approach, utilizing all GPUs to minimize idle time. This dynamic adjustment of GPU cluster partitioning based on workload ensures that hardware utilization remains high, even under highly variable training conditions. G-Core continuously monitors hardware utilization and reallocates resources from underutilized roles to others, balancing the workload across training roles and maximizing overall efficiency.
Also Read:
- Optimizing LLM Operations: A Unified Approach to Training and Inference Scheduling
- A Unified LLM Approach for Complex Interactive Applications
Under the Hood: Implementation and Real-World Impact
G-Core is implemented using Python and PyTorch, leveraging vLLM and SGlang for generation serving, and Megatron-Core as the training backend. The system distributes all modules across different processes, enabling collaboration via Remote Procedure Calls (RPCs) while minimizing interference with their internal orchestration mechanisms. This multi-processing approach enhances stability and simplifies issue diagnosis.
The framework also incorporates features like asynchronous checkpointing to minimize progress loss during interruptions and adapts to elastic resource scaling by reusing checkpoints across GPU clusters of varying sizes. For workload balancing, G-Core employs a simple yet effective method of sorting data by simulated workload, which significantly reduces wasted compute time without compromising model accuracy. It also supports distributed attention mechanisms, enabling the training of models with extremely long context sequences.
G-Core has been successfully deployed in real-world scenarios, training models that support features within WeChat, serving a massive user base. This practical application demonstrates the framework’s robustness and effectiveness at scale, with evaluations conducted on clusters of up to 64 GPUs and validation in production environments with over 512 GPUs. For more technical details, you can refer to the full research paper.
In conclusion, G-Core represents a significant advancement in RLHF training, offering a practical and flexible solution for orchestrating complex, multi-model workflows. By addressing critical bottlenecks in controller scalability and resource placement, G-Core paves the way for future research and deployment of large-scale, human-aligned AI models.


