TLDR: This research paper characterizes the system challenges of training Large Language Models (LLMs) with Reinforcement Learning with Verifiable Rewards (RLVR). It identifies issues such as GPU idling, inefficient parallel strategies, and data management bottlenecks caused by diverse and dynamic workloads. The authors introduce PolyTrace, a benchmark suite, to provide realistic workloads for evaluating and optimizing RLVR training systems, highlighting the need for workload-aware scheduling and improved resource management.
Large Language Models (LLMs) are transforming many industries, and a technique called Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly used to make them smarter at reasoning and understanding. However, training LLMs with RLVR is a complex process, and there hasn’t been much understanding of the system challenges involved – until now.
A new research paper, RL in the Wild: Characterizing RLVR Training in LLM deployment, by Jiecheng Zhou, Qinghao Hu, Yuyang Jin, Zerui Wang, Peng Sun, Yuzhe Gu, Wenwei Zhang, Mingshu Zhai, Xingcheng Zhang, and Weiming Zhang, sheds light on these critical system challenges. The authors conducted an in-depth study of RLVR tasks in real-world LLM deployments, investigating how workloads are distributed and change during training.
The Unique Challenges of RLVR Training
The paper highlights several key differences that make RLVR training particularly demanding compared to other LLM training methods:
- Longer Training Steps: A single RLVR training step can take minutes, or even over an hour, unlike other LLM tasks that typically complete in seconds. This makes recovering from errors much more costly.
- Complex Workflows: RLVR involves multiple models working together across several stages, such as ‘rollout’ (where the model generates responses), ‘inference’ (where rewards are calculated), and ‘training’ (where the model learns). Other LLM tasks usually involve just one model and a single stage.
- Unpredictable Workloads: The length of inputs and outputs in RLVR cannot be known beforehand and can change dramatically. This makes it hard to efficiently divide work among computing resources.
- Intricate Software: RLVR training requires building on top of existing training and inference frameworks, plus integrating software for interacting with environments, like sandboxes.
Key Observations from Real-World Workloads
The researchers analyzed extensive data from large-scale RLVR training jobs and found several critical issues:
- Varied and Long-Tail Sequence Lengths: Different RL tasks (like mathematics, image understanding, video understanding, and tool use) have vastly different input and output lengths. Some tasks generate extremely long outputs, while others have very long inputs. This unevenness often leads to GPUs sitting idle.
- Dynamic Output Lengths: The length of the model’s outputs can change as training progresses, depending on the model size and the specific task. For example, math tasks might produce longer outputs over time as the model improves its reasoning.
- Fluctuating Performance: The speed of training can vary dramatically, sometimes by hundreds of times, even within the same task. This instability is linked to the diverse sequence lengths and how work is divided among GPUs.
- Inefficient Data Handling: Many RL frameworks use a single central controller for data transfer, which becomes a bottleneck, especially for large multimodal data. This can lead to slow data movement and even memory errors on the CPU.
- Load Imbalance: The uneven distribution of input and output lengths causes some GPUs to be overloaded while others wait, leading to poor overall efficiency.
- Unstable Tool Latency: For tasks involving external tools (like search APIs), the time it takes for these tools to respond can be highly unpredictable, further impacting training efficiency.
Introducing PolyTrace: A New Benchmark
To help address these challenges, the authors propose the PolyTrace benchmark suite. This suite provides realistic workloads from seven different RL tasks, allowing researchers to evaluate and optimize RL training systems effectively. PolyTrace captures details like input length, output length, and the number of turns in a dialogue, simulating real-world scenarios without the high cost of actual tool invocations.
Also Read:
- Enhancing LLM Agent Training with Principle-Based Process Rewards and Normalization
- RoRecomp: Making LLMs Reason More Concisely and Efficiently
Implications for System Design
The study emphasizes the need for:
- Workload-Aware Scheduling: Strategies that can adapt to the diverse and dynamic nature of RL workloads are crucial.
- Efficient Data Management: Better mechanisms are needed for transferring and managing data between different stages of the RL pipeline.
- Dynamic Parallelization: Instead of static approaches, systems should dynamically adjust how work is distributed to maximize GPU utilization.
- Optimized Memory Management: More intelligent GPU memory allocation is required to prevent issues like KV-cache recomputation.
The research also explores the impact of hyperparameters like batch size and maximum response length, showing how their optimal settings vary significantly across different RL tasks. Asynchronous training, where different stages of the pipeline can operate with slightly outdated model parameters, is shown to significantly improve throughput without sacrificing performance.
In conclusion, this paper provides a foundational understanding of the system-level complexities in RLVR training for LLMs. By characterizing real-world workloads and introducing the PolyTrace benchmark, it offers valuable insights and tools for developing more efficient and scalable RL training systems in the future.


