spot_img
HomeResearch & DevelopmentUnpacking GPU Communication: How Machine Learning Workloads Impact Networks

Unpacking GPU Communication: How Machine Learning Workloads Impact Networks

TLDR: This research analyzes the communication patterns of various machine learning models (like DeepSeek, GPT, Llama) on distributed GPU systems. By instrumenting NVIDIA’s NCCL library, the study reveals that collective operations like AllReduce and AllGather create specific traffic patterns, often sparse, which can lead to network congestion and performance issues, especially during anomalies. The findings suggest a need to rethink network topologies and collective communication frameworks to better accommodate these unique ML workload behaviors.

Modern machine learning applications, especially those involving large language models like DeepSeek, GPT, and Llama, rely heavily on distributed systems with many Graphics Processing Units (GPUs). For these powerful systems to work together efficiently, they need to communicate constantly. This communication often involves specific operations known as ‘collective communication operations,’ such as AllReduce, AllGather, and Broadcast. While essential, these operations can create intense, bursty traffic patterns that may lead to network congestion and even data loss, significantly slowing down the entire machine learning job.

Understanding Collective Communication in Machine Learning

The performance of these large-scale machine learning tasks is directly tied to how well the underlying network handles this communication. When networks become congested or experience issues like packet loss, the training or inferencing process can be severely impacted. This research highlights the critical need to understand these communication patterns to better design and provision network resources specifically for different types of machine learning workloads.

The Research Approach: Instrumenting NCCL

To gain deeper insights, the researchers instrumented NVIDIA’s Collective Communication Library (NCCL), a widely used framework for GPU-to-GPU communication. They enhanced NCCL’s logging capabilities to capture detailed information, such as the exact bytes exchanged between GPU pairs. Their test setup was robust, featuring four servers, each equipped with eight NVIDIA H100 GPUs, interconnected with NVlink within servers and a rail-optimized topology between them. They ran a variety of popular models, including DeepSeek V3, GPT2, Llama, BERT, Resnet18, and VGG11, to observe their collective communication behavior.

Key Findings from DeepSeek V3 Analysis

The study focused on the DeepSeek V3 inferencing model, distributed across 16 GPUs using model parallelism. A key observation was the dominance of AllReduce operations over AllGather. For instance, with just eight queries, DeepSeek V3 performed over 600,000 AllReduce operations compared to only about 3,000 AllGather operations. While training and fine-tuning workloads also use these operations, they involve much larger data transfers due to weight updates, unlike inference workloads which primarily pass activations.

Interestingly, despite using 16 GPUs, the analysis revealed that only a small number of them actively communicate with each other. Very little traffic actually leaves the individual servers to enter the broader network fabric or spine layer. This sparse communication pattern, also observed in training and fine-tuning, raises questions about whether current rail-optimized network topologies are truly necessary or if more efficient, tailored topologies could be designed.

Also Read:

Implications for Network Design and Resilience

The research also delved into the timing of these operations, finding that collective communications typically occur at a microsecond granularity. However, in real-world data centers, network anomalies like optics failures, port flapping, or congestion can cause these operations to extend to tens of seconds. For training and fine-tuning workloads, such delays can even lead to job failures, requiring restarts from previous checkpoints. Current collective communication frameworks like NCCL and RCCL are not designed to gracefully recover from these types of network issues. This suggests a strong need to integrate network-aware mechanisms into these frameworks to ensure more resilient and robust machine learning operations.

In conclusion, this work provides valuable insights into how GPU communication impacts networks for various machine learning workloads. It highlights the unique traffic patterns generated by these applications and underscores the importance of adapting network designs and communication frameworks to better handle the demands and potential anomalies of modern AI systems. For a deeper dive into the technical details, you can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -