Unpacking GPU Communication: How Machine Learning Workloads Impact Networks

TLDR: This research analyzes the communication patterns of various machine learning models (like DeepSeek, GPT, Llama) on distributed GPU systems. By instrumenting NVIDIA’s NCCL library, the study reveals that collective operations like AllReduce and AllGather create specific traffic patterns, often sparse, which can lead to network congestion and performance issues, especially during anomalies. The findings suggest a need to rethink network topologies and collective communication frameworks to better accommodate these unique ML workload behaviors.

Modern machine learning applications, especially those involving large language models like DeepSeek, GPT, and Llama, rely heavily on distributed systems with many Graphics Processing Units (GPUs). For these powerful systems to work together efficiently, they need to communicate constantly. This communication often involves specific operations known as ‘collective communication operations,’ such as AllReduce, AllGather, and Broadcast. While essential, these operations can create intense, bursty traffic patterns that may lead to network congestion and even data loss, significantly slowing down the entire machine learning job.

Understanding Collective Communication in Machine Learning

The performance of these large-scale machine learning tasks is directly tied to how well the underlying network handles this communication. When networks become congested or experience issues like packet loss, the training or inferencing process can be severely impacted. This research highlights the critical need to understand these communication patterns to better design and provision network resources specifically for different types of machine learning workloads.

The Research Approach: Instrumenting NCCL

To gain deeper insights, the researchers instrumented NVIDIA’s Collective Communication Library (NCCL), a widely used framework for GPU-to-GPU communication. They enhanced NCCL’s logging capabilities to capture detailed information, such as the exact bytes exchanged between GPU pairs. Their test setup was robust, featuring four servers, each equipped with eight NVIDIA H100 GPUs, interconnected with NVlink within servers and a rail-optimized topology between them. They ran a variety of popular models, including DeepSeek V3, GPT2, Llama, BERT, Resnet18, and VGG11, to observe their collective communication behavior.

Key Findings from DeepSeek V3 Analysis

The study focused on the DeepSeek V3 inferencing model, distributed across 16 GPUs using model parallelism. A key observation was the dominance of AllReduce operations over AllGather. For instance, with just eight queries, DeepSeek V3 performed over 600,000 AllReduce operations compared to only about 3,000 AllGather operations. While training and fine-tuning workloads also use these operations, they involve much larger data transfers due to weight updates, unlike inference workloads which primarily pass activations.

Interestingly, despite using 16 GPUs, the analysis revealed that only a small number of them actively communicate with each other. Very little traffic actually leaves the individual servers to enter the broader network fabric or spine layer. This sparse communication pattern, also observed in training and fine-tuning, raises questions about whether current rail-optimized network topologies are truly necessary or if more efficient, tailored topologies could be designed.

Also Read:

Implications for Network Design and Resilience

The research also delved into the timing of these operations, finding that collective communications typically occur at a microsecond granularity. However, in real-world data centers, network anomalies like optics failures, port flapping, or congestion can cause these operations to extend to tens of seconds. For training and fine-tuning workloads, such delays can even lead to job failures, requiring restarts from previous checkpoints. Current collective communication frameworks like NCCL and RCCL are not designed to gracefully recover from these types of network issues. This suggests a strong need to integrate network-aware mechanisms into these frameworks to ensure more resilient and robust machine learning operations.

In conclusion, this work provides valuable insights into how GPU communication impacts networks for various machine learning workloads. It highlights the unique traffic patterns generated by these applications and underscores the importance of adapting network designs and communication frameworks to better handle the demands and potential anomalies of modern AI systems. For a deeper dive into the technical details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking GPU Communication: How Machine Learning Workloads Impact Networks

Understanding Collective Communication in Machine Learning

The Research Approach: Instrumenting NCCL

Key Findings from DeepSeek V3 Analysis

Implications for Network Design and Resilience

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates