TLDR: A study analyzing MLPerf Training v4.1 data found that while increasing GPUs reduces AI model training time, it often decreases efficiency per GPU due to communication overhead. The research identifies an “equilibrium point” where performance and efficiency are optimally balanced across various deep learning models like BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion, suggesting that more GPUs aren’t always better, especially for cost and energy efficiency.
Training large-scale artificial intelligence models has become a significant challenge for both the scientific community and industry. While using many Graphics Processing Units (GPUs) can dramatically speed up training times, this approach often leads to a negative impact on efficiency. A recent study delves into this trade-off, analyzing data from MLPerf Training v4.1 to understand how GPU scalability affects the training of various AI models.
The research, titled “Estudio de la eficiencia en la escalabilidad de GPU‘s para el entrenamiento de Inteligencia Artificial” by David Cortes, Carlos Juiz, and Belen Bermejo, highlights that there are specific configurations that optimize the balance between performance, GPU usage, and efficiency. The findings suggest an “equilibrium point” where training times can be reduced while maximizing efficiency.
Understanding the Metrics
To evaluate performance, the study uses two key metrics: Speedup and Efficiency. Speedup measures how much faster a system is compared to a reference system. Efficiency, on the other hand, is the ratio of Speedup to the number of accelerators used, indicating how effectively available resources are utilized. An ideal efficiency value of 1 means that performance improvement is directly proportional to the number of accelerators.
Analyzing Different AI Workloads
The study examined four distinct deep learning workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion. Each model presented unique scalability characteristics:
- BERT: For the BERT model, configurations with a small number of GPUs, typically around 8, showed the highest efficiency. This is because communication overhead was not yet significant. However, beyond 16 GPUs, efficiency began to drop more sharply. Interestingly, massive configurations with over 3000 accelerators sometimes showed efficiency comparable to systems with just 2 GPUs, pointing to diminishing returns at extreme scales.
- Llama2 LoRA: This model, known for its high memory and communication demands, experienced a rapid decline in efficiency as the number of GPUs increased. While 4 to 8 accelerators offered superior efficiency, larger systems (e.g., 1024 GPUs) completed training much faster, albeit with lower efficiency per GPU.
- RetinaNet: RetinaNet, an object detection model, demonstrated a more linear scalability compared to Llama2 LoRA, possibly due to less frequent parameter synchronization. Efficiency gradually decreased, remaining moderate to high with up to 8 GPUs before a more pronounced drop.
- Stable Diffusion: This generative model, involving an autoencoder and U-Net with multiple denoising steps, proved particularly demanding in terms of inter-GPU communication. Moderate to high efficiency was observed with 4 to 8 GPUs, likely due to effective parallelism in the latent space and convolutional network. However, with more accelerators (16, 32, or hundreds), efficiency significantly declined, possibly due to bottlenecks in gradient communication and model synchronization.
Also Read:
- MLP-Offload Accelerates Large Language Model Training by Breaking the GPU Memory Wall
- Unpacking AI Optimizer Performance: Why Fair Benchmarking Reveals Modest Gains
The Equilibrium Point and Future Implications
A crucial takeaway from this research is that communication and gradient synchronization in large-scale networks significantly penalize the individual performance of each accelerator. This effect is particularly evident in models requiring extensive data exchange, such as Llama2 LoRA. The study emphasizes the importance of carefully planning computing infrastructures, as simply adding more GPUs does not always lead to overall optimization. The “optimal” configuration depends on the primary goal: whether it’s absolute training speed or maximizing efficiency per accelerator, and the available resources.
The findings are particularly relevant for research centers and organizations with budget constraints or energy consumption targets, as establishing a proper balance between computing capacity and communication complexity can lead to better utilization of each GPU. Future research aims to explore parallelism strategies and communication optimizations to mitigate efficiency drops, as well as incorporate energy consumption measurements and sustainability metrics for more holistic decision-making.
For more detailed information, you can refer to the original research paper: Estudio de la eficiencia en la escalabilidad de GPU‘s para el entrenamiento de Inteligencia Artificial.


