spot_img
HomeResearch & DevelopmentAdaptive Strategies for Scalable Decentralized Deep Learning

Adaptive Strategies for Scalable Decentralized Deep Learning

TLDR: This research paper addresses the stability and scalability challenges in decentralized deep learning for large-scale DNN training. It introduces DBench, a benchmarking framework, to analyze the correlation between model accuracy and parameter tensor variances across different communication graphs and training scales. Based on these insights, the paper proposes Ada, an adaptive decentralized approach that dynamically adjusts communication graphs during training. Ada achieves superior convergence rates and model accuracy comparable to centralized learning, even when scaling up to 1008 GPUs for complex models like ResNet50 on ImageNet-1K.

Deep learning models are becoming increasingly complex and require significant computational power for training. To handle this, distributed training methods are often used, with data parallelism being a popular choice due to its efficiency and scalability. In data parallelism, multiple copies of a deep neural network (DNN) model are distributed among accelerators like GPUs, each processing a different part of the training data. While centralized data parallel training synchronizes gradients globally, decentralized learning, also known as gossip learning or asynchronous DL training, averages parameters locally based on predefined communication graphs among neighboring accelerators.

Despite its theoretical advantages in communication efficiency and potential for good model accuracy, decentralized learning has faced challenges in real-world production use. Key issues include a lack of stability, scalability, and generality, especially when training large-scale DNNs. Previous studies have pointed out that fluctuations and significant differences in local training results across accelerators can harm model convergence as the scale increases.

Understanding the Challenges with DBench

To bridge the gap between theory and practical application, researchers introduced a benchmarking framework called DBench. This framework allows for profiling both centralized and decentralized DNN training runs, with configurable parameters such as the communication graph and training scale. DBench collects data on model accuracy for both training and testing, as well as the L2-norm of parameter tensors on each GPU before parameters are averaged. This “white-box” analysis helps to understand the internal workings of decentralized learning.

Experiments conducted using DBench on the Summit supercomputer, utilizing NVIDIA V100 GPUs, revealed several key observations. The study used four applications covering image classification (ResNet20, ResNet50, DenseNet100) and natural language processing (LSTM) with varying model sizes and datasets. The communication graphs explored included ring, torus, exponential, and complete graphs.

The benchmarking results showed that, similar to centralized learning, decentralized data parallel training also struggles with scalability and generality when the training scale increases, leading to a decrease in model accuracy. A significant finding was the positive correlation between model accuracy and the number of connections in a communication graph: more connections generally led to better accuracy. For instance, a complete graph (most connected) typically yielded better results than a ring graph (least connected).

Another crucial observation was the sensitivity of model accuracy to the variance of parameter tensors across model replicas. High variances in parameter tensors were consistently linked to lower model accuracy. Interestingly, these variances were most pronounced at the early stages of training and diminished as training progressed. This suggested that a highly connected graph is beneficial at the beginning for better accuracy, while less connected graphs could be used later to reduce communication costs without sacrificing accuracy.

The study also highlighted that conventional learning-rate configurations, common in centralized deep learning, do not always perform well in decentralized settings, especially at larger scales or with more connections. Fine-tuning learning rates, sometimes using square root scaling instead of linear scaling, was found to improve convergence in challenging scenarios.

Introducing Ada: An Adaptive Solution

Based on these insights, the researchers proposed Ada, a decentralized adaptive approach for large-scale DNN training. Ada follows a decentralized Stochastic Gradient Descent (SGD) method but dynamically adapts the communication graph throughout training iterations. Unlike existing methods that use fixed communication graphs, Ada starts with a highly connected graph and gradually reduces the number of connections per node as training progresses.

Ada utilizes a ring lattice as its base for structuring the varying graph. A ring lattice allows for easy adjustment of the “coordination number” (k), which controls the number of neighbors each node connects to. By starting with a large k (e.g., making it a complete graph) and linearly decreasing it over epochs, Ada aims to achieve the high accuracy benefits of highly connected graphs early on, and then transition to less connected graphs later for lower communication costs, without compromising model accuracy.

Also Read:

Validation and Impact

Ada was validated on the Summit supercomputer across various tasks, including ResNet20 and DenseNet100 on CIFAR10, and LSTM on WikiText2, using 96 GPUs. Crucially, it was also tested on ResNet50 for ImageNet-1K on an unprecedented scale of 1008 GPUs, making it the largest-scale decentralized training ever performed for this model. The results were highly promising.

For ResNet20 and DenseNet100, Ada demonstrated faster convergence to acceptable accuracy compared to centralized complete, decentralized ring, and decentralized torus methods. For the larger LSTM model, where decentralized ring and torus methods failed to converge at 96 GPUs, Ada successfully converged with a good perplexity score. Most notably, on the 1008-GPU ResNet50 ImageNet-1K training, Ada achieved approximately 73% top-1 classification accuracy. This is significantly better than the 35% and 56% achieved by ring- and torus-based decentralized SGD, and comparable to state-of-the-art centralized training, even considering the accuracy drop often associated with large-batch training.

This work introduces DBench for white-box analysis and Ada as an adaptive solution, marking significant progress in understanding and improving the scalability and generality of decentralized learning for production use. The findings and the proposed adaptive approach pave the way for more stable and efficient large-scale decentralized deep learning. You can find the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -