TLDR: Attention-based Double Compression (ADC) is a novel framework for Split Learning that drastically reduces communication overhead in Vision Transformer (ViT) training. It employs a two-step compression strategy: merging similar samples based on attention scores and then discarding the least meaningful tokens. This dual approach allows for significant data reduction during both forward and backward passes, maintaining high model accuracy even under aggressive compression, and outperforming existing communication-efficient Split Learning methods.
In the rapidly evolving landscape of artificial intelligence, deep neural networks (DNNs) have become indispensable across various fields, from computer vision to medical diagnostics. However, the sheer computational power and memory required to train these complex models pose significant challenges, especially when deploying them on devices with limited resources, often referred to as edge devices. Traditional cloud-based training, while powerful, demands transmitting vast amounts of raw data from these edge devices to central servers, leading to substantial communication overhead and raising critical privacy concerns.
Introducing Split Learning and its Challenges
To address these issues, a promising approach called Split Learning (SL) has emerged. SL works by dividing a neural network between an edge device and a cloud server. The client device handles the initial layers of the network using local data, then sends intermediate features (activations) and labels to the cloud server. The server processes the remaining layers, computes gradients, and sends them back to the client for model updates. This collaborative method reduces the need to send raw data, enhancing privacy and communication efficiency by only transmitting intermediate activations and gradients.
Despite its advantages, communication bottlenecks remain a significant hurdle in practical SL implementations. Existing solutions often involve autoencoders or compression techniques like sparsification and quantization. However, many of these methods struggle to maintain model accuracy when aggressively compressing data. The core problem is that they apply a uniform compression strategy, treating all data components equally, regardless of their importance to the learning process. This can lead to valuable information being lost, especially under high compression rates.
Attention-based Double Compression (ADC): A Novel Solution
A new framework, named Attention-based Double Compression (ADC), offers a novel solution to this challenge. ADC is designed to significantly reduce the communication overhead in Split Learning, particularly for Vision Transformers (ViTs), while maintaining high performance. The core innovation of ADC lies in its two-step, intelligent compression strategy that leverages the inherent properties of Transformer-based models.
The first step is Batch Compression. Instead of sending all individual samples in a batch, ADC merges similar samples’ activations. This merging is based on the average attention score calculated in the last client layer, specifically using the CLS-token attention scores. This approach is class-agnostic, meaning it can merge samples from different classes without compromising the model’s ability to generalize or its final accuracy. By clustering these attention scores, ADC reduces the number of samples that need to be transmitted to a target size.
Following batch compression, the second step is Token Selection. Even after merging, each combined activation might still contain redundant information. ADC further reduces dimensionality by discarding the least meaningful tokens. It retains only the top-k tokens that correspond to the most important positions, again guided by the attention information used during the merging phase. This ensures that only the most informative parts of the activations are communicated.
By combining these two strategies, ADC compresses data along two orthogonal axes: samples and features. This dual approach not only reduces the amount of data sent during the forward pass but also naturally compresses the gradients during the backward pass, allowing the entire model to be trained without additional tuning or approximations of the gradients.
Performance and Impact
Extensive simulations demonstrate that ADC significantly outperforms state-of-the-art Split Learning frameworks. It achieves superior communication efficiency and consistently preserves high accuracy across various architectures (DeiT-S and DeiT-T) and datasets (CIFAR100 and Food101). Notably, ADC excels in scenarios requiring aggressive compression ratios, where many competing methods experience severe performance degradation. For instance, ADC can achieve near-baseline accuracy even at very low compression ratios, while other methods require substantially higher ratios for comparable results.
The framework also exhibits remarkably stable convergence, suggesting that it not only maintains accuracy under aggressive compression but may also act as an implicit regularizer during training. Ablation studies further confirm the robustness of ADC, showing that its performance improves with larger batch sizes due to increased sample diversity, and that using CLS-token attention scores for merging yields the best results due to its alignment with the token selection process. The method also performs better when the split point is deeper in the model, as class tokens in deeper layers are more semantically informative.
Also Read:
- Adaptive Strategies for Scalable Decentralized Deep Learning
- Boosting Video Encoding Efficiency with ResidualViT
Looking Ahead
Attention-based Double Compression represents a significant advancement in communication-efficient Split Learning for Vision Transformers. Its ability to jointly reduce redundancy across both batch and token dimensions, while preserving model accuracy, makes it highly suitable for deployment in environments with stringent communication constraints. Future work aims to extend this framework to more realistic communication environments, including noisy wireless channels, and to multi-client scenarios like Federated Learning, to enhance scalability in large networks of edge devices. For more details, you can refer to the full research paper here.


