Optimizing AI Training: A Dual Compression Strategy for Vision Transformers in Split Learning

TLDR: Attention-based Double Compression (ADC) is a novel framework for Split Learning that drastically reduces communication overhead in Vision Transformer (ViT) training. It employs a two-step compression strategy: merging similar samples based on attention scores and then discarding the least meaningful tokens. This dual approach allows for significant data reduction during both forward and backward passes, maintaining high model accuracy even under aggressive compression, and outperforming existing communication-efficient Split Learning methods.

In the rapidly evolving landscape of artificial intelligence, deep neural networks (DNNs) have become indispensable across various fields, from computer vision to medical diagnostics. However, the sheer computational power and memory required to train these complex models pose significant challenges, especially when deploying them on devices with limited resources, often referred to as edge devices. Traditional cloud-based training, while powerful, demands transmitting vast amounts of raw data from these edge devices to central servers, leading to substantial communication overhead and raising critical privacy concerns.

Introducing Split Learning and its Challenges

To address these issues, a promising approach called Split Learning (SL) has emerged. SL works by dividing a neural network between an edge device and a cloud server. The client device handles the initial layers of the network using local data, then sends intermediate features (activations) and labels to the cloud server. The server processes the remaining layers, computes gradients, and sends them back to the client for model updates. This collaborative method reduces the need to send raw data, enhancing privacy and communication efficiency by only transmitting intermediate activations and gradients.

Despite its advantages, communication bottlenecks remain a significant hurdle in practical SL implementations. Existing solutions often involve autoencoders or compression techniques like sparsification and quantization. However, many of these methods struggle to maintain model accuracy when aggressively compressing data. The core problem is that they apply a uniform compression strategy, treating all data components equally, regardless of their importance to the learning process. This can lead to valuable information being lost, especially under high compression rates.

Attention-based Double Compression (ADC): A Novel Solution

A new framework, named Attention-based Double Compression (ADC), offers a novel solution to this challenge. ADC is designed to significantly reduce the communication overhead in Split Learning, particularly for Vision Transformers (ViTs), while maintaining high performance. The core innovation of ADC lies in its two-step, intelligent compression strategy that leverages the inherent properties of Transformer-based models.

The first step is Batch Compression. Instead of sending all individual samples in a batch, ADC merges similar samples’ activations. This merging is based on the average attention score calculated in the last client layer, specifically using the CLS-token attention scores. This approach is class-agnostic, meaning it can merge samples from different classes without compromising the model’s ability to generalize or its final accuracy. By clustering these attention scores, ADC reduces the number of samples that need to be transmitted to a target size.

Following batch compression, the second step is Token Selection. Even after merging, each combined activation might still contain redundant information. ADC further reduces dimensionality by discarding the least meaningful tokens. It retains only the top-k tokens that correspond to the most important positions, again guided by the attention information used during the merging phase. This ensures that only the most informative parts of the activations are communicated.

By combining these two strategies, ADC compresses data along two orthogonal axes: samples and features. This dual approach not only reduces the amount of data sent during the forward pass but also naturally compresses the gradients during the backward pass, allowing the entire model to be trained without additional tuning or approximations of the gradients.

Performance and Impact

Extensive simulations demonstrate that ADC significantly outperforms state-of-the-art Split Learning frameworks. It achieves superior communication efficiency and consistently preserves high accuracy across various architectures (DeiT-S and DeiT-T) and datasets (CIFAR100 and Food101). Notably, ADC excels in scenarios requiring aggressive compression ratios, where many competing methods experience severe performance degradation. For instance, ADC can achieve near-baseline accuracy even at very low compression ratios, while other methods require substantially higher ratios for comparable results.

The framework also exhibits remarkably stable convergence, suggesting that it not only maintains accuracy under aggressive compression but may also act as an implicit regularizer during training. Ablation studies further confirm the robustness of ADC, showing that its performance improves with larger batch sizes due to increased sample diversity, and that using CLS-token attention scores for merging yields the best results due to its alignment with the token selection process. The method also performs better when the split point is deeper in the model, as class tokens in deeper layers are more semantically informative.

Also Read:

Looking Ahead

Attention-based Double Compression represents a significant advancement in communication-efficient Split Learning for Vision Transformers. Its ability to jointly reduce redundancy across both batch and token dimensions, while preserving model accuracy, makes it highly suitable for deployment in environments with stringent communication constraints. Future work aims to extend this framework to more realistic communication environments, including noisy wireless channels, and to multi-client scenarios like Federated Learning, to enhance scalability in large networks of edge devices. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Optimizing AI Training: A Dual Compression Strategy for Vision Transformers in Split Learning

Introducing Split Learning and its Challenges

Attention-based Double Compression (ADC): A Novel Solution

Performance and Impact

Looking Ahead

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates