Adaptive Split Computing: Enabling Large Language Models on Edge Devices

TLDR: This research introduces an autoregressive-aware split computing framework for deploying Large Language Models (LLMs) on memory- and latency-constrained edge devices. It features One-Point Split Compression (OPSC) for memory efficiency, a two-stage intermediate compression pipeline (Threshold Splitting and Token-Wise Adaptive Bit Quantization) to reduce communication overhead while preserving accuracy, and a unified optimization framework to select optimal settings. The approach significantly improves inference speed and reduces server load, making large-scale LLMs practical for real-time IoT applications.

Large Language Models (LLMs) have transformed many areas, from natural language understanding to code generation, powering applications like ChatGPT. However, their massive size and computational demands make it incredibly challenging to run them on everyday devices with limited resources, such as those found in the Internet of Things (IoT). This often means relying heavily on powerful cloud servers, which can lead to bottlenecks and underutilize the growing capabilities of edge devices.

A new research paper, “Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing”, introduces an innovative solution to this problem. The authors, Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim, Sangseok Yun, and Jae-Mo Kang, propose an autoregressive-aware split computing framework designed specifically for deploying LLMs on these resource-constrained edge devices.

The Challenge of LLMs on Edge Devices

Traditionally, deploying LLMs on edge devices faces two major hurdles: immense memory requirements and the iterative nature of how LLMs generate text (autoregressive inference). Existing split computing methods, which divide a model’s workload between an edge device and a cloud server, weren’t built to handle the continuous token generation and the ever-expanding ‘key-value’ (KV) cache that LLMs use. This often results in devices running out of memory or experiencing significant delays.

A Novel Split Computing Framework

The new framework tackles these challenges with three key contributions:

First, it introduces **One-Point Split Compression (OPSC)**. This is a clever mixed-precision quantization scheme. Imagine splitting an LLM into two parts: a front-end that runs on the edge device and a back-end that runs on the cloud. OPSC strategically applies different levels of precision (quantization) to these segments. The part on the edge device is compressed more aggressively to save memory, while the cloud part can maintain higher precision. This prevents memory failures on the edge device without sacrificing too much accuracy.

Second, the framework includes a **two-stage intermediate compression pipeline** for the data that needs to be sent from the edge device to the cloud. This pipeline combines two techniques:

**Threshold Splitting (TS):** LLMs are very sensitive to a small number of ‘outlier’ values in their intermediate data. TS identifies and separates these crucial, large-magnitude values. These important values are then compressed using a technique that preserves their integrity, ensuring minimal accuracy loss.
**Token-Wise Adaptive Bit Quantization (TAB-Q):** The remaining, less critical data is then compressed using an adaptive quantization method. This technique adjusts the compression level based on the data’s distribution, ensuring efficient data transfer while maintaining the model’s ability to understand contextual importance.

Third, the researchers developed a **unified optimization framework**. This framework intelligently selects the best ‘split points’ (where the model is divided), the optimal quantization settings for both model weights and activations, and the appropriate sequence lengths. This joint optimization ensures that the system meets strict memory and latency constraints while maximizing overall performance and accuracy.

Also Read:

Demonstrated Performance and Scalability

Extensive evaluations across various LLMs, including Llama2 variants, and different hardware platforms (like a Jetson Xavier NX edge device and an A6000 GPU cloud server) showed impressive results. The framework achieved a 1.49 times inference speedup and significantly reduced communication overhead. Crucially, it maintained or even improved model accuracy compared to state-of-the-art quantization methods like SmoothQuant, OmniQuant, and Atom.

The proposed method also demonstrated superior scalability, handling an increasing number of edge devices with a lower server workload compared to cloud-only approaches. It effectively offloads more inference steps to the edge, reducing the burden on the central server. This breakthrough makes it possible to deploy LLMs with hundreds of gigabytes of memory requirements on severely resource-constrained edge devices, opening up new possibilities for real-time AI applications in IoT environments.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Adaptive Split Computing: Enabling Large Language Models on Edge Devices

The Challenge of LLMs on Edge Devices

A Novel Split Computing Framework

Demonstrated Performance and Scalability

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Fireworks AI Secures $250 Million Series C Funding, Valued at $4 Billion, to Lead AI Inference Market

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates