TLDR: This research introduces an autoregressive-aware split computing framework for deploying Large Language Models (LLMs) on memory- and latency-constrained edge devices. It features One-Point Split Compression (OPSC) for memory efficiency, a two-stage intermediate compression pipeline (Threshold Splitting and Token-Wise Adaptive Bit Quantization) to reduce communication overhead while preserving accuracy, and a unified optimization framework to select optimal settings. The approach significantly improves inference speed and reduces server load, making large-scale LLMs practical for real-time IoT applications.
Large Language Models (LLMs) have transformed many areas, from natural language understanding to code generation, powering applications like ChatGPT. However, their massive size and computational demands make it incredibly challenging to run them on everyday devices with limited resources, such as those found in the Internet of Things (IoT). This often means relying heavily on powerful cloud servers, which can lead to bottlenecks and underutilize the growing capabilities of edge devices.
A new research paper, “Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing”, introduces an innovative solution to this problem. The authors, Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim, Sangseok Yun, and Jae-Mo Kang, propose an autoregressive-aware split computing framework designed specifically for deploying LLMs on these resource-constrained edge devices.
The Challenge of LLMs on Edge Devices
Traditionally, deploying LLMs on edge devices faces two major hurdles: immense memory requirements and the iterative nature of how LLMs generate text (autoregressive inference). Existing split computing methods, which divide a model’s workload between an edge device and a cloud server, weren’t built to handle the continuous token generation and the ever-expanding ‘key-value’ (KV) cache that LLMs use. This often results in devices running out of memory or experiencing significant delays.
A Novel Split Computing Framework
The new framework tackles these challenges with three key contributions:
First, it introduces **One-Point Split Compression (OPSC)**. This is a clever mixed-precision quantization scheme. Imagine splitting an LLM into two parts: a front-end that runs on the edge device and a back-end that runs on the cloud. OPSC strategically applies different levels of precision (quantization) to these segments. The part on the edge device is compressed more aggressively to save memory, while the cloud part can maintain higher precision. This prevents memory failures on the edge device without sacrificing too much accuracy.
Second, the framework includes a **two-stage intermediate compression pipeline** for the data that needs to be sent from the edge device to the cloud. This pipeline combines two techniques:
- **Threshold Splitting (TS):** LLMs are very sensitive to a small number of ‘outlier’ values in their intermediate data. TS identifies and separates these crucial, large-magnitude values. These important values are then compressed using a technique that preserves their integrity, ensuring minimal accuracy loss.
- **Token-Wise Adaptive Bit Quantization (TAB-Q):** The remaining, less critical data is then compressed using an adaptive quantization method. This technique adjusts the compression level based on the data’s distribution, ensuring efficient data transfer while maintaining the model’s ability to understand contextual importance.
Third, the researchers developed a **unified optimization framework**. This framework intelligently selects the best ‘split points’ (where the model is divided), the optimal quantization settings for both model weights and activations, and the appropriate sequence lengths. This joint optimization ensures that the system meets strict memory and latency constraints while maximizing overall performance and accuracy.
Also Read:
- Collaborative LLM Inference: Introducing Federated Attention for Edge Networks
- Optimizing Large Language Models for Edge Device Performance
Demonstrated Performance and Scalability
Extensive evaluations across various LLMs, including Llama2 variants, and different hardware platforms (like a Jetson Xavier NX edge device and an A6000 GPU cloud server) showed impressive results. The framework achieved a 1.49 times inference speedup and significantly reduced communication overhead. Crucially, it maintained or even improved model accuracy compared to state-of-the-art quantization methods like SmoothQuant, OmniQuant, and Atom.
The proposed method also demonstrated superior scalability, handling an increasing number of edge devices with a lower server workload compared to cloud-only approaches. It effectively offloads more inference steps to the edge, reducing the burden on the central server. This breakthrough makes it possible to deploy LLMs with hundreds of gigabytes of memory requirements on severely resource-constrained edge devices, opening up new possibilities for real-time AI applications in IoT environments.


