TLDR: This research introduces an efficient method for deploying large language models (LLMs), especially Mixture-of-Experts (MoE) models, on resource-limited edge devices. It combines Hessian-Aware Quantization (HAQ) for accurate model compression and a CPU-GPU collaborative inference scheme for optimized resource utilization. HAQ uses adaptive smoothing and Hessian-based techniques to maintain high accuracy with 8-bit quantization, while the collaborative scheme dynamically offloads and caches expert modules between CPU and GPU, significantly reducing memory usage by 60% and improving inference latency and stability. The approach enables near full-precision performance on models like Mixtral-8x7B, making LLMs practical for edge applications.
Deploying large language models (LLMs) on devices like smartphones, smart cameras, or industrial sensors, often referred to as ‘edge devices,’ presents a significant challenge. These devices have limited computing power and memory compared to powerful cloud servers. While LLMs offer incredible capabilities in natural language processing and other tasks, getting them to run efficiently on these resource-constrained environments is crucial for applications requiring low latency and high data security, such as smart terminals and automotive systems.
One popular architecture for scaling LLMs is the Mixture-of-Experts (MoE) model. MoE models enhance capacity through sparse activation, meaning only a few ‘expert’ modules are activated for any given input, rather than the entire model. However, deploying MoE models on edge devices faces two major hurdles: first, maintaining accuracy when compressing the model (quantization) due to unusual data distributions; and second, efficiently managing and offloading these expert modules between the CPU and GPU to balance speed and memory usage.
Overcoming Quantization Challenges with Hessian-Aware Quantization (HAQ)
The first challenge arises from ‘quantization,’ a process that reduces the precision of a model’s data (e.g., from 32-bit to 8-bit) to save memory and speed up computation. In LLMs, especially MoE models, activation data often contains ‘outliers’ – extreme values that can severely degrade accuracy when quantized. To tackle this, researchers have proposed a method called Hessian-Aware Quantization (HAQ).
HAQ introduces an ‘adaptive activation smoothing’ technique. Unlike previous methods that used fixed, empirically set parameters, HAQ dynamically determines the optimal smoothing factor for activations. This process makes the activation distribution more concentrated, effectively reducing the negative impact of outliers and improving quantization accuracy. Following this, HAQ employs ‘Hessian-based weight quantization.’ Inspired by methods like GPTQ, this step uses advanced mathematical concepts (the Hessian matrix) to understand how sensitive the model’s output is to changes in its weights. By minimizing the error between the original and quantized outputs, it ensures that accuracy is preserved even after significant compression.
Furthermore, HAQ incorporates a ‘device-aware heterogeneous precision adaptation’ strategy. This means it intelligently uses both the CPU and GPU on an edge device. The CPU, with its larger memory, stores model weights in a compressed low-bit format and performs de-quantization once during loading. The GPU, optimized for parallel computation, directly loads these 8-bit weights and uses specialized low-precision operations for fast inference. This division of labor optimizes resource allocation, boosting throughput and overall system performance.
Optimizing Performance with CPU-GPU Collaborative Inference
The second major challenge involves efficiently managing the numerous expert modules in MoE models. Since edge devices have limited GPU memory, not all experts can reside on the GPU simultaneously. The proposed solution includes a sophisticated CPU-GPU collaborative inference scheme.
This scheme features a ‘hybrid model offloading’ mechanism. It uses the CPU’s memory as auxiliary storage for experts that don’t fit on the GPU. A ‘predictor-based dynamic decision mechanism’ is key here: it estimates in real-time whether it’s faster to compute an expert’s output directly on the CPU or to transfer the expert’s parameters to the GPU for computation. This dynamic decision-making is especially important during different stages of LLM inference, like the ‘prefill’ stage (processing a long input sequence) versus the ‘decoding’ stage (generating one token at a time).
To further enhance efficiency, a ‘GPU expert caching mechanism’ is implemented. Frequently used experts are stored in a dedicated cache on the GPU. When a new expert is needed, the system first checks if it’s already in the cache. If not, it’s transferred from the CPU. A ‘Least Recently Used (LRU)’ policy manages this cache, ensuring that the most relevant experts are readily available, which significantly reduces data transfer overhead and latency.
Finally, a ‘distributed expert deployment strategy’ is introduced to maximize the ‘expert hit rate’ on the GPU while ensuring stable performance across different layers of the model. This two-stage, layer-wise selection strategy prioritizes experts that are part of high-frequency activation paths, ensuring critical computations are always handled by the faster GPU. It then supplements each layer with additional frequently activated experts, balancing overall hit rate with consistent performance across the model. This approach addresses the issue of uneven load distribution seen in simpler strategies, which can lead to unpredictable latency.
Also Read:
- Smart LLM Adaptation: DP-LLM Adjusts Precision on the Fly
- Enhanced LLM Performance on AI PCs with New Inference Runtime
Real-World Impact and Validation
Extensive experiments were conducted on popular LLMs like the OPT series and Mixtral-8x7B, using datasets such as Wikitext2 and C4. The results are highly promising. The proposed HAQ method, combined with the CPU-GPU collaborative inference, achieved inference accuracy nearly identical to full-precision models, even with 8-bit quantization. Crucially, it reduced GPU memory usage by approximately 60% and significantly improved inference latency. The system also demonstrated higher expert hit rates, lower fluctuations in inference latency, and stronger overall robustness.
This research provides a practical and effective solution for deploying large-scale MoE models in real-world edge environments, making advanced AI capabilities more accessible and efficient on everyday devices. For more technical details, you can refer to the full research paper here.


