TLDR: This research introduces a Hybrid Systolic Array (HSA) accelerator designed for efficient Large Language Model (LLM) inference on edge devices. It tackles challenges like limited memory and power by optimizing dataflow for both compute-intensive prefill and memory-bound decode stages. The accelerator uses MXINT4 weight quantization to reduce memory access and features optimized RMSNorm and RoPE units. The solution achieves significant improvements in area efficiency and maintains high energy efficiency for LLMs like RetNet 1.3B, making on-device LLM inference more practical.
Large Language Models (LLMs) are becoming increasingly vital across many applications, but deploying them directly on edge devices like smartphones or smart home gadgets presents significant challenges. Unlike data centers that can handle many user requests at once, edge devices have limited memory, strict power budgets, and often process one user query at a time. This leads to inefficiencies, especially during the ‘decode’ stage where the model generates output tokens, which is heavily reliant on memory access.
A new research paper, “Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference”, introduces an innovative solution to these challenges. The paper highlights that an ideal edge accelerator needs to be highly efficient in terms of area and minimize external memory access during the memory-intensive decode stage, while also being energy-efficient during the compute-heavy ‘prefill’ stage (where the initial input is processed).
Addressing Key Inefficiencies
Current accelerator designs often struggle to balance these demands. Some use ‘vector unit’ architectures, which are flexible but waste energy during prefill due to poor data reuse. Others employ ‘conventional systolic arrays,’ which are efficient for prefill but perform poorly during decode because they can’t effectively handle single-user queries, leading to low hardware utilization.
The researchers propose a novel ‘Hybrid Systolic Array’ (HSA) architecture. This design cleverly combines the strengths of both conventional systolic arrays and vector units. It ensures power-efficient processing during the prefill stage and maintains high hardware utilization during the decode stage, leading to overall high area efficiency.
Optimized Dataflow and Memory Access
To further reduce the reliance on external memory, the paper introduces the use of MXINT4 weight quantization. This technique stores the model’s weights in a highly compressed 4-bit format, effectively halving memory access during the decode stage. Crucially, the paper also proposes an optimized dataflow specifically tailored for the HSA, which allows for this dequantization with negligible overhead and achieves 100% hardware utilization, even under the limited memory bandwidth of edge devices, all while maintaining high accuracy.
Streamlining Non-Linear Operations
LLMs also involve complex non-linear operations like Root Mean Square Normalization (RMSNorm) and Rotary Position Embedding (RoPE). These operations can add significant latency, consume more area, and increase memory access. The researchers have optimized these units:
- For RMSNorm, they developed a ‘layer-fused’ approach that eliminates latency overhead and removes the need for a large 32KB buffer, streamlining the processing pipeline.
- For RoPE, instead of loading pre-computed values or using dedicated, expensive hardware, they reuse existing computational units to calculate sine and cosine values on the fly. This innovative approach reduces DRAM access and reuses existing hardware resources.
Also Read:
- Optimizing DNN Acceleration: A New Approach to Bit-Level Sparsity
- Quantizing Text Classifiers: How Calibration Data Shapes Performance on Edge Devices
Performance Breakthroughs
The accelerator, prototyped in TSMC 28nm CMOS, demonstrates impressive performance. When running a 1.3B LLM in scenarios with long inputs and long outputs, it achieves 247 and 117 tokens per second per square millimeter, respectively. This represents a significant improvement, offering over 2.45 times to 13.5 times better performance compared to existing solutions, while also maintaining superior energy efficiency during token generation. These advancements are critical for enabling powerful LLMs to run directly and efficiently on edge devices, enhancing security, reducing latency, and lowering costs for a wide range of new applications.


