Boosting Large Language Model Performance on Edge Devices with a Hybrid Accelerator

TLDR: This research introduces a Hybrid Systolic Array (HSA) accelerator designed for efficient Large Language Model (LLM) inference on edge devices. It tackles challenges like limited memory and power by optimizing dataflow for both compute-intensive prefill and memory-bound decode stages. The accelerator uses MXINT4 weight quantization to reduce memory access and features optimized RMSNorm and RoPE units. The solution achieves significant improvements in area efficiency and maintains high energy efficiency for LLMs like RetNet 1.3B, making on-device LLM inference more practical.

Large Language Models (LLMs) are becoming increasingly vital across many applications, but deploying them directly on edge devices like smartphones or smart home gadgets presents significant challenges. Unlike data centers that can handle many user requests at once, edge devices have limited memory, strict power budgets, and often process one user query at a time. This leads to inefficiencies, especially during the ‘decode’ stage where the model generates output tokens, which is heavily reliant on memory access.

A new research paper, “Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference”, introduces an innovative solution to these challenges. The paper highlights that an ideal edge accelerator needs to be highly efficient in terms of area and minimize external memory access during the memory-intensive decode stage, while also being energy-efficient during the compute-heavy ‘prefill’ stage (where the initial input is processed).

Addressing Key Inefficiencies

Current accelerator designs often struggle to balance these demands. Some use ‘vector unit’ architectures, which are flexible but waste energy during prefill due to poor data reuse. Others employ ‘conventional systolic arrays,’ which are efficient for prefill but perform poorly during decode because they can’t effectively handle single-user queries, leading to low hardware utilization.

The researchers propose a novel ‘Hybrid Systolic Array’ (HSA) architecture. This design cleverly combines the strengths of both conventional systolic arrays and vector units. It ensures power-efficient processing during the prefill stage and maintains high hardware utilization during the decode stage, leading to overall high area efficiency.

Optimized Dataflow and Memory Access

To further reduce the reliance on external memory, the paper introduces the use of MXINT4 weight quantization. This technique stores the model’s weights in a highly compressed 4-bit format, effectively halving memory access during the decode stage. Crucially, the paper also proposes an optimized dataflow specifically tailored for the HSA, which allows for this dequantization with negligible overhead and achieves 100% hardware utilization, even under the limited memory bandwidth of edge devices, all while maintaining high accuracy.

Streamlining Non-Linear Operations

LLMs also involve complex non-linear operations like Root Mean Square Normalization (RMSNorm) and Rotary Position Embedding (RoPE). These operations can add significant latency, consume more area, and increase memory access. The researchers have optimized these units:

For RMSNorm, they developed a ‘layer-fused’ approach that eliminates latency overhead and removes the need for a large 32KB buffer, streamlining the processing pipeline.
For RoPE, instead of loading pre-computed values or using dedicated, expensive hardware, they reuse existing computational units to calculate sine and cosine values on the fly. This innovative approach reduces DRAM access and reuses existing hardware resources.

Also Read:

Performance Breakthroughs

The accelerator, prototyped in TSMC 28nm CMOS, demonstrates impressive performance. When running a 1.3B LLM in scenarios with long inputs and long outputs, it achieves 247 and 117 tokens per second per square millimeter, respectively. This represents a significant improvement, offering over 2.45 times to 13.5 times better performance compared to existing solutions, while also maintaining superior energy efficiency during token generation. These advancements are critical for enabling powerful LLMs to run directly and efficiently on edge devices, enhancing security, reducing latency, and lowering costs for a wide range of new applications.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Boosting Large Language Model Performance on Edge Devices with a Hybrid Accelerator

Addressing Key Inefficiencies

Optimized Dataflow and Memory Access

Streamlining Non-Linear Operations

Performance Breakthroughs

Gen AI News and Updates

Peking University Researchers Unveil Analog Chip Boosting AI Data Centers by Up to 1,000-Fold

Rockwell Automation Integrates NVIDIA Nemotron Nano for Edge-Based Generative AI in Industrial Settings

NVIDIA Introduces $249 Jetson Orin Nano Super Developer Kit for Accessible Generative AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates