TLDR: LUT-LLM is a novel FPGA accelerator that enables efficient inference for large language models (1B+ parameters) by replacing traditional arithmetic computations with memory-based table lookups. It introduces activation-weight co-quantization with 2D lookup tables, a bandwidth-aware parallel centroid search, and a spatial-temporal hybrid design. Benchmarked on an AMD V80 FPGA with a Qwen 3 1.7B model, LUT-LLM achieves significantly lower latency and higher energy efficiency compared to high-end GPUs like the AMD MI210 and NVIDIA A100, while maintaining model accuracy.
Large Language Models (LLMs) have transformed many everyday applications, from chatbots to coding assistants. While powerful cloud-based LLM services are common, there’s a growing need for efficient “on-device” intelligence, allowing LLMs to run directly on personal devices like smart home gadgets or robots. This is where specialized hardware accelerators come into play.
Field-Programmable Gate Arrays (FPGAs) have long been recognized for their potential in accelerating single-batch LLM inference, offering advantages in speed and energy efficiency over traditional Graphics Processing Units (GPUs). However, recent advancements in GPU architectures and software optimizations have started to close this performance gap. A key challenge for FPGAs is their comparatively fewer computational resources when relying solely on arithmetic-based computations, which can make them slower than GPUs.
Researchers have identified a unique strength of FPGAs: their abundant distributed on-chip memory. Unlike GPUs, FPGAs can integrate a significantly larger amount of memory directly alongside their processing units. This insight led to the development of a new approach: shifting LLM inference from complex arithmetic calculations to simpler, memory-based computations, primarily through “table lookups.” Imagine looking up a pre-calculated answer in a table rather than performing a complex sum every time.
However, applying memory-based computation to large language models presents its own set of difficulties. Existing memory-based accelerators are often inefficient or unable to scale to the massive size of modern LLMs, and they don’t fully account for how language models execute. To tackle these challenges, a team of researchers introduced LUT-LLM, a groundbreaking FPGA accelerator designed for LLMs with over a billion parameters, utilizing memory-based computation and a technique called vector quantization.
The Core Innovation: Memory-Based Computation with Vector Quantization
LUT-LLM’s fundamental idea is to replace the intensive calculations in LLM’s linear layers with efficient table lookups. To make these tables manageable for large models, LUT-LLM employs “vector quantization.” This technique groups multiple values in a matrix into “vectors” and then replaces each vector with a low-bit index pointing to a representative “centroid” in a pre-defined “codebook.” This significantly reduces the amount of data that needs to be stored and processed.
A crucial finding from the research is that quantizing both the “activations” (the inputs to a layer) and the “weights” (the parameters of the model) together, known as “activation-weight co-quantization,” is the most effective strategy. This approach uses “2D lookup tables,” where one dimension corresponds to activation indices and the other to weight indices, allowing for highly efficient data retrieval.
Key Architectural Features of LUT-LLM
To support this innovative quantization scheme, LUT-LLM incorporates several specialized features:
- Bandwidth-aware Parallel Centroid Search (BPCSU): When an input vector arrives, LUT-LLM needs to quickly find its closest centroid. Instead of a slow sequential search or a resource-heavy parallel approach, BPCSU uses a hybrid design. It employs multiple parallel pipelines and a smaller “reduction tree” to efficiently find centroids. This design is carefully synchronized with the memory bandwidth to ensure that the centroid search latency is completely hidden, maximizing throughput.
- Efficient 2D Table Lookup Based Prefix-Sum Engine: This engine is specifically designed to handle the 2D lookup tables. It efficiently accesses rows of the tables based on activation centroid indices and then uses weight centroid indices to retrieve and expand the pre-computed dot product results. This process is highly pipelined and uses optimized memory partitioning to achieve high parallelism and reduce memory port requirements.
- Spatial-Temporal Hybrid Design: LLMs have different types of operations, some (like linear layers) benefiting from sequential processing, and others (like attention layers) from dataflow processing. LUT-LLM combines these. Linear layers are executed sequentially, allowing for efficient pipelining of centroid searches without repeatedly loading codebooks. The outputs are then streamed to attention and non-linear operations, which run in a dataflow manner. This hybrid approach optimizes memory usage and ensures high throughput across the entire model.
Also Read:
- QUARK: Accelerating Transformers with Quantization and Circuit Sharing
- LLMServingSim2.0: A Unified Platform for Simulating LLM Infrastructure with Diverse Hardware and Serving Strategies
Impressive Performance and Energy Efficiency
The researchers prototyped LUT-LLM for a customized Qwen 3 1.7B model on an AMD V80 FPGA. The results are compelling:
- LUT-LLM achieved 1.66 times lower end-to-end latency compared to the AMD MI210 GPU.
- It demonstrated 1.72 times higher energy efficiency than the NVIDIA A100 GPU.
- The design is scalable, with projections showing that LUT-LLM could handle a 32B model with 2.16 times better energy efficiency than the A100.
Furthermore, the quantization scheme used in LUT-LLM maintains competitive model quality, with only a modest accuracy drop compared to full-precision models, and it significantly outperforms standard low-bit quantization methods. When compared to other state-of-the-art FPGA accelerators, LUT-LLM showed superior energy efficiency and faster decoding speeds, even when processing more data.
This research marks a significant step towards enabling powerful and energy-efficient LLM inference on edge devices, leveraging the unique architectural advantages of FPGAs. Future work aims to further enhance efficiency through new algorithms and explore multi-FPGA systems to scale memory bandwidth even further.


