TLDR: LiquidGEMM is a new W4A8 (4-bit weight, 8-bit activation) GEMM kernel that significantly accelerates Large Language Model (LLM) inference. It overcomes the bottleneck of inefficient dequantization on GPUs by introducing LiquidQuant, a hardware-efficient, overflow-safe dequantization method, and an implicit fine-grained pipeline that seamlessly overlaps weight loading, dequantization, and matrix multiplication. This results in substantial speedups over existing W4A8 kernels and NVIDIA TensorRT-LLM, making LLM serving more efficient.
Large Language Models (LLMs) have revolutionized many applications, but their immense size and computational demands make them challenging to deploy efficiently in real-world production environments. A crucial technique to address these challenges is quantization, which reduces the memory footprint and speeds up computations by converting high-precision numbers (like FP32 or FP16) into lower-precision integer formats (like INT4).
Among various quantization methods, the 4-bit weight and 8-bit activation quantization (W4A8) stands out as a promising approach. It strikes a good balance between maintaining model accuracy and achieving high performance. Theoretically, W4A8 should offer significant advantages, especially in scenarios where memory bandwidth is a bottleneck, such as when processing small batches of requests. However, existing W4A8 implementations often fall short of these theoretical expectations.
The core problem lies in the “dequantization” step, where the compressed 4-bit weights are converted back to 8-bit before being processed by the GPU’s powerful Tensor Cores. This dequantization typically happens on CUDA Cores, which are much slower than Tensor Cores. This creates a bottleneck, as the CUDA Cores cannot keep up with the high throughput of the Tensor Cores, leading to performance degradation rather than improvement.
To tackle this fundamental issue, researchers from Shanghai Jiao Tong University and ByteDance Seed have introduced LiquidGEMM, a new hardware-efficient W4A8 GEMM (General Matrix Multiplication) kernel. GEMM operations are the computational backbone of LLM serving, so optimizing them is critical for overall inference efficiency. LiquidGEMM aims to unlock the full potential of W4A8 quantization by making the dequantization process much more efficient and by intelligently orchestrating how different parts of the GPU work together.
LiquidGEMM incorporates two key innovations. The first is LiquidQuant (LQQ), a hardware-efficient quantization method. Traditional W4A8 dequantization methods often face “overflow” issues, requiring many complex instructions to resolve. LiquidQuant addresses this by applying a clever rotation-based transformation that shifts 8-bit integer values into an unsigned 8-bit range before quantizing them to 4-bit. During dequantization, it uses properties of two’s complement representation to recover the original 8-bit values without overflow, requiring only two simple 32-bit hardware instructions (IMAD and XOR) per four elements. This significantly reduces the computational load on the CUDA Cores, making dequantization much faster.
The second innovation is an implicit fine-grained pipeline (ImFP). Existing approaches often use a “coarse-grained” pipeline where different groups of processing units (warp groups) are explicitly assigned to loading, dequantization, and matrix multiplication tasks. This can lead to inefficiencies due to data moving back and forth between different memory areas and costly synchronization steps between these groups. ImFP, on the other hand, adopts a more streamlined approach. It uses a single-producer, multiple-consumer model where a dedicated “Load” group fetches weights, and then multiple “Compute” groups dynamically pick up these tasks. Each Compute group handles both dequantization and matrix multiplication directly, eliminating unnecessary data movement. The overlapping of dequantization and matrix multiplication happens naturally across these concurrently executing Compute groups, without the need for explicit software synchronization. This implicit parallelism maximizes hardware utilization and avoids pipeline bubbles.
Experimental results demonstrate the significant impact of LiquidGEMM. It achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and an impressive 4.94x end-to-end system-level speedup. When compared to various quantized GEMM kernels in NVIDIA TensorRT-LLM, LiquidGEMM delivers 1.12-1.63x performance gains and up to 1.63x system-level speedup. These improvements highlight that a hardware-aware design is crucial for making W4A8 GEMM both efficient and scalable for high-performance LLM inference in production environments.
Also Read:
- MLP-Offload Accelerates Large Language Model Training by Breaking the GPU Memory Wall
- The Future of Data: Redesigning Systems for LLM Agents
The researchers have deployed LiquidGEMM as the primary GEMM kernel in their production LLM serving infrastructure, underscoring its practical utility. For more technical details, you can refer to the full research paper: LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving.


