TLDR: ClusterFusion is a new execution framework designed to accelerate Large Language Model (LLM) inference, particularly during the decoding phase. It introduces two cluster-level communication primitives, ClusterReduce and ClusterGather, which enable high-speed, on-chip data exchange and reduction between thread blocks within a GPU cluster. By fusing key LLM operations like QKV Projection, Attention, and Output Projection into a single kernel, ClusterFusion significantly reduces off-chip memory traffic and kernel launch overhead. This approach leads to an average 1.61x speedup in end-to-end latency on NVIDIA H100 GPUs compared to existing state-of-the-art frameworks.
Large Language Models (LLMs) are at the heart of many modern AI systems, powering everything from natural language processing to code generation. However, running these models, especially during the decoding phase where they generate output tokens, often faces significant challenges. High latency, fragmented execution across different operations, and a heavy reliance on off-chip memory for data exchange are common bottlenecks that slow down LLM inference.
Traditional LLM execution models struggle with operator fusion, which is the process of combining multiple operations into a single, more efficient unit. This limitation leads to substantial memory traffic and overhead from launching many small computational tasks, known as kernel launches.
Modern GPU architectures, such as NVIDIA Hopper, offer promising solutions with features like distributed shared memory (DSMEM) and low-latency connections within a cluster of processing units. However, these powerful capabilities are often exposed through low-level instructions, making it difficult for developers to create structured and efficient on-chip communication patterns.
To bridge this gap between hardware potential and software implementation, researchers have introduced a new framework called ClusterFusion. This innovative approach aims to expand the scope of operator fusion for LLM inference by introducing two key cluster-level communication primitives: ClusterReduce and ClusterGather.
Cluster-Level Communication Primitives
ClusterReduce and ClusterGather are designed to abstract common communication patterns, such as data reduction (like summing or finding the maximum value) and data aggregation, between different thread blocks within a GPU cluster. These primitives enable high-speed, structured data exchange and reduction directly on-chip, meaning intermediate results can stay within the fast memory of the GPU without needing to be moved to slower, off-chip memory.
By treating each thread block cluster as a fundamental parallel unit, ClusterFusion uses these primitives to efficiently resolve dependencies between blocks. This cluster-centric dataflow allows for the joint scheduling of communication and computation, significantly expanding the opportunities for operator fusion.
Expanded Operator Fusion
ClusterFusion focuses on fusing critical decoding stages of LLMs, such as QKV Projection, Attention, and Output Projection, into a single, highly optimized kernel. In existing systems, these operations often run as separate kernels, requiring intermediate data to be written to and read from global memory, leading to delays and inefficiencies.
With ClusterFusion, the intermediate results from QKV Projection can remain on-chip and be directly used by the Attention module. Similarly, the output of the Attention module stays on-chip for immediate consumption by the Output Projection. This seamless data reuse across multiple modules drastically reduces off-chip memory traffic and the overhead associated with launching multiple kernels.
Also Read:
- Accelerating Large Language Models with Arbitrary Precision Computing
- Unlocking NPU Potential for Mobile LLMs with Dynamic Sparse Attention
Performance and Impact
Evaluations conducted on NVIDIA H100 GPUs demonstrate that ClusterFusion significantly outperforms state-of-the-art inference frameworks. It achieves an average speedup of 1.61 times in end-to-end latency across various LLM models and configurations, including Llama2-7B and DeepSeek-V2-Lite. This performance gain is attributed to two main factors: a substantial reduction in global memory transfer size and a significant decrease in GPU kernel launch overhead.
The research also highlights the importance of carefully configuring the cluster size to maximize performance, as the optimal size can vary depending on the workload. While ClusterFusion currently operates within the fixed cluster size limits of current hardware (up to 16 thread blocks), it paves the way for future architectural advancements that could support even broader intra-chip communication and fusion strategies.
In essence, ClusterFusion represents a significant step forward in optimizing LLM inference by intelligently leveraging modern GPU architectures. By enabling efficient on-chip communication and expanding operator fusion, it helps overcome critical performance bottlenecks, making LLM decoding faster and more efficient. For more technical details, you can refer to the full research paper here.


