spot_img
HomeResearch & DevelopmentBoosting LLM Performance: How Processing-Near-Memory Redefines KV-Cache Management

Boosting LLM Performance: How Processing-Near-Memory Redefines KV-Cache Management

TLDR: This paper introduces a CXL-enabled Processing-Near-Memory (PNM) architecture for efficient 1M-token LLM inference. It offloads KV-cache management and attention computation to PNM accelerators within CXL memory, eliminating costly GPU recalls and enabling larger batch sizes. A hybrid GPU-PNM execution model (PnG-KV) with “steady-token selection” further optimizes performance by utilizing idle GPU resources. The system achieves significant improvements in throughput, energy efficiency, and cost efficiency for large LLMs with long contexts.

Large Language Models (LLMs) are becoming increasingly powerful, with context windows expanding to process millions of tokens. This capability unlocks advanced features like summarizing long documents, multi-step reasoning, and analyzing extensive codebases. However, this growth introduces significant challenges, particularly in managing the Key-Value (KV) cache, which stores information about past tokens to maintain context during inference. The KV-cache footprint grows linearly with context length and batch size, quickly overwhelming the memory capacity of traditional GPU systems.

While technologies like Compute Express Link (CXL) allow LLMs to offload the entire KV-cache to scalable external memory, a major bottleneck remains: the costly data transfers required to recall non-resident KV tokens back to the limited GPU memory as context lengths increase. This constant back-and-forth between external memory and the GPU leads to inefficiencies and limits the overall performance of long-context LLM inference.

To address these challenges, researchers have proposed a novel approach: Scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference. This CXL-enabled KV-cache management system rethinks how memory and computation are coordinated, moving beyond the traditional limitations of GPUs. The core idea is to integrate a PNM accelerator directly within CXL memory modules. This accelerator takes over the task of token page selection, eliminating the need to recall large amounts of data to the GPU. By doing so, it significantly reduces data transfer overhead and frees up GPU memory, allowing for much larger batch sizes during inference.

The PNM architecture is designed for efficiency and scalability. It features a reconfigurable Vector Processing Unit (VPU) capable of handling various KV-cache management tasks, such as generating compact summaries of token pages (digests), estimating relevance scores, and performing attention computations. This flexibility allows the same hardware resources to be reused across different stages, minimizing overhead. Additionally, a parallel Top-K Sorter quickly identifies the most relevant token pages.

A key innovation is the workload partitioning strategy. The system offloads the entire attention computation to the CXL-PNM devices, while the GPU focuses on the computationally intensive fully connected (FC) layers. Crucially, the data exchanged between the GPU and PNM (query, key, and value vectors) remains constant in size, regardless of the context length. This prevents CXL link bottlenecks, ensuring scalability even with millions of tokens.

For systems with multiple PNM devices, the paper introduces a data parallelism (DP) strategy for attention computation. Instead of dividing the workload by attention heads (tensor parallelism), each PNM device processes different batched requests independently. This eliminates the need for costly cross-device communication and aggregation, further enhancing efficiency and scalability.

To maximize GPU utilization and prevent it from sitting idle during PNM attention computation, a hybrid execution model called PnG-KV (PNM-GPU hybrid with steady-token execution) is proposed. In PnG-KV, both the GPU and PNM collaboratively perform attention operations. A novel “Steady Token Selection” algorithm identifies tokens that have persistent relevance over time. A small subset of these steady tokens is then kept on the GPU, allowing it to participate in attention computation for a full batch size. This minimizes data movement from CXL memory to the GPU while maintaining high throughput for FC computations.

The evaluation of this CXL-enabled multi-PNM system demonstrates impressive results. Compared to a CXL-memory-expanded GPU baseline, the PNM-KV (PNM-only offloading) and PnG-KV (hybrid) schemes achieved up to 21.9 times throughput improvement, up to 60 times lower energy consumption per token, and up to 7.3 times better total cost efficiency. These gains were consistent across server-level and rack-scale deployments, supporting LLMs with up to 405 billion parameters and 1 million-token contexts.

Also Read:

This research highlights CXL’s potential not just as a memory expander, but as a computational backbone for future AI systems. By intelligently distributing KV-cache management and attention computation to specialized PNM accelerators, this approach sets a new standard for scalable and efficient long-context LLM inference. You can read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -