Boosting LLM Performance: How Processing-Near-Memory Redefines KV-Cache Management

TLDR: This paper introduces a CXL-enabled Processing-Near-Memory (PNM) architecture for efficient 1M-token LLM inference. It offloads KV-cache management and attention computation to PNM accelerators within CXL memory, eliminating costly GPU recalls and enabling larger batch sizes. A hybrid GPU-PNM execution model (PnG-KV) with “steady-token selection” further optimizes performance by utilizing idle GPU resources. The system achieves significant improvements in throughput, energy efficiency, and cost efficiency for large LLMs with long contexts.

Large Language Models (LLMs) are becoming increasingly powerful, with context windows expanding to process millions of tokens. This capability unlocks advanced features like summarizing long documents, multi-step reasoning, and analyzing extensive codebases. However, this growth introduces significant challenges, particularly in managing the Key-Value (KV) cache, which stores information about past tokens to maintain context during inference. The KV-cache footprint grows linearly with context length and batch size, quickly overwhelming the memory capacity of traditional GPU systems.

While technologies like Compute Express Link (CXL) allow LLMs to offload the entire KV-cache to scalable external memory, a major bottleneck remains: the costly data transfers required to recall non-resident KV tokens back to the limited GPU memory as context lengths increase. This constant back-and-forth between external memory and the GPU leads to inefficiencies and limits the overall performance of long-context LLM inference.

To address these challenges, researchers have proposed a novel approach: Scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference. This CXL-enabled KV-cache management system rethinks how memory and computation are coordinated, moving beyond the traditional limitations of GPUs. The core idea is to integrate a PNM accelerator directly within CXL memory modules. This accelerator takes over the task of token page selection, eliminating the need to recall large amounts of data to the GPU. By doing so, it significantly reduces data transfer overhead and frees up GPU memory, allowing for much larger batch sizes during inference.

The PNM architecture is designed for efficiency and scalability. It features a reconfigurable Vector Processing Unit (VPU) capable of handling various KV-cache management tasks, such as generating compact summaries of token pages (digests), estimating relevance scores, and performing attention computations. This flexibility allows the same hardware resources to be reused across different stages, minimizing overhead. Additionally, a parallel Top-K Sorter quickly identifies the most relevant token pages.

A key innovation is the workload partitioning strategy. The system offloads the entire attention computation to the CXL-PNM devices, while the GPU focuses on the computationally intensive fully connected (FC) layers. Crucially, the data exchanged between the GPU and PNM (query, key, and value vectors) remains constant in size, regardless of the context length. This prevents CXL link bottlenecks, ensuring scalability even with millions of tokens.

For systems with multiple PNM devices, the paper introduces a data parallelism (DP) strategy for attention computation. Instead of dividing the workload by attention heads (tensor parallelism), each PNM device processes different batched requests independently. This eliminates the need for costly cross-device communication and aggregation, further enhancing efficiency and scalability.

To maximize GPU utilization and prevent it from sitting idle during PNM attention computation, a hybrid execution model called PnG-KV (PNM-GPU hybrid with steady-token execution) is proposed. In PnG-KV, both the GPU and PNM collaboratively perform attention operations. A novel “Steady Token Selection” algorithm identifies tokens that have persistent relevance over time. A small subset of these steady tokens is then kept on the GPU, allowing it to participate in attention computation for a full batch size. This minimizes data movement from CXL memory to the GPU while maintaining high throughput for FC computations.

The evaluation of this CXL-enabled multi-PNM system demonstrates impressive results. Compared to a CXL-memory-expanded GPU baseline, the PNM-KV (PNM-only offloading) and PnG-KV (hybrid) schemes achieved up to 21.9 times throughput improvement, up to 60 times lower energy consumption per token, and up to 7.3 times better total cost efficiency. These gains were consistent across server-level and rack-scale deployments, supporting LLMs with up to 405 billion parameters and 1 million-token contexts.

Also Read:

This research highlights CXL’s potential not just as a memory expander, but as a computational backbone for future AI systems. By intelligently distributing KV-cache management and attention computation to specialized PNM accelerators, this approach sets a new standard for scalable and efficient long-context LLM inference. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Boosting LLM Performance: How Processing-Near-Memory Redefines KV-Cache Management

Gen AI News and Updates

Baidu Unveils Next-Generation AI Accelerators and ERNIE 5.0 Model

NVIDIA Introduces $249 Jetson Orin Nano Super Developer Kit for Accessible Generative AI

Optimizing Neural Processing Units for Continual Learning with Microscaling

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates