ShadowServe: Boosting LLM Performance with SmartNIC-Powered KV Cache Management

TLDR: ShadowServe is a novel system that significantly improves Large Language Model (LLM) serving performance by offloading KV cache fetching and decompression to SmartNICs. This approach eliminates interference with host GPU and CPU, leading to up to 2.2x lower time-per-output-token and 1.35x higher throughput, especially in low-bandwidth environments. It achieves this through a chunked pipeline and efficient memory management on the SmartNIC.

Large Language Models, or LLMs, are becoming increasingly powerful, handling longer and more complex conversations. To do this efficiently, they rely on a technique called prefix caching. Imagine you’re having a long chat with an AI; instead of re-processing the entire conversation history every time you send a new message, prefix caching stores the ‘memory’ (known as the KV cache) of the common parts of your conversation. This saves a lot of computational effort.

However, as these LLMs grow and are used in distributed systems, fetching this KV cache data can become a major bottleneck, especially when network bandwidth is limited. Previous attempts to solve this involved compressing the KV cache data to reduce transfer size. But there was a catch: decompressing this data on the same GPU that’s running the LLM often caused significant interference, slowing down both the decompression and the model’s computation. Offloading decompression to the main CPU wasn’t much better, as CPUs are often busy with other tasks and aren’t very efficient at these specific decompression algorithms.

This is where a new system called ShadowServe comes in. Developed by researchers from Harvard University, University of Chicago, University of Washington, and UC Davis, ShadowServe introduces a novel approach to tackle this problem. Instead of using the GPU or CPU for KV cache fetching and decompression, it offloads these tasks entirely to a specialized piece of hardware called a SmartNIC. SmartNICs are essentially network interface cards with their own dedicated processors and memory, isolated from the main host CPU and GPU.

The core idea behind ShadowServe is to separate the ‘control plane’ (which manages when and what KV cache to fetch) on the host CPU from the ‘data plane’ (which handles the actual fetching, decompression, and transfer) on the SmartNIC. This clean separation ensures that the host GPU and CPU are free to focus solely on running the LLM, eliminating the performance interference that plagued previous solutions.

SmartNICs, while powerful for their size, do have limited compute and memory resources. To overcome this, ShadowServe employs two key innovations. First, it uses a ‘chunked pipeline’ for the SmartNIC’s data plane. This means that instead of processing an entire KV cache sequentially, the data is split into fixed-size chunks that flow through different stages (network fetching, lossless decompression, dequantization, and direct memory access to the GPU) in parallel. This maximizes the utilization of the SmartNIC’s resources, including its dedicated hardware accelerators for decompression.

Second, ShadowServe implements a ‘minimal-copy memory management’ scheme. It pre-allocates and ‘pins’ all necessary memory buffers on both the SmartNIC and the GPU. This reduces redundant data copies and avoids costly memory registration during runtime, ensuring a smooth and efficient flow of data through the pipeline.

The results are impressive. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2 times lower loaded time-per-output-token (TPOT), which means the model generates subsequent tokens much faster. It also reduces the time-to-first-token (TTFT) by up to 1.38 times in low-bandwidth network scenarios (20 Gbps or less), leading to an overall throughput increase of up to 1.35 times. While ShadowServe excels in these areas, the research also points out that its performance in very high-bandwidth settings can be limited by the SmartNIC’s current memory subsystem, highlighting an area for future hardware improvements.

Also Read:

ShadowServe doesn’t invent new compression algorithms; instead, it intelligently offloads existing ones to the SmartNIC, making them interference-free. This makes its design broadly applicable to various compression techniques. The system demonstrates that SmartNICs are a highly promising and currently underutilized resource for enhancing the performance and efficiency of LLM serving infrastructure. You can read the full research paper here: ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ShadowServe: Boosting LLM Performance with SmartNIC-Powered KV Cache Management

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates