TLDR: Krul is a novel LLM inference system designed to improve the efficiency of multi-turn conversations by optimizing Key-Value (KV) cache management. Unlike static compression methods, Krul dynamically selects compression strategies based on conversation-specific attention patterns. It introduces a preemptive strategy selector to preserve critical context, a token-wise heterogeneous attention similarity estimator to reduce computational overhead, and a bubble-free restoration scheduler to ensure efficient recomputation and loading. This approach leads to significant reductions in time-to-first-token (TTFT) and KV cache storage, while maintaining high generation quality.
Large Language Models (LLMs) are incredibly powerful, especially in multi-turn conversations where they remember past interactions to provide contextually relevant responses. However, maintaining this “memory” – known as Key-Value (KV) caches – for long and frequent conversations poses a significant challenge. When a conversation becomes inactive, these KV caches are often removed from the GPU memory to free up space. But when the conversation resumes, the system has to recompute or reload all this historical data, leading to delays and increased computational costs.
Existing solutions have tried to tackle this by compressing KV caches, often by grouping similar attention patterns across different layers of the LLM. The problem is, these methods typically use a one-size-fits-all compression approach. They apply the same fixed compression scheme to all conversations, regardless of how attention patterns might vary. This static strategy can lead to a noticeable drop in the quality of the generated responses because it doesn’t adapt to the unique dynamics of each conversation.
Enter Krul, a new multi-turn LLM inference system designed to make KV cache restoration both accurate and efficient. Krul stands out by dynamically choosing its compression strategies. Instead of a fixed approach, it assesses the similarity of attention patterns across different layers for each specific conversation. This allows it to create a customized compression plan, ensuring that crucial context is preserved while still achieving significant memory savings.
Krul introduces three key innovations to achieve this balance. First, a preemptive compression strategy selector intelligently identifies which parts of the model’s memory are sensitive to new user inputs and should not be compressed. For the remaining parts, it selects the most effective compression strategy tailored to that conversation. This ensures that even with compression, the model doesn’t lose vital information needed for future turns.
Second, Krul features a token-wise heterogeneous attention similarity estimator. Calculating attention similarity can be computationally intensive and memory-heavy, especially for long conversations. Krul addresses this by smartly dividing the workload: it offloads the computation of attention similarities for the initial “prefilling” phase of a prompt to the CPU, while keeping the more frequent, smaller computations for the “decoding” phase on the GPU. This clever division minimizes overhead during model generation.
Finally, a bubble-free restoration scheduler tackles the challenge of efficiently restoring compressed KV caches. Traditional restoration methods can suffer from “bubbles” or idle times when recomputing and loading data don’t perfectly align. Krul’s scheduler dynamically orchestrates these tasks, ensuring a smooth, overlapped pipeline that reduces potential delays caused by the imbalance between recomputing and loading compressed data.
The empirical evaluations of Krul on real-world tasks have shown impressive results. It achieves a significant reduction in Time-to-First-Token (TTFT), ranging from 1.5 times to 2.68 times faster compared to current state-of-the-art methods. Furthermore, it reduces KV cache storage by 1.33 times to 2.35 times. Crucially, Krul accomplishes these improvements without compromising the quality of the generated responses, maintaining an average accuracy loss of less than 1%.
Also Read:
- Compactor: A New Approach to Efficient LLM Memory Management
- Guiding Small Language Models to Reason with Cache Steering
This innovative system represents a substantial step forward in making multi-turn conversations with LLMs more responsive and resource-efficient, paving the way for more seamless and cost-effective real-time AI applications. You can find more details about this research in the full paper: Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing.


