Krul: Smarter Memory Use for Multi-Turn LLM Interactions

TLDR: Krul is a novel LLM inference system designed to improve the efficiency of multi-turn conversations by optimizing Key-Value (KV) cache management. Unlike static compression methods, Krul dynamically selects compression strategies based on conversation-specific attention patterns. It introduces a preemptive strategy selector to preserve critical context, a token-wise heterogeneous attention similarity estimator to reduce computational overhead, and a bubble-free restoration scheduler to ensure efficient recomputation and loading. This approach leads to significant reductions in time-to-first-token (TTFT) and KV cache storage, while maintaining high generation quality.

Large Language Models (LLMs) are incredibly powerful, especially in multi-turn conversations where they remember past interactions to provide contextually relevant responses. However, maintaining this “memory” – known as Key-Value (KV) caches – for long and frequent conversations poses a significant challenge. When a conversation becomes inactive, these KV caches are often removed from the GPU memory to free up space. But when the conversation resumes, the system has to recompute or reload all this historical data, leading to delays and increased computational costs.

Existing solutions have tried to tackle this by compressing KV caches, often by grouping similar attention patterns across different layers of the LLM. The problem is, these methods typically use a one-size-fits-all compression approach. They apply the same fixed compression scheme to all conversations, regardless of how attention patterns might vary. This static strategy can lead to a noticeable drop in the quality of the generated responses because it doesn’t adapt to the unique dynamics of each conversation.

Enter Krul, a new multi-turn LLM inference system designed to make KV cache restoration both accurate and efficient. Krul stands out by dynamically choosing its compression strategies. Instead of a fixed approach, it assesses the similarity of attention patterns across different layers for each specific conversation. This allows it to create a customized compression plan, ensuring that crucial context is preserved while still achieving significant memory savings.

Krul introduces three key innovations to achieve this balance. First, a preemptive compression strategy selector intelligently identifies which parts of the model’s memory are sensitive to new user inputs and should not be compressed. For the remaining parts, it selects the most effective compression strategy tailored to that conversation. This ensures that even with compression, the model doesn’t lose vital information needed for future turns.

Second, Krul features a token-wise heterogeneous attention similarity estimator. Calculating attention similarity can be computationally intensive and memory-heavy, especially for long conversations. Krul addresses this by smartly dividing the workload: it offloads the computation of attention similarities for the initial “prefilling” phase of a prompt to the CPU, while keeping the more frequent, smaller computations for the “decoding” phase on the GPU. This clever division minimizes overhead during model generation.

Finally, a bubble-free restoration scheduler tackles the challenge of efficiently restoring compressed KV caches. Traditional restoration methods can suffer from “bubbles” or idle times when recomputing and loading data don’t perfectly align. Krul’s scheduler dynamically orchestrates these tasks, ensuring a smooth, overlapped pipeline that reduces potential delays caused by the imbalance between recomputing and loading compressed data.

The empirical evaluations of Krul on real-world tasks have shown impressive results. It achieves a significant reduction in Time-to-First-Token (TTFT), ranging from 1.5 times to 2.68 times faster compared to current state-of-the-art methods. Furthermore, it reduces KV cache storage by 1.33 times to 2.35 times. Crucially, Krul accomplishes these improvements without compromising the quality of the generated responses, maintaining an average accuracy loss of less than 1%.

Also Read:

This innovative system represents a substantial step forward in making multi-turn conversations with LLMs more responsive and resource-efficient, paving the way for more seamless and cost-effective real-time AI applications. You can find more details about this research in the full paper: Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Krul: Smarter Memory Use for Multi-Turn LLM Interactions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates