TLDR: MoSKA (Mixture of Shared KV Attention) is a novel architecture designed to overcome the performance bottlenecks of Key-Value (KV) cache in long-sequence Large Language Models (LLMs). It differentiates between unique and shared context data, transforming memory-bound operations on shared data into compute-bound operations through batched ‘Shared KV Attention’. Combined with MoE-inspired sparse attention for efficient routing and a disaggregated infrastructure with specialized hardware, MoSKA achieves up to 538.7 times higher throughput compared to baselines, offering a scalable solution for LLM inference.
Large Language Models (LLMs) are becoming increasingly powerful, handling longer and more complex conversations. However, this progress comes with a significant challenge: managing the ‘Key-Value (KV) cache’ during inference. This cache stores information that the LLM needs to remember from previous parts of a conversation or document. As context lengths grow, the KV cache demands a lot of memory and, more critically, a lot of memory bandwidth, leading to GPUs being underutilized and slowing down the entire process.
A new architecture called Mixture of Shared KV Attention (MoSKA) has been introduced to tackle this problem. MoSKA’s core idea is to recognize that not all data in an LLM’s context is the same. It distinguishes between ‘unique’ data, which is specific to a single request, and ‘shared’ data, which can be reused across many requests, like common system prompts or domain-specific documents.
The Innovation: Shared KV Attention
The key innovation in MoSKA is its ‘Shared KV Attention’ mechanism. Traditionally, even when multiple requests access the same shared data, each request processes it individually, leading to many small, memory-intensive operations (GEMV). MoSKA changes this by batching these concurrent requests that access identical shared data. Instead of many small operations, it transforms them into a single, large, compute-intensive operation (GEMM). This fundamental shift moves the bottleneck from memory bandwidth to computation, significantly boosting GPU utilization and overall system speed.
Smart Routing with Sparse Attention
Even with Shared KV Attention, processing a massive shared KV cache (potentially millions of tokens long) would still be too demanding. To manage this, MoSKA incorporates a smart routing layer, inspired by Mixture-of-Experts (MoE) models. It divides the vast shared KV space into smaller, manageable ‘chunks’ or ‘experts’. When a query comes in, a lightweight routing mechanism quickly identifies and selects only the most relevant chunks. This ‘sparse attention’ approach drastically reduces the amount of data the LLM needs to consider, making the process computationally efficient while still benefiting from the batched Shared KV Attention on the selected chunks.
Specialized Hardware: Disaggregated Infrastructure
To fully leverage these innovations, MoSKA proposes a ‘Disaggregated Infrastructure’. This means separating the hardware into specialized nodes. ‘Unique KV Nodes’ are optimized for the latency-sensitive, memory-bound operations on unique data. They are designed to hide memory latency by co-locating other computations. In contrast, ‘Shared KV Nodes’ are built for the throughput-oriented, compute-bound Shared KV Attention tasks. These nodes are equipped with powerful compute units to efficiently process large batches of shared data. This specialization allows resources to be scaled independently, ensuring optimal performance for both types of data.
The Future: Universal MoSKA
The long-term vision for this architecture is ‘Universal MoSKA’. This concept relies on advancements in ‘Position-Independent KV Caching’, which would allow KV chunks to be completely detached from their original context. This would enable a distributed network of nodes, each hosting different domain-specific knowledge. A complex user query could then dynamically pull and compose relevant knowledge chunks from this universal library on demand, leading to highly flexible and powerful AI systems.
Also Read:
- LLMServingSim2.0: A Unified Platform for Simulating LLM Infrastructure with Diverse Hardware and Serving Strategies
- SeqTopK: Smarter Expert Allocation in Large Language Models
Performance Highlights
Evaluations show that MoSKA delivers impressive performance gains. In workloads with high context sharing, it achieved a throughput increase of up to 538.7 times over existing baselines like FlashAttention, SGLang, and ChunkAttention. This superior performance comes from its unique combination of Shared KV Attention and MoE-inspired sparse attention, effectively solving both memory capacity and bandwidth scaling issues in long-sequence LLM inference. For more technical details, you can read the full paper here: MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference.


