TLDR: Judge Q is a novel training method that uses trainable “soft tokens” to improve Key-Value (KV) cache eviction in large language models (LLMs). By training only the embedding layer for these soft tokens, Judge Q enables LLMs to capture global information more effectively during pre-filling, leading to better retention of crucial data and significantly less performance degradation compared to existing methods, with minimal training cost.
Large language models (LLMs) are incredibly powerful, but their efficiency can be hampered by something called the Key-Value (KV) cache. This cache stores historical information, and its size grows significantly with longer input sequences, leading to increased memory usage and slower processing. Current methods for managing this KV cache often focus on local information, which can cause important global details to be overlooked, ultimately affecting the quality of the model’s output.
To address this challenge, researchers have introduced a new training method called Judge Q. This innovative approach aims to optimize how LLMs retain information in their KV cache during the eviction process. Instead of relying solely on local information, Judge Q incorporates a “soft token list” that is specifically trained to capture crucial global information.
The core idea behind Judge Q is quite clever. It introduces a small set of learnable “soft tokens” into the model’s vocabulary. During training, only the embedding layer associated with these soft tokens is fine-tuned, keeping the rest of the model’s weights frozen. This makes the training process very cost-effective. These soft tokens are trained to align their attention patterns with those of the actual decoded tokens, which are known to be excellent at identifying important key-value pairs. By doing so, the soft tokens learn to effectively evaluate the importance of keys and values across the entire input sequence.
During inference, these trained soft tokens are appended to the input sequence. Their queries are then used to calculate importance scores for the key-value pairs in the KV cache. Based on these scores, the most important pairs are retained, and the less critical ones are discarded. After this pruning step, the soft tokens are removed, and the model proceeds with decoding using the optimized KV cache. This ensures that essential information is preserved, maintaining high decoding quality even with a reduced cache size.
Experiments conducted on models like Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks such as LongBench, RULER, and Needle-in-a-Haystack, have shown promising results. Judge Q consistently exhibits less performance degradation compared to existing eviction methods under the same memory budget. For instance, it showed an improvement of approximately 1 point on LongBench and over 3 points on RULER. The method also significantly outperforms baselines in retrieval scenarios like Needle-in-a-Haystack.
One of the key advantages of Judge Q is its ability to improve the “critical key-value hit rate,” meaning it’s better at retaining the key-value pairs that are most crucial for accurate decoding. This directly translates to maintaining the quality of generated content, even in challenging scenarios where the question might not be at the end of the prompt or in long-output text continuation tasks. The research also explored the impact of training data quality, finding that model-generated responses and comprehensive data content lead to better performance. The number of soft tokens also plays a role, with 32 tokens found to be an optimal balance between training cost and generalization.
Also Read:
- AQUA: Enhancing LLM Efficiency Through Dynamic Attention Optimization
- Optimizing LLM Memory: LA Va’s Dynamic KV Cache Eviction Strategy
This methodology can be seamlessly integrated into existing open-source models with minimal training overhead, offering a practical solution to enhance performance in KV cache eviction scenarios. For more technical details, you can refer to the full research paper: Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction.


