TLDR: KITE is a novel, information theory-driven framework for selecting optimal examples for In-Context Learning (ICL) in large language models. It addresses limitations of previous methods by modeling LLMs as linear functions, framing example selection as a query-specific optimization problem, and leveraging an approximately submodular objective. KITE enhances this by incorporating the kernel trick to handle non-linear relationships and an optimal design-based regularizer to encourage diversity among selected examples. Empirically, KITE consistently outperforms existing baselines across various classification tasks and LLMs, demonstrating significant improvements in performance.
In-context learning (ICL) has become a powerful method for adapting large language models (LLMs) to new tasks, especially when data is scarce. This approach involves providing the LLM with a few carefully chosen examples directly within the prompt. However, a critical challenge arises due to the limited context size of LLMs: how do we select the most effective examples to maximize performance for a given user query?
Traditional methods, such as nearest-neighbor-based techniques like KATE, often fall short in high-dimensional embedding spaces. They can suffer from poor generalization and a lack of diversity among the selected examples. This is where a new research paper introduces a novel framework called KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning.
KITE tackles the example selection problem from a principled, information theory-driven perspective. The researchers model an LLM as a linear function over input embeddings and frame the example selection as an optimization problem. The goal is to choose a subset of examples from a larger bank that minimizes the prediction error for a specific query. This differs from traditional approaches that focus on generalization across a distribution of test points; KITE targets accurate prediction for a single, specific query instance.
The framework derives a surrogate objective that is approximately submodular, which allows for the use of a greedy algorithm with a strong approximation guarantee. KITE further enhances its method through two key innovations:
Kernel Trick for Non-Linearity
First, KITE incorporates the well-known kernel trick. This allows the method to operate effectively in high-dimensional feature spaces without needing to explicitly map data into those spaces. Instead, it computes inner products via kernels, enabling the model to capture complex, non-linear relationships between data points. This is crucial because real-world data often exhibits intricate patterns that linear models cannot fully capture.
Optimal Design for Diversity
Second, KITE introduces an optimal design-based regularizer to actively encourage diversity among the selected examples. Inspired by maximum information gain theory, this component ensures that the chosen examples are not only relevant to the query but also sufficiently varied. Promoting diversity is vital for improving the generalizability of the model and enhancing the quality of LLM responses, especially in scenarios where many examples might be semantically similar and lead to redundancy.
The combined objective in KITE balances both relevance (how similar an example is to the input query) and diversity (how varied the selected examples are from each other). The algorithm, called LITE (Linear Information Theoretic Exemplars) when using a linear kernel, efficiently selects examples by maximizing this combined score at each step.
Also Read:
- Steering Data for Fairer AI: A New Approach to Bias Reduction
- Unlocking Efficiency in Language Models: A New Bias-Selection Method for Fine-Tuning
Empirical Validation
The researchers conducted extensive experiments across multiple classification datasets, including SST-5, CMSQA, MRPC, QNLI, and HellaSwag, using state-of-the-art LLMs like GPT-Neo-2.7B, Qwen 2.5-1.5B, and Llama-3.2-3B. KITE consistently outperformed strong retrieval baselines such as Random, BM25, Dense embeddings, and DPP-based retrieval strategies. For instance, KITE showed significant accuracy improvements, surpassing the strongest baseline, DPP, by notable margins on several datasets.
Ablation studies confirmed that the choice of kernel function is a critical hyperparameter, with no single kernel being universally optimal, highlighting the importance of the kernel trick for capturing non-linear relationships. The studies also demonstrated that incorporating diversity (controlled by a parameter λ) is crucial, especially for large and varied example banks, and that KITE maintains its superior performance even in low-resource settings with fewer in-context examples.
The empirical validation of the objective function’s approximate submodularity further justifies the use of a greedy algorithm, ensuring near-optimal results in practice.
In conclusion, KITE offers a robust and effective framework for in-context example selection by combining a principled, information-theoretic approach with kernel methods and diversity regularization. Its consistent outperformance across various benchmarks underscores its potential to significantly enhance the efficacy of in-context learning for LLMs. Future work aims to extend KITE to generative tasks. You can read the full research paper here.


