TLDR: A new method called UDS (Utility-Diversity Sampling) improves large language model (LLM) fine-tuning by intelligently selecting the most valuable and diverse data samples during training. Unlike previous methods, UDS considers both how useful a sample is for learning and how unique it is (both internally and compared to other samples), without needing extra resources or slowing down training. Experiments show UDS consistently achieves higher accuracy and faster training than existing techniques, making LLM fine-tuning more efficient and effective.
Supervised fine-tuning (SFT) is a critical process for adapting large language models (LLMs) to perform specific tasks effectively. However, training these powerful models on vast datasets comes with significant challenges. It can be computationally expensive, demanding substantial resources, and sometimes leads to issues like overfitting or the amplification of biases present in the training data. These hurdles have spurred the development of data curation techniques, where the focus shifts to identifying and prioritizing only the most valuable data for training.
This new research delves into a specific area of data curation known as online batch selection. This approach dynamically evaluates and filters data samples as the training progresses, allowing the model to make real-time decisions about which data points are most beneficial for its current learning state. While promising, existing online batch selection methods often face several limitations. Many tend to focus solely on the ‘utility’ of data – for instance, selecting samples that generate high loss or large gradients – without adequately considering ‘diversity,’ both within individual data samples and across the entire batch. Furthermore, some methods rely on external resources like reference models or separate validation sets, which can be impractical or unavailable in real-world scenarios. Critically, some even introduce additional computational overhead, making the training process slower than using the full dataset.
To address these shortcomings, researchers have introduced a novel framework called UDS, or Utility-Diversity Sampling. UDS is designed to be an efficient and comprehensive solution for online batch selection in SFT, aiming to improve both the effectiveness and efficiency of LLM training.
How UDS Works: A Dual Approach to Data Selection
UDS operates on two complementary principles to score and select data samples:
1. Intra-sample Importance (Utility and Diversity within a Sample): UDS leverages the ‘nuclear norm’ of the logits matrix, which are the raw output scores from the LLM before they are converted into probabilities. This nuclear norm serves a dual purpose:
- Optimization Utility: A higher nuclear norm indicates that a sample has a greater potential to reduce the training loss, meaning it’s particularly informative for guiding the model’s learning.
- Intra-sample Diversity: It also reflects the richness and variety of information within a single training example. A larger nuclear norm suggests that the model is predicting a diverse range of vocabulary tokens, rather than repetitive or low-information outputs, thus capturing more semantic content.
2. Inter-sample Importance (Diversity Across Samples): To prevent the model from repeatedly training on similar or redundant content, UDS estimates ‘inter-sample diversity.’ It maintains a lightweight memory buffer of representations (compressed versions) of recently selected samples. When a new candidate sample arrives, its low-dimensional embedding is compared against those in the buffer. A greater average distance from the historical samples indicates higher inter-sample diversity, suggesting the new sample offers unique information not recently encountered.
This innovative design eliminates the need for external resources and avoids computationally expensive backpropagation steps for every candidate sample, ensuring computational efficiency. By combining these two scores – intra-sample utility/diversity and inter-sample diversity – UDS creates a comprehensive importance score for each sample. The top-K samples with the highest combined scores are then selected for the actual training update.
Also Read:
- GUM: A New Unbiased Approach to Memory-Efficient LLM Training
- Optimizing LLM Ensembles: A Framework for Stable and Fast Text Generation
Impressive Results and Efficiency Gains
Extensive experiments were conducted across various benchmarks, including MMLU (general knowledge), ScienceQA (scientific question answering), GSM8K (mathematical reasoning), and HumanEval (code generation). UDS was tested with popular LLM backbones like Llama-3.1-8B and Qwen-2.5-7B, consistently demonstrating superior performance.
UDS not only achieved the highest accuracy among all tested online batch selection methods but also proved to be more efficient than training on the full dataset. For instance, with Qwen-2.5-7B, UDS achieved higher throughput (samples processed per second) on MMLU and HumanEval compared to full-dataset training, all while yielding better accuracy. Simple heuristics like ‘MaxLoss’ were fast but offered minimal accuracy gains, while ‘MaxGrad’ significantly slowed down training without substantial benefits. Even state-of-the-art methods like ‘GREATS,’ while accurate, were consistently slower than UDS.
Ablation studies further confirmed that both the nuclear norm component (for utility and intra-sample diversity) and the diversity distance component (for inter-sample diversity) are crucial for UDS’s strong performance. The framework also showed robust performance across different data selection budgets, often achieving peak accuracy with a smaller, carefully curated subset of data, even surpassing the performance of full-dataset fine-tuning.
In conclusion, UDS presents a compelling and practical solution for enhancing the supervised fine-tuning of large language models. By intelligently balancing data utility and diversity, it offers a path to more efficient and effective LLM training. For more technical details, you can refer to the full research paper here.


