TLDR: ADMIRE-BayesOpt is a new framework that uses Bayesian Optimization to efficiently find the best data mixtures for training large language models. It treats data mixture selection as a black-box optimization problem, significantly speeding up the process and improving performance compared to existing methods, especially when using its multi-fidelity variant that intelligently leverages smaller, cheaper models. The research also introduces a large dataset of training runs to facilitate further research.
Training large language models (LLMs) effectively hinges on selecting the right mix of training data. This seemingly straightforward task is, in reality, a complex challenge that significantly impacts a model’s final performance. Traditionally, developers have relied on trial-and-error or heuristic methods, which are often inefficient and can lead to suboptimal results.
A new research paper introduces ADMIRE-BayesOpt, a novel framework that tackles this problem by treating data mixture selection as a ‘black-box’ hyperparameter optimization challenge. This approach leverages Bayesian Optimization, a powerful class of algorithms well-suited for optimizing expensive, complex functions without needing to know their internal workings.
How ADMIRE-BayesOpt Works
The core idea behind ADMIRE-BayesOpt is to view data mixture learning as a sequential decision-making process. Instead of exhaustively trying every possible data combination, the system intelligently decides which data mixture to experiment with next. It does this by building a predictive model of how different data mixtures affect model performance, along with an estimate of the uncertainty in those predictions. This allows it to balance exploration (trying new, uncertain mixtures) with exploitation (focusing on mixtures predicted to perform well).
A key innovation is the use of Multi-fidelity Bayesian Optimization (MFBO). LLM training is computationally expensive, especially for very large models. MFBO addresses this by allowing the system to learn from experiments conducted at different scales – for instance, training smaller, cheaper proxy models first, and then transferring that knowledge to larger, more expensive models. This strategy helps to find a suitable trade-off between the computational cost of training exploratory models and achieving high performance in the final, large model. It minimizes the number of costly high-fidelity experiments while avoiding the risk of overfitting to insights gained only from small-scale tests.
Significant Performance Gains and Efficiency
The researchers demonstrated ADMIRE-BayesOpt’s effectiveness across a wide range of models, from 1 million to 7 billion parameters, covering both pre-training and instruction finetuning. The results were consistently strong, showing speed-ups of over 500% in identifying the best data mixture compared to recent baseline methods in their largest experiments.
In zero-shot transfer scenarios, where mixtures learned on smaller models are applied to larger ones, ADMIRE-BayesOpt proved superior in transferability and computational efficiency. For example, when transferring from 0.5 billion to 7 billion parameter models, it achieved a 19x faster performance compared to a prominent baseline method.
The multi-fidelity variant, ADMIRE-MFBO, showed even greater improvements. It intelligently schedules experiments, starting with cheaper, low-fidelity data and gradually incorporating more expensive, high-fidelity evaluations. This progressive sampling strategy led to rapid convergence to optimal solutions with significantly reduced computational cost.
Also Read:
- Dynamic Self-Awareness for Efficient Large Language Model Reasoning
- Navigating Complexity: How AI Language Models Are Enhancing Classical Planning
Opening Up Research with a New Dataset
To further accelerate research in this area, the team has released ADMIRE IFT Runs, a comprehensive dataset comprising 460 full training and evaluation runs across various model sizes. This dataset, which represents over 13,000 GPU hours of computation, allows researchers to study data mixing techniques without needing to run their own costly LLM training experiments. This significantly lowers the barrier to entry for new research in data-centric AI.
Analysis of this dataset revealed fascinating insights, such as the complex relationship between data mixture and performance, the importance of evaluating models on both in-distribution and out-of-distribution tasks, and how certain training domains disproportionately impact specific evaluation benchmarks. It also showed that larger models tend to be more robust to variations in training data composition.
ADMIRE-BayesOpt represents a significant step forward in optimizing data mixtures for language models, offering a principled and efficient framework that outperforms traditional methods. The research also highlights rich opportunities for future exploration in understanding the broader effects of training data on model generalization. You can read the full paper here: ADMIRE-BayesOpt: AcceleratedDataMixtureRE-weighting for Language Models with Bayesian Optimization.


