TLDR: InfiAlign is a new framework that significantly enhances the reasoning capabilities of Large Language Models (LLMs) while drastically reducing the amount of training data and computational resources required. It achieves this through a sophisticated data selection pipeline that curates high-quality, diverse, and difficult examples, combined with a two-stage training process involving Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). The framework demonstrates competitive performance against models trained on much larger datasets, particularly in mathematical and general reasoning tasks.
Large Language Models, or LLMs, have shown incredible abilities in tackling complex reasoning tasks, from mathematics to programming. However, making these models even smarter after their initial training, a process often called ‘alignment,’ usually demands a huge amount of data and computing power. This can be a major hurdle for researchers and developers.
A new research paper introduces InfiAlign, a clever framework designed to make this alignment process much more efficient. The core idea behind InfiAlign is to achieve high performance in reasoning tasks while using significantly less training data. This is a big step towards making advanced LLM development more accessible and less resource-intensive.
Smart Data Selection is Key
At the heart of InfiAlign is an intelligent data selection system. Instead of using vast amounts of data indiscriminately, InfiAlign automatically sifts through open-source reasoning datasets to pick out only the highest-quality examples. It does this by looking at several factors: the diversity of the topics, the difficulty of the problems, and the overall quality of the answers.
For instance, to gauge difficulty, the framework found that longer responses often indicate more complex reasoning problems. So, it prioritizes these longer, more intricate examples. It also ensures diversity by categorizing questions by domain (like algebra or geometry for math, or array and string for coding) and by analyzing the underlying meaning of the questions to cover a broad range of concepts.
After selecting the data, InfiAlign has a rigorous quality control step. It checks for incomplete or poorly formatted answers and even uses other LLMs to regenerate incorrect solutions until they pass verification. This meticulous process ensures that the models learn from only the best examples, preventing the introduction of noise or errors.
A Two-Stage Training Approach
InfiAlign uses a two-stage training strategy. First, it employs Supervised Fine-Tuning (SFT), where the LLM learns from these carefully curated high-quality question-and-answer pairs. The training starts with simpler, structured problems, gradually moving to more diverse and complex tasks. This ‘curriculum learning’ approach helps the model build foundational reasoning skills before tackling more challenging scenarios.
Following SFT, InfiAlign applies Direct Preference Optimization (DPO). This stage further refines the model’s reasoning by teaching it to prefer correct answers over incorrect ones. By pairing a correct solution (often generated by a very powerful ‘teacher’ model) with an incorrect one produced by the InfiAlign model itself, DPO helps the model learn subtle distinctions and improve its decision-making, especially in areas like mathematical reasoning.
Also Read:
- Boosting Language Model Performance Through Targeted Data Selection
- Light-IF: A New Approach for LLMs to Master Complex Instructions
Impressive Results with Less Data
The results are quite compelling. When applied to the Qwen2.5-Math-7B-Base model, InfiAlign’s SFT model achieved performance comparable to DeepSeek-R1-Distill-Qwen-7B, a strong baseline model. What’s remarkable is that InfiAlign accomplished this using only about 12% of the training data (92,000 examples compared to 800,000). This demonstrates significant data efficiency.
Further improvements were seen with the DPO stage, particularly in mathematical reasoning tasks, where the model showed an average improvement of 3.89% on AIME 24/25 benchmarks. The framework also proved scalable, showing consistent gains when the training data was increased from 92,000 to 165,000 examples.
Ablation studies within the paper confirmed the importance of InfiAlign’s data sampling strategies, showing that the combination of response length as a difficulty proxy and dual-granularity diversity sampling is highly effective. The research also highlighted that using high-quality ‘teacher’ models to generate correct solutions is crucial for distilling strong reasoning capabilities into smaller models.
InfiAlign offers a practical and generalizable solution for aligning large reasoning models in a scalable and data-efficient manner. While the framework’s metrics for data selection are currently manually defined and might need tuning for entirely new domains, this work provides a robust foundation for future advancements in making LLMs smarter with fewer resources. You can read the full research paper for more technical details and experimental results: InfiAlign Research Paper.


