TLDR: This research paper reveals that Supervised Fine-Tuning (SFT) on curated data is fundamentally a form of Reinforcement Learning (RL), optimizing a lower bound of the RL objective. It introduces importance-weighted SFT (iw-SFT), a simple modification that tightens this bound, leading to improved performance in LLM reasoning tasks (e.g., AIME 2024, GPQA Diamond) and competitive results in offline continuous control, without requiring complex RL algorithms or additional inference-time techniques.
A new research paper sheds light on the fundamental connection between two prominent machine learning paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Traditionally viewed as distinct approaches for training large language models (LLMs) and control policies, this work argues that SFT, especially when applied to carefully selected or ‘curated’ data, can be understood as a form of Reinforcement Learning.
The paper, titled “Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)” by Chongli Qin and Jost Tobias Springenberg, delves into the theoretical underpinnings of this relationship. It clarifies that when models are fine-tuned using SFT on data that has been filtered for quality or success (a common practice in modern LLM training), they are, in essence, trying to maximize a lower bound of an RL objective. This means SFT is implicitly trying to learn a policy that achieves desired outcomes, much like an RL agent.
While standard SFT on curated data has shown surprising effectiveness, the researchers identified a key limitation: the ‘lower bound’ it optimizes can become less accurate as the model learns and deviates from the original data distribution. To address this, they propose a novel variant called importance-weighted supervised fine-tuning (iw-SFT). This modification introduces a clever re-weighting scheme for the training data. Imagine giving more ‘attention’ or ‘weight’ to data points that are more aligned with the desired behavior, effectively allowing the model to learn more efficiently from the most relevant examples.
The core idea behind iw-SFT is to tighten the connection to the true RL objective. By adaptively re-weighting data, it behaves more like a direct RL training process. This adaptive re-weighting can also be thought of as a dynamic ‘adaptive-filtering’ mechanism, where the model itself learns to prioritize data points that are more ‘preferred’ or higher quality.
The practical implications of iw-SFT are significant. The authors demonstrated its effectiveness across two major domains:
-
Improving LLM Reasoning:
- Also Read:
- How Reinforcement Learning and Fine-Tuning Shape Language Model Capabilities
- Aligning Language Models: The Power of Inferring Rewards with Inverse Reinforcement Learning
Enhancing Offline Reinforcement Learning for Control:
When applied to large language models for complex reasoning tasks, iw-SFT showed impressive gains. Using a curated dataset of high-quality reasoning traces, the method outperformed standard SFT on benchmarks like AIME 2024 and GPQA Diamond, while matching performance on MATH 500. Notably, iw-SFT achieved these results without the need for additional inference-time techniques like ‘budget forcing,’ which are sometimes used to encourage LLMs to ‘think’ longer. This suggests that iw-SFT inherently helps the model extract more valuable information from the training data.
The versatility of iw-SFT was further showcased in continuous control tasks using offline RL datasets (D4RL). Here, the method, particularly its quality-sampled variant (iw-SFT(Q)), proved competitive with advanced RL algorithms. This indicates that the principles of importance weighting and quality-proportional sampling can significantly boost performance even in complex robotic control scenarios, especially when dealing with data that includes explicit reward information.
The researchers emphasize that iw-SFT is straightforward to implement, requiring only minor adjustments to existing SFT pipelines. This ease of implementation, combined with its strong performance, makes it a promising approach for future model training. The paper also explores how iw-SFT can be generalized to scenarios where data comes with varying ‘quality scores,’ further broadening its applicability.
This work provides a fresh perspective on supervised fine-tuning, reframing it not just as a data-driven learning process, but as a form of reinforcement learning. By understanding and improving this underlying connection, the authors pave the way for more effective and robust training of advanced AI models. For more technical details, you can read the full research paper here.


