TLDR: This research paper introduces the Unified Policy Gradient Estimator (UPGE), a theoretical framework that unifies Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) as instances of a single optimization process for Large Language Model (LLM) post-training. Building on this, the authors propose Hybrid Post-Training (HPT), an algorithm that dynamically switches between SFT and RL based on the model’s real-time performance. HPT consistently outperforms existing baselines across various models and mathematical reasoning benchmarks by effectively balancing exploitation of demonstrations and stable exploration, demonstrating significant gains in model capabilities and generalization.
Large Language Models (LLMs) have become incredibly powerful, but getting them to perform at their best often requires a crucial step called post-training. This process typically involves two main approaches: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). SFT uses human-annotated data to teach the model specific behaviors, while RL allows the model to explore and learn from feedback in an environment. Traditionally, these methods have been seen as distinct, or used in a multi-stage pipeline like SFT-then-RL, which can be resource-intensive and complex to tune.
A new research paper, “Towards a Unified View of Large Language Model Post-Training” by Xingtai Lv, Yuxin Zuo, Youbang Sun, and their colleagues from Tsinghua University, Shanghai AI Laboratory, and WeChat AI, proposes a groundbreaking perspective. They argue that SFT and RL are not contradictory but are, in fact, different facets of a single, unified optimization process. This insight leads to a novel theoretical framework and a practical new algorithm.
Unifying the Post-Training Landscape
The core of their theoretical contribution is the Unified Policy Gradient Estimator (UPGE). This framework demonstrates that the gradient calculations for a wide range of post-training methods can be expressed in a single, generalized form. The UPGE is built from four interchangeable components:
- Stabilization Mask: This component, inspired by techniques like PPO clipping, helps manage instability during RL training by selectively turning off gradients when updates are deemed unsafe. Different algorithms employ various modifications to this mask, each with its own empirical motivations.
- Reference Policy Denominator: This term acts as a reweighting coefficient, typically an inverse probability, that assigns more significance to tokens with smaller probabilities. For SFT, it uses the current policy, while for PPO-style RL, it often uses an older rollout policy. Offline RL methods often simplify this by assuming a constant value.
- Advantage Estimate: In the context of LLMs, this usually measures the quality of a generated sequence rather than individual tokens. It helps the model maximize the likelihood of generating positive sequences and minimize negative ones. Methods like GRPO use group-wise normalization to structure this estimate.
- Likelihood Gradient: This is the fundamental term responsible for back-propagating objective signals to the model’s parameters, remaining consistent across all gradient calculations.
By showing that these components can be combined to represent various algorithms, the researchers reveal that SFT and RL are not in conflict but rather provide complementary learning signals that can guide the optimization process jointly. However, they also highlight that different choices for these components introduce various bias-variance tradeoffs.
Introducing Hybrid Post-Training (HPT)
Motivated by their unified theoretical framework, the team developed Hybrid Post-Training (HPT). This algorithm dynamically selects between SFT and RL training signals based on the model’s real-time performance. HPT uses a mixed loss function, where the weights for RL loss and SFT loss are adjusted based on how well the model is performing on a given question. If the model’s performance is strong, HPT emphasizes RL to encourage exploration. If the model’s competence is limited, SFT takes precedence to provide direct guidance and exploit existing demonstrations.
In their implementation, HPT employs a simple switch mechanism: if the model’s performance (measured by the mean of verification scores from multiple on-policy trajectories) exceeds a certain threshold, it uses RL; otherwise, it uses SFT. This adaptive approach is designed to yield both effective exploitation of demonstration data and stable exploration without compromising the model’s learned reasoning patterns.
Also Read:
- Guiding LLM Learning: Adapting Exploration Based on Task Difficulty
- Unlocking Advanced Reasoning in LLMs: A Two-Phase Learning Approach
Experimental Validation and Key Findings
The researchers conducted extensive experiments across various LLM scales and families, including Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and LLaMA-3.1-8B. They evaluated HPT on six mathematical reasoning benchmarks (AIME 2024, AIME 2025, AMC, MATH-500, Minerva, OlympiadBench) and two out-of-distribution suites (GPQA-Diamond, ARC-c).
HPT consistently outperformed strong baselines, including individual SFT and GRPO, sequential SFT→GRPO, and other mixed-policy approaches like LUFFY and SRFT. For instance, HPT achieved a notable 7-point gain over the strongest baseline on AIME 2024 with Qwen2.5-Math-7B.
Empirical analysis provided deeper insights:
- Enhanced Exploration and Exploitation: HPT achieved the highest large-k Pass@k scores, indicating that it not only improves immediate performance but also significantly enhances the model’s exploratory capacity and generalization. It also demonstrated the ability to solve new, challenging problems while preventing catastrophic forgetting of previously learned skills.
- Adaptive Training Dynamics: Visualizations showed that HPT effectively switches between SFT and RL. Initially, SFT-driven updates are more prevalent, especially for weaker models or harder problems. As the model improves, the contribution of RL grows, eventually stabilizing. HPT maintained higher output diversity (entropy) and preserved long-form reasoning patterns, suggesting that SFT helps internalize these routines, which RL then refines.
- Role of Offline Data: Experiments comparing HPT (SFT/On-policy) with Off-policy/On-policy and Mix-policy/On-policy methods indicated that SFT is highly effective for learning from offline data, and dedicated off-policy RL might not be strictly necessary when SFT is dynamically integrated.
- Gate Threshold Importance: An ablation study on the performance-based switch gate (γ) revealed that a carefully chosen threshold is crucial. The optimal setting (γ=0, meaning SFT is used only when the model completely fails a question) achieved the best performance, underscoring the importance of maintaining a dynamic balance between exploration and exploitation.
This research provides a valuable theoretical foundation for understanding LLM post-training and offers a practical, adaptive algorithm that effectively balances different learning signals to boost model capabilities across various tasks and scales.


