TLDR: A new research paper challenges the common belief that high Supervised Fine-Tuning (SFT) scores predict better performance after Reinforcement Learning (RL) in Large Language Models (LLMs). The authors found that high SFT scores can be misleading, often biased towards simpler data, and don’t reliably indicate future RL gains. They propose two new metrics—generalization loss on held-out examples and Pass@large k performance—which significantly improve the prediction of post-RL outcomes, helping to optimize LLM training and save computational resources.
In the rapidly evolving world of Large Language Models (LLMs), particularly those designed for complex reasoning tasks, the training process typically involves two main stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning with Verifiable Rewards (RLVR), often shortened to RL. The conventional wisdom has been that models performing well in the SFT stage would naturally lead to even better outcomes after the subsequent RL phase. However, a recent research paper titled “Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead” challenges this long-held assumption, revealing significant instances where high SFT scores can be deceptive.
Authored by Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, and Newsha Ardalani, this paper highlights a critical disconnect in current LLM post-training practices. The researchers found numerous counter-examples where models with excellent SFT performance actually yielded substantially worse results after RL training compared to models that started with lower SFT scores. This phenomenon, which they term a “quagmire,” arises because high SFT scores can be biased towards simpler or more homogeneous data, making them unreliable predictors of how well a model will generalize or improve during the more exploratory RL stage.
The implications of this finding are substantial. In industrial settings, SFT and RL are often handled by different teams, each optimizing for their own metrics. When the SFT team delivers a model with seemingly strong performance, only for it to underperform after the expensive RL stage, it creates friction, wasted resources, and delays in model development. The high computational cost of RL training, often spanning days and consuming millions of GPU hours, makes it crucial to identify promising SFT candidates early on.
Identifying More Reliable Predictors
To address this predictability problem, the researchers investigated alternative metrics that could more accurately forecast post-RL success. They identified two key indicators:
1. Generalization Loss on Held-Out Reasoning Examples: The study observed that as SFT training progresses, especially with overtraining, the validation loss on held-out examples tends to increase significantly. This “flaring up” of generalization loss strongly correlates with a decreased potential for performance gains during the subsequent RL stage. By monitoring this loss, practitioners can identify models that are overfitting during SFT, even if their SFT performance metrics are high, and avoid committing them to expensive RL training.
2. Pass@large k Performance: The RL objective, particularly with methods like GRPO, aims to maximize Pass@1 accuracy. The paper suggests that Pass@k accuracy, especially for a large ‘k’, provides a more granular measure of a model’s inherent capability to generate correct solutions. This metric is less sensitive to shifts in training data distribution and can effectively rank different SFT models based on their potential for RL success without needing to run actual RL experiments for calibration.
The research involved training hundreds of models, including Llama3, Mistral-Nemo, and Qwen3, up to 12-billion parameters, using various SFT and RL datasets. Extensive evaluations across seven math benchmarks, involving over a million GPU hours, empirically validated the effectiveness of these new metrics. The proposed predictors significantly improved the accuracy of predicting RL outcomes, boosting the R2 coefficient and Spearman’s rank correlation coefficient by up to 0.5 (a two-fold improvement) compared to relying solely on pre-RL performance.
Also Read:
- Unlocking Generalization in Supervised Fine-Tuning
- The Double-Edged Sword: How LLM Training Boosts Performance But Fosters Greed in Decision-Making
Practical Applications and Future Directions
In practice, these metrics offer powerful tools for optimizing the LLM post-training pipeline. For instance, SFT training on unique examples for one epoch might underperform training on half examples for two epochs, both after SFT and SFT-then-RL. Similarly, training on only short examples might lead to better SFT performance but worse outcomes after RL. The new predictors can capture these nuances, guiding decisions on data selection and training paradigms.
The authors plan to open-source an enhanced evaluation tool to facilitate broader adoption of these insights. While this work primarily focuses on mathematical reasoning and the GRPO-based online RL paradigm, future research could explore these dynamics in other reasoning tasks (like coding or science) and with different RL algorithms or offline RL/DPO methods. The paper also notes the computational expense of directly evaluating Pass@large k and suggests exploring methods to estimate it from smaller ‘k’ values for greater efficiency.
This research marks a significant step towards de-risking the expensive RL stage in LLM development, enabling practitioners to make more informed decisions and streamline the entire post-training workflow. You can read the full paper here: Quagmires in SFT-RL Post-Training.


