Predicting LLM Reasoning Power: Why SFT Scores Aren't Enough

TLDR: A new research paper challenges the common belief that high Supervised Fine-Tuning (SFT) scores predict better performance after Reinforcement Learning (RL) in Large Language Models (LLMs). The authors found that high SFT scores can be misleading, often biased towards simpler data, and don’t reliably indicate future RL gains. They propose two new metrics—generalization loss on held-out examples and Pass@large k performance—which significantly improve the prediction of post-RL outcomes, helping to optimize LLM training and save computational resources.

In the rapidly evolving world of Large Language Models (LLMs), particularly those designed for complex reasoning tasks, the training process typically involves two main stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning with Verifiable Rewards (RLVR), often shortened to RL. The conventional wisdom has been that models performing well in the SFT stage would naturally lead to even better outcomes after the subsequent RL phase. However, a recent research paper titled “Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead” challenges this long-held assumption, revealing significant instances where high SFT scores can be deceptive.

Authored by Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, and Newsha Ardalani, this paper highlights a critical disconnect in current LLM post-training practices. The researchers found numerous counter-examples where models with excellent SFT performance actually yielded substantially worse results after RL training compared to models that started with lower SFT scores. This phenomenon, which they term a “quagmire,” arises because high SFT scores can be biased towards simpler or more homogeneous data, making them unreliable predictors of how well a model will generalize or improve during the more exploratory RL stage.

The implications of this finding are substantial. In industrial settings, SFT and RL are often handled by different teams, each optimizing for their own metrics. When the SFT team delivers a model with seemingly strong performance, only for it to underperform after the expensive RL stage, it creates friction, wasted resources, and delays in model development. The high computational cost of RL training, often spanning days and consuming millions of GPU hours, makes it crucial to identify promising SFT candidates early on.

Identifying More Reliable Predictors

To address this predictability problem, the researchers investigated alternative metrics that could more accurately forecast post-RL success. They identified two key indicators:

1. Generalization Loss on Held-Out Reasoning Examples: The study observed that as SFT training progresses, especially with overtraining, the validation loss on held-out examples tends to increase significantly. This “flaring up” of generalization loss strongly correlates with a decreased potential for performance gains during the subsequent RL stage. By monitoring this loss, practitioners can identify models that are overfitting during SFT, even if their SFT performance metrics are high, and avoid committing them to expensive RL training.

2. Pass@large k Performance: The RL objective, particularly with methods like GRPO, aims to maximize Pass@1 accuracy. The paper suggests that Pass@k accuracy, especially for a large ‘k’, provides a more granular measure of a model’s inherent capability to generate correct solutions. This metric is less sensitive to shifts in training data distribution and can effectively rank different SFT models based on their potential for RL success without needing to run actual RL experiments for calibration.

The research involved training hundreds of models, including Llama3, Mistral-Nemo, and Qwen3, up to 12-billion parameters, using various SFT and RL datasets. Extensive evaluations across seven math benchmarks, involving over a million GPU hours, empirically validated the effectiveness of these new metrics. The proposed predictors significantly improved the accuracy of predicting RL outcomes, boosting the R2 coefficient and Spearman’s rank correlation coefficient by up to 0.5 (a two-fold improvement) compared to relying solely on pre-RL performance.

Also Read:

Practical Applications and Future Directions

In practice, these metrics offer powerful tools for optimizing the LLM post-training pipeline. For instance, SFT training on unique examples for one epoch might underperform training on half examples for two epochs, both after SFT and SFT-then-RL. Similarly, training on only short examples might lead to better SFT performance but worse outcomes after RL. The new predictors can capture these nuances, guiding decisions on data selection and training paradigms.

The authors plan to open-source an enhanced evaluation tool to facilitate broader adoption of these insights. While this work primarily focuses on mathematical reasoning and the GRPO-based online RL paradigm, future research could explore these dynamics in other reasoning tasks (like coding or science) and with different RL algorithms or offline RL/DPO methods. The paper also notes the computational expense of directly evaluating Pass@large k and suggests exploring methods to estimate it from smaller ‘k’ values for greater efficiency.

This research marks a significant step towards de-risking the expensive RL stage in LLM development, enabling practitioners to make more informed decisions and streamline the entire post-training workflow. You can read the full paper here: Quagmires in SFT-RL Post-Training.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Predicting LLM Reasoning Power: Why SFT Scores Aren’t Enough

Identifying More Reliable Predictors

Practical Applications and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates