Unpacking LLM Fine-Tuning: How Reinforcement Learning Restores Lost Reasoning Abilities

TLDR: This research paper investigates the two-stage fine-tuning process of Large Language Models (LLMs) using Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). It reveals that SFT can lead to ‘OOD forgetting,’ where out-of-distribution reasoning performance declines after an early peak. The study finds that RL doesn’t create new OOD capabilities but rather restores the lost reasoning ability from SFT, within certain boundaries. Through SVD analysis, the authors discover that this forgetting and restoration correlates with the rotation of singular vectors in parameter matrices, rather than changes in singular values, suggesting SFT performs hard alignment while RL softly re-aligns for robustness.

Large Language Models (LLMs) have become incredibly powerful, and a common way to make them even better for specific tasks is through a two-stage fine-tuning process. This typically involves Supervised Fine-Tuning (SFT) first, followed by Reinforcement Learning (RL). While this approach has shown great success in improving reasoning abilities, the exact mechanisms behind how SFT and RL work together have been a bit of a mystery.

A recent research paper, titled “RL Fine-Tuning Heals OOD Forgetting in SFT,” delves deep into this synergy, challenging some long-held beliefs and uncovering new insights into how LLMs learn and generalize. The authors, Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, and Mohammad Hamdaqa, conducted a detailed analysis using models like LLaMA-3.2-11B and Qwen-2.5-7B.

Challenging the Old Adage: “SFT Memorizes, RL Generalizes”

The popular saying that “SFT memorizes, RL generalizes” has been a simplified view of the fine-tuning process. This paper reveals a more nuanced picture. The researchers found that during the SFT stage, the model’s ability to reason on Out-Of-Distribution (OOD) tasks – meaning tasks slightly different from what it was explicitly trained on – actually peaks early on and then starts to decline. This phenomenon is termed “OOD forgetting.” What’s more, this decline isn’t easily spotted by looking at traditional training or test loss metrics, which continue to decrease.

This means that if you stop SFT too late, the model might have already lost some of its valuable OOD reasoning capacity. The best SFT checkpoint for OOD performance is often missed if only in-distribution metrics are monitored.

RL’s Role: Restoration, Not Creation

The subsequent RL stage, often seen as the magic bullet for generalization, doesn’t actually generate fundamentally new OOD capabilities. Instead, the paper highlights that RL plays an “OOD restoration” role. It helps recover the reasoning ability that was lost during the later stages of SFT. This recovery, however, isn’t limitless. There’s a clear boundary: if SFT is either too short or too long, RL cannot effectively bring back the lost OOD ability.

Essentially, RL acts as an automatic way to mitigate OOD forgetting, saving researchers from having to manually find the perfect SFT stopping point. It fine-tunes the model to a more robust configuration, healing the forgetting and learning downstream tasks simultaneously.

Also Read:

The Underlying Mechanism: Singular Vector Rotation

To understand *why* this forgetting and restoration happens, the researchers employed Singular Value Decomposition (SVD) analysis on the parameter matrices of the LLMs. Contrary to a common belief that changes in model capacity are mainly due to shifts in singular values, this study found that singular values remain quite stable throughout the fine-tuning process.

Instead, the key factor correlating with OOD behavior is the “rotation of singular vectors.” SFT performs a “hard alignment” of crucial parameter directions to the target tasks, leading to rapid but sometimes greedy adjustments and quick forgetting. RL, on the other hand, “conditionally re-aligns singular vectors softly and slowly” towards a more robust configuration. This soft re-alignment is what helps heal the OOD forgetting.

The paper provides a fresh perspective on the roles of SFT and RL, identifying the rotation of singular vectors as a critical mechanism in how LLMs evolve during fine-tuning. This understanding could lead to more effective and robust fine-tuning strategies in the future.

For more technical details and experimental results, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking LLM Fine-Tuning: How Reinforcement Learning Restores Lost Reasoning Abilities

Challenging the Old Adage: “SFT Memorizes, RL Generalizes”

RL’s Role: Restoration, Not Creation

The Underlying Mechanism: Singular Vector Rotation

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates