Decoding LLM Fine-Tuning: How Reinforcement Learning Recovers Lost Generalization

TLDR: This research paper investigates the impact of Supervised Fine-Tuning (SFT) and Reinforcement Learning Fine-Tuning (RL-FT) on Large Language Models’ (LLMs) out-of-distribution (OOD) generalization. It reveals that SFT initially improves OOD performance but then causes it to degrade due to overfitting. RL-FT primarily acts as a “memory restorer,” recovering most of this lost OOD generalization by re-aligning the model’s internal representations. The study’s key finding is that changes in the *directions* of singular vectors (rotations) within the model’s weight matrices are far more critical for performance than changes in their *magnitudes* (singular values). The paper also identifies inexpensive recovery methods, such as low-rank or shallow-layer resets, as effective alternatives to costly RL-FT.

Large Language Models (LLMs) have become ubiquitous, but training them from scratch is a monumental task. This makes post-training methods like Supervised Fine-Tuning (SFT) and Reinforcement Learning Fine-Tuning (RL-FT), such as Proximal Policy Optimization (PPO), incredibly important in modern AI development. A recent research paper, titled “RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs,” delves into how these two crucial stages reshape a model’s internal representations and its ability to perform on tasks it hasn’t explicitly seen before, known as out-of-distribution (OOD) performance.

Authored by Hangzhan Jin, Sicheng Lv, Sifan Wu, and Mohammad Hamdaqa from Polytechnique Montreal, Mila, McGill, and UDeM, this study revisits the interplay between SFT and RL-FT. Unlike previous observational studies, the researchers aimed for a deeper, mechanistic understanding of the parameter-level dynamics.

The Core Problem: SFT’s Double-Edged Sword

The paper highlights a consistent pattern observed across two popular open models, Llama-3.2-11B and Qwen-2.5-7B. Supervised Fine-Tuning, while excellent for specializing a model on specific in-distribution (ID) tasks, often leads to a phenomenon called “SFT forgetting.” This means that as SFT continues, the model’s OOD generalization ability, which initially peaks early in the process, begins to degrade. It becomes overtrained on the specific training data, losing its broader reasoning capabilities. For instance, Llama-3.2-11B saw its OOD performance drop by 48% after full SFT compared to its early-stage peak.

RL’s Role: Restoration, Not Creation

The research reveals that Reinforcement Learning Fine-Tuning primarily acts as a powerful corrective step. It doesn’t necessarily endow the LLM with fundamentally new capabilities but rather restores the OOD generalization lost during aggressive SFT. For Qwen-2.5-7B, RL recovered up to 99% of the lost OOD performance, and for Llama-3.2-11B, it recovered up to 85%. This restoration, however, comes with a slight trade-off: a small reduction in the model’s highly specialized ID accuracy. Importantly, this recovery has limits; if SFT pushes the model into severe overfitting, RL-FT can no longer fully restore its OOD performance.

The Mechanism: Singular Vector Rotations

To understand the underlying mechanisms, the researchers employed spectral analysis, specifically Singular Value Decomposition (SVD), on the model’s weight matrices. Contrary to some prior beliefs that emphasized the absolute size of singular values, this study found that the *directions* of singular vectors (how they rotate in the high-dimensional weight space) have a much larger impact on LLM performance than the singular values themselves. The singular values, which represent the importance of different representational modes, remained remarkably stable throughout both SFT and RL.

The shifts in singular vectors concentrate on the directions corresponding to the largest and smallest singular values, leaving the bulk of the spectrum almost intact. This suggests that the model’s intrinsic capacity remains largely unchanged, but its orientation in the feature space is redefined. Both SFT and RL adapt the network by rotating its singular vectors in similar ways, preserving core, low-index features while progressively redefining higher-index, fine-grained directions.

Targeted Recovery: Low-Rank and Shallow Layers

The study also uncovered surprisingly effective and inexpensive recovery methods. Restoring the directions of singular vectors corresponding to the top 20% of singular values or the first 25% of layers can recover 70% to 80% of a model’s OOD performance. This suggests that generalizable, foundational knowledge is primarily encoded in these top-rank singular directions and shallower layers, while intermediate layers tend to specialize during SFT.

A causal validation experiment further solidified these findings: forcing a high-performing RL-tuned model to adopt the geometric orientation of a poorly-generalizing SFT model caused a significant drop in OOD accuracy. This unequivocally demonstrates that the specific vector directions found by RL are essential for its success and are fundamentally different from those settled upon by SFT.

Also Read:

Practical Implications

The findings reconcile prior reports of RL’s superior OOD performance, clarifying that RL primarily counteracts SFT-induced directional drift to reduce catastrophic forgetting rather than discovering fundamentally new solutions. This spectrum-aware analysis highlights inexpensive recovery knobs, such as low-rank UV merging and shallow-layer resets, that practitioners can employ before resorting to costly RL fine-tuning. For more in-depth details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Decoding LLM Fine-Tuning: How Reinforcement Learning Recovers Lost Generalization

The Core Problem: SFT’s Double-Edged Sword

RL’s Role: Restoration, Not Creation

The Mechanism: Singular Vector Rotations

Targeted Recovery: Low-Rank and Shallow Layers

Practical Implications

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates