TLDR: This research paper investigates the impact of Supervised Fine-Tuning (SFT) and Reinforcement Learning Fine-Tuning (RL-FT) on Large Language Models’ (LLMs) out-of-distribution (OOD) generalization. It reveals that SFT initially improves OOD performance but then causes it to degrade due to overfitting. RL-FT primarily acts as a “memory restorer,” recovering most of this lost OOD generalization by re-aligning the model’s internal representations. The study’s key finding is that changes in the *directions* of singular vectors (rotations) within the model’s weight matrices are far more critical for performance than changes in their *magnitudes* (singular values). The paper also identifies inexpensive recovery methods, such as low-rank or shallow-layer resets, as effective alternatives to costly RL-FT.
Large Language Models (LLMs) have become ubiquitous, but training them from scratch is a monumental task. This makes post-training methods like Supervised Fine-Tuning (SFT) and Reinforcement Learning Fine-Tuning (RL-FT), such as Proximal Policy Optimization (PPO), incredibly important in modern AI development. A recent research paper, titled “RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs,” delves into how these two crucial stages reshape a model’s internal representations and its ability to perform on tasks it hasn’t explicitly seen before, known as out-of-distribution (OOD) performance.
Authored by Hangzhan Jin, Sicheng Lv, Sifan Wu, and Mohammad Hamdaqa from Polytechnique Montreal, Mila, McGill, and UDeM, this study revisits the interplay between SFT and RL-FT. Unlike previous observational studies, the researchers aimed for a deeper, mechanistic understanding of the parameter-level dynamics.
The Core Problem: SFT’s Double-Edged Sword
The paper highlights a consistent pattern observed across two popular open models, Llama-3.2-11B and Qwen-2.5-7B. Supervised Fine-Tuning, while excellent for specializing a model on specific in-distribution (ID) tasks, often leads to a phenomenon called “SFT forgetting.” This means that as SFT continues, the model’s OOD generalization ability, which initially peaks early in the process, begins to degrade. It becomes overtrained on the specific training data, losing its broader reasoning capabilities. For instance, Llama-3.2-11B saw its OOD performance drop by 48% after full SFT compared to its early-stage peak.
RL’s Role: Restoration, Not Creation
The research reveals that Reinforcement Learning Fine-Tuning primarily acts as a powerful corrective step. It doesn’t necessarily endow the LLM with fundamentally new capabilities but rather restores the OOD generalization lost during aggressive SFT. For Qwen-2.5-7B, RL recovered up to 99% of the lost OOD performance, and for Llama-3.2-11B, it recovered up to 85%. This restoration, however, comes with a slight trade-off: a small reduction in the model’s highly specialized ID accuracy. Importantly, this recovery has limits; if SFT pushes the model into severe overfitting, RL-FT can no longer fully restore its OOD performance.
The Mechanism: Singular Vector Rotations
To understand the underlying mechanisms, the researchers employed spectral analysis, specifically Singular Value Decomposition (SVD), on the model’s weight matrices. Contrary to some prior beliefs that emphasized the absolute size of singular values, this study found that the *directions* of singular vectors (how they rotate in the high-dimensional weight space) have a much larger impact on LLM performance than the singular values themselves. The singular values, which represent the importance of different representational modes, remained remarkably stable throughout both SFT and RL.
The shifts in singular vectors concentrate on the directions corresponding to the largest and smallest singular values, leaving the bulk of the spectrum almost intact. This suggests that the model’s intrinsic capacity remains largely unchanged, but its orientation in the feature space is redefined. Both SFT and RL adapt the network by rotating its singular vectors in similar ways, preserving core, low-index features while progressively redefining higher-index, fine-grained directions.
Targeted Recovery: Low-Rank and Shallow Layers
The study also uncovered surprisingly effective and inexpensive recovery methods. Restoring the directions of singular vectors corresponding to the top 20% of singular values or the first 25% of layers can recover 70% to 80% of a model’s OOD performance. This suggests that generalizable, foundational knowledge is primarily encoded in these top-rank singular directions and shallower layers, while intermediate layers tend to specialize during SFT.
A causal validation experiment further solidified these findings: forcing a high-performing RL-tuned model to adopt the geometric orientation of a poorly-generalizing SFT model caused a significant drop in OOD accuracy. This unequivocally demonstrates that the specific vector directions found by RL are essential for its success and are fundamentally different from those settled upon by SFT.
Also Read:
- Unlocking Accurate Features in Sparse Autoencoders: The L0 Parameter’s Critical Role
- Enhancing Time Series Forecasting with Deep LLM Integration: Introducing Multi-layer Steerable Embedding Fusion
Practical Implications
The findings reconcile prior reports of RL’s superior OOD performance, clarifying that RL primarily counteracts SFT-induced directional drift to reduce catastrophic forgetting rather than discovering fundamentally new solutions. This spectrum-aware analysis highlights inexpensive recovery knobs, such as low-rank UV merging and shallow-layer resets, that practitioners can employ before resorting to costly RL fine-tuning. For more in-depth details, you can read the full research paper here.


