spot_img
HomeResearch & DevelopmentUnpacking LLM Fine-Tuning: How Reinforcement Learning Restores Lost Reasoning...

Unpacking LLM Fine-Tuning: How Reinforcement Learning Restores Lost Reasoning Abilities

TLDR: This research paper investigates the two-stage fine-tuning process of Large Language Models (LLMs) using Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). It reveals that SFT can lead to ‘OOD forgetting,’ where out-of-distribution reasoning performance declines after an early peak. The study finds that RL doesn’t create new OOD capabilities but rather restores the lost reasoning ability from SFT, within certain boundaries. Through SVD analysis, the authors discover that this forgetting and restoration correlates with the rotation of singular vectors in parameter matrices, rather than changes in singular values, suggesting SFT performs hard alignment while RL softly re-aligns for robustness.

Large Language Models (LLMs) have become incredibly powerful, and a common way to make them even better for specific tasks is through a two-stage fine-tuning process. This typically involves Supervised Fine-Tuning (SFT) first, followed by Reinforcement Learning (RL). While this approach has shown great success in improving reasoning abilities, the exact mechanisms behind how SFT and RL work together have been a bit of a mystery.

A recent research paper, titled “RL Fine-Tuning Heals OOD Forgetting in SFT,” delves deep into this synergy, challenging some long-held beliefs and uncovering new insights into how LLMs learn and generalize. The authors, Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, and Mohammad Hamdaqa, conducted a detailed analysis using models like LLaMA-3.2-11B and Qwen-2.5-7B.

Challenging the Old Adage: “SFT Memorizes, RL Generalizes”

The popular saying that “SFT memorizes, RL generalizes” has been a simplified view of the fine-tuning process. This paper reveals a more nuanced picture. The researchers found that during the SFT stage, the model’s ability to reason on Out-Of-Distribution (OOD) tasks – meaning tasks slightly different from what it was explicitly trained on – actually peaks early on and then starts to decline. This phenomenon is termed “OOD forgetting.” What’s more, this decline isn’t easily spotted by looking at traditional training or test loss metrics, which continue to decrease.

This means that if you stop SFT too late, the model might have already lost some of its valuable OOD reasoning capacity. The best SFT checkpoint for OOD performance is often missed if only in-distribution metrics are monitored.

RL’s Role: Restoration, Not Creation

The subsequent RL stage, often seen as the magic bullet for generalization, doesn’t actually generate fundamentally new OOD capabilities. Instead, the paper highlights that RL plays an “OOD restoration” role. It helps recover the reasoning ability that was lost during the later stages of SFT. This recovery, however, isn’t limitless. There’s a clear boundary: if SFT is either too short or too long, RL cannot effectively bring back the lost OOD ability.

Essentially, RL acts as an automatic way to mitigate OOD forgetting, saving researchers from having to manually find the perfect SFT stopping point. It fine-tunes the model to a more robust configuration, healing the forgetting and learning downstream tasks simultaneously.

Also Read:

The Underlying Mechanism: Singular Vector Rotation

To understand *why* this forgetting and restoration happens, the researchers employed Singular Value Decomposition (SVD) analysis on the parameter matrices of the LLMs. Contrary to a common belief that changes in model capacity are mainly due to shifts in singular values, this study found that singular values remain quite stable throughout the fine-tuning process.

Instead, the key factor correlating with OOD behavior is the “rotation of singular vectors.” SFT performs a “hard alignment” of crucial parameter directions to the target tasks, leading to rapid but sometimes greedy adjustments and quick forgetting. RL, on the other hand, “conditionally re-aligns singular vectors softly and slowly” towards a more robust configuration. This soft re-alignment is what helps heal the OOD forgetting.

The paper provides a fresh perspective on the roles of SFT and RL, identifying the rotation of singular vectors as a critical mechanism in how LLMs evolve during fine-tuning. This understanding could lead to more effective and robust fine-tuning strategies in the future.

For more technical details and experimental results, you can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -