How Reinforcement Learning and Fine-Tuning Shape Language Model Capabilities

TLDR: This research compares GRPO (Reinforcement Learning) and Supervised Fine-Tuning (SFT) for training large language models on reasoning tasks. It finds that GRPO modestly improves existing capabilities with less out-of-domain impact, while SFT yields stronger in-domain gains but significantly degrades performance on other tasks. The study also reveals that SFT causes more substantial internal model changes, particularly in mid-layer components, which might explain its out-of-domain performance drops. Attempts to mitigate this degradation by freezing model parts were inconclusive.

Training large language models (LLMs) to excel at complex reasoning tasks, particularly in mathematics and coding, has become a significant area of focus in AI research. Two prominent methods for this post-training phase are reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), and supervised fine-tuning (SFT). While both aim to enhance reasoning, their internal dynamics and effects on model capabilities have remained largely unexplored until now.

A recent study, titled “Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them,” delves into a comparative analysis of GRPO and SFT. The researchers, Neel Rajani, Aryo Pradipta Gema, Seraphina Goldfarb-Tarrant, and Ivan Titov, meticulously designed experiments to minimize confounding variables, using the same base model (OLMo-2-1124-7B-Instruct), identical maths problems, and similar hyperparameters for both training approaches.

The findings reveal a distinct trade-off between the two methods. GRPO, while computationally expensive and sometimes unstable to train, resulted in modest improvements on in-domain maths problems. Crucially, it caused only slight degradation in performance on knowledge-intensive benchmarks like MMLU. This suggests that GRPO primarily amplifies the existing capabilities of the base model, refining its ability to produce correct outputs that it was already somewhat capable of generating.

In contrast, SFT proved to be much more stable and cost-effective during training. However, its impact on model capabilities was more pronounced and double-edged. SFT led to greater gains on in-domain maths tasks but also caused more significant degradation on out-of-domain, knowledge-intensive benchmarks. The researchers hypothesize that SFT tends to replace old skills with new ones, leading to a trade-off where specialized performance comes at the cost of broader knowledge retention.

To understand these differences, the study delved into the internal changes within the model’s parameters across various training checkpoints. Both GRPO and SFT were observed to modify the query and key weights within the attention heads the most. However, SFT consistently caused much larger updates to these parameters compared to GRPO. Furthermore, SFT significantly affected the mid-layer Multi-Layer Perceptrons (MLPs), which are known to be crucial for storing factual associations and memorized knowledge. This led the researchers to hypothesize that the more substantial updates in these mid-layer MLPs during SFT might be responsible for the observed degradation in knowledge-intensive tasks.

Inspired by these insights, the researchers explored whether freezing certain parts of the model during SFT could mitigate the loss of factual knowledge. They experimented with freezing MLP matrices and, separately, training only the query and key matrices. The results were largely inconclusive. While freezing MLPs showed some benefits, such as improved performance on GPQA:Diamond, it underperformed on other benchmarks. Training only query and key matrices led to a general degradation across most benchmarks. This indicates that while parameter-level analysis provides valuable insights, directly applying these insights through freezing mechanisms is complex and requires further research.

Also Read:

In conclusion, this research provides a preliminary yet significant understanding of how GRPO and SFT differentially impact large language models. GRPO appears to act like a ‘scalpel,’ subtly amplifying existing skills, while SFT behaves more like a ‘hammer,’ making more drastic changes that replace old capabilities with new ones. The study highlights the need for further investigation into these training dynamics to better balance specialized reasoning capabilities with general knowledge retention in future LLM development. You can read the full research paper here: Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

How Reinforcement Learning and Fine-Tuning Shape Language Model Capabilities

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates