TLDR: This research compares GRPO (Reinforcement Learning) and Supervised Fine-Tuning (SFT) for training large language models on reasoning tasks. It finds that GRPO modestly improves existing capabilities with less out-of-domain impact, while SFT yields stronger in-domain gains but significantly degrades performance on other tasks. The study also reveals that SFT causes more substantial internal model changes, particularly in mid-layer components, which might explain its out-of-domain performance drops. Attempts to mitigate this degradation by freezing model parts were inconclusive.
Training large language models (LLMs) to excel at complex reasoning tasks, particularly in mathematics and coding, has become a significant area of focus in AI research. Two prominent methods for this post-training phase are reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), and supervised fine-tuning (SFT). While both aim to enhance reasoning, their internal dynamics and effects on model capabilities have remained largely unexplored until now.
A recent study, titled “Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them,” delves into a comparative analysis of GRPO and SFT. The researchers, Neel Rajani, Aryo Pradipta Gema, Seraphina Goldfarb-Tarrant, and Ivan Titov, meticulously designed experiments to minimize confounding variables, using the same base model (OLMo-2-1124-7B-Instruct), identical maths problems, and similar hyperparameters for both training approaches.
The findings reveal a distinct trade-off between the two methods. GRPO, while computationally expensive and sometimes unstable to train, resulted in modest improvements on in-domain maths problems. Crucially, it caused only slight degradation in performance on knowledge-intensive benchmarks like MMLU. This suggests that GRPO primarily amplifies the existing capabilities of the base model, refining its ability to produce correct outputs that it was already somewhat capable of generating.
In contrast, SFT proved to be much more stable and cost-effective during training. However, its impact on model capabilities was more pronounced and double-edged. SFT led to greater gains on in-domain maths tasks but also caused more significant degradation on out-of-domain, knowledge-intensive benchmarks. The researchers hypothesize that SFT tends to replace old skills with new ones, leading to a trade-off where specialized performance comes at the cost of broader knowledge retention.
To understand these differences, the study delved into the internal changes within the model’s parameters across various training checkpoints. Both GRPO and SFT were observed to modify the query and key weights within the attention heads the most. However, SFT consistently caused much larger updates to these parameters compared to GRPO. Furthermore, SFT significantly affected the mid-layer Multi-Layer Perceptrons (MLPs), which are known to be crucial for storing factual associations and memorized knowledge. This led the researchers to hypothesize that the more substantial updates in these mid-layer MLPs during SFT might be responsible for the observed degradation in knowledge-intensive tasks.
Inspired by these insights, the researchers explored whether freezing certain parts of the model during SFT could mitigate the loss of factual knowledge. They experimented with freezing MLP matrices and, separately, training only the query and key matrices. The results were largely inconclusive. While freezing MLPs showed some benefits, such as improved performance on GPQA:Diamond, it underperformed on other benchmarks. Training only query and key matrices led to a general degradation across most benchmarks. This indicates that while parameter-level analysis provides valuable insights, directly applying these insights through freezing mechanisms is complex and requires further research.
Also Read:
- Boosting Mathematical Reasoning in LLMs: A Two-Stage Training Strategy for Accuracy and Efficiency
- Precision Tuning for Enhanced LLM Reasoning: Introducing Critical Representation Fine-Tuning
In conclusion, this research provides a preliminary yet significant understanding of how GRPO and SFT differentially impact large language models. GRPO appears to act like a ‘scalpel,’ subtly amplifying existing skills, while SFT behaves more like a ‘hammer,’ making more drastic changes that replace old capabilities with new ones. The study highlights the need for further investigation into these training dynamics to better balance specialized reasoning capabilities with general knowledge retention in future LLM development. You can read the full research paper here: Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them.


