Proximal Supervised Fine-Tuning: Stabilizing LLM Updates for Broader Capabilities

TLDR: Proximal Supervised Fine-Tuning (PSFT) is a novel method for fine-tuning large language models (LLMs) that addresses the common issues of poor generalization and reduced exploration capacity associated with standard Supervised Fine-Tuning (SFT). Inspired by reinforcement learning techniques, PSFT introduces a ‘trust-region’ mechanism to constrain policy updates, preventing overfitting and entropy collapse. Experiments demonstrate that PSFT maintains competitive in-domain performance while significantly improving out-of-domain generalization and providing a more robust foundation for subsequent optimization stages like Reinforcement Learning and Direct Preference Optimization.

Supervised Fine-Tuning (SFT) is a common technique used to adapt large foundation models for specific tasks or domains. While efficient and straightforward, SFT often faces challenges such as poor generalization, where models lose their broader capabilities after being fine-tuned on new data. This limitation can also lead to a reduced ability for the model to explore new solutions, a phenomenon sometimes referred to as ‘entropy collapse’.

Inspired by advanced reinforcement learning (RL) algorithms like Trust-Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), researchers have introduced a new approach called Proximal Supervised Fine-Tuning (PSFT). This method aims to overcome the shortcomings of traditional SFT by incorporating a ‘trust-region’ mechanism, similar to those used in RL, to carefully control how much the model’s ‘policy’ (its decision-making process) changes during fine-tuning. By doing so, PSFT seeks to stabilize the optimization process, improve generalization, and maintain the model’s capacity for exploration.

How PSFT Works

At its core, PSFT reinterprets SFT as a specific type of policy gradient method, where the model learns from a fixed dataset of ‘correct’ actions. Building on this, PSFT introduces a ‘clipped surrogate objective’ – a mathematical formula that limits how drastically the model’s predictions can change from one training step to the next. This clipping mechanism acts like a soft boundary, ensuring that updates remain within an acceptable range. This prevents the model from making overly confident or destructive changes to its internal workings, thereby preserving its existing knowledge and general capabilities.

The paper also suggests that an initial ‘warm-up’ phase using standard SFT can further enhance PSFT’s performance, helping the model to better align with the training data before the trust-region constraints are fully applied.

Also Read:

Key Advantages and Findings

Experiments conducted across diverse domains, including mathematical reasoning and human-value alignment, highlight several significant benefits of PSFT:

Improved Generalization: PSFT consistently outperforms standard SFT in out-of-domain generalization, meaning models fine-tuned with PSFT are better at handling tasks they haven’t been specifically trained on.
Stable Training: Unlike SFT, which can show sharp declines in entropy (indicating overfitting), PSFT maintains a smoother entropy curve throughout training. This stability prevents ‘entropy collapse’ and allows for more prolonged, effective fine-tuning.
Better Foundation for Post-Training: PSFT-tuned models serve as a superior starting point for subsequent optimization stages, such as Reinforcement Learning (RL) or Direct Preference Optimization (DPO). Models initialized with PSFT show greater potential for exploration and achieve better ultimate performance in these later stages.
Reduced Alignment Tax: In human alignment tasks, PSFT effectively reduces the ‘alignment tax’ – the trade-off where models lose general capabilities when aligned to specific human values. PSFT helps models maintain their broad abilities while still achieving alignment.
Robustness Across Models: PSFT demonstrates consistent improvements across different base models, showcasing its broad applicability.

The research also provides insights into the types of tokens (words or sub-words) that are most affected by PSFT’s clipping mechanism. These often include uncertain words or phrases that represent ‘long thinking patterns,’ which are crucial for complex reasoning and are gradually and smoothly learned by PSFT without disrupting the model’s general knowledge.

In conclusion, Proximal Supervised Fine-Tuning offers a promising alternative to traditional SFT, providing a more stable, generalizable, and robust method for adapting foundation models. By drawing inspiration from reinforcement learning, PSFT ensures that models not only perform well on target tasks but also retain and enhance their broader reasoning and exploration capabilities. You can read the full research paper here: Proximal Supervised Fine-Tuning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Proximal Supervised Fine-Tuning: Stabilizing LLM Updates for Broader Capabilities

How PSFT Works

Key Advantages and Findings

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates