Improving LLM Performance and Clarity with Probability Smoothing Policy Optimisation

TLDR: Probability Smoothing Policy Optimisation (PSPO) is a new method for training large language models (LLMs) with reinforcement learning. It replaces traditional ‘clipping’ by smoothing policy probabilities towards an older policy, creating a ‘soft trust region’ and preserving important gradient information. This approach leads to significant performance improvements (over 20% on GSM8K) and produces clearer, more coherent responses compared to standard clipping methods, without adding computational overhead.

Training large language models (LLMs) with reinforcement learning (RL) methods like PPO and GRPO is a common practice, but it often faces a significant challenge: maintaining stability during updates. Traditionally, a technique called ‘ratio clipping’ is used to prevent instability. While effective at keeping things stable, clipping has its downsides. It can discard valuable information and introduce abrupt changes in the learning process, known as gradient discontinuities.

A new research paper, “It’s Not You, It’s Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL”, introduces an innovative alternative called Probability Smoothing Policy Optimisation (PSPO). This method aims to overcome the limitations of clipping by smoothing the current policy’s probabilities towards the old (behavior) policy before calculating the importance ratio. This approach is similar to ‘label smoothing’ used in supervised learning.

How Probability Smoothing Works

Unlike clipping, PSPO is designed to preserve the gradient signal, meaning it doesn’t lose important learning information. By interpolating towards the old policy, PSPO effectively creates a ‘soft trust region’. This soft trust region discourages large, potentially destabilizing updates, providing formal guarantees for stability. The core idea is to reduce overconfidence in any single action, which is particularly relevant in language generation tasks where multiple words can convey the same meaning.

The researchers instantiated PSPO within GRPO, creating GR-PSPO. They then fine-tuned Qwen2.5-0.5B and Qwen2.5-1.5B models on mathematical reasoning tasks, specifically using the GSM8K dataset for training and evaluating on GSM8K, SVAMP, ASDiv, and MATH-500 datasets. The reward function was based on numeric correctness and a bonus for correct formatting.

Key Findings and Performance

The empirical results demonstrated significant improvements. Relative to clipped GRPO, GR-PSPO substantially boosted performance in both the 0.5B and 1.5B models. For instance, on GSM8K, the 0.5B model saw an increase from 17.6% to 39.7% accuracy, and the 1.5B model improved from 37.8% to 59.4%. This represents a gain of over 20% in accuracy.

While GR-PSPO achieved similar quantitative accuracy to unclipped GRPO (which uses a single iteration and no data reuse), it produced notably better response quality. Using an LLM-as-Judge evaluation, GR-PSPO’s responses were rated as clearer, more concise, and more logically coherent across various metrics including overall quality, constraint adherence, logical coherence, mathematical soundness, and clarity.

A notable advantage of GR-PSPO is its ability to reduce issues like ‘instruction leakage’ and verbosity, which were observed in responses from GRPO-clip and GRPO-noclip. This means the model adheres better to the given instructions and provides more focused answers.

Also Read:

Advantages and Future Outlook

PSPO offers stability without the need to truncate the learning objective, and it does so without requiring additional computation or memory. It’s a direct, straightforward replacement for ratio clipping in any clipped-ratio objective. This makes it a practical option, especially in scenarios where multi-epoch updates or mini-batches are used, which can be challenging for unclipped methods in larger or sparser models.

The research acknowledges limitations, primarily that the evaluation focused on mathematical reasoning with objective reward signals. Future work will explore its effectiveness in domains with more subjective or continuous rewards and with larger model sizes and different architectures. However, the current findings strongly suggest that Probability Smoothing Policy Optimisation is a promising advancement for stabilizing and enhancing reinforcement learning in large language models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Improving LLM Performance and Clarity with Probability Smoothing Policy Optimisation

How Probability Smoothing Works

Key Findings and Performance

Advantages and Future Outlook

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates