TLDR: Probability Smoothing Policy Optimisation (PSPO) is a new method for training large language models (LLMs) with reinforcement learning. It replaces traditional ‘clipping’ by smoothing policy probabilities towards an older policy, creating a ‘soft trust region’ and preserving important gradient information. This approach leads to significant performance improvements (over 20% on GSM8K) and produces clearer, more coherent responses compared to standard clipping methods, without adding computational overhead.
Training large language models (LLMs) with reinforcement learning (RL) methods like PPO and GRPO is a common practice, but it often faces a significant challenge: maintaining stability during updates. Traditionally, a technique called ‘ratio clipping’ is used to prevent instability. While effective at keeping things stable, clipping has its downsides. It can discard valuable information and introduce abrupt changes in the learning process, known as gradient discontinuities.
A new research paper, “It’s Not You, It’s Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL”, introduces an innovative alternative called Probability Smoothing Policy Optimisation (PSPO). This method aims to overcome the limitations of clipping by smoothing the current policy’s probabilities towards the old (behavior) policy before calculating the importance ratio. This approach is similar to ‘label smoothing’ used in supervised learning.
How Probability Smoothing Works
Unlike clipping, PSPO is designed to preserve the gradient signal, meaning it doesn’t lose important learning information. By interpolating towards the old policy, PSPO effectively creates a ‘soft trust region’. This soft trust region discourages large, potentially destabilizing updates, providing formal guarantees for stability. The core idea is to reduce overconfidence in any single action, which is particularly relevant in language generation tasks where multiple words can convey the same meaning.
The researchers instantiated PSPO within GRPO, creating GR-PSPO. They then fine-tuned Qwen2.5-0.5B and Qwen2.5-1.5B models on mathematical reasoning tasks, specifically using the GSM8K dataset for training and evaluating on GSM8K, SVAMP, ASDiv, and MATH-500 datasets. The reward function was based on numeric correctness and a bonus for correct formatting.
Key Findings and Performance
The empirical results demonstrated significant improvements. Relative to clipped GRPO, GR-PSPO substantially boosted performance in both the 0.5B and 1.5B models. For instance, on GSM8K, the 0.5B model saw an increase from 17.6% to 39.7% accuracy, and the 1.5B model improved from 37.8% to 59.4%. This represents a gain of over 20% in accuracy.
While GR-PSPO achieved similar quantitative accuracy to unclipped GRPO (which uses a single iteration and no data reuse), it produced notably better response quality. Using an LLM-as-Judge evaluation, GR-PSPO’s responses were rated as clearer, more concise, and more logically coherent across various metrics including overall quality, constraint adherence, logical coherence, mathematical soundness, and clarity.
A notable advantage of GR-PSPO is its ability to reduce issues like ‘instruction leakage’ and verbosity, which were observed in responses from GRPO-clip and GRPO-noclip. This means the model adheres better to the given instructions and provides more focused answers.
Also Read:
- Enhancing Speech Summarization in Multi-modal AI Models Through Advanced Training
- POPE: Enhancing LLM Responses with Diverse User Preferences
Advantages and Future Outlook
PSPO offers stability without the need to truncate the learning objective, and it does so without requiring additional computation or memory. It’s a direct, straightforward replacement for ratio clipping in any clipped-ratio objective. This makes it a practical option, especially in scenarios where multi-epoch updates or mini-batches are used, which can be challenging for unclipped methods in larger or sparser models.
The research acknowledges limitations, primarily that the evaluation focused on mathematical reasoning with objective reward signals. Future work will explore its effectiveness in domains with more subjective or continuous rewards and with larger model sizes and different architectures. However, the current findings strongly suggest that Probability Smoothing Policy Optimisation is a promising advancement for stabilizing and enhancing reinforcement learning in large language models.


