spot_img
HomeResearch & DevelopmentImproving LLM Performance and Clarity with Probability Smoothing Policy...

Improving LLM Performance and Clarity with Probability Smoothing Policy Optimisation

TLDR: Probability Smoothing Policy Optimisation (PSPO) is a new method for training large language models (LLMs) with reinforcement learning. It replaces traditional ‘clipping’ by smoothing policy probabilities towards an older policy, creating a ‘soft trust region’ and preserving important gradient information. This approach leads to significant performance improvements (over 20% on GSM8K) and produces clearer, more coherent responses compared to standard clipping methods, without adding computational overhead.

Training large language models (LLMs) with reinforcement learning (RL) methods like PPO and GRPO is a common practice, but it often faces a significant challenge: maintaining stability during updates. Traditionally, a technique called ‘ratio clipping’ is used to prevent instability. While effective at keeping things stable, clipping has its downsides. It can discard valuable information and introduce abrupt changes in the learning process, known as gradient discontinuities.

A new research paper, “It’s Not You, It’s Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL”, introduces an innovative alternative called Probability Smoothing Policy Optimisation (PSPO). This method aims to overcome the limitations of clipping by smoothing the current policy’s probabilities towards the old (behavior) policy before calculating the importance ratio. This approach is similar to ‘label smoothing’ used in supervised learning.

How Probability Smoothing Works

Unlike clipping, PSPO is designed to preserve the gradient signal, meaning it doesn’t lose important learning information. By interpolating towards the old policy, PSPO effectively creates a ‘soft trust region’. This soft trust region discourages large, potentially destabilizing updates, providing formal guarantees for stability. The core idea is to reduce overconfidence in any single action, which is particularly relevant in language generation tasks where multiple words can convey the same meaning.

The researchers instantiated PSPO within GRPO, creating GR-PSPO. They then fine-tuned Qwen2.5-0.5B and Qwen2.5-1.5B models on mathematical reasoning tasks, specifically using the GSM8K dataset for training and evaluating on GSM8K, SVAMP, ASDiv, and MATH-500 datasets. The reward function was based on numeric correctness and a bonus for correct formatting.

Key Findings and Performance

The empirical results demonstrated significant improvements. Relative to clipped GRPO, GR-PSPO substantially boosted performance in both the 0.5B and 1.5B models. For instance, on GSM8K, the 0.5B model saw an increase from 17.6% to 39.7% accuracy, and the 1.5B model improved from 37.8% to 59.4%. This represents a gain of over 20% in accuracy.

While GR-PSPO achieved similar quantitative accuracy to unclipped GRPO (which uses a single iteration and no data reuse), it produced notably better response quality. Using an LLM-as-Judge evaluation, GR-PSPO’s responses were rated as clearer, more concise, and more logically coherent across various metrics including overall quality, constraint adherence, logical coherence, mathematical soundness, and clarity.

A notable advantage of GR-PSPO is its ability to reduce issues like ‘instruction leakage’ and verbosity, which were observed in responses from GRPO-clip and GRPO-noclip. This means the model adheres better to the given instructions and provides more focused answers.

Also Read:

Advantages and Future Outlook

PSPO offers stability without the need to truncate the learning objective, and it does so without requiring additional computation or memory. It’s a direct, straightforward replacement for ratio clipping in any clipped-ratio objective. This makes it a practical option, especially in scenarios where multi-epoch updates or mini-batches are used, which can be challenging for unclipped methods in larger or sparser models.

The research acknowledges limitations, primarily that the evaluation focused on mathematical reasoning with objective reward signals. Future work will explore its effectiveness in domains with more subjective or continuous rewards and with larger model sizes and different architectures. However, the current findings strongly suggest that Probability Smoothing Policy Optimisation is a promising advancement for stabilizing and enhancing reinforcement learning in large language models.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -