Enhancing LLM Reasoning with Consistency-Aware Policy Optimization

TLDR: COPO (Consistency-Aware Policy Optimization) is a novel reinforcement learning framework designed to improve Large Language Models’ (LLMs) reasoning abilities. It tackles the ‘vanishing gradient’ problem prevalent in existing Group-relative Policy Optimization (GRPO) methods, which occurs when LLM responses become too consistent (all correct or all incorrect) for a given prompt, leading to ineffective learning signals. COPO introduces a structured global reward mechanism and an entropy-based soft blending strategy that adaptively combines local and global optimization objectives. This ensures continuous and meaningful learning, even from challenging data points that would otherwise be wasted, resulting in significant performance gains on mathematical reasoning benchmarks.

Large Language Models (LLMs) have shown remarkable progress in complex problem-solving, especially in areas like mathematical reasoning and code generation. A key driver behind this advancement is Reinforcement Learning (RL), which helps LLMs refine their reasoning capabilities.

Recently, the introduction of models like DeepSeek R1 has sparked interest in using rule-based rewards as a cost-effective way to guide policy optimization in RL. These methods often rely on a concept called Group-relative Policy Optimization (GRPO), where the model learns by comparing the rewards of multiple responses generated for a single prompt.

The Challenge of Vanishing Gradients

However, a significant challenge has emerged with GRPO-based methods: when multiple sampled responses to a single prompt converge to identical outcomes, whether correct or incorrect, the ‘group-based advantage’ (which drives learning) can degenerate to zero. This leads to a problem known as ‘vanishing gradients,’ effectively making those samples useless for learning and limiting training efficiency and performance. This issue is particularly problematic when a task is either too easy (all responses are correct and identical) or too challenging (all responses are incorrect and identical), as the model receives no clear signal to improve.

Introducing COPO: Consistency-Aware Policy Optimization

To address this critical limitation, researchers have proposed a novel framework called COPO: Consistency-Aware Policy Optimization. COPO introduces several key innovations in both reward design and optimization strategy to ensure that the training process continues to receive meaningful learning signals, even when model outputs show high consistency within a group.

COPO’s core idea is to incorporate a structured global reward based on outcome consistency. This global reward works at the batch level, providing an ‘inter-group’ loss that complements the traditional ‘intra-group’ local optimization of GRPO. This means that even if all responses to a single prompt are the same (and thus have zero local advantage), the model can still learn from how well it performs across different prompts in a batch.

How COPO Works

The framework combines two main components:

Intra-group Local Optimization: This part largely follows the principles of GRPO, where rewards and advantages are computed by comparing responses to the same prompt. It encourages the model to shift its output distribution towards higher-rewarding responses within a group.
Inter-group Global Optimization: This is COPO’s novel contribution. When local learning signals disappear due to high consistency, COPO leverages cross-prompt reward variability. It calculates a prompt-level reward (average reward of all responses for that prompt) and then computes a ‘global advantage’ by comparing these prompt-level rewards across the entire mini-batch. This allows the model to continue learning even from prompts where all responses were incorrect, as long as there’s variability in performance across different prompts in the batch.

Adaptive Blending with Consistency Entropy

A crucial aspect of COPO is its entropy-based soft blending mechanism. While global optimization helps mitigate vanishing gradients, it could potentially dilute the precision of credit assignment by giving the same advantage to all responses for a prompt, even lower-quality ones. To balance this, COPO adaptively selects between local and global optimization strategies based on the ‘consistency entropy’ of the current policy’s responses. Consistency entropy measures the diversity of outcomes for a given prompt.

If the consistency entropy is high (meaning diverse responses), local optimization dominates, encouraging the model to differentiate and reinforce higher-quality responses. If entropy is low (meaning uniform responses), global optimization dominates, pushing the model toward maintaining correctness and consistency across prompts. This adaptive blending ensures that all samples contribute to learning without being discarded, addressing the ‘sample wastage’ problem seen in other methods like DAPO.

Also Read:

Performance and Impact

The effectiveness of COPO has been validated through substantial performance gains on multiple mathematical reasoning benchmarks, including MATH-500 and AIME 2024. Experiments with Qwen2.5-Instruct 7B and 3B models showed that COPO consistently achieved superior inference accuracy compared to GRPO and DAPO, especially maintaining stable performance in later training stages where GRPO often suffers a drop. This demonstrates COPO’s ability to extract meaningful learning signals from data that would otherwise lead to vanishing gradients.

The research also includes ablation studies confirming that data with zero in-group advantage (often discarded by other methods) still holds significant learning value when global optimization is applied. The code for COPO has been released and is available on GitHub.

While COPO shows strong results, the paper notes a limitation: it may not offer the same advantages when applied to smaller, already math-tuned models, possibly due to conflicts between the composite loss function and the model’s pre-trained task-specific objectives.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing LLM Reasoning with Consistency-Aware Policy Optimization

The Challenge of Vanishing Gradients

Introducing COPO: Consistency-Aware Policy Optimization

How COPO Works

Adaptive Blending with Consistency Entropy

Performance and Impact

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates