TLDR: COPO (Consistency-Aware Policy Optimization) is a novel reinforcement learning framework designed to improve Large Language Models’ (LLMs) reasoning abilities. It tackles the ‘vanishing gradient’ problem prevalent in existing Group-relative Policy Optimization (GRPO) methods, which occurs when LLM responses become too consistent (all correct or all incorrect) for a given prompt, leading to ineffective learning signals. COPO introduces a structured global reward mechanism and an entropy-based soft blending strategy that adaptively combines local and global optimization objectives. This ensures continuous and meaningful learning, even from challenging data points that would otherwise be wasted, resulting in significant performance gains on mathematical reasoning benchmarks.
Large Language Models (LLMs) have shown remarkable progress in complex problem-solving, especially in areas like mathematical reasoning and code generation. A key driver behind this advancement is Reinforcement Learning (RL), which helps LLMs refine their reasoning capabilities.
Recently, the introduction of models like DeepSeek R1 has sparked interest in using rule-based rewards as a cost-effective way to guide policy optimization in RL. These methods often rely on a concept called Group-relative Policy Optimization (GRPO), where the model learns by comparing the rewards of multiple responses generated for a single prompt.
The Challenge of Vanishing Gradients
However, a significant challenge has emerged with GRPO-based methods: when multiple sampled responses to a single prompt converge to identical outcomes, whether correct or incorrect, the ‘group-based advantage’ (which drives learning) can degenerate to zero. This leads to a problem known as ‘vanishing gradients,’ effectively making those samples useless for learning and limiting training efficiency and performance. This issue is particularly problematic when a task is either too easy (all responses are correct and identical) or too challenging (all responses are incorrect and identical), as the model receives no clear signal to improve.
Introducing COPO: Consistency-Aware Policy Optimization
To address this critical limitation, researchers have proposed a novel framework called COPO: Consistency-Aware Policy Optimization. COPO introduces several key innovations in both reward design and optimization strategy to ensure that the training process continues to receive meaningful learning signals, even when model outputs show high consistency within a group.
COPO’s core idea is to incorporate a structured global reward based on outcome consistency. This global reward works at the batch level, providing an ‘inter-group’ loss that complements the traditional ‘intra-group’ local optimization of GRPO. This means that even if all responses to a single prompt are the same (and thus have zero local advantage), the model can still learn from how well it performs across different prompts in a batch.
How COPO Works
The framework combines two main components:
- Intra-group Local Optimization: This part largely follows the principles of GRPO, where rewards and advantages are computed by comparing responses to the same prompt. It encourages the model to shift its output distribution towards higher-rewarding responses within a group.
- Inter-group Global Optimization: This is COPO’s novel contribution. When local learning signals disappear due to high consistency, COPO leverages cross-prompt reward variability. It calculates a prompt-level reward (average reward of all responses for that prompt) and then computes a ‘global advantage’ by comparing these prompt-level rewards across the entire mini-batch. This allows the model to continue learning even from prompts where all responses were incorrect, as long as there’s variability in performance across different prompts in the batch.
Adaptive Blending with Consistency Entropy
A crucial aspect of COPO is its entropy-based soft blending mechanism. While global optimization helps mitigate vanishing gradients, it could potentially dilute the precision of credit assignment by giving the same advantage to all responses for a prompt, even lower-quality ones. To balance this, COPO adaptively selects between local and global optimization strategies based on the ‘consistency entropy’ of the current policy’s responses. Consistency entropy measures the diversity of outcomes for a given prompt.
If the consistency entropy is high (meaning diverse responses), local optimization dominates, encouraging the model to differentiate and reinforce higher-quality responses. If entropy is low (meaning uniform responses), global optimization dominates, pushing the model toward maintaining correctness and consistency across prompts. This adaptive blending ensures that all samples contribute to learning without being discarded, addressing the ‘sample wastage’ problem seen in other methods like DAPO.
Also Read:
- Fine-Grained Reward Signals for Large Language Models
- Fostering LLM Teamwork: A Reinforcement Learning Approach to Collaborative AI
Performance and Impact
The effectiveness of COPO has been validated through substantial performance gains on multiple mathematical reasoning benchmarks, including MATH-500 and AIME 2024. Experiments with Qwen2.5-Instruct 7B and 3B models showed that COPO consistently achieved superior inference accuracy compared to GRPO and DAPO, especially maintaining stable performance in later training stages where GRPO often suffers a drop. This demonstrates COPO’s ability to extract meaningful learning signals from data that would otherwise lead to vanishing gradients.
The research also includes ablation studies confirming that data with zero in-group advantage (often discarded by other methods) still holds significant learning value when global optimization is applied. The code for COPO has been released and is available on GitHub.
While COPO shows strong results, the paper notes a limitation: it may not offer the same advantages when applied to smaller, already math-tuned models, possibly due to conflicts between the composite loss function and the model’s pre-trained task-specific objectives.


