TLDR: This research paper challenges the common belief that reward miscalibration causes autonomous agents to fail in long-horizon tasks. Instead, it identifies “gradient coupling” between similar training samples as the primary issue, where beneficial updates for correct actions inadvertently strengthen similar, flawed actions. To address this, the authors propose Generative Classification Disentanglement (GCD), a method that trains the agent to classify actions as good or bad, thereby separating their internal representations and reducing harmful gradient interference. Experiments show GCD significantly improves agent performance, especially on new tasks.
Building autonomous agents capable of tackling complex, real-world tasks over many steps has become a major focus in artificial intelligence research. These agents, often powered by large language models, aim to interact with environments and achieve long-term goals. However, a persistent challenge has been their tendency to make repetitive or unproductive actions, leading to task failures.
The prevailing theory for these failures pointed to ‘reward miscalibration.’ This idea suggested that in long sequences of actions, a flawed intermediate step might still accidentally lead to a successful overall outcome, thus mistakenly receiving a positive reward and being reinforced during training. Many researchers have tried to fix this by introducing more detailed, step-level rewards to provide finer feedback.
Challenging the Conventional Wisdom
A new research paper, “Rethinking Reward Miscalibration of GRPO in Agentic RL”, challenges this widely accepted view. The authors, Jingyu Liu, Xiaopeng Wu, Jingquan Peng, Kehan Chen, Chuan Yu, Lizhong Ding, and Yong Liu, reveal that outcome-based reinforcement learning methods, such as GRPO (Group Relative Policy Optimization), are, in principle, designed to penalize detrimental actions. They argue that flawed actions should inherently yield a negative expected advantage, meaning they should be discouraged during training. Even when considering phenomena like the ‘squeezing effect,’ where the probability distribution of actions shifts, the likelihood of good actions should increase, and bad actions should gradually diminish.
The Real Culprit: Gradient Coupling
If reward miscalibration isn’t the primary issue, then what is? The researchers identify ‘gradient coupling’ between similar samples as the true root cause. In agentic tasks, the data generated during training is often highly similar. For instance, successive steps in an agent’s interaction might only differ by a minor observation, and the range of possible actions can be limited. This high degree of similarity means that when the model learns from a well-performing action, the gradients (the signals that guide learning) can inadvertently strengthen other, similar-looking actions that are actually suboptimal or incorrect. This ‘coupling’ causes flawed behaviors to persist or even increase, despite the theoretical negative feedback they should receive.
Generative Classification Disentanglement (GCD)
To tackle this problem, the paper proposes a novel solution called Generative Classification Disentanglement (GCD). The core idea is to train the agent’s actor model to simultaneously act as a classifier. This means the model not only decides what action to take but also learns to classify whether a given action is ‘good’ or ‘bad.’ By adding this auxiliary classification objective, the model is compelled to learn distinct internal representations (embeddings) for good and bad actions. This separation in the model’s ‘mind’ effectively decouples the harmful gradient influences between similar samples.
The overall training objective combines the standard reinforcement learning loss with a GRPO-style loss for the classification task. This approach helps to mitigate the interference where a beneficial update for a correct action might accidentally boost a similar, flawed one. Additionally, the researchers introduce a ‘prompt-based correction’ strategy. During training, the model’s own critiques of its mistakes are summarized and injected as explicit instructions into future prompts. This helps to actively pull the probability of specific flawed actions out of a ‘danger zone’ (where self-correction is weak) and into a ‘safe regime’ (where self-correction is strong).
Also Read:
- Unlocking New Abilities: How Reinforcement Learning Helps Language Models Compose Skills
- Unlocking Deeper Exploration in LLMs with Risk-Sensitive Reinforcement Learning
Experimental Validation and Impact
The effectiveness of GCD was demonstrated through extensive experiments on two complex agent environments: ALFWorld and ScienceWorld. Using Qwen2.5-1.5B and Qwen2.5-7B models, the researchers showed that integrating GCD with existing reinforcement learning algorithms like GRPO, GiGPO, and RLVMR significantly improved performance. This improvement was particularly noticeable on ‘out-of-domain’ tasks, which are new and unseen by the agent during initial training. This suggests that GCD helps agents generalize better by ensuring that bad actions are appropriately punished, and their influence from similar good actions is reduced.
The findings of this paper offer a fresh perspective on the challenges of training autonomous agents. By pinpointing gradient coupling as a critical issue and providing an effective solution in Generative Classification Disentanglement, this research paves the way for more robust and capable AI agents that can reliably solve long-horizon tasks in the real world.


