TLDR: This research paper unifies two distinct approaches to training large language models (LLMs) for Pass@K tasks (where success is measured if at least one of K attempts is correct): direct policy gradient optimization and advantage shaping. It demonstrates that advantage shaping implicitly optimizes ‘surrogate rewards’ and that practical ‘hard-example up-weighting’ can be interpreted as reward-level regularization. This framework provides a clearer understanding and a recipe for designing new, more effective AI learning algorithms that balance exploitation and exploration.
When large language models (LLMs) tackle complex tasks like solving math problems or writing code, they often generate multiple solutions. The common way to evaluate their performance is called ‘Pass@K,’ which checks if at least one of these K generated solutions is correct. However, most traditional AI training methods, known as policy gradients, are designed to optimize for a single successful attempt, creating a mismatch between how models are trained and how they are evaluated.
Recent research has approached this challenge from two seemingly different angles. One set of methods directly calculates policy gradients to maximize the Pass@K reward. These ‘direct optimization’ techniques, often inspired by REINFORCE-style algorithms, reweight the learning signals to focus on examples where success is less common, effectively amplifying the importance of rare correct responses.
The second approach involves ‘advantage shaping.’ This technique modifies the ‘advantage scores’ within existing policy gradient algorithms, such as GRPO, to specifically account for the Pass@K objective. Advantage scores are essentially weights that tell the AI how much to adjust its behavior based on the outcome of an action.
This new research paper, titled “Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients,” reveals that these two distinct approaches are, in fact, two sides of the same coin. The authors demonstrate that by ‘reverse-engineering’ existing advantage-shaping algorithms, they implicitly optimize what are called ‘surrogate rewards.’ A surrogate reward is a mathematical transformation of the actual reward that is easier to optimize, but still guides the AI towards the desired outcome.
Conversely, the paper shows how to ‘forward-engineer’ new advantage-shaping methods by starting with a surrogate reward objective. This means researchers can now design new ways to guide AI learning by first defining a suitable surrogate reward, then deriving the corresponding advantage-shaping rules.
A key insight from this work is the concept of ‘reward-level regularization.’ The paper interprets practical modifications, such as ‘hard-example up-weighting’ (giving more importance to problems the AI struggles with), as a form of regularization applied directly to the reward function. Unlike traditional regularization methods that might modify the AI’s internal parameters, this approach influences learning by adjusting the value placed on different types of outcomes. This helps balance ‘exploitation’ (improving performance on already easy tasks) with ‘exploration’ (focusing on harder, unsolved problems to find new solutions).
For instance, the paper shows that a simple gradient scaling technique, dubbed ‘skew-R,’ which downweights contributions from examples already solved with high probability, can be interpreted as optimizing a regularized surrogate reward. This provides a theoretical justification for empirically motivated strategies, such as the ‘prioritized sampling’ used in advanced LLMs like Kimi 1.5, which reweights examples to make harder ones appear more frequently during training.
The research also delves into practical considerations, discussing the trade-offs between biased and unbiased gradient estimations and the role of normalization factors. It highlights that while unbiasedness is often desirable, biased scalings can be beneficial in certain scenarios, especially when computational resources are limited or when dealing with a small number of generated responses.
Also Read:
- Self-Rewarding PPO: Improving LLM Generalization from Demonstrations
- Unlocking Smarter LLM Reasoning: How Internal Confidence Guides Learning
In conclusion, this paper offers a unified framework for understanding and developing policy gradient methods for reinforcement learning with verifiable rewards. It establishes a clear equivalence between advantage shaping and surrogate reward maximization, providing a powerful new lens for designing more effective and stable AI training algorithms. For more technical details, you can read the full paper here.


