TLDR: This research introduces Posterior-GRPO (P-GRPO), a novel reinforcement learning framework for large language models (LLMs) that rewards the quality of the intermediate reasoning process in addition to the final code outcome. To enable this, they developed LCB-RB, a new benchmark for evaluating reasoning, and an Optimized-Degraded (OD-based) method for training a reasoning-specific reward model. P-GRPO mitigates ‘reward hacking’ by only applying reasoning rewards when the final code is correct, ensuring alignment between internal reasoning and functional correctness. The approach significantly improves code generation performance, outperforming outcome-only baselines and generalizing effectively to mathematical tasks.
Large Language Models (LLMs) have made significant strides in generating code, largely thanks to advancements in reinforcement learning (RL). However, a common limitation in current approaches is their sole reliance on the final outcome, such as whether a generated code passes all tests. This overlooks the crucial quality of the intermediate reasoning process that leads to the code.
A new research paper, Posterior-GRPO: Rewarding Reasoning Processes in Code Generation, introduces a unified framework designed to integrate the quality of the reasoning process into the reinforcement learning paradigm. This aims to ensure that LLMs not only produce correct code but also arrive at it through sound and logical thinking.
Addressing Key Challenges in LLM Training
The researchers identified three primary challenges in incorporating reasoning quality into RL for code generation. Firstly, there was a lack of suitable benchmarks to evaluate how well reward models could distinguish between good and bad reasoning processes. Existing benchmarks often focused on the final solution rather than the thought process.
Secondly, reliable reward models specifically designed for evaluating reasoning were missing. While some models could assess code quality, the semantic difference between natural language reasoning and code structure meant direct application was suboptimal.
Finally, a significant hurdle was ‘reward hacking,’ where policy models learn to exploit the reward signal for reasoning without actually improving the final code outcomes. This means a model might generate reasoning that scores high but still leads to incorrect or suboptimal code.
Introducing LCB-RB and the OD-based Method
To tackle the first two challenges, the paper introduces LCB-RB, a new benchmark derived from LiveCodeBench. This benchmark consists of preference pairs, each containing a superior and an inferior reasoning process. To train a reward model that can accurately score reasoning quality, they developed the Optimized-Degraded based (OD-based) method.
The OD-based method involves using a powerful LLM to generate an initial reasoning process. This initial reasoning is then systematically optimized and degraded along specific dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. By training on these inherently contrasting pairs, the reward model learns to effectively differentiate between high-quality and low-quality reasoning patterns. A 7B parameter reward model trained with this method achieved state-of-the-art performance on LCB-RB and showed strong generalization to other benchmarks.
Posterior-GRPO: A Novel RL Algorithm
To combat reward hacking, the researchers propose Posterior-GRPO (P-GRPO), a novel reinforcement learning algorithm. P-GRPO conditions process-based rewards on task success. This means that the model is only incentivized for superior reasoning paths when its final code outcome is correct (i.e., passes all test cases). If the code is incorrect, the thinking reward is set to zero, preventing the model from exploiting the reasoning reward signal without achieving functional correctness.
P-GRPO integrates three types of rewards: a format reward (ensuring output structure), a rule-based reward (based on test case pass rates), and the thinking reward from the newly trained reward model. This gated design ensures that the model’s internal optimization aligns with both reasoning quality and final code correctness. This approach also improves data utilization efficiency, providing meaningful gradient signals even when all samples in a batch are functionally correct, as their reasoning paths can still vary in quality.
Also Read:
- Enhancing LLM Reasoning with Consistency-Aware Policy Optimization
- Fine-Grained Reward Signals for Large Language Models
Impressive Results Across Domains
The effectiveness of P-GRPO was demonstrated across various code generation benchmarks, including HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench. A 7B parameter model using P-GRPO showed a significant average improvement of 13.9% over the base model and surpassed outcome-only reward baselines by 4.5%, achieving performance comparable to GPT-4-Turbo.
The research also highlighted P-GRPO’s generalizability by extending it to mathematical tasks. On mathematical benchmarks like MATH500, Minerva Math, and AIME 2024, P-GRPO achieved a 7.3% relative improvement over outcome-only reward baselines, further validating its ability to enhance reasoning capabilities across different domains.
In essence, Posterior-GRPO represents a significant step forward in training LLMs for code generation and mathematical reasoning. By explicitly rewarding the quality of the thinking process, conditioned on successful outcomes, it fosters models that not only produce correct answers but also derive them through robust and logical reasoning.


