TLDR: TGPR (Tree-Guided Policy Refinement) is a new framework that significantly improves how Large Language Models (LLMs) debug code. It combines the GRPO reinforcement learning algorithm with a Thompson Sampling-guided tree search during training. This strategic exploration helps LLMs learn from diverse successful and failed code refinement paths, leading to more robust debugging policies. TGPR achieved substantial performance gains on benchmarks like HumanEval, MBPP, and APPS, demonstrating its effectiveness in reducing various code errors and enhancing LLM code generation capabilities.
Large Language Models (LLMs) have shown incredible potential in generating code, but they often struggle with complex programming tasks, leading to bugs and errors in their initial outputs. This highlights the crucial need for iterative refinement, where LLMs can progressively debug and improve their code based on feedback. However, navigating the vast possibilities of code repairs efficiently has been a significant challenge for existing methods.
Traditional approaches often rely on fixed rules or standard reinforcement learning algorithms, which can get stuck in limited exploration, failing to discover optimal solutions. They struggle with the ‘exploration-exploitation dilemma’ – how to balance trying new, uncertain fixes with refining known, partially working solutions.
Introducing Tree-Guided Policy Refinement (TGPR)
A new framework called Tree-Guided Policy Refinement (TGPR) has emerged to tackle this problem. Developed by Daria Ozerova and Ekaterina Trofimova, TGPR introduces a novel way to enhance LLMs’ self-debugging capabilities. It combines a powerful reinforcement learning algorithm called Group Relative Policy Optimization (GRPO) with a clever search strategy known as Thompson Sampling-guided tree search.
The core innovation of TGPR is that this tree search isn’t used during the final deployment of the LLM (which would be too slow). Instead, it acts as a sophisticated data augmentation engine during the *training* phase. By strategically exploring both successful and failed code refinement paths, TGPR generates a much richer and more diverse set of learning experiences for the LLM.
How TGPR Works
Imagine the process of debugging code as navigating a vast tree, where each branch represents a different possible code fix. TGPR uses Thompson Sampling, a principled statistical method, to guide this exploration. For each potential code fix (or ‘node’ in the tree), it calculates parameters that help it decide whether to explore a less-visited but potentially high-reward path or to exploit a known good solution.
To provide meaningful feedback, TGPR uses a custom reward function. This function doesn’t just give a simple pass/fail signal; it combines a measure of how similar the generated code is to a correct reference solution (CodeBLEU) with the number of unit tests the code successfully passes. This ‘dense’ reward signal helps the LLM understand its progress even when the code isn’t perfectly correct yet.
By learning from these strategically explored paths, the GRPO agent develops a more robust and generalizable debugging strategy. It internalizes the lessons from the structured search process, allowing it to make more confident and effective decisions during actual test-time inference.
Impressive Results
The researchers evaluated TGPR on challenging code generation benchmarks like HumanEval, MBPP, and APPS. The results were significant. Compared to a strong GRPO baseline, TGPR achieved substantial improvements:
- On MBPP, TGPR boosted pass@1 (meaning the first attempt passes) by 4.2 percentage points and pass@10 (meaning one of the top 10 attempts passes) by 4.2 percentage points.
- On HumanEval, pass@1 improved by 2.7 percentage points and pass@10 by 6.2 percentage points.
- On APPS, which features more complex test cases, TGPR saw a pass@1 improvement of 3.8 percentage points and an impressive pass@10 gain of 12.51 percentage points.
Furthermore, an error analysis showed that TGPR consistently achieved the lowest error rates across various categories, including algorithmic design flaws, semantic errors, and performance errors. This indicates that TGPR enables LLMs to produce solutions that are not only more functionally correct but also more robust and capable of handling stricter test cases.
Also Read:
- RECODE-H: A New Benchmark for Interactive Research Code Generation with Human Feedback
- Empowering Language Models: How TAPO Integrates Reasoning and Adaptive Tool Use
Conclusion
TGPR represents a significant step forward in enabling LLMs to self-debug code more effectively. By integrating a Thompson Sampling-guided tree search for strategic exploration during training, coupled with a custom hybrid reward design, TGPR empowers LLMs to learn from a diverse set of refinement experiences. This principled approach paves the way for more autonomous and proficient LLM-powered code generation, tackling complex reasoning tasks with greater resilience. You can read the full research paper here: TGPR: Tree-Guided Policy Refinement for Robust Self-Debugging of LLMs.


