TLDR: TGPO (Tree-Guided Preference Optimization) is an offline reinforcement learning framework designed to improve Web Agent training. It addresses common challenges like credit assignment misallocation, high annotation costs, and reward sparsity by using a tree-structured trajectory representation to merge semantically identical states, an automated Process Reward Model for fine-grained rewards, and a dynamic weighting mechanism to prioritize critical decision points. Experiments show TGPO significantly outperforms existing methods, achieving higher success rates and fewer redundant actions on various web interaction tasks.
The rapid evolution of large language models (LLMs) and vision-language models (VLMs) has made them indispensable for creating automated Web Agents. These agents are designed to interact with websites, translating natural language instructions into actions like clicking and typing, all while understanding web semantics and making decisions in dynamic online environments.
However, training these Web Agents using reinforcement learning (RL) presents significant hurdles. Key challenges include misallocating credit for actions (where good actions in a failed sequence are penalized), the prohibitively high cost of manually annotating data for training, and the sparsity of reward signals, which can lead agents to learn inefficient behaviors with many redundant steps.
To tackle these issues, researchers have introduced Tree-Guided Preference Optimization (TGPO), an innovative offline reinforcement learning framework. TGPO proposes a novel tree-structured representation for agent trajectories. This structure merges semantically identical states across different interaction paths, effectively eliminating conflicts in how actions are labeled. This means that if the same action occurs in an identical web state but leads to different outcomes in separate attempts, the tree structure helps to resolve this ambiguity.
A core component of TGPO is its Process Reward Model (PRM), which automatically generates detailed, fine-grained rewards. This model evaluates progress towards subgoals, detects and penalizes redundant actions or cycles, verifies the effectiveness of actions, and ensures actions comply with syntax requirements. By combining these reward dimensions, TGPO provides a much richer feedback signal than traditional methods, reducing the need for costly manual annotation.
Furthermore, TGPO incorporates a dynamic weighting mechanism during training. Unlike standard preference optimization methods that treat all decision points equally, TGPO prioritizes high-impact decision points where the difference in potential rewards between chosen and rejected actions is significant. This allows the agent to focus its learning on the most critical choices, leading to more efficient and robust policies.
The effectiveness of TGPO was rigorously tested on two benchmarks: Online-Mind2Web and a newly constructed C-WebShop dataset. Using models like Qwen3-14B and Qwen2.5-VL-72B, TGPO consistently outperformed existing methods, including SFT, KTO, and DPO. On the Online-Mind2Web benchmark, TGPO achieved a 38.4% success rate and the shortest average trajectory length of 10.71 steps, even surpassing the performance of the closed-source model GPT-4o. It also significantly reduced redundant steps, demonstrating its ability to optimize action sequences.
Similarly, on the C-WebShop dataset, TGPO maintained its superior performance with a 78.6% success rate, drastically cutting down average steps and nearly eliminating redundant actions. These results highlight the framework’s robustness across diverse web interaction scenarios.
An ablation study further confirmed the importance of TGPO’s key components. The tree structure proved highly effective in resolving label conflicts, which were present in a substantial percentage of raw trajectories. The fine-grained reward system and dynamic weighting mechanism also contributed significantly to the improved success rates and reduced redundant actions, enabling the model to learn more efficient execution paths.
Also Read:
- Guiding AI: A Human-Centered Approach to Web Browsing Agents
- Advancing GUI Automation with Semi-online Reinforcement Learning
In conclusion, TGPO offers a powerful solution to the long-standing challenges in Web Agent training. By integrating a tree-structured trajectory representation, an automated process reward model, and adaptive weighting, it achieves higher success rates and greater efficiency. This approach holds promise for broader applications beyond web agents, including GUI interactions and gaming environments. For more details, you can refer to the full research paper here.


