TLDR: Graph-Enhanced Policy Optimization (GEPO) is a new framework that helps LLM agents overcome “structural blindness” in complex environments with sparse rewards. It dynamically builds a state-transition graph, using graph centrality to identify important states. This information is used to create structured intrinsic rewards, a graph-enhanced advantage function, and a dynamic discount factor, leading to more efficient exploration, precise credit assignment, and farsighted planning. GEPO achieved significant success rate improvements on benchmarks like ALFWorld, WebShop, and Workbench.
Large Language Models (LLMs) are becoming increasingly capable, moving beyond simple text tasks to act as interactive agents in dynamic environments. Imagine an AI agent navigating a virtual home, browsing the web to find a product, or operating a business dashboard. While impressive, training these agents is tough, especially when rewards are sparse – meaning feedback only comes after a long sequence of actions. Traditional reinforcement learning (RL) methods struggle here, often getting “structurally blind” to the environment’s layout.
This “structural blindness” means agents can’t understand the underlying connections or topology of their environment. For example, in a virtual home, a hallway might be a critical bottleneck connecting many rooms, but a standard agent might not recognize its strategic importance. This leads to three main problems: inefficient exploration (the agent doesn’t know which states are important), imprecise credit assignment (it’s hard to tell which actions led to a success when feedback is rare), and myopic planning (the agent undervalues future rewards, even in crucial states).
To tackle these challenges, researchers from JD.com have introduced a novel framework called Graph-Enhanced Policy Optimization (GEPO). GEPO dynamically builds a state-transition graph from the agent’s experiences. Think of this as the agent creating a mental map of its environment as it explores. This map isn’t static; it grows and adapts online, providing an increasingly comprehensive view of the environment’s structure.
The core of GEPO lies in using graph-theoretic centrality measures, specifically betweenness centrality, to understand the strategic importance of different states and transitions. Betweenness centrality helps identify “bottleneck” states – those that lie on many shortest paths between other states. These are often the pivotal points an agent must traverse to achieve its goals.
Structured Intrinsic Rewards
Instead of waiting for a sparse external reward, GEPO provides an “intrinsic reward” for visiting high-impact states or traversing critical transitions. This transforms the sparse reward problem into a denser one, guiding the agent to explore strategically important areas faster.
Graph-Enhanced Advantage Function
GEPO refines how credit is assigned to actions. It doesn’t just look at the final outcome of a trajectory but also considers how many pivotal states were visited. This means an agent gets partial credit for visiting crucial regions, even if it doesn’t fully succeed, making credit assignment more precise and topology-aware.
Also Read:
- Boosting AI Search Agent Performance with Entity-Aware Rewards
- Orchestrating LLM Tools: A New Approach to Multi-Turn Interactions with Plan DAGs
Dynamic Discount Factor
Normally, future rewards are discounted at a fixed rate. However, GEPO introduces a dynamic discount factor that adapts based on the strategic value of the current state. If an agent enters a highly central state, the discount factor increases, making the agent more “farsighted” and encouraging it to plan further ahead for long-term goals.
The researchers put GEPO to the test on three challenging benchmarks: ALFWorld (navigating virtual homes), WebShop (browsing e-commerce websites), and a proprietary Workbench dataset (operating a business dashboard). The results were impressive. GEPO consistently outperformed competitive baselines, achieving absolute success rate gains of +4.1% on ALFWorld, +5.3% on WebShop, and a significant +10.9% on the Workbench benchmark. These gains were observed across different LLM sizes (1.5B and 7B models), highlighting GEPO’s robustness and generalizability.
An ablation study further confirmed the synergistic nature of GEPO’s components. Removing any single component led to a performance drop, but removing two components simultaneously resulted in a much larger, “super-additive” decline, indicating that these mechanisms work together in a powerful way.
The choice of centrality measure also proved important, with betweenness centrality yielding the best performance. This makes sense, as betweenness directly identifies the critical “bridge” nodes that are essential for navigating complex environments.
In conclusion, GEPO offers a robust and generalizable strategy for advancing LLM agent training by explicitly modeling environmental structure. By dynamically building a state-transition graph and leveraging graph-theoretic insights, GEPO helps agents overcome structural blindness, leading to more efficient exploration, precise credit assignment, and farsighted planning in complex, sparse-reward tasks. For more details, you can read the full research paper here: Graph-Enhanced Policy Optimization in LLM Agent Training.


