TLDR: This research introduces SUPO, a new reinforcement learning framework that enables large language models (LLMs) to handle complex, multi-turn tasks beyond their fixed context limits. It achieves this by teaching LLMs to generate intelligent summaries of past interactions, keeping the context compact and relevant. Experiments show SUPO significantly improves task success rates on function calling and searching tasks, demonstrating a scalable approach for long-horizon agent training.
Large Language Models (LLMs) have shown incredible potential as problem-solvers, capable of understanding natural language, generating structured outputs, and interacting with external tools. However, when these powerful AI agents are tasked with complex, multi-turn problems that require many steps or interactions, they often hit a fundamental roadblock: their limited context window.
Imagine an LLM agent trying to solve a long-running puzzle. As it makes more moves and gathers more information, the “history” of its actions and observations grows. This ever-expanding history quickly fills up its working memory, leading to several challenges. Firstly, the LLM’s ability to follow instructions and reason effectively can degrade when dealing with very long contexts. Secondly, processing these extensive histories becomes computationally expensive, slowing down the learning process. Most critically, the fixed size of an LLM’s context window fundamentally limits how far into a task it can go, preventing it from tackling problems that require more interactions than can fit into its memory.
To overcome this scalability barrier, researchers have introduced a novel approach called summarization-based context management. This method allows LLM agents to scale their operations beyond a fixed context length by periodically compressing their past interactions into concise, LLM-generated summaries. Instead of letting the context grow indefinitely, the agent’s working memory is regularly refreshed with a compact, yet informative, summary of what has happened so far. Crucially, these summaries are not pre-defined or based on rigid rules; instead, the LLM agent learns how to generate them as part of its training, optimizing what information to keep, how to abstract it, and what details to discard as irrelevant.
This innovative idea is formalized through a “summarization-augmented Markov Decision Process” (MDP), which integrates summarization steps directly into the agent’s decision-making process. This framework allows for a policy gradient representation, meaning that existing reinforcement learning (RL) systems can be seamlessly adapted to train these agents. The result is an end-to-end optimization process that improves both the agent’s ability to use tools and its strategy for summarizing information.
The researchers instantiated this framework with an algorithm named SUmmarization augmented Policy Optimization (SUPO). SUPO is designed to jointly optimize both the agent’s tool-use behaviors and its summarization strategies. Key design elements of SUPO include a smart way of managing trajectories (the sequence of actions and observations), a method for estimating advantages that helps stabilize learning, and a mechanism to mask “overlong” trajectories, which prevents the model from being penalized for attempting longer, more complex tasks.
Experiments were conducted on two challenging multi-turn tool-use tasks: CodeGym, a synthetic environment for interactive function calling, and BrowseComp-Plus, a complex searching task. The results were compelling. SUPO significantly improved the success rates on both tasks, achieving gains of +3.2% on CodeGym and +14.0% on BrowseComp-Plus, all while maintaining the same or even lower working context lengths compared to traditional baselines. Furthermore, SUPO demonstrated an impressive ability to scale performance even when the number of summarization rounds during testing exceeded those during training, suggesting a robust and generalizable learning of summarization strategies.
Also Read:
- Empowering Language Models: How TAPO Integrates Reasoning and Adaptive Tool Use
- Enhancing LLM Performance with Evolving Contexts: Introducing Agentic Context Engineering (ACE)
This work establishes summarization-based context management as a principled and scalable approach for training RL agents to operate effectively beyond the limitations of a fixed context window. It opens doors for LLM agents to tackle even more complex and long-horizon tasks in the future, potentially leading to more reliable, intelligent, and autonomous AI systems. For more technical details, you can refer to the full research paper here.


