TLDR: This research introduces a new method for training Large Language Model (LLM) agents to dynamically decide when to plan, rather than always planning or never planning. Through a two-stage training process involving supervised fine-tuning and reinforcement learning, agents learn to efficiently allocate computational resources for planning, leading to improved performance on complex tasks like the Minecraft-inspired Crafter environment. The study shows that there’s an optimal “Goldilocks” frequency for planning, and that these dynamically planning agents can be effectively guided by human input.
Large Language Models (LLMs) have shown incredible capabilities in problem-solving, especially when they are prompted to ‘think step-by-step’ or ‘plan’ before taking action. This process, often referred to as reasoning or planning, helps LLMs tackle complex tasks more effectively. However, a new research paper from a team of experts from University College London, University of Oxford, and other institutions, highlights a critical challenge: always planning is not always the best approach.
The paper, titled Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents, delves into the idea that while planning is beneficial, doing it constantly can be computationally expensive and even hinder performance on longer, more intricate tasks. Conversely, never planning also limits an agent’s potential. This introduces a dilemma: how can LLM agents know when to invest their computational resources in planning and when to act directly?
The ‘Goldilocks’ Zone for Planning
The researchers introduce a conceptual framework that formalizes dynamic planning for LLM agents. This framework allows agents to flexibly decide when to allocate ‘test-time compute’ for planning. Their experiments in environments like Crafter (a Minecraft-inspired game) and POGS (a custom graph search environment) revealed a fascinating ‘Goldilocks’ effect: there’s an optimal frequency for planning that outperforms both the ‘always plan’ and ‘never plan’ strategies. Planning too often, similar to existing methods like ReAct, can lead to instability and inefficient behavior, while too little planning leaves the agent without strategic guidance.
A Two-Stage Training Approach
To teach LLM agents this crucial meta-cognitive skill, the team developed a simple yet effective two-stage training pipeline:
-
Supervised Fine-Tuning (SFT): In this initial stage, models are ‘primed’ using diverse synthetic data. This data includes explicit natural language plans alongside actions, helping the models learn the structure and rationale behind planning.
-
Reinforcement Learning (RL): After SFT, the models undergo RL fine-tuning in long-horizon environments. This stage refines their ability to strategically decide when to plan, execute those plans, and replan only when necessary.
A key finding from the SFT stage was that training with explicit natural language plans significantly improved the learning process itself, even when compared to training on identical action sequences without plans. This suggests that plans provide valuable context and a form of ‘cognitive grounding’ for the model.
Smarter, More Efficient Agents
The RL-trained agents, especially those that went through the SFT priming with dynamic planning, showed remarkable improvements. They were more sample-efficient and consistently achieved more complex objectives in the Crafter environment. These agents learned to generate and execute plans at various levels of abstraction, adapting their planning frequency to the demands of the situation.
Human-AI Collaboration
Perhaps one of the most exciting outcomes of this research is the enhanced ability for human-agent collaboration. The RL-trained planning agents could be effectively steered by human-written plans, achieving feats they couldn’t accomplish independently. For instance, a human-guided agent successfully completed Crafter by collecting a diamond, an achievement not observed in autonomous agents. This demonstrates a significant step towards more controllable and collaborative LLM agentic systems.
Also Read:
- Smart Planning for LLM Agents: Balancing Speed and Expense
- Agentic Reinforcement Learning: Empowering LLMs as Autonomous Decision-Makers
Paving the Way for Adaptive AI
This work marks a crucial advancement in the field of LLM agents. By enabling agents to dynamically allocate test-time compute for planning, the researchers are paving the way for more efficient, adaptive, and controllable AI systems. While there are still limitations, such as the specific model scales and environments used, the findings suggest a promising future for LLMs that can intelligently decide when to ‘think’ and when to ‘act’, leading to more capable and safer AI agents.


