TLDR: A systematic study investigated 16 algorithms integrating Gemini 2.5 Flash into Task and Motion Planning (TAMP) systems for robotics. Across 4,950 problems, LLM-based planners showed lower success rates and longer planning times than engineered TAMP methods. Key findings include that faster, ‘DIRECT’ LLM variants often outperform ‘THINKING’ ones, and providing geometric details can increase LLM task-planning errors. The research suggests that LLMs are most effective when quickly generating candidate solutions, with the TAMP system handling complex geometric reasoning and error correction.
Autonomous robots face significant challenges in navigating and performing tasks in complex, unstructured environments. A key area of robotics research, Task and Motion Planning (TAMP), aims to break down these long-horizon problems into manageable steps. Recently, there’s been growing interest in leveraging Large Language Models (LLMs) like Gemini 2.5 Flash to enhance TAMP systems, given their impressive semantic knowledge and generalization capabilities. However, the best way to integrate LLMs into these sophisticated planning frameworks remains a complex question.
A recent systematic study, titled A Systematic Study of Large Language Models for Task and Motion Planning With PDDLStream, by Jorge Mendez-Mendez from Stony Brook University, delves into this very challenge. The research explores how LLMs can substitute critical components within TAMP systems, aiming to understand their planning capabilities and limitations across a wide range of robotics tasks.
Exploring LLM Integration in TAMP
The study developed 16 distinct algorithms, all utilizing Gemini 2.5 Flash, to replace key parts of existing TAMP systems. These algorithms were built upon two foundational TAMP methods: ADAPTIVE and BILEVEL. The core idea was to see if an LLM could effectively act as a planner or a stream evaluator within a verification loop, where the TAMP system would check the LLM’s outputs and guide it to correct mistakes.
The researchers investigated several ways LLMs could contribute:
- PDDL Planning: Here, the LLM was tasked with generating high-level symbolic plans (PDDL plans) without geometric details, as early tests showed that including such details often led to more errors.
- POSES Stream Evaluation: This involved the LLM generating continuous values, such as stable object placements or robot base configurations, that satisfy geometric constraints.
- INTEGRATED PDDL Planning and Stream Evaluation: This ambitious approach aimed for the LLM to simultaneously produce action sequences and geometric samples, allowing it to reason about both logical and geometric constraints at once.
Another crucial aspect explored was the LLM’s “thinking budget.” Gemini 2.5 Flash can be configured to “think” before producing an output (THINKING variants) or to generate responses directly without an explicit thinking phase (DIRECT variants). The study hypothesized that while reasoning might seem beneficial, a faster, less-reasoning model might be more effective if the TAMP system could quickly correct its errors.
Extensive Evaluation and Key Findings
The study conducted an extensive evaluation across 4,950 problems in three diverse TAMP domains: Blocked, Packing, and Rovers. Each problem had a time limit of 300 seconds. The results provided significant insights into the performance of LLM-based planners:
- Lower Success Rates and Higher Planning Times: Generally, the Gemini-based planners exhibited lower success rates and significantly higher planning times compared to their traditional, engineered TAMP counterparts.
- “DIRECT” Outperforms “THINKING”: Surprisingly, the faster, non-reasoning “DIRECT” LLM variants consistently outperformed their “THINKING” counterparts in most cases. This suggests that the TAMP system’s ability to quickly verify and correct LLM mistakes is more valuable than the LLM spending more time on internal reasoning.
- Geometric Details Can Harm Performance: Providing geometric details to the LLM for task planning actually increased the number of task-planning errors, indicating that LLMs might struggle to integrate this information effectively into their PDDL planning.
- Integrated Approaches Struggle: The “INTEGRATED” LLM approaches, which attempted to reason about both PDDL and geometry simultaneously, generally performed the worst. This reinforces the idea that complex, geometrically-aware reasoning is currently better handled by the formal TAMP system itself.
- Challenges in Complex Domains: All LLM-based approaches performed poorly in the “Rovers” domain, which is the most difficult due to its long horizon and intricate geometric constraints.
- Failure Analysis: Most failures were due to time-outs, but a significant portion also occurred because LLMs “gave up” (asserting a solvable problem was unsolvable) or hit API token limits.
Also Read:
- Boosting Language Models’ Planning Prowess with Multi-Token Prediction
- Crafting Smarter Control Systems: A New Approach with AI and Optimization
Implications for Robotics and AI
The research concludes that while LLMs show promise in solving many novel TAMP problems, they currently cannot match the performance of dedicated, engineered TAMP methods. The study highlights that the efficiency of LLM-Modulo TAMP systems is maximized when the LLM quickly generates many candidate solutions, even if flawed, allowing the formal TAMP system to handle the complex, geometrically-aware reasoning and error correction.
This systematic study provides valuable empirical evidence for the strengths and weaknesses of integrating LLMs into TAMP frameworks, guiding future research toward more effective hybrid planning solutions for autonomous robots.


