Evaluating Large Language Models for Robot Task and Motion Planning

TLDR: A systematic study investigated 16 algorithms integrating Gemini 2.5 Flash into Task and Motion Planning (TAMP) systems for robotics. Across 4,950 problems, LLM-based planners showed lower success rates and longer planning times than engineered TAMP methods. Key findings include that faster, ‘DIRECT’ LLM variants often outperform ‘THINKING’ ones, and providing geometric details can increase LLM task-planning errors. The research suggests that LLMs are most effective when quickly generating candidate solutions, with the TAMP system handling complex geometric reasoning and error correction.

Autonomous robots face significant challenges in navigating and performing tasks in complex, unstructured environments. A key area of robotics research, Task and Motion Planning (TAMP), aims to break down these long-horizon problems into manageable steps. Recently, there’s been growing interest in leveraging Large Language Models (LLMs) like Gemini 2.5 Flash to enhance TAMP systems, given their impressive semantic knowledge and generalization capabilities. However, the best way to integrate LLMs into these sophisticated planning frameworks remains a complex question.

A recent systematic study, titled A Systematic Study of Large Language Models for Task and Motion Planning With PDDLStream, by Jorge Mendez-Mendez from Stony Brook University, delves into this very challenge. The research explores how LLMs can substitute critical components within TAMP systems, aiming to understand their planning capabilities and limitations across a wide range of robotics tasks.

Exploring LLM Integration in TAMP

The study developed 16 distinct algorithms, all utilizing Gemini 2.5 Flash, to replace key parts of existing TAMP systems. These algorithms were built upon two foundational TAMP methods: ADAPTIVE and BILEVEL. The core idea was to see if an LLM could effectively act as a planner or a stream evaluator within a verification loop, where the TAMP system would check the LLM’s outputs and guide it to correct mistakes.

The researchers investigated several ways LLMs could contribute:

PDDL Planning: Here, the LLM was tasked with generating high-level symbolic plans (PDDL plans) without geometric details, as early tests showed that including such details often led to more errors.
POSES Stream Evaluation: This involved the LLM generating continuous values, such as stable object placements or robot base configurations, that satisfy geometric constraints.
INTEGRATED PDDL Planning and Stream Evaluation: This ambitious approach aimed for the LLM to simultaneously produce action sequences and geometric samples, allowing it to reason about both logical and geometric constraints at once.

Another crucial aspect explored was the LLM’s “thinking budget.” Gemini 2.5 Flash can be configured to “think” before producing an output (THINKING variants) or to generate responses directly without an explicit thinking phase (DIRECT variants). The study hypothesized that while reasoning might seem beneficial, a faster, less-reasoning model might be more effective if the TAMP system could quickly correct its errors.

Extensive Evaluation and Key Findings

The study conducted an extensive evaluation across 4,950 problems in three diverse TAMP domains: Blocked, Packing, and Rovers. Each problem had a time limit of 300 seconds. The results provided significant insights into the performance of LLM-based planners:

Lower Success Rates and Higher Planning Times: Generally, the Gemini-based planners exhibited lower success rates and significantly higher planning times compared to their traditional, engineered TAMP counterparts.
“DIRECT” Outperforms “THINKING”: Surprisingly, the faster, non-reasoning “DIRECT” LLM variants consistently outperformed their “THINKING” counterparts in most cases. This suggests that the TAMP system’s ability to quickly verify and correct LLM mistakes is more valuable than the LLM spending more time on internal reasoning.
Geometric Details Can Harm Performance: Providing geometric details to the LLM for task planning actually increased the number of task-planning errors, indicating that LLMs might struggle to integrate this information effectively into their PDDL planning.
Integrated Approaches Struggle: The “INTEGRATED” LLM approaches, which attempted to reason about both PDDL and geometry simultaneously, generally performed the worst. This reinforces the idea that complex, geometrically-aware reasoning is currently better handled by the formal TAMP system itself.
Challenges in Complex Domains: All LLM-based approaches performed poorly in the “Rovers” domain, which is the most difficult due to its long horizon and intricate geometric constraints.
Failure Analysis: Most failures were due to time-outs, but a significant portion also occurred because LLMs “gave up” (asserting a solvable problem was unsolvable) or hit API token limits.

Also Read:

Implications for Robotics and AI

The research concludes that while LLMs show promise in solving many novel TAMP problems, they currently cannot match the performance of dedicated, engineered TAMP methods. The study highlights that the efficiency of LLM-Modulo TAMP systems is maximized when the LLM quickly generates many candidate solutions, even if flawed, allowing the formal TAMP system to handle the complex, geometrically-aware reasoning and error correction.

This systematic study provides valuable empirical evidence for the strengths and weaknesses of integrating LLMs into TAMP frameworks, guiding future research toward more effective hybrid planning solutions for autonomous robots.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Large Language Models for Robot Task and Motion Planning

Exploring LLM Integration in TAMP

Extensive Evaluation and Key Findings

Implications for Robotics and AI

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates