Frontier AI Models Show Advanced Planning Skills, Rivaling Specialized Planners in 2025

TLDR: A 2025 research paper evaluates the planning performance of DeepSeek R1, Gemini 2.5 Pro, and GPT-5 against the classical planner LAMA. GPT-5 demonstrates performance competitive with LAMA on standard planning tasks, showcasing significant progress in LLM reasoning. While obfuscated tasks reveal LLMs’ reliance on semantic information, Gemini 2.5 Pro exhibits strong robustness. The study highlights improved symbolic reasoning but also points out the substantial computational cost of LLMs compared to specialized planners.

A recent research paper, “The 2025 Planning Performance of Frontier Large Language Models,” delves into the evolving capabilities of advanced AI models in solving complex planning problems. Authored by Augusto B. Corrêa from the University of Oxford, André G. Pereira from the Federal University of Rio Grande do Sul, and Jendrik Seipp from Linköping University, the study provides an updated evaluation of how frontier Large Language Models (LLMs) perform in end-to-end planning tasks.

Planning, in the context of AI, involves finding a sequence of actions to transform an initial state into a desired goal state. This is a crucial benchmark for assessing the reasoning abilities of LLMs. The researchers evaluated three cutting-edge LLMs from 2025: DeepSeek R1, Gemini 2.5 Pro, and GPT-5. For comparison, they also included LAMA, a strong classical planner, as a reference point. The evaluation focused on a subset of domains from the most recent Learning Track of the International Planning Competition (IPC).

The methodology involved prompting the LLMs to generate plans based on PDDL (Planning Domain Definition Language) domain and task descriptions. To ensure the models were truly reasoning and not just relying on memorized semantic information, the tasks were presented in two formats: standard problems and an “obfuscated” version. In the obfuscated tasks, all symbolic names for actions, predicates, and objects were replaced with random strings, making it challenging for LLMs that depend on token semantics, while having no impact on symbolic planners like LAMA.

The findings reveal significant progress in LLM planning capabilities. On standard PDDL domains, GPT-5 demonstrated performance competitive with LAMA, solving 205 out of 360 tasks, compared to LAMA’s 204. DeepSeek R1 and Gemini 2.5 Pro also showed strong results, solving 157 and 155 tasks respectively. Interestingly, LLMs even outperformed LAMA in specific domains like Childsnack and Spanner, with GPT-5 solving all 45 tasks in the Spanner domain.

However, the scenario shifted with the obfuscated tasks. The performance of all LLMs degraded, indicating their continued reliance on semantic information. GPT-5 solved 152 tasks, still leading among the LLMs. Gemini 2.5 Pro showed remarkable robustness, with a comparatively smaller drop in performance, solving 146 tasks. DeepSeek R1 experienced the most significant decline, solving only 93 obfuscated tasks. Despite the degradation, the fact that LLMs could solve a non-trivial number of these purely symbolic tasks highlights a substantial improvement over previous generations.

The study also examined the length of the generated plans and the computational effort involved. LLMs were capable of generating very long, valid plans, some exceeding 500 steps, suggesting improved reliability in maintaining long sequences of correct reasoning. For Gemini 2.5 Pro, solving obfuscated tasks required a substantially higher number of reasoning tokens, implying greater computational effort to compensate for the lack of semantic clues.

A critical consideration highlighted by the researchers is the computational cost. While LAMA operates efficiently on minimal hardware, LLMs demand vastly more resources, typically massive-scale GPU clusters. This stark difference underscores a key trade-off: while LLMs are becoming more capable at planning, they remain orders of magnitude less efficient than specialized planners, a crucial factor for practical applications.

Also Read:

In conclusion, the research indicates that the 2025 generation of frontier LLMs has made substantial strides in automated planning, with GPT-5 achieving parity with classical planners on standard tasks. While their dependence on token semantics persists, their improved symbolic reasoning is evident. This progress, however, comes with a significant computational cost, posing a challenge for their widespread practical deployment. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Frontier AI Models Show Advanced Planning Skills, Rivaling Specialized Planners in 2025

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates