TLDR: A 2025 research paper evaluates the planning performance of DeepSeek R1, Gemini 2.5 Pro, and GPT-5 against the classical planner LAMA. GPT-5 demonstrates performance competitive with LAMA on standard planning tasks, showcasing significant progress in LLM reasoning. While obfuscated tasks reveal LLMs’ reliance on semantic information, Gemini 2.5 Pro exhibits strong robustness. The study highlights improved symbolic reasoning but also points out the substantial computational cost of LLMs compared to specialized planners.
A recent research paper, “The 2025 Planning Performance of Frontier Large Language Models,” delves into the evolving capabilities of advanced AI models in solving complex planning problems. Authored by Augusto B. Corrêa from the University of Oxford, André G. Pereira from the Federal University of Rio Grande do Sul, and Jendrik Seipp from Linköping University, the study provides an updated evaluation of how frontier Large Language Models (LLMs) perform in end-to-end planning tasks.
Planning, in the context of AI, involves finding a sequence of actions to transform an initial state into a desired goal state. This is a crucial benchmark for assessing the reasoning abilities of LLMs. The researchers evaluated three cutting-edge LLMs from 2025: DeepSeek R1, Gemini 2.5 Pro, and GPT-5. For comparison, they also included LAMA, a strong classical planner, as a reference point. The evaluation focused on a subset of domains from the most recent Learning Track of the International Planning Competition (IPC).
The methodology involved prompting the LLMs to generate plans based on PDDL (Planning Domain Definition Language) domain and task descriptions. To ensure the models were truly reasoning and not just relying on memorized semantic information, the tasks were presented in two formats: standard problems and an “obfuscated” version. In the obfuscated tasks, all symbolic names for actions, predicates, and objects were replaced with random strings, making it challenging for LLMs that depend on token semantics, while having no impact on symbolic planners like LAMA.
The findings reveal significant progress in LLM planning capabilities. On standard PDDL domains, GPT-5 demonstrated performance competitive with LAMA, solving 205 out of 360 tasks, compared to LAMA’s 204. DeepSeek R1 and Gemini 2.5 Pro also showed strong results, solving 157 and 155 tasks respectively. Interestingly, LLMs even outperformed LAMA in specific domains like Childsnack and Spanner, with GPT-5 solving all 45 tasks in the Spanner domain.
However, the scenario shifted with the obfuscated tasks. The performance of all LLMs degraded, indicating their continued reliance on semantic information. GPT-5 solved 152 tasks, still leading among the LLMs. Gemini 2.5 Pro showed remarkable robustness, with a comparatively smaller drop in performance, solving 146 tasks. DeepSeek R1 experienced the most significant decline, solving only 93 obfuscated tasks. Despite the degradation, the fact that LLMs could solve a non-trivial number of these purely symbolic tasks highlights a substantial improvement over previous generations.
The study also examined the length of the generated plans and the computational effort involved. LLMs were capable of generating very long, valid plans, some exceeding 500 steps, suggesting improved reliability in maintaining long sequences of correct reasoning. For Gemini 2.5 Pro, solving obfuscated tasks required a substantially higher number of reasoning tokens, implying greater computational effort to compensate for the lack of semantic clues.
A critical consideration highlighted by the researchers is the computational cost. While LAMA operates efficiently on minimal hardware, LLMs demand vastly more resources, typically massive-scale GPU clusters. This stark difference underscores a key trade-off: while LLMs are becoming more capable at planning, they remain orders of magnitude less efficient than specialized planners, a crucial factor for practical applications.
Also Read:
- Navigating AI’s Tricky Terrains: A Deep Dive into Search Strategies for Uninformed Regions
- Charting the Course: How AI Video Generation is Building Interactive World Models
In conclusion, the research indicates that the 2025 generation of frontier LLMs has made substantial strides in automated planning, with GPT-5 achieving parity with classical planners on standard tasks. While their dependence on token semantics persists, their improved symbolic reasoning is evident. This progress, however, comes with a significant computational cost, posing a challenge for their widespread practical deployment. For more details, you can read the full paper here.


