Bridging the Gap: Evaluating Large Language Models in AI Planning Through an NLP Lens

TLDR: This research paper introduces a new pipeline to evaluate and improve plans generated by Large Language Models (LLMs) for AI planning tasks. By treating plan generation as a Natural Language Processing (NLP) problem, the authors propose a recovery mechanism that combines NLP manipulations with symbolic planning. The findings suggest that while this pipeline can improve plan quality and success rates, LLMs still lack true reasoning abilities for planning and remain less reliable than traditional symbolic planners, often requiring significant completion by classical methods.

Large Language Models (LLMs) have revolutionized many areas of artificial intelligence, particularly in their ability to process and generate human-like text. Naturally, their potential in AI planning—the process of finding a sequence of actions to achieve a goal—has garnered significant interest. However, despite their impressive linguistic capabilities, LLMs often fall short when it comes to generating reliable and accurate plans, frequently producing plans with errors or fabricated actions.

A recent research paper titled “How Far Are LLMs from Symbolic Planners? An NLP-Based Perspective” by Ma’ayan Armony, Albert Mero˜no-Pe˜nuela, and Gerard Canal from King’s College London, delves into this challenge. The authors approach the problem of planning with LLMs not just as a planning task, but fundamentally as a Natural Language Processing (NLP) task, given that LLMs are inherently NLP models. This perspective is crucial because LLMs generate outputs based on statistical patterns from vast text corpora, rather than through deep semantic understanding or logical reasoning over constraints, which is the bedrock of traditional symbolic planners.

The paper introduces a novel recovery pipeline designed to evaluate and improve LLM-generated plans. This pipeline consists of several stages: an NLP-based evaluation of the initial plans, followed by three stages of NLP manipulation to refine these plans, and finally, a completion stage using a symbolic planner. This comprehensive approach allows for a more nuanced understanding of the quality of even invalid plans, going beyond simple success rates.

The Evaluation and Recovery Pipeline

The pipeline begins by comparing an LLM-generated plan (π0) to a ground-truth (GT) plan (πGT) produced by a symbolic planner. Actions in the LLM’s plan are assigned quality labels (e.g., correct, misplaced, same action but wrong parameters, different action, redundant) based on their similarity to actions in the GT plan. This similarity is quantified using semantic measures and parameter alignment. The evaluation also considers the Longest Common Subset (LCS) and Longest Common Subsequence (LCS) to measure structural overlap.

A key aspect of the evaluation is determining the “plan potential”—how difficult it would be to transform an invalid LLM plan into a valid and optimal one. This involves applying basic NLP transformations like circular shifts (reordering actions) and consistent parameter remapping (swapping objects in the plan). The plan variant with the highest score after these transformations is selected as π1. The pipeline also identifies the “Last Executable Action” (LEA), which is the last action in the plan that doesn’t violate domain constraints, providing insight into how far the LLM got before failing.

For plan recovery, the pipeline identifies the best sub-plan (π2 and π3) from the initial and NLP-transformed plans. The final recovery step involves “replanning.” Here, the last successfully executed state of the LLM’s plan is used as the starting point for a symbolic planner to complete the rest of the plan (πcomp), resulting in the final recovered plan (π4).

Also Read:

Key Findings and Implications

The research evaluated plans generated by various LLMs (including Qwen, Llama, Gemma, Claude, Gemini, and GPT-3.5 Turbo) across different prompt types (zero-shot, one-shot, state-tracking, PDDL) in Blocksworld and Logistics domains. The results are insightful:

The recovery pipeline significantly improved the overall success rate of LLM-generated plans, increasing it from 21.9% to 27.5%.
Natural Language (NL) prompts showed more improvement through recovery than PDDL prompts, suggesting that NL plans are often closer to being recoverable.
Models with lower initial success rates often saw more substantial improvements through the recovery process.
A critical finding is the lack of clear evidence of underlying reasoning during plan generation by LLMs. Often, the first action in an LLM-generated plan is non-executable, indicating that LLMs may not reason over the search space but rather generate superficially plausible actions.
The correlation between a plan’s executability and its validity was found to be inconsistent, especially for high-success-rate models. This suggests that success rate alone is an insufficient indicator of true plan quality or usability.
The study also revealed that LLMs often do not consistently generate nearly-correct solutions that simply require refinement. In many cases, the symbolic planner had to complete the vast majority of the plan, with the LLM’s contribution to the final plan being minimal.

In conclusion, while LLMs show promise in understanding and generating language related to planning, they are still far from being reliable, independent planners. Their outputs are more akin to statistical pattern recognition than logical derivation. The proposed NLP-based evaluation and recovery pipeline offers a valuable framework for understanding LLM capabilities and improving their planning outputs, but it also reinforces the notion that classical symbolic planners remain essential for robust and reliable AI planning. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Gap: Evaluating Large Language Models in AI Planning Through an NLP Lens

The Evaluation and Recovery Pipeline

Key Findings and Implications

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates