spot_img
HomeResearch & DevelopmentBridging the Gap: Evaluating Large Language Models in AI...

Bridging the Gap: Evaluating Large Language Models in AI Planning Through an NLP Lens

TLDR: This research paper introduces a new pipeline to evaluate and improve plans generated by Large Language Models (LLMs) for AI planning tasks. By treating plan generation as a Natural Language Processing (NLP) problem, the authors propose a recovery mechanism that combines NLP manipulations with symbolic planning. The findings suggest that while this pipeline can improve plan quality and success rates, LLMs still lack true reasoning abilities for planning and remain less reliable than traditional symbolic planners, often requiring significant completion by classical methods.

Large Language Models (LLMs) have revolutionized many areas of artificial intelligence, particularly in their ability to process and generate human-like text. Naturally, their potential in AI planning—the process of finding a sequence of actions to achieve a goal—has garnered significant interest. However, despite their impressive linguistic capabilities, LLMs often fall short when it comes to generating reliable and accurate plans, frequently producing plans with errors or fabricated actions.

A recent research paper titled “How Far Are LLMs from Symbolic Planners? An NLP-Based Perspective” by Ma’ayan Armony, Albert MeroËœno-PeËœnuela, and Gerard Canal from King’s College London, delves into this challenge. The authors approach the problem of planning with LLMs not just as a planning task, but fundamentally as a Natural Language Processing (NLP) task, given that LLMs are inherently NLP models. This perspective is crucial because LLMs generate outputs based on statistical patterns from vast text corpora, rather than through deep semantic understanding or logical reasoning over constraints, which is the bedrock of traditional symbolic planners.

The paper introduces a novel recovery pipeline designed to evaluate and improve LLM-generated plans. This pipeline consists of several stages: an NLP-based evaluation of the initial plans, followed by three stages of NLP manipulation to refine these plans, and finally, a completion stage using a symbolic planner. This comprehensive approach allows for a more nuanced understanding of the quality of even invalid plans, going beyond simple success rates.

The Evaluation and Recovery Pipeline

The pipeline begins by comparing an LLM-generated plan (Ï€0) to a ground-truth (GT) plan (Ï€GT) produced by a symbolic planner. Actions in the LLM’s plan are assigned quality labels (e.g., correct, misplaced, same action but wrong parameters, different action, redundant) based on their similarity to actions in the GT plan. This similarity is quantified using semantic measures and parameter alignment. The evaluation also considers the Longest Common Subset (LCS) and Longest Common Subsequence (LCS) to measure structural overlap.

A key aspect of the evaluation is determining the “plan potential”—how difficult it would be to transform an invalid LLM plan into a valid and optimal one. This involves applying basic NLP transformations like circular shifts (reordering actions) and consistent parameter remapping (swapping objects in the plan). The plan variant with the highest score after these transformations is selected as Ï€1. The pipeline also identifies the “Last Executable Action” (LEA), which is the last action in the plan that doesn’t violate domain constraints, providing insight into how far the LLM got before failing.

For plan recovery, the pipeline identifies the best sub-plan (Ï€2 and Ï€3) from the initial and NLP-transformed plans. The final recovery step involves “replanning.” Here, the last successfully executed state of the LLM’s plan is used as the starting point for a symbolic planner to complete the rest of the plan (Ï€comp), resulting in the final recovered plan (Ï€4).

Also Read:

Key Findings and Implications

The research evaluated plans generated by various LLMs (including Qwen, Llama, Gemma, Claude, Gemini, and GPT-3.5 Turbo) across different prompt types (zero-shot, one-shot, state-tracking, PDDL) in Blocksworld and Logistics domains. The results are insightful:

  • The recovery pipeline significantly improved the overall success rate of LLM-generated plans, increasing it from 21.9% to 27.5%.
  • Natural Language (NL) prompts showed more improvement through recovery than PDDL prompts, suggesting that NL plans are often closer to being recoverable.
  • Models with lower initial success rates often saw more substantial improvements through the recovery process.
  • A critical finding is the lack of clear evidence of underlying reasoning during plan generation by LLMs. Often, the first action in an LLM-generated plan is non-executable, indicating that LLMs may not reason over the search space but rather generate superficially plausible actions.
  • The correlation between a plan’s executability and its validity was found to be inconsistent, especially for high-success-rate models. This suggests that success rate alone is an insufficient indicator of true plan quality or usability.
  • The study also revealed that LLMs often do not consistently generate nearly-correct solutions that simply require refinement. In many cases, the symbolic planner had to complete the vast majority of the plan, with the LLM’s contribution to the final plan being minimal.

In conclusion, while LLMs show promise in understanding and generating language related to planning, they are still far from being reliable, independent planners. Their outputs are more akin to statistical pattern recognition than logical derivation. The proposed NLP-based evaluation and recovery pipeline offers a valuable framework for understanding LLM capabilities and improving their planning outputs, but it also reinforces the notion that classical symbolic planners remain essential for robust and reliable AI planning. For more details, you can refer to the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -