A New Benchmark Reveals LLMs Struggle with Constrained Planning

TLDR: LEXICON is a novel benchmark designed to evaluate large language models (LLMs) on planning tasks that incorporate temporal constraints described in natural language. It automatically generates complex, solvable problems across various environments and includes an automated system to verify LLM-generated plans. Experiments show that the performance of state-of-the-art LLMs, even those with advanced reasoning capabilities, significantly deteriorates as the number of constraints in a planning problem increases, highlighting a critical limitation in their current planning abilities.

Large language models (LLMs) have shown impressive reasoning abilities, leading to their evaluation in various planning tasks. However, a significant gap exists in their performance when these planning tasks involve real-world constraints, especially safety-critical ones. To address this, a new benchmark called LEXICON has been introduced.

LEXICON, which stands for natural language-based (LEXI) constrained (CON) planning benchmark, aims to rigorously evaluate how well LLMs can handle planning problems with temporal constraints. Developed by researchers Periklis Mantenoglou, Rishi Hazra, Pedro Zuidberg Dos Martires, and Luc De Raedt from Örebro University and KU Leuven, LEXICON takes existing planning environments and automatically imposes temporal constraints on their states. These constrained problems are then translated into natural language and presented to LLMs for solving.

A core strength of LEXICON is its extensibility. It can incorporate new, unconstrained environment generators, for which temporal constraints are automatically constructed. This design makes the benchmark future-proof, allowing the difficulty of planning problems to increase as LLM capabilities advance. The benchmark currently supports five diverse environments: BabyAI, Blocksworld, Logistics, Sokoban, and AlfWorld, each presenting unique planning challenges with added constraints.

The LEXICON architecture includes two main components: a symbolic reasoning engine and a translator. The reasoning engine generates constrained planning problems, ensuring they remain solvable while increasing complexity compared to their unconstrained versions. It also guarantees that constraints are meaningful and non-redundant. The translator then converts these formal planning problems and their constraints into clear natural language descriptions, making them accessible for LLM evaluation.

Once an LLM generates a plan, LEXICON’s automated verifier module steps in. It maps the LLM’s natural language plan to formal actions, validates its correctness against the compiled constrained problem, and checks for optimality by comparing its length to the known optimal cost. This rigorous verification process ensures accurate assessment of LLM performance.

Experiments conducted with state-of-the-art LLMs, including reasoning models like GPT-5, OpenAI o3, DeepSeek R1, Gemini 2.5 Pro, and Claude 3.7 Sonnet, revealed a consistent trend: LLM performance significantly declines as the number of constraints in a planning task increases. While reasoning models generally outperformed those without explicit thinking capabilities, they still struggled with highly constrained problems, often failing to produce even suboptimal plans when faced with 10 constraints.

Interestingly, the research observed that the number of “thinking tokens” generated by reasoning models increased with the optimal plan length, suggesting deeper reasoning for more complex tasks. However, this deeper reasoning did not always translate to sound plans, with models exhibiting errors such as precondition violations, state hallucination, misinterpreted constraints, and loss of state tracking.

Despite these challenges, LEXICON itself proves to be highly efficient. It can generate and verify new constrained planning problems roughly an order of magnitude faster than LLMs can solve them. This efficiency enables real-time, adaptive evaluation of LLM planners, allowing for on-the-fly assessment of their capabilities across varying complexities.

Also Read:

The introduction of LEXICON provides a principled platform for evaluating LLMs on increasingly complex planning tasks with compositional constraints. As LLMs continue to evolve, this benchmark will be crucial for understanding and improving their ability to perform robust and safe planning in real-world applications. For more details, you can refer to the full research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Benchmark Reveals LLMs Struggle with Constrained Planning

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates