spot_img
HomeResearch & DevelopmentA New Benchmark Reveals LLMs Struggle with Constrained Planning

A New Benchmark Reveals LLMs Struggle with Constrained Planning

TLDR: LEXICON is a novel benchmark designed to evaluate large language models (LLMs) on planning tasks that incorporate temporal constraints described in natural language. It automatically generates complex, solvable problems across various environments and includes an automated system to verify LLM-generated plans. Experiments show that the performance of state-of-the-art LLMs, even those with advanced reasoning capabilities, significantly deteriorates as the number of constraints in a planning problem increases, highlighting a critical limitation in their current planning abilities.

Large language models (LLMs) have shown impressive reasoning abilities, leading to their evaluation in various planning tasks. However, a significant gap exists in their performance when these planning tasks involve real-world constraints, especially safety-critical ones. To address this, a new benchmark called LEXICON has been introduced.

LEXICON, which stands for natural language-based (LEXI) constrained (CON) planning benchmark, aims to rigorously evaluate how well LLMs can handle planning problems with temporal constraints. Developed by researchers Periklis Mantenoglou, Rishi Hazra, Pedro Zuidberg Dos Martires, and Luc De Raedt from Örebro University and KU Leuven, LEXICON takes existing planning environments and automatically imposes temporal constraints on their states. These constrained problems are then translated into natural language and presented to LLMs for solving.

A core strength of LEXICON is its extensibility. It can incorporate new, unconstrained environment generators, for which temporal constraints are automatically constructed. This design makes the benchmark future-proof, allowing the difficulty of planning problems to increase as LLM capabilities advance. The benchmark currently supports five diverse environments: BabyAI, Blocksworld, Logistics, Sokoban, and AlfWorld, each presenting unique planning challenges with added constraints.

The LEXICON architecture includes two main components: a symbolic reasoning engine and a translator. The reasoning engine generates constrained planning problems, ensuring they remain solvable while increasing complexity compared to their unconstrained versions. It also guarantees that constraints are meaningful and non-redundant. The translator then converts these formal planning problems and their constraints into clear natural language descriptions, making them accessible for LLM evaluation.

Once an LLM generates a plan, LEXICON’s automated verifier module steps in. It maps the LLM’s natural language plan to formal actions, validates its correctness against the compiled constrained problem, and checks for optimality by comparing its length to the known optimal cost. This rigorous verification process ensures accurate assessment of LLM performance.

Experiments conducted with state-of-the-art LLMs, including reasoning models like GPT-5, OpenAI o3, DeepSeek R1, Gemini 2.5 Pro, and Claude 3.7 Sonnet, revealed a consistent trend: LLM performance significantly declines as the number of constraints in a planning task increases. While reasoning models generally outperformed those without explicit thinking capabilities, they still struggled with highly constrained problems, often failing to produce even suboptimal plans when faced with 10 constraints.

Interestingly, the research observed that the number of “thinking tokens” generated by reasoning models increased with the optimal plan length, suggesting deeper reasoning for more complex tasks. However, this deeper reasoning did not always translate to sound plans, with models exhibiting errors such as precondition violations, state hallucination, misinterpreted constraints, and loss of state tracking.

Despite these challenges, LEXICON itself proves to be highly efficient. It can generate and verify new constrained planning problems roughly an order of magnitude faster than LLMs can solve them. This efficiency enables real-time, adaptive evaluation of LLM planners, allowing for on-the-fly assessment of their capabilities across varying complexities.

Also Read:

The introduction of LEXICON provides a principled platform for evaluating LLMs on increasingly complex planning tasks with compositional constraints. As LLMs continue to evolve, this benchmark will be crucial for understanding and improving their ability to perform robust and safe planning in real-world applications. For more details, you can refer to the full research paper.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -