Language Models Crafting Sound Policies for Automated Planning

TLDR: This research introduces LMPLAN, a system where language models (LMs) generate Python programs as sound policies or value functions for solving generalized planning problems in PDDL. Experiments show LM-generated policies can outperform traditional planners in coverage, especially when combined with value functions. Surprisingly, LMs sometimes perform better with symbolic, meaningless input names, suggesting a capacity for symbolic reasoning beyond word semantics.

Artificial intelligence has long sought to automate complex decision-making, a field known as automated planning (AP). Within AP, a particularly challenging area is Generalised Planning (GP), which aims to create flexible programs that can solve entire families of related planning problems, not just individual instances. Imagine a logistics company needing a system that can plan package deliveries for any number of trucks and packages, rather than having to reprogram for each new scenario. This is where generalised planning comes in.

Recent advancements in large language models (LMs) have opened new avenues for tackling these complex problems. Researchers Dillon Z. Chen, Johannes Zenn, Tristan Cinquin, and Sheila A. McIlraith from the Vector Institute, LAAS-CNRS, and the University of Toronto have explored how LMs can be used to generate Python programs for generalised PDDL planning. Their work, detailed in the paper “Language Models For Generalised PDDL Planning: Synthesising Sound and Programmatic Policies”, introduces a novel approach that leverages LMs to create provably sound policies.

The Core Idea: LMs as Programmers for Planning

The central idea behind this research is to prompt LMs to write Python programs that act as “generalised policies” or “value functions” for solving planning problems defined in PDDL (Planning Domain Definition Language). PDDL is a standard language for describing planning domains, including object types, predicates, and actions. Instead of the LM directly outputting a plan, it generates code that can then be used by a planner to find solutions.

The researchers developed a planner called LMPLAN. This system takes a PDDL domain, along with a few example problems and a natural language prompt, and asks the LM to generate a Python program. This program can either be a “value function” (which estimates how good a state is, guiding a search algorithm) or a “policy” (which directly suggests an action to take in a given state).

Ensuring Reliability: Soundness in LM-Generated Policies

A significant contribution of this work is the guarantee of “soundness” for the synthesised policies. Soundness means that any solution generated by the policy is guaranteed to be correct and valid according to the PDDL domain rules. This is achieved by restricting the LM-generated policies to only predict actions that are applicable in the current state. If the LM makes an error and suggests an invalid action, the system defaults to choosing a random applicable action, thus maintaining soundness without needing external verifiers.

Experimental Insights: Policies, Value Functions, and Symbolic Reasoning

The LMPLAN planner was tested on 10 standard PDDL planning domains, including Blocksworld, Ferry, and Satellite, with problems ranging in difficulty. The experiments aimed to answer several key questions:

First, comparing LM-generated value functions (used in heuristic search) and policies (used as reactive controllers), the study found that policies (πLM) were remarkably effective for simpler problems, solving all problems in 6 out of 10 domains and even outperforming a state-of-the-art planner (LAMA) in overall coverage. Value functions (V LM), while more consistent across domains, had lower overall coverage. The best performance was achieved by a portfolio approach (πLM ⊗ V LM) that intelligently chose between policies and value functions based on validation scores, achieving the highest total coverage.

Second, the research investigated the importance of soundness and completeness. While all approaches in this study were sound, policies (πLM) are not inherently complete (meaning they might not find a solution even if one exists). However, the results showed that soundness is crucial, and incomplete but sound policies can still achieve excellent performance. Combining policies with complete search algorithms (πLM ⊕ V LM) consistently improved performance over pure search with value functions alone.

Perhaps the most surprising finding relates to how LMs process information. The researchers conducted an ablation study where all meaningful names in PDDL files (like “dog” or “at”) were replaced with meaningless symbols (e.g., “o1” for object, “p2” for predicate). Contrary to previous hypotheses that LMs rely on word semantics, the LM-generated programs sometimes performed better with these symbolic representations. This suggests that LMs in this framework might be learning a form of symbolic planning or reasoning, rather than just memorizing solutions based on natural language understanding.

Also Read:

Limitations and Future Directions

Despite its strengths, the LMPLAN approach has limitations. While policies excel in coverage, their plan quality can sometimes be significantly worse than traditional planners, occasionally taking many unnecessary actions. Also, standalone policies lack completeness and termination guarantees. The observation that LMs sometimes perform better with symbolic names is provocative but warrants further investigation to fully understand the underlying mechanisms.

This research highlights the potential of LMs to generate robust and sound programmatic solutions for generalised planning, pushing the boundaries of what LMs can achieve in complex reasoning tasks.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Language Models Crafting Sound Policies for Automated Planning

The Core Idea: LMs as Programmers for Planning

Ensuring Reliability: Soundness in LM-Generated Policies

Experimental Insights: Policies, Value Functions, and Symbolic Reasoning

Limitations and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates