Bridging Vision and Formal Logic for Autonomous AI Planning

TLDR: VLMFP is a new AI framework that uses two specialized Vision Language Models (VLMs) to autonomously generate planning rules (PDDL domain and problem files) from visual inputs. It combines a SimVLM for accurate visual perception and action simulation with a GenVLM for PDDL generation and refinement. This allows AI to perform complex visual planning without human intervention, showing strong generalization across different scenarios, visual styles, and even modified game rules, significantly outperforming previous methods.

Planning how to achieve a goal, especially in complex visual environments, is a fundamental challenge for artificial intelligence. While humans effortlessly navigate scenarios like assembling furniture or driving, current AI systems, particularly Vision Language Models (VLMs), often struggle with the precision required for spatial reasoning and planning over many steps. On the other hand, formal planning systems, using languages like Planning Domain Definition Language (PDDL), excel at long-term, structured problem-solving but cannot directly interpret visual information.

A new framework called VLMFP (VLM-Guided Formal Planning) addresses this gap by combining the strengths of both approaches. Developed by researchers including Yilun Hao, Yongchao Chen, Chuchu Fan, and Yang Zhang, VLMFP is a dual-VLM system designed to autonomously generate the necessary planning files from visual inputs, eliminating the need for human experts or constant environmental interaction.

Previous attempts to merge VLMs and PDDL planners often hit a roadblock: while VLMs could generate PDDL “problem” files (describing a specific scenario’s initial state and goal), they struggled with the more complex “domain” files, which define all the general rules and actions of a planning environment. VLMFP overcomes this by introducing two specialized VLMs working in tandem.

How VLMFP Works: A Dual-VLM Approach

VLMFP employs two distinct Vision Language Models, each with a specific role:

SimVLM (Simulation VLM): This model is fine-tuned to excel at visual perception and simulating action outcomes. Given a visual scene and a set of rules, SimVLM can accurately describe the spatial relationships in an image, predict the consequences of proposed actions (e.g., whether moving right leads to hitting a wall or falling into a hole), and determine if a sequence of actions achieves the overall goal. It’s particularly strong in understanding visual-spatial details.

GenVLM (Generation VLM): This is a larger, more general-purpose VLM (like GPT-4o) with extensive knowledge of PDDL. Its role is to generate and iteratively refine both the PDDL problem and domain files. GenVLM leverages SimVLM’s precise visual understanding to create initial PDDL files and then uses feedback from SimVLM’s simulations to correct any discrepancies.

The process unfolds in several iterative steps: First, SimVLM analyzes the visual input and provides a natural language description of the scenario. GenVLM then uses this description to generate initial PDDL problem and domain files. These files undergo a “prescreening” for syntactic and semantic correctness. Next, random action sequences are executed in both the PDDL environment (based on GenVLM’s files) and SimVLM’s simulation. Any inconsistencies between these two executions provide crucial feedback to GenVLM, which then refines the PDDL files. This cycle continues until the PDDL files are consistent and a valid plan can be found by a PDDL planner.

Also Read:

Achieving Broad Generalization

One of VLMFP’s significant strengths is its ability to generalize across various aspects of planning problems. The same generated PDDL domain file can be reused for all different instances within the same problem type (e.g., different maps of the same game). Furthermore, the VLMs themselves can adapt to entirely different problems, even with varied visual appearances and altered game rules.

The researchers evaluated VLMFP across six grid-world domains, including Frozenlake, Maze, and Sokoban, testing its ability to generalize to unseen instances, appearances, and even modified game rules. SimVLM demonstrated high accuracy in describing scenarios and simulating actions, even with novel visual styles. With SimVLM’s guidance, VLMFP successfully generated PDDL files that led to valid plans for a significant percentage of unseen problems, outperforming other baseline methods by a considerable margin.

For instance, VLMFP achieved an average success rate of 70.0% for unseen instances in seen appearances and 54.1% for unseen instances in unseen appearances, significantly surpassing the best baseline’s 30.7% and 32.3% respectively. The framework proved particularly effective in more complex domains where reasoning about multiple object types and intricate actions is required.

This work marks a crucial step towards making formal planning more accessible and robust, allowing AI systems to interpret visual information and generate comprehensive planning rules without human intervention. To learn more about this innovative framework, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging Vision and Formal Logic for Autonomous AI Planning

How VLMFP Works: A Dual-VLM Approach

Achieving Broad Generalization

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

U.S. Air Force Secures Skydio Drone Technology for Enhanced Autonomous Operations

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates