spot_img
HomeResearch & DevelopmentBridging Vision and Formal Logic for Autonomous AI Planning

Bridging Vision and Formal Logic for Autonomous AI Planning

TLDR: VLMFP is a new AI framework that uses two specialized Vision Language Models (VLMs) to autonomously generate planning rules (PDDL domain and problem files) from visual inputs. It combines a SimVLM for accurate visual perception and action simulation with a GenVLM for PDDL generation and refinement. This allows AI to perform complex visual planning without human intervention, showing strong generalization across different scenarios, visual styles, and even modified game rules, significantly outperforming previous methods.

Planning how to achieve a goal, especially in complex visual environments, is a fundamental challenge for artificial intelligence. While humans effortlessly navigate scenarios like assembling furniture or driving, current AI systems, particularly Vision Language Models (VLMs), often struggle with the precision required for spatial reasoning and planning over many steps. On the other hand, formal planning systems, using languages like Planning Domain Definition Language (PDDL), excel at long-term, structured problem-solving but cannot directly interpret visual information.

A new framework called VLMFP (VLM-Guided Formal Planning) addresses this gap by combining the strengths of both approaches. Developed by researchers including Yilun Hao, Yongchao Chen, Chuchu Fan, and Yang Zhang, VLMFP is a dual-VLM system designed to autonomously generate the necessary planning files from visual inputs, eliminating the need for human experts or constant environmental interaction.

Previous attempts to merge VLMs and PDDL planners often hit a roadblock: while VLMs could generate PDDL “problem” files (describing a specific scenario’s initial state and goal), they struggled with the more complex “domain” files, which define all the general rules and actions of a planning environment. VLMFP overcomes this by introducing two specialized VLMs working in tandem.

How VLMFP Works: A Dual-VLM Approach

VLMFP employs two distinct Vision Language Models, each with a specific role:

SimVLM (Simulation VLM): This model is fine-tuned to excel at visual perception and simulating action outcomes. Given a visual scene and a set of rules, SimVLM can accurately describe the spatial relationships in an image, predict the consequences of proposed actions (e.g., whether moving right leads to hitting a wall or falling into a hole), and determine if a sequence of actions achieves the overall goal. It’s particularly strong in understanding visual-spatial details.

GenVLM (Generation VLM): This is a larger, more general-purpose VLM (like GPT-4o) with extensive knowledge of PDDL. Its role is to generate and iteratively refine both the PDDL problem and domain files. GenVLM leverages SimVLM’s precise visual understanding to create initial PDDL files and then uses feedback from SimVLM’s simulations to correct any discrepancies.

The process unfolds in several iterative steps: First, SimVLM analyzes the visual input and provides a natural language description of the scenario. GenVLM then uses this description to generate initial PDDL problem and domain files. These files undergo a “prescreening” for syntactic and semantic correctness. Next, random action sequences are executed in both the PDDL environment (based on GenVLM’s files) and SimVLM’s simulation. Any inconsistencies between these two executions provide crucial feedback to GenVLM, which then refines the PDDL files. This cycle continues until the PDDL files are consistent and a valid plan can be found by a PDDL planner.

Also Read:

Achieving Broad Generalization

One of VLMFP’s significant strengths is its ability to generalize across various aspects of planning problems. The same generated PDDL domain file can be reused for all different instances within the same problem type (e.g., different maps of the same game). Furthermore, the VLMs themselves can adapt to entirely different problems, even with varied visual appearances and altered game rules.

The researchers evaluated VLMFP across six grid-world domains, including Frozenlake, Maze, and Sokoban, testing its ability to generalize to unseen instances, appearances, and even modified game rules. SimVLM demonstrated high accuracy in describing scenarios and simulating actions, even with novel visual styles. With SimVLM’s guidance, VLMFP successfully generated PDDL files that led to valid plans for a significant percentage of unseen problems, outperforming other baseline methods by a considerable margin.

For instance, VLMFP achieved an average success rate of 70.0% for unseen instances in seen appearances and 54.1% for unseen instances in unseen appearances, significantly surpassing the best baseline’s 30.7% and 32.3% respectively. The framework proved particularly effective in more complex domains where reasoning about multiple object types and intricate actions is required.

This work marks a crucial step towards making formal planning more accessible and robust, allowing AI systems to interpret visual information and generate comprehensive planning rules without human intervention. To learn more about this innovative framework, you can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -