spot_img
HomeResearch & DevelopmentNew Benchmark Challenges AI's Understanding of Physical Spaces

New Benchmark Challenges AI’s Understanding of Physical Spaces

TLDR: Blueprint-Bench is a new benchmark evaluating AI models’ spatial reasoning by tasking them with converting apartment photographs into 2D floor plans. It tests leading LLMs, image generation models, and agent systems. Results show current AI models perform at or below a random baseline, significantly lagging human performance, highlighting a major gap in their spatial intelligence despite input modalities being within their training distribution.

A new research paper introduces Blueprint-Bench, a novel benchmark designed to rigorously evaluate the spatial reasoning capabilities of various artificial intelligence models. The core task involves converting a series of apartment photographs into accurate 2D floor plans, a challenge that requires genuine spatial intelligence rather than just pattern recognition.

Authored by Lukas Petersson, Axel Backlund, Hanna Petersson, Axel Wennstr¨om, Callum Sharrock, and Arash Dabiri from Andon Labs, the paper highlights a significant blind spot in current AI capabilities. While modern multimodal models are well-acquainted with photographic inputs, the task of spatial reconstruction – inferring room layouts, understanding connectivity, and maintaining consistent scale – proves to be a formidable hurdle.

The Blueprint-Bench Challenge

The benchmark dataset comprises 50 apartments, each accompanied by approximately 20 interior images. Models are tasked with generating a 2D floor plan that adheres to nine specific rules, ensuring consistency and robust scoring. These rules dictate elements like black walls, green doors (without swings), a pure white background, straight lines, and red dots marking the center of each enclosed room. The goal is to create a minimalistic map, ignoring details like furniture or windows.

The evaluation process is model-agnostic, meaning any system capable of generating an image from a sequence of images can participate. This includes leading large language models (LLMs) such as GPT-5, Claude 4 Opus, Gemini 2.5 Pro, and Grok-4. Image generation models like GPT-Image and NanoBanana were also tested. Additionally, agent systems, which can iteratively refine their outputs in a simulated environment, were evaluated, specifically Codex CLI and Claude Code.

Measuring Spatial Intelligence

To quantify performance, Blueprint-Bench employs a sophisticated scoring algorithm. This algorithm measures the similarity between a generated floor plan and the ground-truth plan based on two primary factors: room connectivity graphs and room size rankings. It extracts spatial structures from the standardized floor plan images, identifying room locations, boundaries, and door connections. A composite similarity score is then calculated, considering factors like Jaccard similarity for edge overlap, degree correlation for connectivity patterns, graph density matching, and accuracy in room and door counts.

Key Findings: A Spatial Blind Spot

The results from Blueprint-Bench reveal a stark reality: most current AI models perform at or even below a random baseline. This indicates a profound lack of spatial intelligence when compared to human performance, which remains substantially superior. Even with the challenging setup of only viewing images rather than physically navigating an apartment, humans consistently produced floor plans with correct room connectivity, though they sometimes struggled with precise size rankings.

Image generation models, particularly GPT-4o and NanoBanana, struggled significantly with instruction following, often failing to adhere to the specified rules (e.g., including furniture or not placing red dots correctly). While other image models like GPT Image followed instructions better, their spatial intelligence scores were still on par with the random baseline.

Intriguingly, agent-based approaches, which allow for iterative refinement and multiple attempts, showed no meaningful improvement over single-pass generation. For instance, the Codex GPT-5 agent simply viewed all images and then generated a script to create a floor plan without reviewing its output. Claude Code, using Claude 4 Opus, did attempt iterative refinement but still produced outputs with errors, suggesting that the ability to refine doesn’t automatically translate to better spatial understanding.

Also Read:

The Path Forward

Blueprint-Bench serves as the first numerical framework for directly comparing spatial intelligence across diverse model architectures, including LLMs and image generation models. The paper emphasizes the importance of such benchmarks for AI safety and for tracking progress in a fundamental aspect of intelligence that current models have yet to master.

The authors plan to continuously evaluate new models as they are released and welcome community submissions to their public leaderboard. This open-source approach aims to foster validation and accelerate the emergence of genuine spatial intelligence in generalist AI systems. For more details, you can read the full research paper here: Blueprint-Bench: Comparing Spatial Intelligence of LLMs, Agents and Image Models.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -