New Benchmark Challenges AI's Understanding of Physical Spaces

TLDR: Blueprint-Bench is a new benchmark evaluating AI models’ spatial reasoning by tasking them with converting apartment photographs into 2D floor plans. It tests leading LLMs, image generation models, and agent systems. Results show current AI models perform at or below a random baseline, significantly lagging human performance, highlighting a major gap in their spatial intelligence despite input modalities being within their training distribution.

A new research paper introduces Blueprint-Bench, a novel benchmark designed to rigorously evaluate the spatial reasoning capabilities of various artificial intelligence models. The core task involves converting a series of apartment photographs into accurate 2D floor plans, a challenge that requires genuine spatial intelligence rather than just pattern recognition.

Authored by Lukas Petersson, Axel Backlund, Hanna Petersson, Axel Wennstr¨om, Callum Sharrock, and Arash Dabiri from Andon Labs, the paper highlights a significant blind spot in current AI capabilities. While modern multimodal models are well-acquainted with photographic inputs, the task of spatial reconstruction – inferring room layouts, understanding connectivity, and maintaining consistent scale – proves to be a formidable hurdle.

The Blueprint-Bench Challenge

The benchmark dataset comprises 50 apartments, each accompanied by approximately 20 interior images. Models are tasked with generating a 2D floor plan that adheres to nine specific rules, ensuring consistency and robust scoring. These rules dictate elements like black walls, green doors (without swings), a pure white background, straight lines, and red dots marking the center of each enclosed room. The goal is to create a minimalistic map, ignoring details like furniture or windows.

The evaluation process is model-agnostic, meaning any system capable of generating an image from a sequence of images can participate. This includes leading large language models (LLMs) such as GPT-5, Claude 4 Opus, Gemini 2.5 Pro, and Grok-4. Image generation models like GPT-Image and NanoBanana were also tested. Additionally, agent systems, which can iteratively refine their outputs in a simulated environment, were evaluated, specifically Codex CLI and Claude Code.

Measuring Spatial Intelligence

To quantify performance, Blueprint-Bench employs a sophisticated scoring algorithm. This algorithm measures the similarity between a generated floor plan and the ground-truth plan based on two primary factors: room connectivity graphs and room size rankings. It extracts spatial structures from the standardized floor plan images, identifying room locations, boundaries, and door connections. A composite similarity score is then calculated, considering factors like Jaccard similarity for edge overlap, degree correlation for connectivity patterns, graph density matching, and accuracy in room and door counts.

Key Findings: A Spatial Blind Spot

The results from Blueprint-Bench reveal a stark reality: most current AI models perform at or even below a random baseline. This indicates a profound lack of spatial intelligence when compared to human performance, which remains substantially superior. Even with the challenging setup of only viewing images rather than physically navigating an apartment, humans consistently produced floor plans with correct room connectivity, though they sometimes struggled with precise size rankings.

Image generation models, particularly GPT-4o and NanoBanana, struggled significantly with instruction following, often failing to adhere to the specified rules (e.g., including furniture or not placing red dots correctly). While other image models like GPT Image followed instructions better, their spatial intelligence scores were still on par with the random baseline.

Intriguingly, agent-based approaches, which allow for iterative refinement and multiple attempts, showed no meaningful improvement over single-pass generation. For instance, the Codex GPT-5 agent simply viewed all images and then generated a script to create a floor plan without reviewing its output. Claude Code, using Claude 4 Opus, did attempt iterative refinement but still produced outputs with errors, suggesting that the ability to refine doesn’t automatically translate to better spatial understanding.

Also Read:

The Path Forward

Blueprint-Bench serves as the first numerical framework for directly comparing spatial intelligence across diverse model architectures, including LLMs and image generation models. The paper emphasizes the importance of such benchmarks for AI safety and for tracking progress in a fundamental aspect of intelligence that current models have yet to master.

The authors plan to continuously evaluate new models as they are released and welcome community submissions to their public leaderboard. This open-source approach aims to foster validation and accelerate the emergence of genuine spatial intelligence in generalist AI systems. For more details, you can read the full research paper here: Blueprint-Bench: Comparing Spatial Intelligence of LLMs, Agents and Image Models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmark Challenges AI’s Understanding of Physical Spaces

The Blueprint-Bench Challenge

Measuring Spatial Intelligence

Key Findings: A Spatial Blind Spot

The Path Forward

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates