Pinpointing Addresses: AddressVLM Improves AI's Street-View Localization

TLDR: AddressVLM is a new AI model that significantly improves the ability of large vision-language models (LVLMs) to pinpoint exact street-level addresses from street-view images. It achieves this by using a novel “cross-view alignment tuning” method, which combines street-view images with satellite maps to give the AI a better global understanding of urban layouts. The model was trained on new datasets and shows substantial accuracy gains over existing methods, demonstrating its potential for flexible address-related question answering and scalability across different cities.

Large Vision-Language Models (LVLMs) have shown impressive capabilities in understanding images and language, particularly in broad geographic localization, like identifying a country or city. However, these advanced AI models often struggle with the more precise task of pinpointing exact street-level addresses within urban environments. This limitation makes it difficult for them to answer specific questions related to addresses using street-view images.

A new research paper introduces AddressVLM, a novel approach designed to integrate fine-grained, city-wide address localization into LVLMs. The core challenge addressed by AddressVLM is that street-view images, while detailed, provide only a microscopic view, making it hard for models to grasp the overall layout of a city’s streets. To overcome this, AddressVLM incorporates perspective-invariant satellite images as ‘macro cues’ to provide a broader, global understanding.

The innovation lies in its ‘cross-view alignment tuning’ mechanism. This involves two key components. First, a ‘satellite-view and street-view image grafting mechanism’ combines street-view images with their corresponding regional satellite images. The street-view image is scaled down and placed in the upper right corner of the satellite image, which has street names marked on it. This unique input format helps the model focus on the overall street distribution from the map while still incorporating street-level details.

Second, an ‘automatic alignment label generation mechanism’ is used. Instead of manually creating labels, a well-trained LVLM automatically generates explanations for why a street-view image matches a specific address on the satellite map, based on visual clues like building colors, shapes, and surrounding environments. This creates flexible and diverse training labels.

AddressVLM employs a two-stage training protocol. The first stage, cross-view alignment tuning, teaches the model to align street-view images with street addresses on satellite maps, thereby integrating a global understanding of urban street layouts. The second stage, address localization tuning, then refines this knowledge using street-view images alone to infer fine-grained address information.

To facilitate this research, the team constructed two new street-view Visual Question Answering (VQA) datasets: Pitts-VQA (based on Pittsburgh) and SF-Base-VQA (based on San Francisco). These datasets include diverse question types—generation, judgment, and multiple-choice—to thoroughly evaluate the model’s capabilities.

Qualitative and quantitative evaluations demonstrate that AddressVLM significantly outperforms existing LVLMs and state-of-the-art methods like GeoReasoner. For instance, AddressVLM showed improvements of over 9% and 12% in average address localization accuracy on the Pitts-VQA and SF-Base-VQA datasets, respectively, compared to baselines. It also showed improvements of 11% and 14% over GeoReasoner on these datasets.

The research also highlights AddressVLM’s scalability. When trained on a merged dataset from both Pittsburgh and San Francisco, the unified model performed even better, suggesting its potential to extend capabilities across more cities or even an entire country. Further tests on the Tokyo dataset confirmed its adaptability to different urban address systems. The model’s efficiency, using only 4 billion parameters, also makes it feasible for future on-device deployment.

Also Read:

This work marks a significant step forward in enabling AI models to understand and answer complex, flexible questions about precise street-level locations from images. For more technical details, you can refer to the full research paper: AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Pinpointing Addresses: AddressVLM Improves AI’s Street-View Localization

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates