TLDR: AddressVLM is a new AI model that significantly improves the ability of large vision-language models (LVLMs) to pinpoint exact street-level addresses from street-view images. It achieves this by using a novel “cross-view alignment tuning” method, which combines street-view images with satellite maps to give the AI a better global understanding of urban layouts. The model was trained on new datasets and shows substantial accuracy gains over existing methods, demonstrating its potential for flexible address-related question answering and scalability across different cities.
Large Vision-Language Models (LVLMs) have shown impressive capabilities in understanding images and language, particularly in broad geographic localization, like identifying a country or city. However, these advanced AI models often struggle with the more precise task of pinpointing exact street-level addresses within urban environments. This limitation makes it difficult for them to answer specific questions related to addresses using street-view images.
A new research paper introduces AddressVLM, a novel approach designed to integrate fine-grained, city-wide address localization into LVLMs. The core challenge addressed by AddressVLM is that street-view images, while detailed, provide only a microscopic view, making it hard for models to grasp the overall layout of a city’s streets. To overcome this, AddressVLM incorporates perspective-invariant satellite images as ‘macro cues’ to provide a broader, global understanding.
The innovation lies in its ‘cross-view alignment tuning’ mechanism. This involves two key components. First, a ‘satellite-view and street-view image grafting mechanism’ combines street-view images with their corresponding regional satellite images. The street-view image is scaled down and placed in the upper right corner of the satellite image, which has street names marked on it. This unique input format helps the model focus on the overall street distribution from the map while still incorporating street-level details.
Second, an ‘automatic alignment label generation mechanism’ is used. Instead of manually creating labels, a well-trained LVLM automatically generates explanations for why a street-view image matches a specific address on the satellite map, based on visual clues like building colors, shapes, and surrounding environments. This creates flexible and diverse training labels.
AddressVLM employs a two-stage training protocol. The first stage, cross-view alignment tuning, teaches the model to align street-view images with street addresses on satellite maps, thereby integrating a global understanding of urban street layouts. The second stage, address localization tuning, then refines this knowledge using street-view images alone to infer fine-grained address information.
To facilitate this research, the team constructed two new street-view Visual Question Answering (VQA) datasets: Pitts-VQA (based on Pittsburgh) and SF-Base-VQA (based on San Francisco). These datasets include diverse question types—generation, judgment, and multiple-choice—to thoroughly evaluate the model’s capabilities.
Qualitative and quantitative evaluations demonstrate that AddressVLM significantly outperforms existing LVLMs and state-of-the-art methods like GeoReasoner. For instance, AddressVLM showed improvements of over 9% and 12% in average address localization accuracy on the Pitts-VQA and SF-Base-VQA datasets, respectively, compared to baselines. It also showed improvements of 11% and 14% over GeoReasoner on these datasets.
The research also highlights AddressVLM’s scalability. When trained on a merged dataset from both Pittsburgh and San Francisco, the unified model performed even better, suggesting its potential to extend capabilities across more cities or even an entire country. Further tests on the Tokyo dataset confirmed its adaptability to different urban address systems. The model’s efficiency, using only 4 billion parameters, also makes it feasible for future on-device deployment.
Also Read:
- StreetViewAI: A New Horizon for Accessible Street View Exploration
- GoViG: AI Generates Navigation Instructions from Visual Observations Alone
This work marks a significant step forward in enabling AI models to understand and answer complex, flexible questions about precise street-level locations from images. For more technical details, you can refer to the full research paper: AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models.


