TLDR: UrbanVLA is a Vision-Language-Action (VLA) model designed for reliable and scalable urban micromobility navigation. It addresses challenges in dynamic city environments by aligning noisy route instructions with visual observations and planning trajectories. Trained through a two-stage process (Supervised Fine-Tuning and Reinforcement Fine-Tuning), UrbanVLA demonstrates superior performance in simulation and robust real-world generalization, improving obstacle avoidance, social compliance, and long-horizon navigation for robots like delivery bots.
Urban micromobility, which includes platforms like delivery robots and assistive wheelchairs, is a rapidly growing field in artificial intelligence. However, navigating dynamic and unstructured urban environments over long distances presents significant challenges for these robots. Traditional navigation methods often rely on detailed maps, which are costly and difficult to maintain in ever-changing cityscapes. Learning-based approaches, while promising, struggle with the geometric inaccuracies and noise inherent in route information from consumer navigation tools like Google Maps.
To address these complex issues, researchers have introduced UrbanVLA, a novel route-conditioned Vision-Language-Action (VLA) framework. This innovative model is specifically designed for scalable urban navigation, enabling robots to reliably follow long-horizon route instructions in real-world city areas. UrbanVLA tackles the problem by explicitly aligning noisy route waypoints with visual observations during execution, and then planning precise trajectories for the robot to follow.
The development of UrbanVLA involves a sophisticated two-stage training pipeline. The first stage, Supervised Fine-Tuning (SFT), uses simulated environments and trajectories derived from web videos. This stage helps the model learn fundamental navigation skills such as goal-reaching, collision avoidance, and social compliance. Following SFT, the model undergoes Reinforcement Fine-Tuning (RFT) using a combination of simulated and real-world data. This crucial second stage significantly enhances the model’s safety and adaptability, allowing it to interpret noisy routes, adhere to navigation norms, and adapt to the complexities of real-world urban settings, including dynamic obstacles and varied terrain.
A key innovation within UrbanVLA is the Heuristic Trajectory Lifting (HTL) algorithm. This algorithm is vital for processing real-world navigation data, which often provides only ground-truth trajectories without explicit ‘roadbooks’. HTL extracts high-level route information from raw trajectories, encouraging the model to learn from visual cues rather than over-relying on idealized route inputs. This makes UrbanVLA more robust to the inherent noise and ambiguity of real-world navigation instructions.
Experiments demonstrate that UrbanVLA achieves remarkable performance, significantly surpassing existing baselines. In the SocialNav task on MetaUrban, UrbanVLA showed more than a 55% improvement in performance. Furthermore, real-world deployments have showcased UrbanVLA’s ability to perform reliable navigation across diverse environments, including unseen layouts, dynamic obstacles, and varying illumination. It can handle long-horizon tasks spanning over 500 meters, successfully navigating scenarios like overpass crossings, pedestrian interactions, street turning, and obstacle avoidance. This robust performance, even when relying solely on RGB camera inputs, highlights its strong inherent capability to avoid obstacles and navigate socially.
Also Read:
- Enhancing Robot Dexterity: A New Framework for Generalizable Skill Learning
- Robots Learn Dexterous Skills from Everyday Human Videos
In essence, UrbanVLA represents a significant step forward in urban micromobility, integrating high-level navigation tool guidance with on-board vision to create a scalable and reliable solution for autonomous agents operating in dynamic pedestrian environments. For more technical details, you can refer to the full research paper.


