Navigating New Paths: How Off-the-Shelf AI Models Learn to Follow Instructions

TLDR: This paper explores whether pre-trained Large Vision-Language Models (LVLMs) like Qwen2.5-VL can effectively perform Vision-and-Language Navigation (VLN) by following natural language instructions in unfamiliar environments. Researchers fine-tuned Qwen2.5-VL on the Room-to-Room (R2R) dataset and compared its performance using two types of action spaces: low-level (atomic actions like “turn left”) and panoramic (selecting navigable directions from a 360-degree view). The study found that while off-the-shelf LVLMs can learn VLN, they still trail specialized models. Crucially, the panoramic action space significantly outperformed the low-level one, achieving a 41% success rate compared to 26%, largely due to shorter, less error-prone navigation sequences.

In the rapidly evolving field of artificial intelligence, enabling robots to understand and act upon human instructions in real-world environments is a significant challenge. This is the core of Vision-and-Language Navigation (VLN), a task where autonomous robots learn to navigate unfamiliar spaces by following natural language commands, such as “Walk down the hallway and take the last door to your left.”

Traditionally, VLN systems have relied on models specifically designed and optimized for navigation. However, recent advancements in Large Vision-Language Models (LVLMs) have opened new possibilities. These powerful AI models, capable of processing both visual and textual information, hold immense potential for VLN tasks. A recent research paper explores this potential, investigating whether ‘off-the-shelf’ LVLMs, without extensive architectural modifications or simulator-based training, can effectively perform VLN, and how different types of action spaces influence their performance.

The paper, titled “Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces,” was authored by Vebjørn Haug Kåsene from the University of Oslo and Pierre Lison from the Norwegian Computing Center. Their work delves into two primary questions: can readily available LVLMs be adapted for VLN, and how do low-level versus panoramic action spaces impact their navigation capabilities?

The researchers focused on fine-tuning an open-source LVLM, Qwen2.5-VL-3B-Instruct, using the Room-to-Room (R2R) dataset. The R2R dataset is a widely used benchmark that provides thousands of trajectory-instruction pairs within simulated indoor environments, allowing robots to learn how to follow instructions to reach specific locations.

Also Read:

Understanding Action Spaces

The concept of ‘action space’ refers to the set of possible movements or decisions a robot can make. The paper compares two main types:

Low-level Action Space: In this setup, the robot perceives its environment through an egocentric (first-person) view, similar to how a human sees. It then selects from a discrete set of atomic actions like “Move Forward,” “Turn Left” (by 30 degrees), “Turn Right” (by 30 degrees), or “Stop.” This approach is intuitive but can lead to longer, more complex sequences of actions to reach a destination.
Panoramic Action Space: Here, the robot is provided with a 360-degree panoramic image of its surroundings. Instead of atomic movements, it chooses from a set of navigable candidate directions, each corresponding to an adjacent location in the environment’s navigation graph. This effectively reduces the task to a visually guided search, often leading to more direct paths.

The study’s findings revealed that while off-the-shelf LVLMs like Qwen2.5-VL can indeed learn to perform Vision-and-Language Navigation, their performance still lags behind models specifically designed and heavily optimized for this task. The best resulting model achieved a 41% success rate on the R2R test set.

A significant discovery was the impact of the action space. The panoramic action space consistently outperformed the low-level one. The model fine-tuned for panoramic actions achieved a 41% success rate, whereas the low-level model reached only 26%. This performance gap, which is even larger than observed in previous studies with older model architectures, suggests that panoramic views provide crucial information that simplifies the navigation task for LVLMs. One key reason for this difference is that low-level action sequences are, on average, twice as long as panoramic ones, increasing the opportunities for errors to accumulate.

The researchers also explored variations within the low-level action space, such as disabling an automatic reorientation step before moving forward. Interestingly, removing this adjustment led to a noticeable performance gain, suggesting that explicitly aligning the robot’s heading with its movement direction might not always be necessary for effective navigation.

This research highlights that while fine-tuning off-the-shelf LVLMs for VLN is a promising direction, there’s still a performance gap compared to specialized architectures. Future work could involve evaluating a broader range of LVLMs and further investigating the panoramic action space through detailed studies to understand its full impact on navigation performance. For more technical details, you can refer to the full research paper: Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Navigating New Paths: How Off-the-Shelf AI Models Learn to Follow Instructions

Understanding Action Spaces

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates