spot_img
HomeResearch & DevelopmentNavigating New Paths: How Off-the-Shelf AI Models Learn to...

Navigating New Paths: How Off-the-Shelf AI Models Learn to Follow Instructions

TLDR: This paper explores whether pre-trained Large Vision-Language Models (LVLMs) like Qwen2.5-VL can effectively perform Vision-and-Language Navigation (VLN) by following natural language instructions in unfamiliar environments. Researchers fine-tuned Qwen2.5-VL on the Room-to-Room (R2R) dataset and compared its performance using two types of action spaces: low-level (atomic actions like “turn left”) and panoramic (selecting navigable directions from a 360-degree view). The study found that while off-the-shelf LVLMs can learn VLN, they still trail specialized models. Crucially, the panoramic action space significantly outperformed the low-level one, achieving a 41% success rate compared to 26%, largely due to shorter, less error-prone navigation sequences.

In the rapidly evolving field of artificial intelligence, enabling robots to understand and act upon human instructions in real-world environments is a significant challenge. This is the core of Vision-and-Language Navigation (VLN), a task where autonomous robots learn to navigate unfamiliar spaces by following natural language commands, such as “Walk down the hallway and take the last door to your left.”

Traditionally, VLN systems have relied on models specifically designed and optimized for navigation. However, recent advancements in Large Vision-Language Models (LVLMs) have opened new possibilities. These powerful AI models, capable of processing both visual and textual information, hold immense potential for VLN tasks. A recent research paper explores this potential, investigating whether ‘off-the-shelf’ LVLMs, without extensive architectural modifications or simulator-based training, can effectively perform VLN, and how different types of action spaces influence their performance.

The paper, titled “Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces,” was authored by Vebjørn Haug Kåsene from the University of Oslo and Pierre Lison from the Norwegian Computing Center. Their work delves into two primary questions: can readily available LVLMs be adapted for VLN, and how do low-level versus panoramic action spaces impact their navigation capabilities?

The researchers focused on fine-tuning an open-source LVLM, Qwen2.5-VL-3B-Instruct, using the Room-to-Room (R2R) dataset. The R2R dataset is a widely used benchmark that provides thousands of trajectory-instruction pairs within simulated indoor environments, allowing robots to learn how to follow instructions to reach specific locations.

Also Read:

Understanding Action Spaces

The concept of ‘action space’ refers to the set of possible movements or decisions a robot can make. The paper compares two main types:

  • Low-level Action Space: In this setup, the robot perceives its environment through an egocentric (first-person) view, similar to how a human sees. It then selects from a discrete set of atomic actions like “Move Forward,” “Turn Left” (by 30 degrees), “Turn Right” (by 30 degrees), or “Stop.” This approach is intuitive but can lead to longer, more complex sequences of actions to reach a destination.

  • Panoramic Action Space: Here, the robot is provided with a 360-degree panoramic image of its surroundings. Instead of atomic movements, it chooses from a set of navigable candidate directions, each corresponding to an adjacent location in the environment’s navigation graph. This effectively reduces the task to a visually guided search, often leading to more direct paths.

The study’s findings revealed that while off-the-shelf LVLMs like Qwen2.5-VL can indeed learn to perform Vision-and-Language Navigation, their performance still lags behind models specifically designed and heavily optimized for this task. The best resulting model achieved a 41% success rate on the R2R test set.

A significant discovery was the impact of the action space. The panoramic action space consistently outperformed the low-level one. The model fine-tuned for panoramic actions achieved a 41% success rate, whereas the low-level model reached only 26%. This performance gap, which is even larger than observed in previous studies with older model architectures, suggests that panoramic views provide crucial information that simplifies the navigation task for LVLMs. One key reason for this difference is that low-level action sequences are, on average, twice as long as panoramic ones, increasing the opportunities for errors to accumulate.

The researchers also explored variations within the low-level action space, such as disabling an automatic reorientation step before moving forward. Interestingly, removing this adjustment led to a noticeable performance gain, suggesting that explicitly aligning the robot’s heading with its movement direction might not always be necessary for effective navigation.

This research highlights that while fine-tuning off-the-shelf LVLMs for VLN is a promising direction, there’s still a performance gap compared to specialized architectures. Future work could involve evaluating a broader range of LVLMs and further investigating the panoramic action space through detailed studies to understand its full impact on navigation performance. For more technical details, you can refer to the full research paper: Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -