TLDR: This paper benchmarks Large Language Models (LLMs) and Vision-Language Models (VLMs) for exploration in reinforcement learning (RL). It reveals that while VLMs can infer high-level objectives, they consistently fail at precise low-level control, a phenomenon termed the “knowing-doing gap.” The research also explores a hybrid framework where VLM guidance significantly improves early-stage sample efficiency for RL agents, suggesting a promising direction for combining the semantic understanding of foundation models with the precise control of traditional RL.
Reinforcement Learning (RL) is a powerful framework that allows artificial intelligence agents to learn by interacting with an environment, much like how humans learn through trial and error. A core challenge in RL, especially in scenarios where rewards are rare (known as sparse-reward settings), is exploration. This refers to the agent’s ability to efficiently discover valuable strategies rather than getting stuck exploiting suboptimal ones too early.
Traditional exploration methods can be very inefficient, often requiring millions of interactions to find meaningful solutions. However, a new direction in AI research involves leveraging the vast knowledge embedded in large foundation models, such as Large Language Models (LLMs) and Vision-Language Models (VLMs), to improve this exploration process.
A recent research paper titled “Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches” by Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, and Paulo Rauber, delves into this exciting area. The authors systematically benchmark LLMs and VLMs on classic RL tasks to understand their zero-shot exploration capabilities – meaning how well they perform without any specific prior training for the task at hand. You can read the full paper here: Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches.
Benchmarking Foundation Models in RL Exploration
The paper makes several key contributions. First, it provides a comprehensive benchmark of foundation models across a range of classic exploration tasks. These tasks progress in complexity, starting from simple Multi-Armed Bandits (MABs), moving to spatial reasoning in Gridworlds, and finally to high-dimensional, sparse-reward Atari games.
In Multi-Armed Bandits, which isolate the exploration-exploitation trade-off, the researchers found that LLMs perform significantly better when given explicit instructions to explore, rather than having to infer the need for exploration themselves. While models like GPT-4 performed competitively with classical exploration algorithms when reward differences were clear, they struggled with subtle statistical distinctions.
Moving to Gridworlds, which require both short-term action planning and long-term memory, LLMs performed well in deterministic settings where the reward location was fixed and known. However, in stochastic Gridworlds, where the reward location was random and unknown, LLMs struggled with systematic exploration and often revisited already explored areas, even with explicit planning prompts. This highlights a limitation in their ability to effectively leverage memory over multiple interactions for long-horizon tasks.
The “Knowing-Doing Gap” in Vision-Language Models
Perhaps the most significant finding from the paper is the characterization of a persistent “knowing-doing gap” in VLMs when applied to hard-exploration Atari games. The researchers evaluated GPT-4o on seven challenging Atari games like Montezuma’s Revenge and Pitfall, which are known for their sparse rewards.
The qualitative analysis revealed that VLMs possess an impressive high-level understanding. For instance, in games like Freeway and Gravitar, GPT-4o could infer objectives directly from visual input, recognizing characters, enemies, and the overall goal. It successfully identified that a character needed to cross a road or that a ship should fire at an enemy.
However, this high-level understanding often broke down when precise, low-level control was required. In games like Montezuma’s Revenge, the VLM might correctly identify the goal (e.g., “retrieve the key”) but consistently fail at the precise timing and momentum needed to execute actions like jumping over a pit. Furthermore, VLMs sometimes struggled with basic self-recognition, failing to identify the player’s avatar in games like Venture. This gap means that while VLMs “know” what to do, they often lack the fine-grained procedural “doing” required for execution.
Hybrid Approaches: Bridging the Gap
Recognizing this “knowing-doing gap,” the paper investigates a simple on-policy hybrid framework. The idea is not to replace traditional RL agents entirely but to leverage the semantic guidance of VLMs to assist them. In this framework, a VLM acts as a temporary, exploratory guide for a standard RL agent (specifically, a PPO agent).
To test this, the researchers used the Freeway environment, where the VLM’s high-level strategy is known to be correct and the required control is simple. The results showed that the PPO-VLM hybrid agent learned significantly faster than both a vanilla PPO agent and a PPO agent augmented with Random Network Distillation (a strong exploration baseline). This suggests that VLM guidance can act as a powerful “semantic accelerator” for an RL policy, especially in the early stages of learning.
While this comes at the cost of increased computation due to VLM queries, it provides a clear quantitative data point demonstrating the potential synergy under ideal conditions. The authors emphasize that this is an upper-bound analysis and not a general solution, but it highlights a promising path forward.
Also Read:
- Beyond Binary: A New Framework for Detailed Robotic Manipulation Evaluation
- Embodied AI: Bridging Language Understanding with Physical World Models
Conclusion
The research provides a clear picture of the current capabilities and limitations of foundation models in RL exploration. While they show strong semantic understanding and can benefit from explicit instruction, they struggle with precise low-level control and long-term memory in complex environments. The findings strongly suggest that designing hybrid systems, where foundation models provide high-level semantic guidance to more robust, traditional RL policies, is a promising direction for future research. Such systems could strategically leverage the strengths of both paradigms to tackle challenging exploration problems in AI.


