TLDR: ROVER is a new framework that enhances Vision-Language Models’ (VLMs) ability to understand long video sequences for embodied robotic tasks. It recursively breaks down complex tasks into smaller subtasks, allowing VLMs to focus on short, relevant video segments. This approach significantly improves reasoning accuracy, reduces hallucinations, and offers linear scalability with video length, outperforming existing methods in task progress estimation, natural language reasoning, and video question answering.
Vision-language models, or VLMs, have shown remarkable abilities in understanding images, but they often struggle when it comes to processing long sequences of camera frames from videos, especially in real-world robotic tasks. These embodied tasks require continuous reasoning over visual input, which can be challenging for current VLM approaches.
To address this limitation, researchers have introduced a new framework called ROVER, which stands for Reasoning Over VidEo Recursively. This innovative approach allows VLMs to break down long video trajectories into smaller, more manageable segments, each corresponding to a shorter subtask within the overall task. By doing so, ROVER enables more focused and accurate reasoning over these localized video segments without losing sight of the broader task context.
How ROVER Works
ROVER operates by recursively decomposing a task shown in a video. Instead of trying to process an entire, lengthy video sequence at once, it generates a separate line of reasoning for each subtask. For example, if a robot is tasked with ‘opening a door,’ ROVER might first focus on the subtask of ‘grasping the door handle.’ Once that subtask is complete, its reasoning for that segment concludes, and a new line of reasoning begins for the next subtask, such as ‘pulling the door open.’
This decomposition strategy offers several key advantages. Firstly, it significantly improves accuracy by allowing the VLM to concentrate on the most relevant temporal segments of the video. Secondly, it enables the use of a subtask-specific ‘sliding context window,’ which further reduces the number of frames the model needs to process at any given moment. This means ROVER’s processing time scales linearly with video length, a significant improvement over older methods that might scale quadratically.
Performance and Benefits
The ROVER framework was evaluated using an in-context learning approach on a variety of OpenX Embodiment videos and a new dataset derived from RoboCasa. This new dataset includes 543 videos across 27 robotic manipulation tasks, featuring both expert and intentionally perturbed non-expert trajectories to test the model’s robustness in diverse scenarios.
ROVER consistently outperformed strong baseline methods across three main video reasoning tasks: estimating task progress, performing frame-level natural language reasoning, and answering questions about video content. A notable finding was ROVER’s ability to mitigate ‘hallucinations’ – instances where the VLM incorrectly states that an event occurred or misinterprets the situation. This improvement was particularly evident during unexpected or non-optimal moments in a trajectory, where other models struggled when reasoning over long sequences of frames.
The research also demonstrated ROVER’s robustness to various factors, including different video lengths, frame rates, camera views, and even different underlying Vision-Language Models (such as Gemini-1.5-Pro, GPT-4o, and Qwen-2.5-VL-32B-Instruct). This indicates its potential for broad applicability in real-world robotic systems.
Also Read:
- RICL: Enhancing Robot Learning with In-Context Adaptability
- Unlocking Dynamic Vision: A New Benchmark Challenges AI’s Understanding of Movement
Future Directions
While ROVER marks a significant step forward, the researchers acknowledge some limitations. If the decomposition process itself fails (e.g., by identifying unnecessary or incorrect subtasks), the reasoning might become fragmented. The current implementation relies on an in-context learning approach, and future work could explore fine-tuning methods to further enhance its performance.
Overall, ROVER provides a robust and scalable foundation for more precise and efficient VLM reasoning over video sequences in embodied tasks. For more technical details, you can refer to the full research paper available at arXiv:2508.01943.


