spot_img
HomeResearch & DevelopmentComputer Agents Learn New Tricks: Watching Online Videos for...

Computer Agents Learn New Tricks: Watching Online Videos for On-the-Fly Guidance

TLDR: This research introduces a framework for computer-use agents to learn from online video tutorials during inference. The system retrieves and filters relevant videos, processes them into structured ‘demonstration trajectories’ with inferred UI actions and objectives, and then dynamically selects the most helpful trajectory as in-context guidance at each step. Experiments show this approach significantly improves agent performance on desktop and web tasks compared to baselines, highlighting the importance of video segmentation, dynamic selection, action filtering, and visual information.

Computer-use agents are designed to automate tasks on computers, from simple digital workflows to complex multi-step operations. While these agents have made significant strides, they often struggle with tasks requiring specific procedural knowledge, such as how to use a particular application or navigate a unique user interface. Humans, on the other hand, readily overcome such challenges by watching video tutorials, selectively imitating relevant segments to achieve their goals.

A new research paper, titled “Learning from Online Videos at Inference Time for Computer-Use Agents,” explores how to empower these agents to effectively learn from online video tutorials during their operation. The authors, Yujian Liu, Ze Wang, Hao Chen, Ximeng Sun, Xiaodong Yu, Jialian Wu, Jiang Liu, Emad Barsoum, Zicheng Liu, and Shiyu Chang, propose a novel framework that allows computer-use agents to leverage the vast amount of online video content as dynamic, in-context guidance.

The Challenge of Learning from Videos

The core challenge lies in bridging the gap between continuous video streams and the discrete actions an agent performs. Videos show implicit actions and continuous screen changes, while agents need explicit UI actions and observe sparse screenshots. Furthermore, raw videos can be long and contain irrelevant information, making it difficult for an agent to pinpoint the exact, short snippet needed for its current subgoal.

A Three-Step Framework for Video-Powered Agents

The proposed framework addresses these issues through three main components:

1. Video Retrieval: When faced with a task, the agent first generates search queries to find relevant online video tutorials. These videos are then filtered to ensure they are genuinely helpful, demonstrating computer operations pertinent to the task and running on the correct operating system (e.g., Ubuntu for desktop tasks). This involves both coarse selection based on titles and descriptions, and a more detailed content verification using a Vision Language Model (VLM) to examine transcripts and sample frames.

2. Video Processing: Once relevant videos are identified, they are converted into a structured format called “demonstration trajectories.” A VLM is used to infer the underlying UI actions (like clicks, types, or drags) from screen changes within the video. These actions are then filtered to remove irrelevant movements. Crucially, the videos are segmented into short subsequences of actions, and each subsequence is assigned a concise textual objective. This process transforms a continuous video into a series of actionable, goal-oriented segments, each with an objective, observations (screenshots), and a sequence of actions.

3. Video Application: During the agent’s execution, a dynamic two-stage selection mechanism comes into play. At each step, the agent first performs a coarse ranking of demonstration trajectories based on their objectives to create a candidate pool. Then, it inspects the initial observation and action sequence of these candidates to select the single most helpful trajectory for its next decision. This selected trajectory is then provided as in-context guidance. To maintain coherence, the agent first checks if the previously selected trajectory is still relevant before searching for a new one, mimicking how humans might stick to a plan until a deviation occurs.

Also Read:

Empirical Success and Key Insights

The researchers evaluated their method on two widely used benchmarks: OSWorld-Verified for desktop tasks and WebArena for web-based tasks. The results consistently showed that their framework outperforms strong baseline agents that lack video access, as well as variants that only use textual tutorials or video transcripts. For instance, on OSWorld-Verified, the method improved success rates by 2.1% over the state-of-the-art Jedi framework, and on WebArena, it achieved a 4.2% improvement over AgentOccam.

Further analyses highlighted several critical factors contributing to this success:

  • More Videos, Better Performance: Access to a larger pool of relevant videos led to improved performance, suggesting scalability.
  • Dynamic Trajectory Selection: Splitting videos into short, goal-oriented trajectories and dynamically selecting them at each step was crucial, significantly outperforming methods that use entire videos.
  • Action Filtering: Removing irrelevant actions during video processing improved the quality of the demonstration trajectories.
  • Visual Information: Providing visual information (screenshots) alongside textual objectives and actions was more effective than text-only summaries, underscoring the importance of visual context.

This research demonstrates a powerful approach to enabling computer-use agents to learn on the fly from the abundant resource of online video tutorials. By systematically distilling videos into actionable, visually grounded guidance, agents can acquire domain-specific procedural knowledge more effectively, bringing them closer to human-like computer interaction. You can find the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -