Computer Agents Learn New Tricks: Watching Online Videos for On-the-Fly Guidance

TLDR: This research introduces a framework for computer-use agents to learn from online video tutorials during inference. The system retrieves and filters relevant videos, processes them into structured ‘demonstration trajectories’ with inferred UI actions and objectives, and then dynamically selects the most helpful trajectory as in-context guidance at each step. Experiments show this approach significantly improves agent performance on desktop and web tasks compared to baselines, highlighting the importance of video segmentation, dynamic selection, action filtering, and visual information.

Computer-use agents are designed to automate tasks on computers, from simple digital workflows to complex multi-step operations. While these agents have made significant strides, they often struggle with tasks requiring specific procedural knowledge, such as how to use a particular application or navigate a unique user interface. Humans, on the other hand, readily overcome such challenges by watching video tutorials, selectively imitating relevant segments to achieve their goals.

A new research paper, titled “Learning from Online Videos at Inference Time for Computer-Use Agents,” explores how to empower these agents to effectively learn from online video tutorials during their operation. The authors, Yujian Liu, Ze Wang, Hao Chen, Ximeng Sun, Xiaodong Yu, Jialian Wu, Jiang Liu, Emad Barsoum, Zicheng Liu, and Shiyu Chang, propose a novel framework that allows computer-use agents to leverage the vast amount of online video content as dynamic, in-context guidance.

The Challenge of Learning from Videos

The core challenge lies in bridging the gap between continuous video streams and the discrete actions an agent performs. Videos show implicit actions and continuous screen changes, while agents need explicit UI actions and observe sparse screenshots. Furthermore, raw videos can be long and contain irrelevant information, making it difficult for an agent to pinpoint the exact, short snippet needed for its current subgoal.

A Three-Step Framework for Video-Powered Agents

The proposed framework addresses these issues through three main components:

1. Video Retrieval: When faced with a task, the agent first generates search queries to find relevant online video tutorials. These videos are then filtered to ensure they are genuinely helpful, demonstrating computer operations pertinent to the task and running on the correct operating system (e.g., Ubuntu for desktop tasks). This involves both coarse selection based on titles and descriptions, and a more detailed content verification using a Vision Language Model (VLM) to examine transcripts and sample frames.

2. Video Processing: Once relevant videos are identified, they are converted into a structured format called “demonstration trajectories.” A VLM is used to infer the underlying UI actions (like clicks, types, or drags) from screen changes within the video. These actions are then filtered to remove irrelevant movements. Crucially, the videos are segmented into short subsequences of actions, and each subsequence is assigned a concise textual objective. This process transforms a continuous video into a series of actionable, goal-oriented segments, each with an objective, observations (screenshots), and a sequence of actions.

3. Video Application: During the agent’s execution, a dynamic two-stage selection mechanism comes into play. At each step, the agent first performs a coarse ranking of demonstration trajectories based on their objectives to create a candidate pool. Then, it inspects the initial observation and action sequence of these candidates to select the single most helpful trajectory for its next decision. This selected trajectory is then provided as in-context guidance. To maintain coherence, the agent first checks if the previously selected trajectory is still relevant before searching for a new one, mimicking how humans might stick to a plan until a deviation occurs.

Also Read:

Empirical Success and Key Insights

The researchers evaluated their method on two widely used benchmarks: OSWorld-Verified for desktop tasks and WebArena for web-based tasks. The results consistently showed that their framework outperforms strong baseline agents that lack video access, as well as variants that only use textual tutorials or video transcripts. For instance, on OSWorld-Verified, the method improved success rates by 2.1% over the state-of-the-art Jedi framework, and on WebArena, it achieved a 4.2% improvement over AgentOccam.

Further analyses highlighted several critical factors contributing to this success:

More Videos, Better Performance: Access to a larger pool of relevant videos led to improved performance, suggesting scalability.
Dynamic Trajectory Selection: Splitting videos into short, goal-oriented trajectories and dynamically selecting them at each step was crucial, significantly outperforming methods that use entire videos.
Action Filtering: Removing irrelevant actions during video processing improved the quality of the demonstration trajectories.
Visual Information: Providing visual information (screenshots) alongside textual objectives and actions was more effective than text-only summaries, underscoring the importance of visual context.

This research demonstrates a powerful approach to enabling computer-use agents to learn on the fly from the abundant resource of online video tutorials. By systematically distilling videos into actionable, visually grounded guidance, agents can acquire domain-specific procedural knowledge more effectively, bringing them closer to human-like computer interaction. You can find the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Computer Agents Learn New Tricks: Watching Online Videos for On-the-Fly Guidance

The Challenge of Learning from Videos

A Three-Step Framework for Video-Powered Agents

Empirical Success and Key Insights

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

A New Way to Disentangle Data for Scientific Exploration

SiegPath Honored with ‘Most Innovative Fintech Award’ at AI Expo Europe 2025 for AI-Driven Solutions

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates