AI Agents Learn Computer Skills by Watching Online Videos

TLDR: Watch & Learn (W&L) is a framework that enables Computer Use Agents (CUAs) to learn how to operate computers by converting readily available online human demonstration videos into executable UI trajectories. By framing the problem as an inverse dynamics objective, W&L generates over 53,000 high-quality trajectories without manual annotation, significantly improving CUA performance in both in-context learning and supervised training on challenging benchmarks like OSWorld.

Imagine an artificial intelligence that can learn to use any computer application just by watching online tutorial videos, much like a human would. This is the groundbreaking concept behind “Watch and Learn” (W&L), a new framework developed by researchers from Google Cloud AI Research, Google DeepMind, and The Ohio State University. This innovative approach aims to overcome a major hurdle in developing Computer Use Agents (CUAs) – the scarcity of high-quality training data.

CUAs are AI systems designed to interact with software and the web, performing tasks from everyday productivity to complex enterprise automation. For these agents to be truly effective, they need to understand how to plan multi-step workflows and translate those plans into concrete actions within diverse and constantly changing applications. Traditionally, gathering the necessary annotated data for training these agents has been incredibly expensive and time-consuming.

The web, however, is a treasure trove of human demonstration videos, such as YouTube tutorials and screencasts, which naturally showcase complex workflows across countless applications. The W&L framework taps into this vast resource, converting these raw human demonstration videos into executable UI trajectories at scale. Instead of trying to directly generate these trajectories or relying on complicated, multi-stage reasoning methods, W&L redefines the problem as an “inverse dynamics objective.” This means the system learns to predict the user’s action by observing consecutive screen states – essentially, figuring out what action caused the screen to change from one moment to the next.

How Watch & Learn Works

The W&L framework operates in three main stages:

1. Training an Inverse Dynamics Model (IDM): The core of W&L is an IDM that predicts user actions from two consecutive screen observations. To train this model, the researchers built a massive dataset of over 630,000 state transitions. This corpus was created by synthesizing interactions with live web pages and incorporating existing human-annotated data. The IDM uses a vision-only architecture, similar to how humans visually perceive an interface, and predicts actions like clicking, scrolling, typing, waiting, and moving the cursor.

2. Generating Data from Videos: Once the IDM is trained, W&L retrieves suitable tutorial videos from platforms like YouTube. A smart retrieval system, enhanced by Gemini 2.5 Flash, refines search queries to find relevant instructional content. These videos are then filtered to ensure they are high-quality screencasts, removing irrelevant segments like talking-head introductions or blurred transitions. The trained IDM is then applied to these filtered videos, transforming raw human demonstrations into structured, executable UI trajectories without any manual annotation.

3. Applications of Trajectories: The automatically labeled trajectories serve two crucial purposes. First, they act as “in-context exemplars” during inference. This means that when a CUA is given a new task, it can refer to these video-derived demonstrations to understand planning and grounding priors, as well as application-specific knowledge. Second, these trajectories are used as “supervised training data” to fine-tune existing CUA models, significantly improving their general knowledge and performance.

Impressive Results on OSWorld

The effectiveness of W&L was rigorously tested on OSWorld, a challenging benchmark that evaluates agents in real desktop and operating system environments. The results were consistently positive across various model categories:

General-purpose multimodal models (like Gemini 2.5 Flash, OpenAI o3, and Claude 4 Sonnet) showed performance improvements of +1.6 to +3.0 percentage points when provided with W&L exemplars.
The Jedi agentic framework, a state-of-the-art vision-only agent, saw a +2.2 point gain.
Open-source CUAs, such as UI-TARS-7B and Qwen 2.5-VL 7B, benefited even more from supervised fine-tuning with the 53,000 video-derived trajectories. Qwen 2.5-VL, a general-purpose multimodal model not originally tailored for computer use, experienced a remarkable jump from 1.9% to 13.0% success rate (+11.1 points), demonstrating the significant value of this task-specific supervision.

The research also highlighted that the accuracy of action labels is paramount. W&L’s dedicated IDM significantly outperformed other labeling methods, leading directly to better downstream performance for the agents. Furthermore, the study found that while targeted retrieval of videos is beneficial, even randomly selected exemplars didn’t actively harm performance, indicating the robustness of the underlying action labels.

Also Read:

Looking Ahead

The “Watch and Learn” framework represents a significant step forward in making Computer Use Agents more capable and adaptable. By leveraging the vast amount of human demonstration videos available online, it provides a scalable and practical foundation for advancing CUAs towards real-world deployment. Future work aims to expand the IDM to support more complex actions like drag-and-drop, combine or split tutorials for longer tasks, and explore reinforcement learning applications. You can read the full research paper here: Watch and Learn: Learning to Use Computers from Online Videos.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Agents Learn Computer Skills by Watching Online Videos

How Watch & Learn Works

Impressive Results on OSWorld

Looking Ahead

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates