TLDR: Watch & Learn (W&L) is a framework that enables Computer Use Agents (CUAs) to learn how to operate computers by converting readily available online human demonstration videos into executable UI trajectories. By framing the problem as an inverse dynamics objective, W&L generates over 53,000 high-quality trajectories without manual annotation, significantly improving CUA performance in both in-context learning and supervised training on challenging benchmarks like OSWorld.
Imagine an artificial intelligence that can learn to use any computer application just by watching online tutorial videos, much like a human would. This is the groundbreaking concept behind “Watch and Learn” (W&L), a new framework developed by researchers from Google Cloud AI Research, Google DeepMind, and The Ohio State University. This innovative approach aims to overcome a major hurdle in developing Computer Use Agents (CUAs) – the scarcity of high-quality training data.
CUAs are AI systems designed to interact with software and the web, performing tasks from everyday productivity to complex enterprise automation. For these agents to be truly effective, they need to understand how to plan multi-step workflows and translate those plans into concrete actions within diverse and constantly changing applications. Traditionally, gathering the necessary annotated data for training these agents has been incredibly expensive and time-consuming.
The web, however, is a treasure trove of human demonstration videos, such as YouTube tutorials and screencasts, which naturally showcase complex workflows across countless applications. The W&L framework taps into this vast resource, converting these raw human demonstration videos into executable UI trajectories at scale. Instead of trying to directly generate these trajectories or relying on complicated, multi-stage reasoning methods, W&L redefines the problem as an “inverse dynamics objective.” This means the system learns to predict the user’s action by observing consecutive screen states – essentially, figuring out what action caused the screen to change from one moment to the next.
How Watch & Learn Works
The W&L framework operates in three main stages:
1. Training an Inverse Dynamics Model (IDM): The core of W&L is an IDM that predicts user actions from two consecutive screen observations. To train this model, the researchers built a massive dataset of over 630,000 state transitions. This corpus was created by synthesizing interactions with live web pages and incorporating existing human-annotated data. The IDM uses a vision-only architecture, similar to how humans visually perceive an interface, and predicts actions like clicking, scrolling, typing, waiting, and moving the cursor.
2. Generating Data from Videos: Once the IDM is trained, W&L retrieves suitable tutorial videos from platforms like YouTube. A smart retrieval system, enhanced by Gemini 2.5 Flash, refines search queries to find relevant instructional content. These videos are then filtered to ensure they are high-quality screencasts, removing irrelevant segments like talking-head introductions or blurred transitions. The trained IDM is then applied to these filtered videos, transforming raw human demonstrations into structured, executable UI trajectories without any manual annotation.
3. Applications of Trajectories: The automatically labeled trajectories serve two crucial purposes. First, they act as “in-context exemplars” during inference. This means that when a CUA is given a new task, it can refer to these video-derived demonstrations to understand planning and grounding priors, as well as application-specific knowledge. Second, these trajectories are used as “supervised training data” to fine-tune existing CUA models, significantly improving their general knowledge and performance.
Impressive Results on OSWorld
The effectiveness of W&L was rigorously tested on OSWorld, a challenging benchmark that evaluates agents in real desktop and operating system environments. The results were consistently positive across various model categories:
-
General-purpose multimodal models (like Gemini 2.5 Flash, OpenAI o3, and Claude 4 Sonnet) showed performance improvements of +1.6 to +3.0 percentage points when provided with W&L exemplars.
-
The Jedi agentic framework, a state-of-the-art vision-only agent, saw a +2.2 point gain.
-
Open-source CUAs, such as UI-TARS-7B and Qwen 2.5-VL 7B, benefited even more from supervised fine-tuning with the 53,000 video-derived trajectories. Qwen 2.5-VL, a general-purpose multimodal model not originally tailored for computer use, experienced a remarkable jump from 1.9% to 13.0% success rate (+11.1 points), demonstrating the significant value of this task-specific supervision.
The research also highlighted that the accuracy of action labels is paramount. W&L’s dedicated IDM significantly outperformed other labeling methods, leading directly to better downstream performance for the agents. Furthermore, the study found that while targeted retrieval of videos is beneficial, even randomly selected exemplars didn’t actively harm performance, indicating the robustness of the underlying action labels.
Also Read:
- Making GUI Agents More Accurate Across Screen Resolutions
- Improving Multi-modal Video AI Fine-Tuning with Oracle Ranking
Looking Ahead
The “Watch and Learn” framework represents a significant step forward in making Computer Use Agents more capable and adaptable. By leveraging the vast amount of human demonstration videos available online, it provides a scalable and practical foundation for advancing CUAs towards real-world deployment. Future work aims to expand the IDM to support more complex actions like drag-and-drop, combine or split tutorials for longer tasks, and explore reinforcement learning applications. You can read the full research paper here: Watch and Learn: Learning to Use Computers from Online Videos.


