Unlocking Zero-Shot Vision: How Video Models Are Becoming Generalist Learners

TLDR: The research paper demonstrates that large generative video models, specifically Veo 3, exhibit emergent zero-shot capabilities across a wide range of visual tasks, from perception and modeling to manipulation and reasoning. This suggests that video models are on a trajectory to become unified, general-purpose foundation models for machine vision, akin to how Large Language Models transformed natural language processing.

The world of artificial intelligence has seen a remarkable transformation in natural language processing (NLP) with the advent of Large Language Models (LLMs). These models moved NLP from specialized tools for single tasks to unified, general-purpose systems capable of understanding and generating human language across a vast array of applications. Now, a new research paper from Google DeepMind suggests that video models are poised for a similar revolution in machine vision.

The Dawn of Generalist Vision Models

Titled “Video models are zero-shot learners and reasoners”, the paper, authored by Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos, explores the emergent capabilities of generative video models. The core question posed by the researchers is whether these models can develop general-purpose visual understanding, much like LLMs did for language. Their answer is a resounding yes, demonstrated through the impressive performance of Veo 3, a cutting-edge video model.

The key concept here is “zero-shot learning.” In simple terms, this means the model can perform tasks it was never explicitly trained for, simply by being prompted with an instruction. Imagine asking a model to segment objects in an image, detect edges, or even solve a maze, without ever having shown it specific examples of these tasks during its training phase. This is precisely what Veo 3 is shown to achieve.

Veo 3’s Multifaceted Visual Intelligence

The researchers conducted an extensive qualitative and quantitative analysis, testing Veo 3 across 62 qualitative and 7 quantitative tasks. These tasks span the entire spectrum of visual understanding, categorized into four hierarchical capabilities:

Perception: This foundational ability involves understanding visual information. Veo 3 demonstrated zero-shot capabilities in tasks like edge detection, segmenting objects, localizing keypoints, enhancing low-light images, deblurring, denoising, and even interpreting ambiguous images like the classic dalmatian illusion. These are tasks traditionally handled by highly specialized computer vision models.

Modeling: Building on perception, modeling involves forming an understanding of the visual world and its governing principles, such as physics. Veo 3 showed an intuitive grasp of rigid and soft body dynamics, flammability, air resistance, buoyancy, and optical phenomena like refraction and reflection. It could also categorize objects, recognize patterns, and maintain a memory of world states across video frames.

Manipulation: With the ability to perceive and model, Veo 3 can meaningfully alter the visual world. This includes a wide range of image editing tasks like background removal, style transfer, colorization, inpainting (filling in missing parts), outpainting (extending an image), and even editing images based on simple doodles. Beyond static images, it could compose scenes, generate novel views of objects, and simulate dexterous object manipulation, such as opening a jar or throwing a ball.

Reasoning: This is where perception, modeling, and manipulation integrate to tackle complex visual problems. The paper introduces the concept of “chain-of-frames” (CoF), analogous to LLMs’ “chain-of-thought.” By generating videos frame-by-frame, Veo 3 can perform step-by-step visual reasoning. Examples include solving mazes, navigating a robot, completing visual sequences, sorting numbers, and even using tools to accomplish a task.

Also Read:

Rapid Progress and Future Outlook

The paper highlights a significant performance leap from Veo 2 to Veo 3, indicating rapid advancements in video model capabilities. While Veo 3’s zero-shot performance might not always match the state-of-the-art of highly specialized models, this mirrors the early days of LLMs. The consistent improvement and the potential for inference-time scaling methods suggest that these models will continue to close the gap.

The researchers acknowledge that video generation is currently more expensive than running bespoke models, but they draw parallels to the rapidly falling costs of LLM inference. They foresee a future where general-purpose video models become the foundation for machine vision, replacing many task-specific models due to their versatility and emergent intelligence.

This research marks an exciting moment for vision AI, suggesting that video models are on the path to becoming unified, generalist foundation models, ushering in a “GPT-3 moment for vision.” You can read the full research paper here: Video models are zero-shot learners and reasoners.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Zero-Shot Vision: How Video Models Are Becoming Generalist Learners

The Dawn of Generalist Vision Models

Veo 3’s Multifaceted Visual Intelligence

Rapid Progress and Future Outlook

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates