TLDR: The research paper demonstrates that large generative video models, specifically Veo 3, exhibit emergent zero-shot capabilities across a wide range of visual tasks, from perception and modeling to manipulation and reasoning. This suggests that video models are on a trajectory to become unified, general-purpose foundation models for machine vision, akin to how Large Language Models transformed natural language processing.
The world of artificial intelligence has seen a remarkable transformation in natural language processing (NLP) with the advent of Large Language Models (LLMs). These models moved NLP from specialized tools for single tasks to unified, general-purpose systems capable of understanding and generating human language across a vast array of applications. Now, a new research paper from Google DeepMind suggests that video models are poised for a similar revolution in machine vision.
The Dawn of Generalist Vision Models
Titled “Video models are zero-shot learners and reasoners”, the paper, authored by Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos, explores the emergent capabilities of generative video models. The core question posed by the researchers is whether these models can develop general-purpose visual understanding, much like LLMs did for language. Their answer is a resounding yes, demonstrated through the impressive performance of Veo 3, a cutting-edge video model.
The key concept here is “zero-shot learning.” In simple terms, this means the model can perform tasks it was never explicitly trained for, simply by being prompted with an instruction. Imagine asking a model to segment objects in an image, detect edges, or even solve a maze, without ever having shown it specific examples of these tasks during its training phase. This is precisely what Veo 3 is shown to achieve.
Veo 3’s Multifaceted Visual Intelligence
The researchers conducted an extensive qualitative and quantitative analysis, testing Veo 3 across 62 qualitative and 7 quantitative tasks. These tasks span the entire spectrum of visual understanding, categorized into four hierarchical capabilities:
Perception: This foundational ability involves understanding visual information. Veo 3 demonstrated zero-shot capabilities in tasks like edge detection, segmenting objects, localizing keypoints, enhancing low-light images, deblurring, denoising, and even interpreting ambiguous images like the classic dalmatian illusion. These are tasks traditionally handled by highly specialized computer vision models.
Modeling: Building on perception, modeling involves forming an understanding of the visual world and its governing principles, such as physics. Veo 3 showed an intuitive grasp of rigid and soft body dynamics, flammability, air resistance, buoyancy, and optical phenomena like refraction and reflection. It could also categorize objects, recognize patterns, and maintain a memory of world states across video frames.
Manipulation: With the ability to perceive and model, Veo 3 can meaningfully alter the visual world. This includes a wide range of image editing tasks like background removal, style transfer, colorization, inpainting (filling in missing parts), outpainting (extending an image), and even editing images based on simple doodles. Beyond static images, it could compose scenes, generate novel views of objects, and simulate dexterous object manipulation, such as opening a jar or throwing a ball.
Reasoning: This is where perception, modeling, and manipulation integrate to tackle complex visual problems. The paper introduces the concept of “chain-of-frames” (CoF), analogous to LLMs’ “chain-of-thought.” By generating videos frame-by-frame, Veo 3 can perform step-by-step visual reasoning. Examples include solving mazes, navigating a robot, completing visual sequences, sorting numbers, and even using tools to accomplish a task.
Also Read:
- Embodied AI: Bridging Language Understanding with Physical World Models
- AHA: Real-Time Highlight Detection for Streaming Video Without Future Context
Rapid Progress and Future Outlook
The paper highlights a significant performance leap from Veo 2 to Veo 3, indicating rapid advancements in video model capabilities. While Veo 3’s zero-shot performance might not always match the state-of-the-art of highly specialized models, this mirrors the early days of LLMs. The consistent improvement and the potential for inference-time scaling methods suggest that these models will continue to close the gap.
The researchers acknowledge that video generation is currently more expensive than running bespoke models, but they draw parallels to the rapidly falling costs of LLM inference. They foresee a future where general-purpose video models become the foundation for machine vision, replacing many task-specific models due to their versatility and emergent intelligence.
This research marks an exciting moment for vision AI, suggesting that video models are on the path to becoming unified, generalist foundation models, ushering in a “GPT-3 moment for vision.” You can read the full research paper here: Video models are zero-shot learners and reasoners.


