VESSA: Adapting Vision Models with Object-Centric Videos

TLDR: VESSA is a novel self-supervised fine-tuning method for vision foundation models that adapts them to new domains without needing labeled data. It leverages short, multi-view object-centric videos and employs a self-distillation paradigm with parameter-efficient techniques like LoRA and careful training adjustments to prevent catastrophic forgetting. Experiments show VESSA consistently improves classification performance on domain-specific datasets, learning more robust and object-focused representations by utilizing temporal information from videos.

In the rapidly evolving field of artificial intelligence, foundation models have emerged as powerful tools, capable of performing a wide array of tasks across different domains. However, these sophisticated models often face challenges when applied to specialized areas with unique data characteristics or when labeled data for fine-tuning is scarce. This is particularly true for vision-centric models, where adapting them to new visual domains without extensive manual annotations has remained a significant hurdle.

Addressing this challenge, a new method called VESSA, which stands for Video-based objEct-centric Self-Supervised Adaptation, has been introduced. Developed by Jesimon Barreto, Carlos Caetano, André Araujo, and William Robson Schwartz, VESSA offers a novel approach to fine-tune visual foundation models without the need for any human-provided labels. Instead, it leverages the rich information contained within short, multi-view object-centric videos.

The core idea behind VESSA is to adapt a pre-trained vision model to a new domain by continuously learning from unlabeled video data. Unlike traditional methods that rely on supervised fine-tuning with labeled examples, VESSA employs a self-supervised learning strategy. This means the model learns by generating its own supervisory signals from the data itself, specifically from the temporal and multi-view consistency found in videos.

VESSA’s training technique is built upon a self-distillation paradigm, where a ‘student’ network learns to match the outputs of a ‘teacher’ network. Both networks are exposed to different augmented views of the same object from various frames within a video. A critical aspect of VESSA’s success lies in its careful optimization strategies. The researchers found that a naive application of self-distillation could lead to the model ‘forgetting’ its pre-trained knowledge. To prevent this, VESSA incorporates several key adjustments.

Firstly, it carefully tunes the prediction heads of the model and deploys parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA). LoRA allows for efficient fine-tuning by only updating a small number of parameters, preserving the model’s original knowledge while adapting it to the new domain. Secondly, VESSA uses a staged unfreezing strategy, initially training only the projection head and then gradually unfreezing different parts of the model’s backbone. This controlled adaptation helps maintain stability and efficiency during the fine-tuning process.

A unique feature of VESSA is its use of Uncertainty-Weighted Self-Distillation (UWSD) loss. This mechanism prioritizes harder training examples by modulating their contribution to the learning process based on the teacher network’s prediction uncertainty. This ensures that the model focuses its learning efforts where it’s most needed.

The benefits of using multi-view object observations from videos are significant. Videos naturally provide diverse perspectives and capture conditions of an object, allowing VESSA to learn representations that are robust to changes in viewpoint, lighting, and other environmental factors, all without requiring any annotations. This temporal diversity is a key differentiator from image-based self-supervised methods.

The effectiveness of VESSA was rigorously tested across three different vision foundation models (DINO, DINOv2, and TIPS) and two large-scale video datasets (MVImageNet and CO3D). The results consistently demonstrated that VESSA leads to notable improvements in downstream classification tasks, outperforming both the base pre-trained models and other existing adaptation methods. For instance, on the CO3D dataset, VESSA applied to DINOv2 achieved a top-1 accuracy of 91.85%, a statistically significant improvement over other approaches.

Qualitative analyses further highlighted VESSA’s ability to learn more semantically meaningful and object-centric representations. While baseline models often focused on background similarities during object retrieval, VESSA consistently attended to the object of interest, even when its texture or color varied from the query image. This indicates a deeper understanding of the object itself, rather than just its surrounding context.

While VESSA marks a significant step forward, the researchers acknowledge certain limitations. Like many fine-tuning methods, it can exhibit a tendency to ‘forget’ previously acquired general knowledge. Additionally, its reliance on object-centric video data with multiple viewpoints might limit its applicability in scenarios where such structured data is not readily available. Nevertheless, VESSA opens up exciting new avenues for adapting foundation models to diverse visual contexts without the prohibitive cost and effort of manual labeling.

Also Read:

The code for VESSA is publicly available, allowing other researchers and practitioners to explore and build upon this innovative approach. For more technical details, you can refer to the full research paper: VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

VESSA: Adapting Vision Models with Object-Centric Videos

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates