spot_img
HomeResearch & DevelopmentVESSA: Adapting Vision Models with Object-Centric Videos

VESSA: Adapting Vision Models with Object-Centric Videos

TLDR: VESSA is a novel self-supervised fine-tuning method for vision foundation models that adapts them to new domains without needing labeled data. It leverages short, multi-view object-centric videos and employs a self-distillation paradigm with parameter-efficient techniques like LoRA and careful training adjustments to prevent catastrophic forgetting. Experiments show VESSA consistently improves classification performance on domain-specific datasets, learning more robust and object-focused representations by utilizing temporal information from videos.

In the rapidly evolving field of artificial intelligence, foundation models have emerged as powerful tools, capable of performing a wide array of tasks across different domains. However, these sophisticated models often face challenges when applied to specialized areas with unique data characteristics or when labeled data for fine-tuning is scarce. This is particularly true for vision-centric models, where adapting them to new visual domains without extensive manual annotations has remained a significant hurdle.

Addressing this challenge, a new method called VESSA, which stands for Video-based objEct-centric Self-Supervised Adaptation, has been introduced. Developed by Jesimon Barreto, Carlos Caetano, André Araujo, and William Robson Schwartz, VESSA offers a novel approach to fine-tune visual foundation models without the need for any human-provided labels. Instead, it leverages the rich information contained within short, multi-view object-centric videos.

The core idea behind VESSA is to adapt a pre-trained vision model to a new domain by continuously learning from unlabeled video data. Unlike traditional methods that rely on supervised fine-tuning with labeled examples, VESSA employs a self-supervised learning strategy. This means the model learns by generating its own supervisory signals from the data itself, specifically from the temporal and multi-view consistency found in videos.

VESSA’s training technique is built upon a self-distillation paradigm, where a ‘student’ network learns to match the outputs of a ‘teacher’ network. Both networks are exposed to different augmented views of the same object from various frames within a video. A critical aspect of VESSA’s success lies in its careful optimization strategies. The researchers found that a naive application of self-distillation could lead to the model ‘forgetting’ its pre-trained knowledge. To prevent this, VESSA incorporates several key adjustments.

Firstly, it carefully tunes the prediction heads of the model and deploys parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA). LoRA allows for efficient fine-tuning by only updating a small number of parameters, preserving the model’s original knowledge while adapting it to the new domain. Secondly, VESSA uses a staged unfreezing strategy, initially training only the projection head and then gradually unfreezing different parts of the model’s backbone. This controlled adaptation helps maintain stability and efficiency during the fine-tuning process.

A unique feature of VESSA is its use of Uncertainty-Weighted Self-Distillation (UWSD) loss. This mechanism prioritizes harder training examples by modulating their contribution to the learning process based on the teacher network’s prediction uncertainty. This ensures that the model focuses its learning efforts where it’s most needed.

The benefits of using multi-view object observations from videos are significant. Videos naturally provide diverse perspectives and capture conditions of an object, allowing VESSA to learn representations that are robust to changes in viewpoint, lighting, and other environmental factors, all without requiring any annotations. This temporal diversity is a key differentiator from image-based self-supervised methods.

The effectiveness of VESSA was rigorously tested across three different vision foundation models (DINO, DINOv2, and TIPS) and two large-scale video datasets (MVImageNet and CO3D). The results consistently demonstrated that VESSA leads to notable improvements in downstream classification tasks, outperforming both the base pre-trained models and other existing adaptation methods. For instance, on the CO3D dataset, VESSA applied to DINOv2 achieved a top-1 accuracy of 91.85%, a statistically significant improvement over other approaches.

Qualitative analyses further highlighted VESSA’s ability to learn more semantically meaningful and object-centric representations. While baseline models often focused on background similarities during object retrieval, VESSA consistently attended to the object of interest, even when its texture or color varied from the query image. This indicates a deeper understanding of the object itself, rather than just its surrounding context.

While VESSA marks a significant step forward, the researchers acknowledge certain limitations. Like many fine-tuning methods, it can exhibit a tendency to ‘forget’ previously acquired general knowledge. Additionally, its reliance on object-centric video data with multiple viewpoints might limit its applicability in scenarios where such structured data is not readily available. Nevertheless, VESSA opens up exciting new avenues for adapting foundation models to diverse visual contexts without the prohibitive cost and effort of manual labeling.

Also Read:

The code for VESSA is publicly available, allowing other researchers and practitioners to explore and build upon this innovative approach. For more technical details, you can refer to the full research paper: VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -