spot_img
HomeResearch & DevelopmentImproving Multi-modal Video AI Fine-Tuning with Oracle Ranking

Improving Multi-modal Video AI Fine-Tuning with Oracle Ranking

TLDR: Oracle-RLAIF is a new framework for fine-tuning multi-modal video models. It uses an “Oracle ranker” to provide quality rankings of model responses instead of a traditional reward model that assigns scores. Coupled with a novel rank-based loss function called GRPO rank, this approach makes fine-tuning more flexible and data-efficient. Experiments show Oracle-RLAIF significantly improves video comprehension, especially in temporal and action-based tasks, outperforming existing state-of-the-art methods.

Recent advancements in artificial intelligence have led to sophisticated multi-modal video models capable of understanding and responding to video content. These models can perform tasks like generating captions, answering questions about videos, and even reasoning about visual events. However, making these models even better, a process known as fine-tuning, often requires a lot of effort and resources, especially when it comes to gathering human feedback.

Traditionally, fine-tuning involves two main steps: first, supervised fine-tuning (SFT) uses human-annotated videos to teach the model to produce correct and relevant answers. Second, a reinforcement learning phase, often called Reinforcement Learning from Human Feedback (RLHF), uses human preferences over different model outputs to further improve video comprehension. The challenge with RLHF is its high cost and inefficiency in collecting human labels, especially as AI models grow larger.

To address this, researchers have explored Reinforcement Learning from AI Feedback (RLAIF), where an AI acts as a “judge” instead of a human. Current RLAIF methods typically rely on a specialized reward model that assigns a score to each response, indicating its quality. Training these reward models can be complex and restrictive.

A new framework, called Oracle-RLAIF, offers a more flexible and cost-effective solution. Instead of a reward model that scores responses, Oracle-RLAIF uses a more general “Oracle ranker.” This Oracle ranker simply orders candidate model responses from best to worst, rather than assigning a specific numerical score. This approach eliminates the need for a perfectly calibrated reward model, making the fine-tuning process much more adaptable to various scenarios, such as using feedback from general-purpose AI models or distilling knowledge from larger models.

Alongside Oracle-RLAIF, the researchers introduced GRPO rank, a novel loss function. This function is based on Group Relative Policy Optimization (GRPO) but is specifically designed to directly optimize the model using ordinal (rank-based) feedback. GRPO rank applies non-linear penalties for rank errors, giving larger penalties when the model’s predicted rank deviates significantly from the Oracle’s ground-truth rank. It also prioritizes correctly ranking high-quality responses, which are most important for user experience.

The Oracle-RLAIF framework works by having an initial, pre-trained video language model generate multiple responses to a single prompt. The multi-modal Oracle ranker then evaluates and ranks these candidate responses based on their quality and relevance to the video. Using these rankings, the GRPO rank algorithm fine-tunes the initial model, guiding it to produce responses that align better with the Oracle’s preferences. This entire process does not require a separate reward model or a value model to calculate expected rewards, simplifying the pipeline significantly.

Empirical evaluations demonstrated that Oracle-RLAIF consistently outperforms leading video language models that use existing fine-tuning methods. When tested across various video comprehension benchmarks like MSVD, MSRVTT, ActivityNet, and the more contemporary Video-MME dataset, Oracle-RLAIF showed significant improvements. For instance, it achieved a +6.2% improvement in overall accuracy on Video-MME, with notable gains in tasks requiring Temporal Perception (+21.2%), Action Recognition (+11.7%), and Object Reasoning (+11.2%). These results highlight the framework’s strength in aligning models with temporally and causally grounded responses.

However, the framework showed some performance declines in categories like Spatial Perception, Spatial Reasoning, and Information Synopsis. The researchers hypothesize that these tasks involve higher ambiguity or abstraction, which might be less effectively optimized through relative ranking alone. Such tasks might benefit more from architectural changes to the model rather than just fine-tuning techniques.

Also Read:

In conclusion, Oracle-RLAIF represents a significant step forward in fine-tuning large multi-modal video models. By leveraging a flexible Oracle ranker and the innovative GRPO rank algorithm, it provides a data-efficient and robust method for improving video understanding. This work paves the way for creating more adaptable AI systems that can learn effectively from ranked feedback. You can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -