spot_img
HomeResearch & DevelopmentPinpointing Events in Videos: A New Approach to Weakly-Supervised...

Pinpointing Events in Videos: A New Approach to Weakly-Supervised Audio-Visual Localization

TLDR: CLASP is a novel method for Dense Audio-Visual Event Localization (DAVEL) under a challenging weakly-supervised setting (W-DAVEL), where only video-level event labels are provided. It addresses this by identifying ‘cross-modal salient anchors’ – reliable timestamps with consistent event semantics across audio and visual modalities. The method uses a Mutual Event Agreement Evaluation module to find these anchors, a Cross-modal Salient Anchor Identification module to select them globally and locally, and an Anchor-based Temporal Propagation module to enhance event semantic encoding. CLASP establishes new benchmarks on UnAV-100 and ActivityNet1.3 datasets, achieving state-of-the-art performance.

Understanding what’s happening in long videos, especially when events involve both sounds and visuals, is a complex challenge for artificial intelligence. Imagine a video of a concert: you hear the music and see the musicians. The task of identifying exactly when a specific instrument starts playing or when the crowd cheers is known as Dense Audio-Visual Event Localization (DAVEL).

Traditionally, DAVEL systems require very detailed labels, where human annotators mark the precise start and end times of every event in a video. This process is incredibly time-consuming and expensive, making it difficult to scale to the vast amounts of video content available today. This is where the new challenge of Weakly-supervised Dense Audio-Visual Event Localization (W-DAVEL) comes in: can we achieve accurate event localization using only broad, video-level labels, like simply knowing that ‘people cheering’ or ‘a car passing by’ occurs somewhere in the video, without knowing exactly when?

Introducing CLASP: A Novel Approach to W-DAVEL

A recent research paper, titled “CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization”, introduces a groundbreaking method to tackle this W-DAVEL problem. The core idea behind CLASP is to identify what the researchers call ‘cross-modal salient anchors’. These are reliable moments in a video where the audio and visual information strongly agree on the presence of a particular event, even without precise temporal labels. Think of them as highly confident timestamps where the model can infer the event category with high certainty from both modalities.

The CLASP framework operates through three key modules:

  • Mutual Event Agreement Evaluation (MEAE): This module acts like a consistency checker. It independently predicts event probabilities for both the audio and visual streams. By comparing these predictions, it generates an ‘agreement score’ for each moment in the video. A high agreement score indicates that both modalities are strongly suggesting the same event is happening, making that moment a potential salient anchor.

  • Cross-modal Salient Anchor Identification (CSAI): Once the agreement scores are calculated, this module identifies the actual salient anchors. It does this in two ways: a ‘Global Anchor Identification’ that picks the most confident moments across the entire video, and a ‘Local Anchor Identification’ that finds confident moments within smaller, specific time windows. These identified audio and visual anchor features are then combined to form a robust multimodal representation of the salient events.

  • Anchor-based Temporal Propagation (ATP): This is where the magic happens. The rich semantic information contained in the identified salient anchors is then used to enhance the understanding of the entire video’s audio and visual features. By propagating this event knowledge from the confident anchor points to other temporal segments, the model can better pinpoint the start and end times of events across the whole timeline, even without direct supervision.

Also Read:

Achieving State-of-the-Art Performance

The researchers rigorously tested CLASP on two widely recognized datasets: UnAV-100 and ActivityNet1.3. These datasets contain long, untrimmed videos with a wide variety of audio-visual events. The results were impressive, demonstrating that CLASP significantly outperforms existing methods for weakly-supervised audio-visual event localization. For instance, on the UnAV-100 dataset, CLASP surpassed the previous state-of-the-art method by a notable margin, showcasing its effectiveness and generalizability.

The paper also includes detailed ablation studies, which are experiments designed to understand the contribution of each component of the CLASP system. These studies confirmed that each module, especially the dual global and local anchor identification mechanisms and the anchor-based propagation, plays a crucial role in the overall superior performance.

This work represents a significant step forward in making audio-visual event localization more practical and scalable by reducing the reliance on expensive, fine-grained annotations. For more technical details, you can read the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -