TLDR: Hear-Your-Click is a new interactive video-to-audio (V2A) framework that allows users to generate specific sounds for objects in a video by simply clicking on them. It uses Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) and novel data augmentation techniques (RVS, MLM) to achieve precise object-level audio-visual alignment. The system also introduces a new evaluation metric, the CAV score, and demonstrates superior performance in generating accurate and synchronized audio for specific visual elements.
Current video-to-audio (V2A) generation methods often struggle with complex video scenes because they rely on global video information. This means they can’t generate sounds specifically for individual objects or regions within a video, limiting their practical use in areas like film production and interactive media.
To overcome these limitations, researchers have introduced “Hear-Your-Click,” an innovative interactive V2A framework. This system allows users to generate sounds for specific objects in a video simply by clicking on them within a frame. This provides a much finer level of control over the audio generation process, addressing the challenge of customizing audio for specific visual elements.
The core of Hear-Your-Click is a technique called Object-aware Contrastive Audio-Visual Fine-tuning (OCAV). This method leverages a Mask-guided Visual Encoder (MVE) to extract visual features specifically related to the selected object. These object-level visual features are then aligned with corresponding audio segments through contrastive learning, ensuring a strong connection between what is seen and what is heard.
To make the model even more sensitive to segmented objects and improve its robustness, two new data augmentation strategies were developed: Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM). RVS helps the model handle complex scenes with multiple objects by combining different video segments, while MLM ensures that sound loudness changes accurately reflect an object’s distance from the camera and its presence in the frame, mitigating issues like sounds continuing after their sources move out of view.
For evaluating the accuracy of the audio-visual correspondence, a new metric called the CAV score was designed. This score helps measure how well the generated audio matches the objects in the original video, providing a more precise assessment than traditional audio generation metrics alone.
The interactive inference process of Hear-Your-Click is designed for user-friendliness. Users can upload a silent video, select a single frame, and refine a mask around their target object using promptable segmentation capabilities from models like Segment Anything Model (SAM). This mask is then propagated throughout the entire video sequence using the Track Anything Model (TAM). Finally, the system uses a trained Latent Diffusion Model (LDM), conditioned on the extracted visual features, to generate the final audio.
Also Read:
- AI Model Learns to Compose and Perform Classical Piano with Expressive Nuances
- Unlocking Dynamic Presentations: A New AI Approach for Slide Animation Comprehension
Extensive experiments have shown that Hear-Your-Click offers more precise control and significantly improves audio generation performance across various metrics, including the newly introduced CAV score. This demonstrates its effectiveness in capturing object-specific local details within videos and producing accurate, synchronized sounds. You can find more details about this research paper here: Hear-Your-Click Research Paper.


