Interactive Sound Generation: Click on Objects, Hear the Audio

TLDR: Hear-Your-Click is a new interactive video-to-audio (V2A) framework that allows users to generate specific sounds for objects in a video by simply clicking on them. It uses Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) and novel data augmentation techniques (RVS, MLM) to achieve precise object-level audio-visual alignment. The system also introduces a new evaluation metric, the CAV score, and demonstrates superior performance in generating accurate and synchronized audio for specific visual elements.

Current video-to-audio (V2A) generation methods often struggle with complex video scenes because they rely on global video information. This means they can’t generate sounds specifically for individual objects or regions within a video, limiting their practical use in areas like film production and interactive media.

To overcome these limitations, researchers have introduced “Hear-Your-Click,” an innovative interactive V2A framework. This system allows users to generate sounds for specific objects in a video simply by clicking on them within a frame. This provides a much finer level of control over the audio generation process, addressing the challenge of customizing audio for specific visual elements.

The core of Hear-Your-Click is a technique called Object-aware Contrastive Audio-Visual Fine-tuning (OCAV). This method leverages a Mask-guided Visual Encoder (MVE) to extract visual features specifically related to the selected object. These object-level visual features are then aligned with corresponding audio segments through contrastive learning, ensuring a strong connection between what is seen and what is heard.

To make the model even more sensitive to segmented objects and improve its robustness, two new data augmentation strategies were developed: Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM). RVS helps the model handle complex scenes with multiple objects by combining different video segments, while MLM ensures that sound loudness changes accurately reflect an object’s distance from the camera and its presence in the frame, mitigating issues like sounds continuing after their sources move out of view.

For evaluating the accuracy of the audio-visual correspondence, a new metric called the CAV score was designed. This score helps measure how well the generated audio matches the objects in the original video, providing a more precise assessment than traditional audio generation metrics alone.

The interactive inference process of Hear-Your-Click is designed for user-friendliness. Users can upload a silent video, select a single frame, and refine a mask around their target object using promptable segmentation capabilities from models like Segment Anything Model (SAM). This mask is then propagated throughout the entire video sequence using the Track Anything Model (TAM). Finally, the system uses a trained Latent Diffusion Model (LDM), conditioned on the extracted visual features, to generate the final audio.

Also Read:

Extensive experiments have shown that Hear-Your-Click offers more precise control and significantly improves audio generation performance across various metrics, including the newly introduced CAV score. This demonstrates its effectiveness in capturing object-specific local details within videos and producing accurate, synchronized sounds. You can find more details about this research paper here: Hear-Your-Click Research Paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Interactive Sound Generation: Click on Objects, Hear the Audio

Gen AI News and Updates

Generative AI Powers Next-Gen Autonomous Emergency Response

Enhancing Interpretability and Performance in Vision Transformers with Randomized-MLP Regularization

C3-Diff: Enhancing Spatial Gene Expression Maps with AI and Histology

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates