spot_img
HomeResearch & DevelopmentAdvancing Video-to-Audio Synthesis for Extended Content with LD-LAudio-V1

Advancing Video-to-Audio Synthesis for Extended Content with LD-LAudio-V1

TLDR: LD-LAudio-V1 is a new model that uses dual lightweight adapters to generate high-quality, temporally synchronized audio for long-form videos, overcoming limitations of existing short-form methods. It is supported by LPSE-1, a new clean, human-annotated dataset of pure sound effects for long videos. The model significantly reduces audio inconsistencies and splicing artifacts, demonstrating substantial performance improvements across various metrics with minimal computational overhead.

Generating high-quality audio that perfectly matches video content, often known as Foley sound generation, is a crucial aspect of video editing and post-production. It allows for the creation of rich, semantically aligned soundscapes for videos that might otherwise be silent.

While significant progress has been made in this field, most existing methods excel at generating audio for short video segments, typically under 10 seconds. The challenge arises when attempting to extend these capabilities to long-form videos. Current approaches often struggle with maintaining consistent audio quality and temporal synchronization over extended periods, leading to noticeable issues like “splicing artifacts” and general temporal inconsistencies. Furthermore, there has been a notable scarcity of clean, high-quality datasets specifically designed for long-form video-to-audio synthesis, with many existing ones containing unwanted noise like speech or music.

Addressing these critical limitations, researchers have introduced a new approach called LD-LAudio-V1, an extension of state-of-the-art video-to-audio models. This innovative system incorporates dual lightweight adapters, specifically designed to enable the generation of long-form audio. These adapters work at both the frame level and the clip level, allowing the model to understand and maintain coherence across extended video durations. This dual-adapter system helps to significantly reduce the splicing artifacts and temporal inconsistencies that plague previous methods, all while maintaining computational efficiency.

Alongside the LD-LAudio-V1 model, the team has also released a new, meticulously curated dataset called LPSE-1 (Long-form Pure Sound Effects – 1). This dataset is a significant contribution to the field, comprising over 6,000 videos, each longer than 60 seconds, and containing more than 20,000 human-annotated audio-visual events across 120 different categories. A key feature of LPSE-1 is its focus on pure sound effects, meaning it is entirely free from noise, voice-overs, music, or other irrelevant audio types, making it an exceptionally clean resource for training and evaluating long-form V2A models.

The performance of LD-LAudio-V1 demonstrates remarkable improvements over existing methods. When compared to direct fine-tuning with short training videos, LD-LAudio-V1 shows substantial gains across various evaluation metrics. For instance, it achieved a 27.27% improvement in FDpasst, a 34.98% improvement in FDpanns, and a striking 65.87% improvement in FDvgg. Other metrics like KLpanns, KLpasst, ISpanns, IBscore, Energy ∆10ms, Energy ∆10ms(vs.GT), and Semantic Relevance also saw significant positive changes, indicating higher quality audio generation, better semantic alignment, and improved temporal consistency. These advancements are achieved with only a minimal increase in model parameters, approximately 4%, highlighting the efficiency of the dual lightweight adapter design.

Also Read:

This research marks a significant step forward in the field of video-to-audio generation, particularly for long-form content. By providing both a novel model architecture and a high-quality, clean dataset, LD-LAudio-V1 and LPSE-1 are poised to facilitate further research and development in creating more realistic and synchronized audio for extended video experiences. You can find more details about this research in the paper: LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -