TLDR: The research paper introduces MF-MJT, a MeanFlow-accelerated model for multimodal video-to-audio (VTA) and text-to-audio (TTA) synthesis. It addresses the trade-off between synthesis quality and inference efficiency by enabling one-step generation using average velocity, significantly speeding up the process (up to 500x faster than some baselines) while maintaining high audio quality, semantic alignment, and temporal synchronization. It also includes a scalar rescaling mechanism for classifier-free guidance to prevent distortions in one-step generation.
Creating audio for silent videos has always presented a challenge: how do you achieve high-quality sound without making the process incredibly slow? Traditional methods, especially those based on flow matching, often require many steps to generate audio, leading to sluggish performance. This is a significant hurdle for applications like video dubbing and content creation.
A new research paper introduces a groundbreaking solution called MeanFlow-accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation, or MF-MJT for short. This innovative model tackles the efficiency bottleneck head-on by enabling one-step audio generation, dramatically speeding up the process while maintaining excellent audio quality, ensuring the sound matches the video’s meaning, and keeping everything perfectly synchronized in time.
The core innovation lies in its use of ‘MeanFlow’. Unlike previous flow matching models that focus on instantaneous velocity (like a snapshot of speed at a single moment), MeanFlow models the average velocity of the flow fields. Think of it like this: instead of calculating every tiny movement along a path, it calculates the overall direction and speed, allowing it to jump directly to the end result in a single step. This is what makes the generation process so much faster.
Beyond just speed, the MF-MJT model also introduces a clever ‘scalar rescaling mechanism’. When using a technique called classifier-free guidance (CFG) to balance conditional and unconditional predictions – essentially, guiding the audio generation based on specific inputs while also allowing for some creative freedom – there’s a risk of distortions, especially in one-step generation. This new mechanism effectively mitigates these distortions, ensuring the generated audio remains high quality and accurate.
What’s particularly impressive is the model’s versatility. While primarily designed for video-to-audio (VTA) synthesis, the underlying network is jointly trained with multimodal conditions, meaning it learns from video, audio, and text simultaneously. This allows it to also perform exceptionally well on text-to-audio (TTA) synthesis tasks, generating audio from written descriptions.
The architecture of MF-MJT builds upon a multimodal joint training backbone, integrating video, audio, and text inputs into a unified framework. It uses advanced components like CLIP encoders for visual and text features, a Variational Autoencoder (VAE) for audio, and a Synchformer visual encoder to enhance audio-visual synchrony. These elements work together to create a shared understanding across different types of data.
Extensive experiments have shown remarkable results. On video-to-audio synthesis, MF-MJT significantly outperforms existing methods in terms of inference speed, achieving a real-time factor (RTF) of just 0.007 for one-step generation on an NVIDIA H800 GPU. This means it can generate audio over 140 times faster than the duration of the audio itself, offering a speedup of 2x to 500x compared to some baselines. Crucially, this acceleration doesn’t come at the cost of quality; the model maintains comparable perceptual quality, semantic alignment, and temporal synchronization.
Similarly, for text-to-audio synthesis, MF-MJT demonstrates strong performance, surpassing other efficient TTA models in key metrics while maintaining its incredibly fast inference speed. The research also explored the impact of the scalar rescaling mechanism and found it consistently improved perceptual quality across different guidance strengths.
Also Read:
- Phoneme-Level Energy for Expressive AI Singing: A New Approach to Dynamic Control
- Improving 3D Sound Event Localization in Videos Through Semantic and Spatial Fusion
In conclusion, the MF-MJT framework represents a significant leap forward in multimodal audio synthesis. By leveraging MeanFlow for one-step generation and incorporating a smart CFG-scaled mechanism, it delivers unprecedented efficiency without compromising on the quality or accuracy of the synthesized audio. This makes it a powerful tool for a wide range of applications requiring fast, high-fidelity audio generation from video or text inputs. You can read the full research paper here: Research Paper.


