TLDR: RFM-Editing is a novel framework for text-guided audio editing that utilizes rectified flow matching and diffusion models. It enables precise modification of audio content based on simple text instructions, eliminating the need for full captions or masks. The model excels at localizing and editing specific audio events while preserving unedited regions, even in complex scenarios with overlapping sounds. It also introduces a new dataset for training and benchmarking, demonstrating competitive performance and efficiency.
Text-guided audio editing, the ability to modify existing audio using natural language instructions, is a rapidly evolving field with significant potential for sound design, post-production, and personalized audio generation. While text-to-audio generation has seen remarkable advancements, precise text-guided audio editing, especially in complex scenarios, has remained a challenge.
Existing methods often struggle with accurately localizing the content to be edited while preserving the rest of the audio. Some approaches require full captions or costly optimization during inference, making them less practical for real-world applications. The scarcity of large-scale datasets for instruction-guided audio editing has also limited the progress of training-based models.
Introducing RFM-Editing
A new research paper, RFM-EDITING: RECTIFIED FLOW MATCHING FOR TEXT-GUIDED AUDIO EDITING, proposes a novel solution to these challenges. Developed by Liting Gao, Yi Yuan, Yaru Chen, Yuelan Cheng, Zhenbo Li, Juan Wen, Shubin Zhang, and Wenwu Wang, RFM-Editing is an efficient, end-to-end framework based on rectified flow matching (RFM) for text-guided audio editing.
The core innovation of RFM-Editing lies in its ability to learn localized velocity fields directly from instructions, eliminating the need for explicit masks or full captions. This makes the editing process more intuitive and practical for users who only have raw audio and a specific editing instruction.
How RFM-Editing Works
RFM-Editing is built upon the foundation of Latent Diffusion Models (LDM). It integrates several key components:
- An audio feature extractor to process input audio.
- A low-rank adaptation (LoRA)-tuned Flan-T5 text encoder to accurately understand editing instructions.
- A UNet model specifically designed for text-guided latent editing.
- A HiFi-GAN decoder for high-fidelity reconstruction of the edited audio waveform.
A crucial aspect of RFM-Editing is its use of Rectified Flow Matching. Unlike standard diffusion models that rely on stochastic differential equations (SDEs), RFM formulates a deterministic ordinary differential equation (ODE) process. This models a straight-line trajectory from noise to the target edited audio, leading to more stable and efficient training. The model also intelligently preserves unedited regions by concatenating original audio features with noisy latent features and using a flexible initialization strategy during the editing process.
Training and Performance
To support its development, the researchers constructed a new, large-scale audio editing dataset. This dataset features overlapping multi-event audio, derived from AudioCaps2, and includes instruction-conditioned triplets for ‘add’, ‘remove’, and ‘replace’ tasks. This allows RFM-Editing to be trained and benchmarked in complex scenarios.
Experiments demonstrate that RFM-Editing achieves competitive performance compared to existing methods like AudioEditor, AUDIT, and Zero-Shot. It shows strong semantic alignment with instructions and maintains high editing quality across various metrics. Notably, RFM-Editing can automatically and accurately localize audio events based on the instruction, without requiring time-aligned masks. This is a significant advantage, as it simplifies the user experience and improves the precision of edits.
Furthermore, RFM-Editing offers a clear advantage in efficiency, performing edits significantly faster than some previous methods that rely on costly inference-time optimization. The framework also highlights the critical role of well-formulated prompts in achieving high-quality editing outcomes.
Also Read:
- Advancing Image Inversion and Semantic Editing in Rectified Flow Models
- Enhancing Musical Instrument Retrieval with Contrastive Timbre Representations
Conclusion
RFM-Editing represents a significant step forward in text-guided audio editing. By leveraging rectified flow matching and an instruction-driven approach, it offers a practical and efficient solution for modifying audio content with precision and semantic faithfulness, even in challenging multi-event scenarios. This work paves the way for future advancements in leveraging language prompting capabilities for sophisticated audio manipulation.


