spot_img
HomeResearch & DevelopmentEDITGEN: Revolutionizing Audio Editing with Instruction-Based Auto-Regressive Models

EDITGEN: Revolutionizing Audio Editing with Instruction-Based Auto-Regressive Models

TLDR: EDITGEN is a novel audio editing framework that leverages cross-attention control within auto-regressive models, inspired by image editing techniques. It introduces three mechanisms (Replace, Reweight, Refine) for instruction-based audio manipulation. By integrating MUSICGEN and employing a soft-blending technique, EDITGEN significantly outperforms diffusion-based methods (like Auffusion) in terms of melody, dynamics, tempo, and overall audio realism, as confirmed by both automated metrics and human evaluations. This marks the first successful application of prompt-to-prompt guidance in auto-regressive audio editing.

Audio manipulation has traditionally been a complex and resource-intensive task, often demanding extensive datasets with detailed annotations and specialized expertise to craft effective model architectures. Fine-tuning existing models also proves to be a costly endeavor. A new research paper introduces EDITGEN, an innovative approach that simplifies audio editing by leveraging cross-attention control within auto-regressive models.

Inspired by successful image editing techniques like Prompt-to-Prompt, EDITGEN guides audio edits through sophisticated cross and self-attention mechanisms. The study explores two primary strategies: a diffusion-based approach, influenced by Auffusion, which extends the model’s capabilities for refinement edits, and an alternative method that integrates MUSICGEN, a pre-trained auto-regressive model.

The researchers propose three distinct editing mechanisms based on the manipulation of attention scores: Replacement, Reweighting, and Refinement. These mechanisms allow for precise, instruction-based audio modifications without the need for retraining the underlying model.

How EDITGEN Works

EDITGEN adapts the Prompt-to-Prompt technique, previously effective in image manipulation, to the audio domain. This adaptation enables fine-grained, semantically meaningful audio editing. The core idea involves injecting attention maps from an original audio generation into a new generation with a modified prompt, preserving the original structure while applying the desired edit.

  • Replace: This mechanism allows users to swap specific elements or ‘tokens’ in the original audio with new ones. For instance, changing an acoustic guitar to an electric guitar. The attention maps from the source audio are injected into the generation process with the modified prompt, controlling the edit up to a specified timestamp.
  • Refine: When new tokens are added to the prompt, the attention injection is applied only to the common tokens shared between the original and modified prompts. This ensures that new elements are seamlessly integrated while maintaining the integrity of existing audio features.
  • Reweight: Users can strengthen or weaken the influence of specific tokens on the final audio output. This is achieved by scaling the attention map of the target token with a parameter, allowing for subtle or dramatic adjustments to its effect.

A key innovation for auto-regressive models like MUSICGEN is the application of attention injection at all timesteps, ensuring edits affect the entire generated audio. Additionally, the paper introduces ‘soft-blending,’ a technique that merges feature maps from the forward process with injected ones using a weighted average. This dynamic blending factor adjusts based on the attention layer’s position, mimicking the iterative editing approach of diffusion models.

Also Read:

Evaluation and Results

To evaluate EDITGEN, a diverse dataset of prompt pairs was created, covering various audio editing axes such as Instrument Change, Mood/Tonal Change, Genre Shift, Melodic Transformation, Harmonic Modification, and Form/Structure Variation. The study generated 660 audio samples across both the diffusion-based Auffusion model and the MUSICGEN-based approach.

Automated metrics were used to assess musical characteristics, including Melody Accuracy, Dynamics Correlation, Rhythm F1 Score, and CLAP Score (for text-audio adherence). The results consistently showed that the MUSICGEN-based approach significantly outperformed the diffusion-based Auffusion across all metrics. It demonstrated superior melody accuracy, better similarity to both original audio and target text prompts, and excelled in dynamics correlation and rhythm F1 score.

A human study involving 24 evaluators further validated these findings. Participants judged audio clips generated by EDITGEN with MUSICGEN to be more natural and faithful to the original content, considering elements like melody, tempo, and dynamics, as well as textual alignment. This marks the first successful application of the Prompt-to-Prompt technique in the context of auto-regressive audio editing.

This research, detailed in the paper EDITGEN: HARNESSING CROSS ATTENTION CONTROL FOR INSTRUCTION-BASED AUTO-REGRESSIVE AUDIO EDITING, paves the way for more intuitive and controllable text-based audio editing, with future work focusing on creating comprehensive audio editing datasets and conducting broader user studies across diverse musical and cultural contexts.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -