EDITGEN: Revolutionizing Audio Editing with Instruction-Based Auto-Regressive Models

TLDR: EDITGEN is a novel audio editing framework that leverages cross-attention control within auto-regressive models, inspired by image editing techniques. It introduces three mechanisms (Replace, Reweight, Refine) for instruction-based audio manipulation. By integrating MUSICGEN and employing a soft-blending technique, EDITGEN significantly outperforms diffusion-based methods (like Auffusion) in terms of melody, dynamics, tempo, and overall audio realism, as confirmed by both automated metrics and human evaluations. This marks the first successful application of prompt-to-prompt guidance in auto-regressive audio editing.

Audio manipulation has traditionally been a complex and resource-intensive task, often demanding extensive datasets with detailed annotations and specialized expertise to craft effective model architectures. Fine-tuning existing models also proves to be a costly endeavor. A new research paper introduces EDITGEN, an innovative approach that simplifies audio editing by leveraging cross-attention control within auto-regressive models.

Inspired by successful image editing techniques like Prompt-to-Prompt, EDITGEN guides audio edits through sophisticated cross and self-attention mechanisms. The study explores two primary strategies: a diffusion-based approach, influenced by Auffusion, which extends the model’s capabilities for refinement edits, and an alternative method that integrates MUSICGEN, a pre-trained auto-regressive model.

The researchers propose three distinct editing mechanisms based on the manipulation of attention scores: Replacement, Reweighting, and Refinement. These mechanisms allow for precise, instruction-based audio modifications without the need for retraining the underlying model.

How EDITGEN Works

EDITGEN adapts the Prompt-to-Prompt technique, previously effective in image manipulation, to the audio domain. This adaptation enables fine-grained, semantically meaningful audio editing. The core idea involves injecting attention maps from an original audio generation into a new generation with a modified prompt, preserving the original structure while applying the desired edit.

Replace: This mechanism allows users to swap specific elements or ‘tokens’ in the original audio with new ones. For instance, changing an acoustic guitar to an electric guitar. The attention maps from the source audio are injected into the generation process with the modified prompt, controlling the edit up to a specified timestamp.
Refine: When new tokens are added to the prompt, the attention injection is applied only to the common tokens shared between the original and modified prompts. This ensures that new elements are seamlessly integrated while maintaining the integrity of existing audio features.
Reweight: Users can strengthen or weaken the influence of specific tokens on the final audio output. This is achieved by scaling the attention map of the target token with a parameter, allowing for subtle or dramatic adjustments to its effect.

A key innovation for auto-regressive models like MUSICGEN is the application of attention injection at all timesteps, ensuring edits affect the entire generated audio. Additionally, the paper introduces ‘soft-blending,’ a technique that merges feature maps from the forward process with injected ones using a weighted average. This dynamic blending factor adjusts based on the attention layer’s position, mimicking the iterative editing approach of diffusion models.

Also Read:

Evaluation and Results

To evaluate EDITGEN, a diverse dataset of prompt pairs was created, covering various audio editing axes such as Instrument Change, Mood/Tonal Change, Genre Shift, Melodic Transformation, Harmonic Modification, and Form/Structure Variation. The study generated 660 audio samples across both the diffusion-based Auffusion model and the MUSICGEN-based approach.

Automated metrics were used to assess musical characteristics, including Melody Accuracy, Dynamics Correlation, Rhythm F1 Score, and CLAP Score (for text-audio adherence). The results consistently showed that the MUSICGEN-based approach significantly outperformed the diffusion-based Auffusion across all metrics. It demonstrated superior melody accuracy, better similarity to both original audio and target text prompts, and excelled in dynamics correlation and rhythm F1 score.

A human study involving 24 evaluators further validated these findings. Participants judged audio clips generated by EDITGEN with MUSICGEN to be more natural and faithful to the original content, considering elements like melody, tempo, and dynamics, as well as textual alignment. This marks the first successful application of the Prompt-to-Prompt technique in the context of auto-regressive audio editing.

This research, detailed in the paper EDITGEN: HARNESSING CROSS ATTENTION CONTROL FOR INSTRUCTION-BASED AUTO-REGRESSIVE AUDIO EDITING, paves the way for more intuitive and controllable text-based audio editing, with future work focusing on creating comprehensive audio editing datasets and conducting broader user studies across diverse musical and cultural contexts.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

EDITGEN: Revolutionizing Audio Editing with Instruction-Based Auto-Regressive Models

How EDITGEN Works

Evaluation and Results

Gen AI News and Updates

Enhancing Controllability and Latent Space Regularization in AI Music Generation with Power Transforms

Yasam Ayavefe Champions Human-Centered AI in Digital Innovation and Creative Fields

MusRec: Zero-Shot AI Model Edits Real Music with Text Prompts

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates