Recomposer: Precise Audio Editing with Event-Roll Guidance

TLDR: Recomposer is a new generative audio editing system that allows users to delete, insert, and enhance individual sound events within complex audio scenes. It uses an encoder-decoder transformer model guided by text descriptions and a visual “event roll” for precise temporal control. The system is trained on synthetically mixed audio data and demonstrates significant improvements in targeted audio modifications, highlighting the importance of action, class, and timing information for effective editing.

Imagine being able to precisely edit individual sounds within a complex audio recording, like removing a dog bark or making a doorbell more prominent, even when multiple sounds overlap. This challenging task is now being tackled by a new system called Recomposer, developed by researchers at Google DeepMind. Recomposer offers a novel approach to generative audio editing, allowing users to delete, insert, and enhance specific sound events with remarkable control.

Traditional audio editing software often struggles with intricate soundscapes where different audio events occur simultaneously. Generative models, with their ability to infer and fill in missing details, present a powerful alternative. Recomposer leverages this by treating a sound scene as a collection of individual events, enabling targeted modifications that would be difficult or impossible with conventional tools.

The core of Recomposer’s innovation lies in its “event roll” guided interface. This graphical representation displays the timing of individual sound events, allowing users to specify edits with textual descriptions like “enhance Door” alongside precise time extents. This combination of action, class, and timing information forms what the researchers call an “activity roll,” providing millisecond-level control over the editing process.

At its technical heart, Recomposer is an encoder-decoder transformer model that operates on SoundStream representations of audio. SoundStream is an efficient neural audio codec. The system takes the original audio’s encoding and combines it with time-aligned embeddings of the edit instructions. These instructions are generated by a pretrained Sentence-T5 network, which converts text descriptions into vectors, then aligned with the activity roll. The transformer then generates a sequence of SoundStream tokens, which are finally converted back into the edited audio waveform.

To train such a sophisticated model, a vast amount of data is required. The Google DeepMind team ingeniously created synthetic training examples. They mixed isolated “target” sound events, sourced from Freesound, with dense, real-world “background” sound scenes from AudioSet. This synthetic mixing allowed them to generate perfect pairs of input audio and desired output audio for various edit operations: deletion, insertion, and enhancement. For instance, to train for deletion, they would mix a target sound with a background, and the desired output would be just the background, effectively teaching the model to remove the target.

The research focused on three primary editing operations:

Delete

This involves removing a specific sound event while maintaining a coherent and natural-sounding background. The model learns to reconstruct the audio as if the target sound was never there.

Insert

Here, the model generates and places a new sound event of a specified class into the audio at a designated time. This is akin to a conditional text-to-audio generation task.

Also Read:

Enhance

This operation identifies a weak or subtle audio event and regenerates it at a higher amplitude, making it more prominent. It combines elements of source separation and conditional generation.

Evaluation of Recomposer involved two key metrics: Multiscale Spectral Distortion (MSD), which measures signal-level differences in spectrograms, and Classifier KL Divergence (KLD), which assesses how well the generated events match the intended class using a sound event classifier like YAMNet. The results consistently showed that Recomposer significantly improved the audio in the target regions compared to unprocessed inputs. Ablation studies further highlighted the critical role of each conditioning component – the requested action, the target event class, and the precise timing – in achieving successful edits.

While Recomposer demonstrates a powerful proof-of-concept, the researchers acknowledge areas for future development. These include providing more control over the properties of generated events (beyond just class and timing), expanding the vocabulary of event descriptions beyond fixed AudioSet labels, and potentially integrating video-derived conditioning for soundtracks that accompany visuals. The work underscores the feasibility of event-oriented editing in complex sound scenes and paves the way for more intuitive and powerful audio manipulation tools.

For more technical details, you can read the full research paper: Recomposer: Event-roll-guided generative audio editing.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Recomposer: Precise Audio Editing with Event-Roll Guidance

Delete

Insert

Enhance

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates