TLDR: Recomposer is a new generative audio editing system that allows users to delete, insert, and enhance individual sound events within complex audio scenes. It uses an encoder-decoder transformer model guided by text descriptions and a visual “event roll” for precise temporal control. The system is trained on synthetically mixed audio data and demonstrates significant improvements in targeted audio modifications, highlighting the importance of action, class, and timing information for effective editing.
Imagine being able to precisely edit individual sounds within a complex audio recording, like removing a dog bark or making a doorbell more prominent, even when multiple sounds overlap. This challenging task is now being tackled by a new system called Recomposer, developed by researchers at Google DeepMind. Recomposer offers a novel approach to generative audio editing, allowing users to delete, insert, and enhance specific sound events with remarkable control.
Traditional audio editing software often struggles with intricate soundscapes where different audio events occur simultaneously. Generative models, with their ability to infer and fill in missing details, present a powerful alternative. Recomposer leverages this by treating a sound scene as a collection of individual events, enabling targeted modifications that would be difficult or impossible with conventional tools.
The core of Recomposer’s innovation lies in its “event roll” guided interface. This graphical representation displays the timing of individual sound events, allowing users to specify edits with textual descriptions like “enhance Door” alongside precise time extents. This combination of action, class, and timing information forms what the researchers call an “activity roll,” providing millisecond-level control over the editing process.
At its technical heart, Recomposer is an encoder-decoder transformer model that operates on SoundStream representations of audio. SoundStream is an efficient neural audio codec. The system takes the original audio’s encoding and combines it with time-aligned embeddings of the edit instructions. These instructions are generated by a pretrained Sentence-T5 network, which converts text descriptions into vectors, then aligned with the activity roll. The transformer then generates a sequence of SoundStream tokens, which are finally converted back into the edited audio waveform.
To train such a sophisticated model, a vast amount of data is required. The Google DeepMind team ingeniously created synthetic training examples. They mixed isolated “target” sound events, sourced from Freesound, with dense, real-world “background” sound scenes from AudioSet. This synthetic mixing allowed them to generate perfect pairs of input audio and desired output audio for various edit operations: deletion, insertion, and enhancement. For instance, to train for deletion, they would mix a target sound with a background, and the desired output would be just the background, effectively teaching the model to remove the target.
The research focused on three primary editing operations:
Delete
This involves removing a specific sound event while maintaining a coherent and natural-sounding background. The model learns to reconstruct the audio as if the target sound was never there.
Insert
Here, the model generates and places a new sound event of a specified class into the audio at a designated time. This is akin to a conditional text-to-audio generation task.
Also Read:
- Maestro: A Holistic Optimizer for Reliable AI Agent Systems
- Cloning Voice AI Agents for Telesales: A Deep Dive into Call Recording-Based Development
Enhance
This operation identifies a weak or subtle audio event and regenerates it at a higher amplitude, making it more prominent. It combines elements of source separation and conditional generation.
Evaluation of Recomposer involved two key metrics: Multiscale Spectral Distortion (MSD), which measures signal-level differences in spectrograms, and Classifier KL Divergence (KLD), which assesses how well the generated events match the intended class using a sound event classifier like YAMNet. The results consistently showed that Recomposer significantly improved the audio in the target regions compared to unprocessed inputs. Ablation studies further highlighted the critical role of each conditioning component – the requested action, the target event class, and the precise timing – in achieving successful edits.
While Recomposer demonstrates a powerful proof-of-concept, the researchers acknowledge areas for future development. These include providing more control over the properties of generated events (beyond just class and timing), expanding the vocabulary of event descriptions beyond fixed AudioSet labels, and potentially integrating video-derived conditioning for soundtracks that accompany visuals. The work underscores the feasibility of event-oriented editing in complex sound scenes and paves the way for more intuitive and powerful audio manipulation tools.
For more technical details, you can read the full research paper: Recomposer: Event-roll-guided generative audio editing.


