spot_img
HomeResearch & DevelopmentPhoneme-Level Energy for Expressive AI Singing: A New Approach...

Phoneme-Level Energy for Expressive AI Singing: A New Approach to Dynamic Control

TLDR: This research introduces a novel method for controllable Singing Voice Synthesis (SVS) that allows users to precisely control the dynamics (loudness variation) of generated singing voices. By explicitly conditioning the SVS model on phoneme-level energy sequences, extracted from spectrograms without manual annotation, the system achieves significant improvements in dynamic control and perceived audio quality compared to baseline models. This approach offers a user-friendly way to manipulate musical expressiveness, marking a significant step towards more controllable and natural AI-generated singing.

Singing Voice Synthesis (SVS) has made remarkable strides in generating high-quality audio, but a persistent challenge has been the lack of precise user control over expressive attributes. Most existing SVS systems tend to produce expressive singing in a probabilistic manner, leaving little room for users to dictate specific musical intentions. This new research from Korea University addresses this gap by focusing on a crucial aspect of musical expressiveness: dynamics, which refers to the temporal variation of loudness in a singing voice.

The paper, titled “Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence,” introduces a novel approach to enable explicit and user-friendly dynamic control in SVS. Traditionally, controlling dynamics has been difficult, often relying on implicit modeling or extensive manual annotations. The researchers, Yerin Ryu, Inseop Shin, and Chanwoo Kim, propose a method that conditions the SVS model directly on energy sequences extracted from ground-truth spectrograms. This innovative step significantly reduces the need for costly manual annotations.

A key contribution of this work is the introduction of a phoneme-level energy sequence. While frame-level energy sequences offer high precision, they are impractical for users due to their length and complexity (hundreds or thousands of values for a short song). By aggregating this energy information to the phoneme level, the system provides a more intuitive and manageable interface for users to control the loudness of each individual phoneme, making it the first attempt to enable user-driven dynamics control in SVS at this level.

The model architecture, inspired by diffusion-based frameworks like DiffSinger, utilizes a Denoising Diffusion Probabilistic Model (DDPM) for mel-spectrogram decoding. It integrates lyric, note, and duration sequences, along with the newly proposed phoneme-level energy sequence, as inputs. These inputs are processed through an FFT block and a length regulator to align them correctly before being fed to the decoder. The energy sequence is simply summed with other input embeddings, demonstrating an effective way to incorporate dynamic control.

Experimental results highlight the effectiveness of this approach. The proposed method achieved a significant reduction in the Mean Absolute Error (MAE) of energy sequences for phoneme-level inputs, outperforming both baseline models and those relying on implicit energy predictors. Specifically, the phoneme-level model reduced energy MAE from 0.33 (baseline) to 0.14, while the frame-level model achieved an even lower 0.03, demonstrating superior fidelity in replicating energy patterns. This indicates that explicitly providing energy as an input is far more effective for dynamic control than implicit methods.

Beyond objective metrics, subjective evaluations using Mean Opinion Scores (MOS) also showed promising results. The phoneme-level model achieved a MOS of 3.78, higher than the baseline’s 3.43, suggesting an improvement in perceived audio quality without compromising synthesis quality. This indicates that the added control does not detract from the overall listening experience.

Also Read:

This research paves the way for more natural, expressive, and user-controllable singing voice synthesis. While the current work primarily focuses on dynamic control, the authors suggest that this energy sequence input method can be integrated with more advanced SVS architectures to enhance other expressive attributes in the future. For more technical details, you can refer to the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -