spot_img
HomeResearch & DevelopmentEGGCodec: Advancing F0 Extraction Through Robust EGG Signal Reconstruction

EGGCodec: Advancing F0 Extraction Through Robust EGG Signal Reconstruction

TLDR: EGGCodec is a new neural framework designed for precise electroglottography (EGG) signal reconstruction and fundamental frequency (F0) extraction. It improves upon existing models by simplifying its architecture (removing the GAN discriminator) and introducing specialized loss functions for both frequency and time domains. By focusing on reconstructing EGG signals, which more accurately reflect vocal fold vibrations, EGGCodec achieves superior F0 extraction accuracy and robustness compared to current state-of-the-art methods, significantly reducing errors in F0 estimation and voicing decisions.

Understanding the nuances of human speech is a complex field, and a crucial element within it is the fundamental frequency, or F0. F0 reflects the rate at which our vocal folds vibrate, carrying vital information about prosody and speaker characteristics. Accurate F0 extraction is essential for a wide range of applications, from speech recognition and synthesis to speaker identification and even music research.

Traditionally, F0 has been extracted from microphone-captured speech signals. However, this presents significant challenges due to the intricate vibration mechanisms of vocal folds and the variability of recording conditions. These factors can make precise F0 extraction a formidable task. A more reliable alternative comes in the form of electroglottography (EGG) signals. EGG signals offer higher accuracy and stability because they more directly reflect the periodic nature of vocal fold vibrations, making them ideally suited for F0 extraction.

In this context, a new framework called EGGCodec has been introduced. EGGCodec is a robust neural Encodec framework specifically designed for EGG signal reconstruction and F0 extraction. It builds upon the existing Encodec model, which is known for its ability to compress and reconstruct speech signals effectively. While Encodec is powerful, directly applying it to reconstruct EGG signals has been challenging due to its structural complexity, limitations in its loss function, and the inherent instability of its Generative Adversarial Network (GAN) discriminator.

EGGCodec addresses these challenges with several key innovations. One significant change is the removal of the conventional GAN discriminator, which streamlines the training process without compromising efficiency. Instead of extracting F0 directly from features, EGGCodec leverages reconstructed EGG signals, which have a closer correspondence to F0. To ensure high fidelity between the reconstructed and target EGG signals, EGGCodec employs a multi-scale frequency-domain loss function that captures the subtle relationships across different frequencies. This is complemented by a time-domain correlation loss, which improves the model’s ability to generalize and maintain accuracy over time.

The process within EGGCodec involves encoding speech signals into compact representations, quantizing them, and then reconstructing them into waveforms. Crucially, EGGCodec shifts its reconstruction target from speech signals to EGG signals. This means the model learns to generate outputs that closely match EGG signals, allowing it to focus on reconstructing the vocal cord vibration signal from speech input, thereby capturing the fine details of F0 more accurately. This approach not only enhances F0 extraction accuracy but also improves EGGCodec’s ability to characterize the dynamics of vocal fold opening and closing.

For F0 extraction, EGGCodec differentiates the reconstructed EGG signal to create a differential EGG (dEGG) signal. The peaks in the dEGG signal correspond to vocal fold closure instants. By using a peak detection algorithm, these peaks are identified as periodic markers to calculate vibration periods and derive F0. An important preprocessing step for EGG signals, especially from datasets like PTDB-TUG, involves applying a 50 Hz high-pass filter. This filter removes low-frequency components that originate from throat muscle artifacts rather than vocal fold vibrations, preventing interference with model training and ensuring cleaner, more reliable F0 estimation.

Extensive evaluations have demonstrated EGGCodec’s superior performance compared to state-of-the-art F0 extraction schemes. For instance, it reduces the mean absolute error (MAE) from 14.14 Hz to 13.69 Hz and improves the voicing decision error (VDE) by 38.2%. The model was trained on the PTDB-TUG corpus, which includes synchronized speech and EGG recordings, and evaluated on the CSTR-FDA dataset, a gold-standard pitch determination corpus. The results show that EGGCodec’s reconstructed EGG signals exhibit a high degree of consistency with the original signals, especially in the vibrating regions of the vocal cords. Noise augmentation during training also proved crucial, yielding perfect reconstructed EGG signals and enabling accurate vocal fold cycle detection.

Also Read:

Ablation studies, which systematically evaluate each component’s contribution, further validate EGGCodec’s design. The optimal configuration, integrating various loss functions and noise augmentation, achieved the best balance between accuracy and robustness. Even without the GAN discriminator, EGGCodec maintained strong performance, confirming the effectiveness of its simplified training process. This innovative framework not only enhances the accuracy of EGG reconstruction but also significantly contributes to the stability and reliability of F0 extraction, paving the way for more precise speech analysis. For more in-depth technical details, you can refer to the original research paper.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -