TLDR: A new research paper introduces “speaker identity unlearning” for Zero-Shot Text-to-Speech (ZS-TTS) systems. They propose Teacher-Guided Unlearning (TGU), a method that teaches AI models to forget specific voices while retaining the ability to generate high-quality speech for other speakers. This is crucial for voice privacy, ensuring that individuals can opt out of having their voices replicated by AI. A new metric, spk-ZRF, was also introduced to measure the randomness of generated voices for forgotten identities, preventing reconstruction.
In an era where Zero-Shot Text-to-Speech (ZS-TTS) technology is rapidly advancing, enabling highly realistic voice synthesis from just a few seconds of audio, significant privacy and ethical concerns have emerged. Imagine an AI system that can perfectly mimic anyone’s voice with minimal input – while impressive, this capability also poses a threat to individual voice privacy. Until now, there hasn’t been a clear method to selectively remove the ability to replicate unwanted individual voices from these powerful pre-trained models.
A new research paper, titled “Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech,” addresses this critical challenge head-on. Authored by TaeSoo Kim, Jinju Kim, Dongchan Kim, Jong Hwan Ko, and Gyeong-Moon Park, this work introduces the novel concept of speaker identity unlearning for ZS-TTS systems. The core idea is to make these AI models ‘forget’ specific speaker identities while still maintaining their high-quality speech generation capabilities for other voices.
The Challenge of Forgetting in AI
Traditional machine unlearning (MU) techniques often focus on removing the influence of specific training data. However, ZS-TTS models can replicate voices they’ve never been explicitly trained on, making conventional unlearning insufficient. The goal isn’t just to prevent mimicry, but to ensure the generated speech avoids any fixed style that could be traced back to the forgotten speaker. This requires the model to generate speech in a random, variable voice when prompted with a forgotten identity.
Introducing Guided Unlearning: SGU and TGU
The researchers propose the first machine unlearning frameworks for ZS-TTS, called Guided Unlearning. This includes two novel approaches: Sample-Guided Unlearning (SGU) and the more advanced Teacher-Guided Unlearning (TGU).
SGU attempts to guide the model by concatenating a forgotten speaker’s audio with a random speaker’s audio and masking parts. However, this method faces limitations because the model struggles to leverage both preceding and succeeding audio contexts for infilling, potentially leading to unnatural speech patterns due to mismatches in tempo and rhythm.
TGU, the paper’s primary contribution, overcomes these limitations. It leverages the pre-trained ZS-TTS model itself as a ‘teacher.’ When the model is given a forgotten speaker’s voice prompt and text, the teacher model generates speech conditioned only on the text, resulting in a random voice style. This randomly generated speech then becomes the target for the unlearning model. This ensures that the model learns to produce varying voice styles for forgotten speakers, preventing any consistent or identifiable pattern from emerging. Crucially, TGU also maintains the model’s original performance for speakers it’s supposed to retain.
A New Metric for True Forgetting
To properly evaluate the effectiveness of unlearning, the researchers introduced a new metric: speaker-Zero Retrain Forgetting (spk-ZRF). Unlike standard metrics that only compare performance between forgotten and retained sets, spk-ZRF specifically measures the degree of randomness in the generated speaker identities for forgotten voices. A high spk-ZRF score indicates that the model has truly unlearned, making it difficult to reconstruct or manipulate the unlearned voices, thereby enhancing privacy.
Also Read:
- Un-pruning: Enabling Data Forgetting in the Structure of Sparse AI Models
- How Fine-Tuning Affects LLM Memory and Privacy
Promising Results and Future Implications
Experiments conducted on a state-of-the-art ZS-TTS model, VoiceBox, demonstrated TGU’s superior performance. TGU effectively prevented the model from replicating forgotten speakers’ voices while maintaining high quality for other speakers. It achieved a speaker similarity (SIM) score for forgotten voices that closely matched the similarity between actual audio samples from different speakers, indicating effective unlearning. For retained speakers, TGU maintained a high SIM score, showing minimal performance degradation compared to the original model.
Furthermore, TGU showed strong scalability, performing consistently well even when unlearning multiple speakers. It also proved effective in out-of-domain scenarios, successfully unlearning voices that were not part of the original training dataset. Human subjective evaluations corroborated these quantitative findings, confirming TGU’s ability to generate distinct voices for forgotten speakers while preserving overall speech quality.
This pioneering work marks a significant step towards ensuring safety and privacy in the use of ZS-TTS models. By enabling individuals to opt out of voice replication, it addresses critical ethical concerns and paves the way for broader, more responsible availability of these powerful AI technologies. The paper can be accessed here: Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech.


