spot_img
HomeResearch & DevelopmentArtiFree: Enhancing Speech by Tackling AI-Generated Artifacts

ArtiFree: Enhancing Speech by Tackling AI-Generated Artifacts

TLDR: The research paper “ARTIFREE: DETECTING AND REDUCING GENERATIVE ARTIFACTS IN DIFFUSION-BASED SPEECH ENHANCEMENT” introduces a system called ArtiFree to address generative artifacts and high inference latency in diffusion-based speech enhancement. It proposes using variance in speech embeddings to predict phonetic errors and an ensemble inference method guided by semantic consistency to select artifact-free outputs. Additionally, the paper explores adaptive diffusion steps to balance artifact suppression and latency, demonstrating significant Word Error Rate (WER) reduction and improved phonetic accuracy.

Speech enhancement (SE) technology plays a crucial role in improving audio quality by suppressing unwanted noise, making applications like telephony, transcription, and assistive hearing devices more robust. Recently, generative models, particularly those based on diffusion, have shown great promise in SE. These models can produce remarkably natural-sounding speech and adapt well to new noise conditions by learning the distribution of clean speech from noisy inputs.

However, these advanced diffusion-based SE models face a couple of significant challenges. One major issue is the introduction of “generative artifacts.” These aren’t just simple distortions; they can include phoneme insertions or substitutions (where the AI hallucinates words or sounds that weren’t there), hiss, breathing artifacts, and high-frequency distortions. These artifacts can severely degrade the performance of downstream tasks like automatic speech recognition (ASR), even if the speech sounds perceptually good to a human ear. Another challenge is the high inference latency, meaning these models can be slow, which is unsuitable for real-time applications.

Introducing ArtiFree: A Solution for Cleaner Speech

A new research paper, “ARTIFREE: DETECTING AND REDUCING GENERATIVE ARTIFACTS IN DIFFUSION-BASED SPEECH ENHANCEMENT”, introduces a systematic approach called ArtiFree to tackle these problems. The core idea behind ArtiFree is to both detect and reduce these generative artifacts while maintaining efficiency.

Detecting Artifacts Through Uncertainty

Unlike traditional predictive SE models that produce a single, deterministic output, diffusion models are stochastic, meaning they can generate multiple plausible outputs from the same noisy input. The researchers observed that not all these outputs are semantically correct; some contain hallucinated phonemes. They found that the variance in speech embeddings (a numerical representation of speech that captures phonetic structure) across multiple diffusion runs can effectively predict where and when these phonetic errors occur. Essentially, if the model is “uncertain” about a particular segment of speech, it tends to generate artifacts, leading to higher variance in its different attempts. This variance acts as a reliable indicator for artifact-prone regions, allowing for 100% accuracy in artifact detection even with a small number of samples.

Reducing Artifacts with Semantic Consistency

To reduce these artifacts, ArtiFree proposes an ensemble inference method guided by semantic consistency. This involves generating several enhanced speech candidates from the same noisy input using different random seeds. Each candidate’s speech embeddings are then compared. The system selects the output that is most semantically consistent, meaning it aligns best with the overall phonetic structure. This consistency can be determined by comparing candidates to a clean reference (for analysis), the noisy input, or the centroid (average) of all generated samples. By favoring outputs that are semantically aligned, ArtiFree effectively suppresses unstable artifacts without needing to retrain the model. This method achieved up to a 15% reduction in Word Error Rate (WER) in low-SNR conditions, significantly improving phonetic accuracy.

Balancing Quality and Speed with Adaptive Diffusion Steps

While ensemble inference improves quality, it increases latency because multiple samples need to be generated. To address this, the researchers also investigated the effect of varying the number of reverse diffusion steps (N) during inference. They found that reducing N can substantially lower the real-time factor (RTF), which is the inference time for a one-second audio file. Lower N values can also slightly improve WER by limiting the chances of hallucinations, though it might lead to a minor drop in perceptual quality metrics like PESQ. By using adaptive N schedules – for example, fewer steps for high-SNR audio and more for low-SNR audio – ArtiFree can achieve a favorable balance between artifact suppression, quality, and latency. Combining a small ensemble size (e.g., S=3) with a reduced number of diffusion steps (e.g., N=10) can restore the runtime to that of default settings while still significantly reducing phoneme artifacts.

Also Read:

Conclusion

The ArtiFree framework demonstrates that generative artifacts in diffusion-based speech enhancement can be effectively detected and reduced. By leveraging the variance in speech embeddings for prediction and semantic consistency for selection, along with adaptive diffusion steps for efficiency, this work offers practical strategies for achieving artifact-free outputs in high-fidelity speech applications. The findings highlight the power of semantic priors in guiding generative models towards more accurate and reliable results.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -