ArtiFree: Enhancing Speech by Tackling AI-Generated Artifacts

TLDR: The research paper “ARTIFREE: DETECTING AND REDUCING GENERATIVE ARTIFACTS IN DIFFUSION-BASED SPEECH ENHANCEMENT” introduces a system called ArtiFree to address generative artifacts and high inference latency in diffusion-based speech enhancement. It proposes using variance in speech embeddings to predict phonetic errors and an ensemble inference method guided by semantic consistency to select artifact-free outputs. Additionally, the paper explores adaptive diffusion steps to balance artifact suppression and latency, demonstrating significant Word Error Rate (WER) reduction and improved phonetic accuracy.

Speech enhancement (SE) technology plays a crucial role in improving audio quality by suppressing unwanted noise, making applications like telephony, transcription, and assistive hearing devices more robust. Recently, generative models, particularly those based on diffusion, have shown great promise in SE. These models can produce remarkably natural-sounding speech and adapt well to new noise conditions by learning the distribution of clean speech from noisy inputs.

However, these advanced diffusion-based SE models face a couple of significant challenges. One major issue is the introduction of “generative artifacts.” These aren’t just simple distortions; they can include phoneme insertions or substitutions (where the AI hallucinates words or sounds that weren’t there), hiss, breathing artifacts, and high-frequency distortions. These artifacts can severely degrade the performance of downstream tasks like automatic speech recognition (ASR), even if the speech sounds perceptually good to a human ear. Another challenge is the high inference latency, meaning these models can be slow, which is unsuitable for real-time applications.

Introducing ArtiFree: A Solution for Cleaner Speech

A new research paper, “ARTIFREE: DETECTING AND REDUCING GENERATIVE ARTIFACTS IN DIFFUSION-BASED SPEECH ENHANCEMENT”, introduces a systematic approach called ArtiFree to tackle these problems. The core idea behind ArtiFree is to both detect and reduce these generative artifacts while maintaining efficiency.

Detecting Artifacts Through Uncertainty

Unlike traditional predictive SE models that produce a single, deterministic output, diffusion models are stochastic, meaning they can generate multiple plausible outputs from the same noisy input. The researchers observed that not all these outputs are semantically correct; some contain hallucinated phonemes. They found that the variance in speech embeddings (a numerical representation of speech that captures phonetic structure) across multiple diffusion runs can effectively predict where and when these phonetic errors occur. Essentially, if the model is “uncertain” about a particular segment of speech, it tends to generate artifacts, leading to higher variance in its different attempts. This variance acts as a reliable indicator for artifact-prone regions, allowing for 100% accuracy in artifact detection even with a small number of samples.

Reducing Artifacts with Semantic Consistency

To reduce these artifacts, ArtiFree proposes an ensemble inference method guided by semantic consistency. This involves generating several enhanced speech candidates from the same noisy input using different random seeds. Each candidate’s speech embeddings are then compared. The system selects the output that is most semantically consistent, meaning it aligns best with the overall phonetic structure. This consistency can be determined by comparing candidates to a clean reference (for analysis), the noisy input, or the centroid (average) of all generated samples. By favoring outputs that are semantically aligned, ArtiFree effectively suppresses unstable artifacts without needing to retrain the model. This method achieved up to a 15% reduction in Word Error Rate (WER) in low-SNR conditions, significantly improving phonetic accuracy.

Balancing Quality and Speed with Adaptive Diffusion Steps

While ensemble inference improves quality, it increases latency because multiple samples need to be generated. To address this, the researchers also investigated the effect of varying the number of reverse diffusion steps (N) during inference. They found that reducing N can substantially lower the real-time factor (RTF), which is the inference time for a one-second audio file. Lower N values can also slightly improve WER by limiting the chances of hallucinations, though it might lead to a minor drop in perceptual quality metrics like PESQ. By using adaptive N schedules – for example, fewer steps for high-SNR audio and more for low-SNR audio – ArtiFree can achieve a favorable balance between artifact suppression, quality, and latency. Combining a small ensemble size (e.g., S=3) with a reduced number of diffusion steps (e.g., N=10) can restore the runtime to that of default settings while still significantly reducing phoneme artifacts.

Also Read:

Conclusion

The ArtiFree framework demonstrates that generative artifacts in diffusion-based speech enhancement can be effectively detected and reduced. By leveraging the variance in speech embeddings for prediction and semantic consistency for selection, along with adaptive diffusion steps for efficiency, this work offers practical strategies for achieving artifact-free outputs in high-fidelity speech applications. The findings highlight the power of semantic priors in guiding generative models towards more accurate and reliable results.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ArtiFree: Enhancing Speech by Tackling AI-Generated Artifacts

Introducing ArtiFree: A Solution for Cleaner Speech

Detecting Artifacts Through Uncertainty

Reducing Artifacts with Semantic Consistency

Balancing Quality and Speed with Adaptive Diffusion Steps

Conclusion

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates