New Benchmark Uncovers How Large Language Models Handle Speech Disfluencies

TLDR: The DRES (Disfluency Removal Evaluation Suite) benchmark evaluates Large Language Models (LLMs) on their ability to remove disfluencies (like ‘um’, ‘uh’, interjections) from speech transcripts. The study found that proprietary LLMs outperform open-source ones, segmentation of text improves performance, and few-shot prompting should be used cautiously. It also identified common failure modes like over-deletion and under-deletion, noted that reasoning-tuned models perform poorly on this task, and highlighted that while fine-tuning improves disfluency removal, it can harm generalization to other tasks. The research provides nine practical recommendations for deploying disfluency removal in speech-driven AI systems.

Conversational speech is naturally filled with hesitations, repetitions, and filler words – collectively known as disfluencies. Think of phrases like “um,” “uh,” or self-corrections. While these are a normal part of human conversation, they pose a significant challenge for speech-driven AI systems like voice assistants, summarization tools, and chatbots. These systems, often trained on clean written text, struggle to accurately process and interpret spoken input laden with disfluencies, leading to errors in understanding and degraded performance.

A new research paper titled DRES: Benchmarking LLMs for Disfluency Removal introduces a novel benchmark called DRES (Disfluency Removal Evaluation Suite). Developed by Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, and James Caverlee from Texas A&M University, DRES aims to provide a controlled and reproducible way to evaluate how well Large Language Models (LLMs) can remove these disfluencies from text.

The Challenge of Disfluencies for AI

Disfluencies are common in spoken language but largely absent from written text. As AI interfaces that rely on speech become more prevalent – from smart speakers to voice modes in generative AI – the ability to handle these speech quirks becomes crucial. Current LLMs, primarily trained on written data, often see a drop in performance when faced with disfluent speech. Even advanced Automatic Speech Recognition (ASR) systems, while good at transcribing, frequently miss or misrepresent disfluencies, making it harder to train AI models on realistic spoken data.

DRES tackles this by focusing specifically on LLMs, isolating the disfluency removal task from the complexities of ASR errors and acoustic variations. By using human-annotated transcripts from the Switchboard corpus, DRES establishes a clear “semantic upper bound” for performance, meaning it measures the LLM’s ability to understand and clean up text without the added noise of speech recognition mistakes.

Key Findings from the DRES Benchmark

The researchers conducted an extensive evaluation of various LLMs, including both proprietary (like GPT-4o) and open-source models (like Llama and Qwen), across different sizes, architectures (dense vs. mixture-of-experts), and prompting strategies. Here are some of the key insights and practical recommendations:

Proprietary Models Lead the Way: Models like GPT-4o consistently achieved the highest scores, outperforming open-source alternatives by a significant margin. This suggests that their training data, likely including extensive transcribed speech, gives them an edge. The recommendation is to use proprietary models for production systems, while open-source models need more targeted training with spoken data.
Segmentation Improves Performance: Breaking down long transcripts into smaller segments before feeding them to the LLM consistently improved performance and stability. This highlights a common challenge with LLMs and long-context processing. Applying segmentation as a preprocessing step is highly recommended.
Few-Shot Prompting Needs Caution: Providing a few examples (few-shot prompting) didn’t always improve results. Some models, especially smaller ones, showed slight gains, but others, like certain Llama variants, actually performed worse, sometimes over-editing fluent text. Practitioners should use few-shot prompting carefully and test its effectiveness for their specific model.
Focus on Specific Disfluency Types: The benchmark revealed that LLMs are generally good at handling “edited” disfluencies (like self-corrections) but frequently miss “interjections” (INTJ, e.g., “uh,” “um”) and “parentheticals” (PRN). Future model development should prioritize improving robustness for these under-served categories.
Understanding Failure Modes: The study identified two main failure modes: “over-deletion” (models remove fluent words along with disfluencies) and “under-deletion” (models fail to remove many true disfluencies). Reasoning-oriented models, surprisingly, tended towards extreme over-deletion. Segmentation can help mitigate over-deletion, while models prone to under-deletion might need additional filtering or fine-tuning.
Reasoning Doesn’t Equal Disfluency Removal: Models specifically tuned for reasoning tasks (like o4-mini and Phi-4) performed poorly on disfluency removal, often exhibiting severe over-deletion. This indicates that general reasoning capabilities do not directly translate to this specialized task, emphasizing the need for dedicated evaluation.
Model Size Isn’t Everything: While larger models generally performed better, the gains were not always linear. Some mid-sized models even underperformed smaller or larger variants, suggesting that training data and optimization play a more significant role than just parameter count. Model selection should be based on empirical benchmarks rather than size alone.
Fine-Tuning vs. Generalization: Fine-tuning LLMs specifically for disfluency removal can achieve near state-of-the-art performance. However, this often comes at the cost of degraded performance on unrelated general-purpose tasks (like question answering or common sense reasoning). Fine-tuning is best suited for dedicated disfluency pipelines, not for general conversational AI.

Also Read:

Towards More Robust Spoken-Language Systems

The DRES benchmark provides a crucial foundation for advancing robust spoken-language systems. The findings highlight consistent performance gaps between proprietary and open-source LLMs, largely driven by their training exposure. The research suggests a modular approach where specialized disfluency removal components preprocess ASR output before it reaches general-purpose LLMs, helping to maintain the flexibility of these powerful models while ensuring cleaner input. Future work could explore lightweight adapters, multi-task learning, and continual learning to improve accuracy without sacrificing generalization, ultimately leading to more reliable and user-friendly speech-driven AI applications.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmark Uncovers How Large Language Models Handle Speech Disfluencies

The Challenge of Disfluencies for AI

Key Findings from the DRES Benchmark

Towards More Robust Spoken-Language Systems

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates