spot_img
HomeResearch & DevelopmentNew Benchmark Uncovers How Large Language Models Handle Speech...

New Benchmark Uncovers How Large Language Models Handle Speech Disfluencies

TLDR: The DRES (Disfluency Removal Evaluation Suite) benchmark evaluates Large Language Models (LLMs) on their ability to remove disfluencies (like ‘um’, ‘uh’, interjections) from speech transcripts. The study found that proprietary LLMs outperform open-source ones, segmentation of text improves performance, and few-shot prompting should be used cautiously. It also identified common failure modes like over-deletion and under-deletion, noted that reasoning-tuned models perform poorly on this task, and highlighted that while fine-tuning improves disfluency removal, it can harm generalization to other tasks. The research provides nine practical recommendations for deploying disfluency removal in speech-driven AI systems.

Conversational speech is naturally filled with hesitations, repetitions, and filler words – collectively known as disfluencies. Think of phrases like “um,” “uh,” or self-corrections. While these are a normal part of human conversation, they pose a significant challenge for speech-driven AI systems like voice assistants, summarization tools, and chatbots. These systems, often trained on clean written text, struggle to accurately process and interpret spoken input laden with disfluencies, leading to errors in understanding and degraded performance.

A new research paper titled DRES: Benchmarking LLMs for Disfluency Removal introduces a novel benchmark called DRES (Disfluency Removal Evaluation Suite). Developed by Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, and James Caverlee from Texas A&M University, DRES aims to provide a controlled and reproducible way to evaluate how well Large Language Models (LLMs) can remove these disfluencies from text.

The Challenge of Disfluencies for AI

Disfluencies are common in spoken language but largely absent from written text. As AI interfaces that rely on speech become more prevalent – from smart speakers to voice modes in generative AI – the ability to handle these speech quirks becomes crucial. Current LLMs, primarily trained on written data, often see a drop in performance when faced with disfluent speech. Even advanced Automatic Speech Recognition (ASR) systems, while good at transcribing, frequently miss or misrepresent disfluencies, making it harder to train AI models on realistic spoken data.

DRES tackles this by focusing specifically on LLMs, isolating the disfluency removal task from the complexities of ASR errors and acoustic variations. By using human-annotated transcripts from the Switchboard corpus, DRES establishes a clear “semantic upper bound” for performance, meaning it measures the LLM’s ability to understand and clean up text without the added noise of speech recognition mistakes.

Key Findings from the DRES Benchmark

The researchers conducted an extensive evaluation of various LLMs, including both proprietary (like GPT-4o) and open-source models (like Llama and Qwen), across different sizes, architectures (dense vs. mixture-of-experts), and prompting strategies. Here are some of the key insights and practical recommendations:

  • Proprietary Models Lead the Way: Models like GPT-4o consistently achieved the highest scores, outperforming open-source alternatives by a significant margin. This suggests that their training data, likely including extensive transcribed speech, gives them an edge. The recommendation is to use proprietary models for production systems, while open-source models need more targeted training with spoken data.
  • Segmentation Improves Performance: Breaking down long transcripts into smaller segments before feeding them to the LLM consistently improved performance and stability. This highlights a common challenge with LLMs and long-context processing. Applying segmentation as a preprocessing step is highly recommended.
  • Few-Shot Prompting Needs Caution: Providing a few examples (few-shot prompting) didn’t always improve results. Some models, especially smaller ones, showed slight gains, but others, like certain Llama variants, actually performed worse, sometimes over-editing fluent text. Practitioners should use few-shot prompting carefully and test its effectiveness for their specific model.
  • Focus on Specific Disfluency Types: The benchmark revealed that LLMs are generally good at handling “edited” disfluencies (like self-corrections) but frequently miss “interjections” (INTJ, e.g., “uh,” “um”) and “parentheticals” (PRN). Future model development should prioritize improving robustness for these under-served categories.
  • Understanding Failure Modes: The study identified two main failure modes: “over-deletion” (models remove fluent words along with disfluencies) and “under-deletion” (models fail to remove many true disfluencies). Reasoning-oriented models, surprisingly, tended towards extreme over-deletion. Segmentation can help mitigate over-deletion, while models prone to under-deletion might need additional filtering or fine-tuning.
  • Reasoning Doesn’t Equal Disfluency Removal: Models specifically tuned for reasoning tasks (like o4-mini and Phi-4) performed poorly on disfluency removal, often exhibiting severe over-deletion. This indicates that general reasoning capabilities do not directly translate to this specialized task, emphasizing the need for dedicated evaluation.
  • Model Size Isn’t Everything: While larger models generally performed better, the gains were not always linear. Some mid-sized models even underperformed smaller or larger variants, suggesting that training data and optimization play a more significant role than just parameter count. Model selection should be based on empirical benchmarks rather than size alone.
  • Fine-Tuning vs. Generalization: Fine-tuning LLMs specifically for disfluency removal can achieve near state-of-the-art performance. However, this often comes at the cost of degraded performance on unrelated general-purpose tasks (like question answering or common sense reasoning). Fine-tuning is best suited for dedicated disfluency pipelines, not for general conversational AI.

Also Read:

Towards More Robust Spoken-Language Systems

The DRES benchmark provides a crucial foundation for advancing robust spoken-language systems. The findings highlight consistent performance gaps between proprietary and open-source LLMs, largely driven by their training exposure. The research suggests a modular approach where specialized disfluency removal components preprocess ASR output before it reaches general-purpose LLMs, helping to maintain the flexibility of these powerful models while ensuring cleaner input. Future work could explore lightweight adapters, multi-task learning, and continual learning to improve accuracy without sacrificing generalization, ultimately leading to more reliable and user-friendly speech-driven AI applications.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -