TLDR: The paper introduces Z-Scores, a new span-level, linguistically-grounded evaluation metric for disfluency removal in speech. Unlike traditional word-based metrics (E-Scores) that only provide aggregate performance, Z-Scores categorize system behavior across distinct disfluency types (EDITED, INTJ, PRN). This allows researchers to identify specific model weaknesses, such as poor handling of interjections or parentheticals, which are often hidden in overall F1 scores. A deterministic alignment module enables Z-Scores to work with generative models, providing diagnostic insights that can guide targeted model improvements like tailored prompts or data augmentation.
Spontaneous speech is a natural part of human communication, but it’s often filled with what linguists call ‘disfluencies.’ These include common interjections like ‘um’ and ‘uh,’ parenthetical phrases such as ‘you know’ or ‘I mean,’ and edited sections where speakers correct themselves, like ‘Where did I put my keys – sorry, phone?’ While these are normal in conversation, they can pose significant challenges for artificial intelligence systems like smart speakers, transcription services, and conversational AI, often degrading their performance.
Historically, evaluating how well AI models remove these disfluencies has relied on word-level metrics, primarily precision, recall, and F1 scores, collectively referred to as E-Scores. While these metrics offer a general sense of a model’s overall performance, they fall short in explaining *why* a model succeeds or fails. For instance, a model might have a decent overall F1 score, yet consistently struggle with specific types of disfluencies, a weakness that remains hidden in the aggregate numbers.
To address this crucial gap, researchers have introduced a new evaluation metric called Z-Scores. This innovative metric provides a span-level, linguistically-grounded assessment of disfluency removal. Unlike E-Scores, Z-Scores categorize a system’s behavior across distinct disfluency types: EDITED (false starts and repairs), INTJ (interjections), and PRN (parentheticals). This allows for a much more granular understanding of model performance.
A key component of the Z-Score framework is its deterministic alignment module. This module ensures a robust and reliable mapping between the text generated by an AI model and the original disfluent transcript. This alignment is vital because it enables the evaluation of generative language models (GMs), such as large language models (LLMs) and small language models (SLMs), for disfluency removal. Previous methods often struggled with this alignment, limiting the use of powerful generative models in this task or relying on less informative n-gram-based metrics.
The diagnostic power of Z-Scores is particularly evident in a case study involving LLMs and metaprompting. When a baseline prompt (P0) was used, the E-Scores suggested reasonable overall performance. However, the Z-Scores painted a different picture, revealing clear weaknesses: the model performed very well on EDITED disfluencies but struggled significantly with INTJ (interjections) and PRN (parentheticals). These specific deficiencies were completely obscured by the aggregate E-Scores.
When metaprompts (P1 and P2) were introduced, which included explicit examples of INTJ and PRN disfluencies, the Z-Scores showed remarkable improvements in these specific categories. The scores for INTJ and PRN removal increased substantially, while the performance on EDITED disfluencies remained stable. This directly demonstrated that the improvements were localized to the linguistic phenomena targeted by the new prompts.
This ability to pinpoint exactly which types of disfluencies a model handles well, and which it struggles with, is the core value of Z-Scores. It transforms evaluation from a simple pass/fail judgment into a diagnostic tool. Researchers can now identify specific model failure modes and design targeted interventions, such as crafting more effective prompts, augmenting training data with specific disfluency types, or even developing specialized architectural components. These targeted strategies can lead to measurable and meaningful performance improvements.
Also Read:
- New Benchmark Uncovers How Large Language Models Handle Speech Disfluencies
- ChiReSSD: A Generative AI Approach to Reconstruct Disordered Speech in Children
The development of Z-Scores marks a significant step forward in the evaluation of disfluency removal systems. By complementing traditional word-level metrics with a linguistically informed, span-level assessment, Z-Scores provide a deeper understanding of AI model behavior. This empowers researchers and practitioners to refine models more effectively, ultimately leading to more fluent and accurate AI-generated text. The researchers have also made an open-source Python package available, providing a standardized resource for future research and development in this area. You can find more details in the research paper.


