StutterCut: A New Approach to Pinpointing Speech Dysfluencies

TLDR: StutterCut is a semi-supervised method that uses graph partitioning to accurately segment speech dysfluencies like stuttering. It refines connections between speech segments using a classifier trained on less precise labels, guided by an uncertainty measure. The method outperforms existing techniques on real and synthetic datasets, and the researchers also introduced FluencyBank++, a new dataset with detailed dysfluency boundaries.

The paper introduces a new method called StutterCut, designed to accurately identify and mark the exact moments when speech dysfluencies, like stuttering, occur. This is a significant step forward for speech therapy and providing real-time feedback to individuals who stutter.

Currently, many existing methods can only tell if a whole sentence contains a dysfluency, but they can’t pinpoint where it starts or ends. This lack of precise timing makes it difficult to offer targeted therapy or create realistic synthetic speech for practice. A major challenge in this field is the scarcity of real-world datasets that have detailed, frame-level labels for dysfluencies. Synthetic datasets exist, but they often don’t capture the natural variations of real speech and can introduce artificial sounds.

StutterCut tackles this problem by treating dysfluency segmentation as a graph partitioning problem. Imagine speech as a series of overlapping sound snippets, each represented as a node in a network. The connections between these nodes show how similar the snippets are. StutterCut uses a technique called Normalised Cut to divide this network into two main groups: dysfluent and non-dysfluent speech.

What makes StutterCut unique is its semi-supervised approach. It doesn’t rely on expensive, human-annotated, frame-level labels for training. Instead, it uses a “pseudo-oracle” classifier, which is trained on less precise, utterance-level labels. This classifier acts like an expert guide, helping to refine the connections between the speech snippets in the network. The influence of this guidance is carefully controlled by an “uncertainty measure” derived from Monte Carlo dropout, ensuring that only reliable predictions are used. This means the system can learn effectively even with less detailed initial data.

Also Read:

How StutterCut Works

The method involves five main stages:

First, the input speech is broken down into overlapping windows, and embeddings (numerical representations) are extracted from each window to form the nodes of a graph.

Second, a “pseudo-oracle” classifier, trained on weak dysfluency labels, generates another set of similarities between these nodes based on its predictions.

Third, the initial graph’s similarities are refined by integrating the knowledge from the pseudo-oracle, with its influence adjusted by an uncertainty mask. This mask ensures that only confident predictions from the pseudo-oracle guide the segmentation.

Fourth, the Normalised Cut algorithm is applied to this refined graph to partition it into dysfluent and non-dysfluent clusters.

Finally, a boundary extraction process merges consecutive dysfluent windows into continuous segments, defining the start and end times of the dysfluencies.

To further support research in this area, the authors have also extended an existing dataset called FluencyBank, creating FluencyBank++. This new version includes precise, frame-level boundaries for four types of dysfluencies: prolongation, repetition, interjection, and block. This provides a more authentic benchmark for evaluating new methods compared to purely synthetic datasets.

Experiments conducted on both real-world (FluencyBank++) and synthetic datasets (VCTK-TTS) show that StutterCut performs better than existing methods. It achieves higher F1 scores, which balance precision and recall, and more accurate detection of stuttering onset. For instance, on FluencyBank++, StutterCut significantly improved overall F1 scores compared to previous state-of-the-art methods like WhisterML and YOLO-Stutter. The paper notes that StutterCut is particularly effective due to the combination of graph-based clustering and the classifier-guided constraints.

The researchers acknowledge that StutterCut still faces challenges with certain types of interjections that lack clear pauses or articulatory struggle, and sometimes misclassifies blocks as pauses. Future work aims to address these limitations by exploring adaptive window sizing, incorporating additional acoustic features, and evaluating the method on multilingual stuttering corpora to ensure its broad applicability.

The code and the FluencyBank++ dataset will be made publicly available, fostering reproducibility and further advancements in the field of dysfluency segmentation. You can find more details about this research in the full paper available at this link.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

StutterCut: A New Approach to Pinpointing Speech Dysfluencies

How StutterCut Works

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates