TLDR: StutterCut is a semi-supervised method that uses graph partitioning to accurately segment speech dysfluencies like stuttering. It refines connections between speech segments using a classifier trained on less precise labels, guided by an uncertainty measure. The method outperforms existing techniques on real and synthetic datasets, and the researchers also introduced FluencyBank++, a new dataset with detailed dysfluency boundaries.
The paper introduces a new method called StutterCut, designed to accurately identify and mark the exact moments when speech dysfluencies, like stuttering, occur. This is a significant step forward for speech therapy and providing real-time feedback to individuals who stutter.
Currently, many existing methods can only tell if a whole sentence contains a dysfluency, but they can’t pinpoint where it starts or ends. This lack of precise timing makes it difficult to offer targeted therapy or create realistic synthetic speech for practice. A major challenge in this field is the scarcity of real-world datasets that have detailed, frame-level labels for dysfluencies. Synthetic datasets exist, but they often don’t capture the natural variations of real speech and can introduce artificial sounds.
StutterCut tackles this problem by treating dysfluency segmentation as a graph partitioning problem. Imagine speech as a series of overlapping sound snippets, each represented as a node in a network. The connections between these nodes show how similar the snippets are. StutterCut uses a technique called Normalised Cut to divide this network into two main groups: dysfluent and non-dysfluent speech.
What makes StutterCut unique is its semi-supervised approach. It doesn’t rely on expensive, human-annotated, frame-level labels for training. Instead, it uses a “pseudo-oracle” classifier, which is trained on less precise, utterance-level labels. This classifier acts like an expert guide, helping to refine the connections between the speech snippets in the network. The influence of this guidance is carefully controlled by an “uncertainty measure” derived from Monte Carlo dropout, ensuring that only reliable predictions are used. This means the system can learn effectively even with less detailed initial data.
Also Read:
- Mapping Sound: A U-Net Approach to Pinpointing Acoustic Sources
- ProKG-Dial: Crafting Specialized AI Conversations with Knowledge Graphs
How StutterCut Works
The method involves five main stages:
First, the input speech is broken down into overlapping windows, and embeddings (numerical representations) are extracted from each window to form the nodes of a graph.
Second, a “pseudo-oracle” classifier, trained on weak dysfluency labels, generates another set of similarities between these nodes based on its predictions.
Third, the initial graph’s similarities are refined by integrating the knowledge from the pseudo-oracle, with its influence adjusted by an uncertainty mask. This mask ensures that only confident predictions from the pseudo-oracle guide the segmentation.
Fourth, the Normalised Cut algorithm is applied to this refined graph to partition it into dysfluent and non-dysfluent clusters.
Finally, a boundary extraction process merges consecutive dysfluent windows into continuous segments, defining the start and end times of the dysfluencies.
To further support research in this area, the authors have also extended an existing dataset called FluencyBank, creating FluencyBank++. This new version includes precise, frame-level boundaries for four types of dysfluencies: prolongation, repetition, interjection, and block. This provides a more authentic benchmark for evaluating new methods compared to purely synthetic datasets.
Experiments conducted on both real-world (FluencyBank++) and synthetic datasets (VCTK-TTS) show that StutterCut performs better than existing methods. It achieves higher F1 scores, which balance precision and recall, and more accurate detection of stuttering onset. For instance, on FluencyBank++, StutterCut significantly improved overall F1 scores compared to previous state-of-the-art methods like WhisterML and YOLO-Stutter. The paper notes that StutterCut is particularly effective due to the combination of graph-based clustering and the classifier-guided constraints.
The researchers acknowledge that StutterCut still faces challenges with certain types of interjections that lack clear pauses or articulatory struggle, and sometimes misclassifies blocks as pauses. Future work aims to address these limitations by exploring adaptive window sizing, incorporating additional acoustic features, and evaluating the method on multilingual stuttering corpora to ensure its broad applicability.
The code and the FluencyBank++ dataset will be made publicly available, fostering reproducibility and further advancements in the field of dysfluency segmentation. You can find more details about this research in the full paper available at this link.


