TLDR: This research introduces Graph Connectionist Temporal Classification (GTC) for Automatic Phoneme Recognition (APR) systems. It addresses the challenge of noisy pseudo-labels from Grapheme-to-Phoneme (G2P) systems, which often provide multiple possible pronunciations for a word. Unlike standard Connectionist Temporal Classification (CTC), GTC allows the training model to consider a graph of alternative phoneme sequences, treating multiple pronunciations as valid supervision. Experiments on English and Dutch datasets demonstrate that incorporating these pronunciation variations consistently improves phoneme error rates, offering a more robust training strategy for APR systems.
Automatic Phoneme Recognition (APR) systems are crucial for various applications, from assisting speech recognition in less-resourced languages to aiding language learning and speech pathology. However, a significant hurdle in training these systems is the scarcity of manually annotated data, which is both time-consuming and expensive to produce.
A common workaround involves using Grapheme-to-Phoneme (G2P) systems. These systems convert written words into their phonetic transcriptions, which then serve as ‘pseudo-labels’ for training APR models. While this approach allows for training on large datasets of utterance-text pairs, it introduces a new challenge: G2P systems often generate multiple possible pronunciations for a single word. This reflects the natural variability in how words are spoken, but standard training methods, particularly those using Connectionist Temporal Classification (CTC) loss, struggle to account for this inherent ambiguity.
Addressing Pronunciation Ambiguity with Graph Temporal Classification
A new research paper, Graph Connectionist Temporal Classification for Phoneme Recognition, proposes an innovative solution by adapting Graph Temporal Classification (GTC) for APR. GTC is an extension of the traditional CTC loss that can handle a set of acceptable ground truth sequences, rather than being limited to a single, definitive one. In the context of APR, this means the model can be trained using a ‘graph’ of alternative phoneme sequences, effectively treating multiple pronunciations for a word as equally valid forms of supervision.
The core idea is to build a flexible training framework that acknowledges the natural variations in pronunciation. Instead of forcing the model to pick one ‘correct’ pronunciation from the G2P output, GTC allows it to learn from a network of possibilities. This approach is inspired by similar techniques used in semi-supervised Automatic Speech Recognition (ASR), where models often deal with multiple noisy pseudo-label sequences.
How It Works
The implementation of GTC involves modifying the ‘label’ component of the standard CTC framework. Typically, this label defines a single target sequence. With GTC, this label is transformed into a Weighted Finite State Acceptor (WFSA) that encodes a graph of all acceptable phoneme sequences. For each word in an utterance, the system takes all its possible pronunciations generated by the G2P, builds individual CTC graphs for them, and then combines these into a parallel structure. These word-level structures are then serialized to form a comprehensive graph for the entire utterance, allowing the model to consider any valid concatenation of pronunciations.
Experimental Validation
The researchers tested their GTC-based APR system on both United States English and Belgian Dutch datasets. For English, they used the Common Voice dataset for training and the TIMIT dataset for testing. For Dutch, they utilized the Corpus Gesproken Nederlands (CGN). Different G2P resources were employed for each language to generate the multiple pronunciations: a threshold-based G2P model trained on CMUDict for English, and the rule-based Fonilex dictionary for Dutch.
The results were compelling. The study first established ‘oracle’ Label Error Rates (LERs), which showed that considering more pronunciations significantly reduced the potential error rate, confirming that the ‘best’ pronunciation isn’t always the most obvious one. When training the APR models, incorporating multiple pronunciations via GTC consistently led to improvements in Phoneme Error Rate (PER) compared to a baseline model trained with standard CTC using only a single pronunciation per word.
For English, the model trained with a maximum of two pronunciations per word achieved the best performance, reducing the PER from 32.9% (1-best CTC) to 28.8%. In Dutch, the model trained with up to three pronunciations showed the most significant improvement, lowering the PER from 23.9% to 23.0%.
Also Read:
- Enhanced Speech Recognition for Vietnamese-English Code-Switching with TSPC
- Improving LLM Reliability with Graph-Enhanced Uncertainty Estimation
Conclusion and Future Directions
This research demonstrates that leveraging the inherent pronunciation variability provided by G2P systems through Graph Temporal Classification is a promising strategy for training more robust APR systems. By allowing the model to account for multiple valid pronunciations, it can better handle the noisy supervision derived from G2P outputs.
Future work could explore capturing coarticulation effects (how sounds influence each other across word boundaries), incorporating pronunciation variability directly into the decoding graph with more complex arcs (insertions, substitutions, deletions), and extending this method to non-Germanic languages to assess its universality.


