TLDR: A new state-of-the-art Romanian Automatic Speech Recognition (ASR) system has been developed using NVIDIA’s FastConformer architecture and a hybrid CTC-TDT decoder. Trained on over 2,600 hours of speech, mostly weakly supervised, the system achieves up to 27% relative Word Error Rate (WER) reduction across various Romanian benchmarks, including read, spontaneous, and domain-specific speech. It also offers practical decoding efficiency, making it suitable for research and low-latency applications, with the model and resources being open-sourced.
A significant advancement in Automatic Speech Recognition (ASR) for the Romanian language has been unveiled, introducing a new state-of-the-art system. This innovative solution, detailed in a recent research paper, leverages NVIDIA’s FastConformer architecture, marking its first application in the context of Romanian speech processing. The system was trained on an extensive dataset comprising over 2,600 hours of speech, primarily using weakly supervised transcriptions.
Addressing the Challenge of Low-Resource Languages
Despite rapid progress in ASR driven by end-to-end architectures and large datasets, Romanian has remained a low-resource language. Previous efforts often relied on older architectures or training methods that didn’t incorporate the latest advancements. The scarcity of manually annotated Romanian data has been a major hurdle, with even the largest publicly available datasets being modest in size. This new research tackles this challenge head-on by employing large-scale weak supervision, a powerful strategy for languages with limited resources.
The FastConformer Architecture at its Core
The heart of this new Romanian ASR system is the FastConformer encoder, an optimized version of the Conformer architecture. Conformer models are known for effectively capturing both local and global dependencies in speech by combining convolutional modules with self-attention. FastConformer further enhances this by introducing architectural optimizations for computational efficiency, such as an eightfold downsampling step early in the encoder and reduced convolutional kernel sizes. These changes significantly decrease computational cost and parameter count while maintaining high accuracy, making it suitable for processing long audio sequences efficiently.
A Hybrid Decoder for Enhanced Accuracy and Flexibility
The system employs a sophisticated hybrid decoder architecture that integrates both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) decoders, sharing a common encoder. This hybrid design offers remarkable flexibility during inference, allowing the selection of the most suitable decoding strategy for different applications. It also brings practical benefits, including faster convergence for the CTC decoder and improved overall recognition accuracy for both decoding branches due to joint optimization.
The researchers explored various decoding strategies to optimize performance and efficiency. These included CTC greedy, TDT greedy, TDT with Alignment-Length Synchronous Decoding (ALSD), and CTC beam search with an external 6-gram token-level language model. Each strategy presents a different trade-off between transcription accuracy and computational speed, allowing for tailored deployment based on specific needs.
Also Read:
- Factorization Memory: A Novel Approach to Efficient Language Modeling
- Step-Audio-EditX: A New Open-Source Model for Advanced Audio Editing and Text-to-Speech
Achieving State-of-the-Art Performance
This new system has achieved state-of-the-art performance across all Romanian evaluation benchmarks. This includes read, spontaneous, and domain-specific speech. Compared to previous best-performing systems, it demonstrates a significant relative Word Error Rate (WER) reduction of up to 27%. For instance, on read speech datasets, a 9% WER reduction was observed, and on oratory speech, a 27% improvement. Even on spontaneous speech datasets, consistent gains of 14% and 6% were achieved. Beyond accuracy, the approach also shows practical decoding efficiency, making it suitable for low-latency ASR applications.
The researchers emphasize their commitment to open science by publicly releasing their trained model, along with comprehensive training and inference recipes, and standardized evaluation datasets. This contribution aims to accelerate further progress in Romanian speech processing and foster more inclusive, language-diverse speech technologies. For more in-depth information, you can refer to the full research paper available at arXiv:2511.03361.


