TLDR: A new research paper addresses the poor performance of current speech recognition systems on Jamaican Patois music. The authors curated over 40 hours of manually transcribed Patois music, the largest dataset of its kind, and used it to fine-tune state-of-the-art Whisper ASR models. Their findings show that fine-tuned models, even smaller ones, significantly outperform larger pre-trained models on Patois. They also developed scaling laws to predict ASR performance based on model and dataset size, providing a valuable tool for future development in low-resource language AI.
Jamaican Patois, also known as Jamaican Creole, is a vibrant and widely spoken language across Jamaica and the global Caribbean diaspora. Despite its prevalence, especially in music and daily communication, it remains significantly underrepresented in modern language technologies. A major challenge lies in Automatic Speech Recognition (ASR) systems, which perform poorly when applied to Patois speech. This is primarily because Patois is a low-resource language, meaning there’s limited transcribed speech data available for training these systems.
This limitation is particularly evident in music transcription. Jamaican music, often performed entirely in Patois, is a central vehicle for cultural expression. However, existing ASR systems, like those providing automatic captions on platforms such as YouTube, frequently produce inaccurate or even nonsensical transcriptions of Patois audio. This not only hinders accessibility for non-Patois speakers, including the deaf and hard-of-hearing community and international audiences, but also poses a significant barrier to the development of downstream Natural Language Processing (NLP) technologies. Large Language Models (LLMs), which have transformed AI, depend heavily on large, high-quality, and diverse datasets. For predominantly spoken languages like Patois, creating such datasets is nearly impossible without robust speech-to-text systems. A high-quality Patois ASR model could therefore be crucial for building foundational LLMs for the language.
A Data-Centric Approach to Patois ASR
A recent research paper, “Towards Robust Speech Recognition for Jamaican Patois Music Transcription”, tackles this problem head-on with a data-centric approach. The researchers, Jordan Madden, Matthew Stone, Dimitri Johnson, and Daniel Geddez from the Jamaica Artificial Intelligence Association, focused on curating a substantial dataset of manually transcribed Patois music.
Their contributions are threefold: first, they introduce a supervised dataset comprising approximately 42 hours of transcribed Patois music, which they believe is the largest of its kind. Second, they fine-tuned a series of state-of-the-art Whisper models using this dataset to evaluate ASR performance. Third, they developed a scaling equation that models ASR performance as a function of dataset size and model capacity. This work aims to significantly improve the accuracy and reliability of Patois transcription, contributing to the broader ecosystem of tools that make underrepresented languages more accessible and better supported in AI systems.
The Dataset and Model Fine-tuning
The curated dataset consists of 5,110 recordings of Jamaican music, each accompanied by corresponding Jamaican Patois transcriptions. Each data point includes a URL to a 30-second MP3 audio clip, a manually annotated transcription, and the official lyrics of the full song. The audio is sampled at 22,050 Hz, totaling 42.58 hours of audio. A data processing script was developed to convert this raw data into a format suitable for popular deep learning frameworks like PyTorch and HuggingFace Transformers.
For model fine-tuning, the researchers utilized OpenAI’s Whisper models, accessed via the Huggingface Transformers library. Whisper models are pre-trained on 680,000 hours of multilingual audio, enabling them to learn robust audio representations. The team fine-tuned the ‘tiny’, ‘base’, ‘small’, and ‘medium’ variants of Whisper on their dataset, hypothesizing that Whisper’s pre-trained features would facilitate effective adaptation to Jamaican Patois. They trained these models using various amounts of the dataset (20, 35, and 40 hours of audio) for 4,000 steps, optimizing them with the AdamW optimizer and a Linear Learning Rate Scheduler. All audio was resampled to 16,000 Hz to match Whisper’s input format and transformed into log Mel-spectrograms.
Performance and Scaling Laws
The quality of the models was evaluated using the Word Error Rate (WER) metric, which measures the number of errors in a generated transcript compared to the ground truth. The results showed a clear improvement in WER as training progressed for all model sizes. As expected, larger models consistently performed better, with the Medium model achieving the lowest WER. This aligns with findings in other areas of AI, where larger models often exhibit greater sample efficiency.
Crucially, the study compared the fine-tuned Whisper models to the pre-trained Whisper Large model. While Whisper Large performs exceptionally well on standard English, it showed a significantly high WER of 0.89 on Jamaican Patois. In contrast, even the fine-tuned Whisper Tiny model, which is approximately 50 times smaller than Whisper Large, significantly outperformed it on Jamaican Patois. This highlights that even for languages related to English, the general priors learned by large pre-trained models are insufficient for effective transcription of specific dialects like Jamaican Patois, underscoring the importance of domain-specific fine-tuning.
Building on these observations, the researchers developed scaling laws to predict model performance based on model size (M) and dataset size (D). They hypothesized that WER scales as a power-law function: WER = A · M-α · D-β. By fitting this equation to their experimental results, they derived a specific scaling law for Whisper models fine-tuned on Jamaican Patois music. This law accurately predicted WER values, validating its accuracy and providing a practical tool for guiding future decisions on model selection and dataset size given computational resources and desired performance.
Also Read:
- Boosting ASR Accuracy in CRM Systems with Weak Supervision and Synthetic Data
- Rethinking Homophone Normalization in Machine Translation for Ge’ez Script Languages
Looking Ahead
This research represents a significant step towards building robust ASR systems for Jamaican Patois. By curating a unique, large-scale dataset and demonstrating the effectiveness of fine-tuning state-of-the-art models, the authors have laid crucial groundwork. The derived scaling laws offer valuable insights for optimizing resource allocation in low-resource language settings. This work promises to enhance the accessibility of Jamaican Patois audio content and establish foundational support for future Jamaican Patois language modeling in AI systems.


