Advancing Speech Recognition for Jamaican Patois Music Through Data and Model Fine-tuning

TLDR: A new research paper addresses the poor performance of current speech recognition systems on Jamaican Patois music. The authors curated over 40 hours of manually transcribed Patois music, the largest dataset of its kind, and used it to fine-tune state-of-the-art Whisper ASR models. Their findings show that fine-tuned models, even smaller ones, significantly outperform larger pre-trained models on Patois. They also developed scaling laws to predict ASR performance based on model and dataset size, providing a valuable tool for future development in low-resource language AI.

Jamaican Patois, also known as Jamaican Creole, is a vibrant and widely spoken language across Jamaica and the global Caribbean diaspora. Despite its prevalence, especially in music and daily communication, it remains significantly underrepresented in modern language technologies. A major challenge lies in Automatic Speech Recognition (ASR) systems, which perform poorly when applied to Patois speech. This is primarily because Patois is a low-resource language, meaning there’s limited transcribed speech data available for training these systems.

This limitation is particularly evident in music transcription. Jamaican music, often performed entirely in Patois, is a central vehicle for cultural expression. However, existing ASR systems, like those providing automatic captions on platforms such as YouTube, frequently produce inaccurate or even nonsensical transcriptions of Patois audio. This not only hinders accessibility for non-Patois speakers, including the deaf and hard-of-hearing community and international audiences, but also poses a significant barrier to the development of downstream Natural Language Processing (NLP) technologies. Large Language Models (LLMs), which have transformed AI, depend heavily on large, high-quality, and diverse datasets. For predominantly spoken languages like Patois, creating such datasets is nearly impossible without robust speech-to-text systems. A high-quality Patois ASR model could therefore be crucial for building foundational LLMs for the language.

A Data-Centric Approach to Patois ASR

A recent research paper, “Towards Robust Speech Recognition for Jamaican Patois Music Transcription”, tackles this problem head-on with a data-centric approach. The researchers, Jordan Madden, Matthew Stone, Dimitri Johnson, and Daniel Geddez from the Jamaica Artificial Intelligence Association, focused on curating a substantial dataset of manually transcribed Patois music.

Their contributions are threefold: first, they introduce a supervised dataset comprising approximately 42 hours of transcribed Patois music, which they believe is the largest of its kind. Second, they fine-tuned a series of state-of-the-art Whisper models using this dataset to evaluate ASR performance. Third, they developed a scaling equation that models ASR performance as a function of dataset size and model capacity. This work aims to significantly improve the accuracy and reliability of Patois transcription, contributing to the broader ecosystem of tools that make underrepresented languages more accessible and better supported in AI systems.

The Dataset and Model Fine-tuning

The curated dataset consists of 5,110 recordings of Jamaican music, each accompanied by corresponding Jamaican Patois transcriptions. Each data point includes a URL to a 30-second MP3 audio clip, a manually annotated transcription, and the official lyrics of the full song. The audio is sampled at 22,050 Hz, totaling 42.58 hours of audio. A data processing script was developed to convert this raw data into a format suitable for popular deep learning frameworks like PyTorch and HuggingFace Transformers.

For model fine-tuning, the researchers utilized OpenAI’s Whisper models, accessed via the Huggingface Transformers library. Whisper models are pre-trained on 680,000 hours of multilingual audio, enabling them to learn robust audio representations. The team fine-tuned the ‘tiny’, ‘base’, ‘small’, and ‘medium’ variants of Whisper on their dataset, hypothesizing that Whisper’s pre-trained features would facilitate effective adaptation to Jamaican Patois. They trained these models using various amounts of the dataset (20, 35, and 40 hours of audio) for 4,000 steps, optimizing them with the AdamW optimizer and a Linear Learning Rate Scheduler. All audio was resampled to 16,000 Hz to match Whisper’s input format and transformed into log Mel-spectrograms.

Performance and Scaling Laws

The quality of the models was evaluated using the Word Error Rate (WER) metric, which measures the number of errors in a generated transcript compared to the ground truth. The results showed a clear improvement in WER as training progressed for all model sizes. As expected, larger models consistently performed better, with the Medium model achieving the lowest WER. This aligns with findings in other areas of AI, where larger models often exhibit greater sample efficiency.

Crucially, the study compared the fine-tuned Whisper models to the pre-trained Whisper Large model. While Whisper Large performs exceptionally well on standard English, it showed a significantly high WER of 0.89 on Jamaican Patois. In contrast, even the fine-tuned Whisper Tiny model, which is approximately 50 times smaller than Whisper Large, significantly outperformed it on Jamaican Patois. This highlights that even for languages related to English, the general priors learned by large pre-trained models are insufficient for effective transcription of specific dialects like Jamaican Patois, underscoring the importance of domain-specific fine-tuning.

Building on these observations, the researchers developed scaling laws to predict model performance based on model size (M) and dataset size (D). They hypothesized that WER scales as a power-law function: WER = A · M^-α · D^-β. By fitting this equation to their experimental results, they derived a specific scaling law for Whisper models fine-tuned on Jamaican Patois music. This law accurately predicted WER values, validating its accuracy and providing a practical tool for guiding future decisions on model selection and dataset size given computational resources and desired performance.

Also Read:

Looking Ahead

This research represents a significant step towards building robust ASR systems for Jamaican Patois. By curating a unique, large-scale dataset and demonstrating the effectiveness of fine-tuning state-of-the-art models, the authors have laid crucial groundwork. The derived scaling laws offer valuable insights for optimizing resource allocation in low-resource language settings. This work promises to enhance the accessibility of Jamaican Patois audio content and establish foundational support for future Jamaican Patois language modeling in AI systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Speech Recognition for Jamaican Patois Music Through Data and Model Fine-tuning

A Data-Centric Approach to Patois ASR

The Dataset and Model Fine-tuning

Performance and Scaling Laws

Looking Ahead

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates