Unlocking Digital Services for Wolof Speakers: Introducing the WolBanking77 Dataset

TLDR: The WolBanking77 dataset is introduced to address the lack of digital resources for low-resource languages like Wolof, spoken by over 10 million people in West Africa. It provides 9,791 text sentences and over 4 hours of spoken sentences in the banking domain, enabling the development of voice assistants for intent classification. This initiative aims to improve financial inclusion and access to digital services for the 42% illiterate population in Senegal, reducing fraud risks. Experiments with state-of-the-art NLP and ASR models show promising results, highlighting the dataset’s value for research in low-resource language AI.

In an increasingly digital world, access to essential services like banking often relies on written language. However, for communities where literacy rates are lower and oral traditions are strong, this creates a significant barrier. A new research paper introduces a groundbreaking solution: the WolBanking77 dataset, designed to empower Wolof speakers in West Africa with voice-activated digital banking services.

Wolof is a language spoken by over 10 million people across Senegal, Gambia, and Mauritania, with approximately 90% of Senegal’s population speaking it. Despite its widespread use, digital resources for Wolof are scarce. This scarcity, coupled with a 42% illiteracy rate in Senegal, highlights a critical need for voice-based interfaces to ensure financial inclusion and access to public services, especially for those in the informal sector who are often vulnerable to fraud due to language barriers.

The WolBanking77 dataset is a significant step towards addressing this challenge. It is specifically created for academic research in intent classification, a core component of natural language understanding (NLU) that allows systems to determine a user’s goal from their spoken or typed request. The dataset comprises two main parts: a text dataset and an audio dataset.

The text dataset contains 9,791 sentences in the banking domain, manually translated from the English Banking77 dataset into French and Wolof by linguistic experts. These translations were carefully localized to reflect the Senegalese context, ensuring relevance and naturalness. For instance, common terms like “ATM” and “app” were translated into their Wolof equivalents, “GAB” and “aplikaasiyoN.”

The audio dataset, based on the MINDS-14 dataset, includes over 4 hours of spoken sentences. It features 263 utterances covering 10 intents across banking and transport domains. These audio recordings were collected from students at Cheikh Anta Diop University in Dakar, using the Lig-Aikuma software. Participants had diverse accents and ages, contributing to a robust and representative dataset. Ethical considerations were paramount during collection, with participant names anonymized and informed consent obtained.

The researchers conducted extensive experiments using WolBanking77 to evaluate various state-of-the-art models for both Automatic Speech Recognition (ASR) and Intent Detection. For intent detection, models like AfroXLMR, which was pre-trained on African languages, showed promising performance, achieving F1-scores up to 79% after fine-tuning. This demonstrates the dataset’s ability to challenge and improve existing models for low-resource languages.

In the ASR task, which converts spoken language into text, the Canary-1b-flash model achieved an impressive Word Error Rate (WER) of 0.59%, outperforming other leading models like Phi-4-multimodal-instruct and Distil-whisper-large-v3.5. These results indicate that high-quality speech recognition is achievable for Wolof, even with a relatively modest amount of speech data (4 hours).

The creation of WolBanking77 is a crucial contribution to the field of natural language processing for low-resource languages. It provides a valuable resource for researchers to develop and benchmark AI models that can understand and process Wolof speech and text. This, in turn, paves the way for practical applications like voice assistants that can help millions access digital financial services, manage transactions, and reduce the risk of fraud.

Also Read:

Looking ahead, the team plans to continuously maintain and update the dataset, add more audio recordings in diverse environments, and release open-source code to further stimulate research. They also intend to share text data for potential responses to each intent, facilitating the development of complete conversational AI systems. The WolBanking77 dataset and its associated code are freely available under a CC BY 4.0 license, encouraging widespread use and collaboration within the academic community. For more details, you can refer to the original research paper: WolBanking77: Wolof Banking Speech Intent Classification Dataset.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Digital Services for Wolof Speakers: Introducing the WolBanking77 Dataset

Gen AI News and Updates

Globee® Awards Unveil Winners of 18th Annual Impact Recognition for 2025

Bairong Inc. and Shanghai Pudong Development Bank Forge AI-Powered Strategic Alliance for Financial Agent Deployment

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates