TLDR: A new research paper introduces a Shona-English slang dataset from social media and a hybrid AI chatbot. This chatbot uses a fine-tuned DistilBERT model for intent recognition (96.4% accuracy) and combines rule-based responses with retrieval-augmented generation (RAG) to assist users, demonstrated by helping prospective students at Pace University. The work aims to improve conversational AI for underrepresented African languages, promoting digital inclusion.
In an increasingly interconnected world, artificial intelligence (AI) systems are transforming how we interact with technology, from virtual assistants to recommendation engines. However, a significant challenge persists: many African languages remain severely underrepresented in the field of Natural Language Processing (NLP). This exclusion risks widening the digital divide, limiting access to crucial AI-driven services in areas like education and healthcare.
A recent research paper by Happymore Masoka addresses this critical gap for Shona, a widely spoken Bantu language in Zimbabwe and Zambia. While millions speak Shona, existing digital resources primarily consist of formal texts, failing to capture the dynamic, informal communication prevalent among younger speakers, which often includes slang and code-mixing with English. Standard NLP models, trained on formal data, struggle with these everyday linguistic patterns, hindering the development of culturally relevant conversational AI.
To tackle this, the study introduces a groundbreaking Shona–English slang dataset. This unique dataset was carefully curated from anonymized social media conversations, offering a true reflection of how Shona speakers communicate informally. It’s meticulously annotated for various linguistic features, including intent (e.g., greeting, request, finance), sentiment (positive, negative, neutral), dialogue acts (question, statement, command), code-mixing features (word-level or phrase-level language switches), and tone (friendly, formal, humorous). This valuable resource is now publicly available, paving the way for further research and development.
Building upon this dataset, the researchers fine-tuned a multilingual DistilBERT model for intent classification. This model achieved impressive performance, boasting 96.4% accuracy and a 96.3% F1-score in recognizing user intent from informal Shona inputs. This robust classifier is a significant step forward in enabling AI systems to understand the nuances of everyday Shona conversations.
The core innovation of this work is a hybrid chatbot architecture that integrates this powerful intent classifier with other components. The chatbot combines rule-based responses for predictable and culturally nuanced interactions (like greetings) with Retrieval-Augmented Generation (RAG) for handling more complex, domain-specific queries. For instance, when a user asks about university programs, the RAG module retrieves relevant information from a knowledge base and generates a coherent response. This intelligent combination allows the chatbot to provide both precise, culturally appropriate answers and flexible, informative replies.
The effectiveness of this hybrid system was demonstrated through a practical use case: assisting prospective students with graduate program information at Pace University. Qualitative evaluations showed that this hybrid chatbot significantly outperforms a RAG-only baseline, particularly in its cultural relevance and ability to engage users effectively. It can handle queries in Shona slang and English, supporting greetings, program inquiries, and even application workflows, thereby improving accessibility for Shona-speaking users.
Also Read:
- Bridging the Information Gap: How AI Can Enhance Mental Health Support
- Unmasking and Crafting Hidden Promotions in AI Conversations
This research marks a crucial advancement in NLP for low-resource African languages. By making the dataset, the fine-tuned model, and the methodology publicly available, this work not only promotes digital inclusion but also fosters the creation of AI systems that are more culturally resonant and responsive to the diverse linguistic realities of the world. For more details, you can read the full research paper here.


