HausaMovieReview: A New Dataset Paves the Way for Sentiment Analysis in Low-Resource Hausa

TLDR: Researchers introduce HausaMovieReview, a 5,000-comment dataset for sentiment analysis in Hausa and code-switched English, sourced from Kannywood YouTube movie reviews. Surprisingly, classical Decision Tree models (89.71% accuracy) significantly outperformed fine-tuned BERT (79.7%) and RoBERTa (76.6%) models, suggesting classical methods with strong feature engineering can be highly effective for low-resource, domain-specific NLP tasks.

Natural Language Processing (NLP) has made incredible strides, but many African languages, including Hausa, remain “low-resource.” This means there’s a significant lack of annotated datasets, which are crucial for developing effective NLP tools. This gap often leaves these languages underrepresented in the world of AI and machine learning.

Addressing this fundamental challenge, a new research paper introduces HausaMovieReview, a groundbreaking benchmark dataset designed specifically for sentiment analysis in Hausa. This dataset is a collection of 5,000 YouTube comments related to Kannywood movies, a popular Hausa-language film industry based in Northern Nigeria. What makes this dataset particularly interesting is its inclusion of code-switched English, reflecting the real-world linguistic patterns of online communication in the region.

Building the Dataset

The creation of HausaMovieReview was a meticulous process. Researchers gathered 17,095 raw comments from 13 episodes of the popular Kannywood series “Labarina” on YouTube. From this large pool, a representative subset of 5,000 comments was randomly selected for annotation. Three independent native Hausa speakers, all familiar with the language’s nuances and the Kannywood industry, meticulously labeled each comment as Positive, Neutral, or Negative. This rigorous annotation process, which included clear guidelines and a majority-vote finalization, resulted in a highly reliable dataset, confirmed by a strong Fleiss’ Kappa score of 0.865 for inter-annotator agreement. The complete dataset and associated code are openly available on GitHub, fostering further research and development.

You can find the full research paper here: HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language.

Testing the Models

With the HausaMovieReview dataset in hand, the researchers embarked on a comparative analysis, evaluating both classical machine learning models and advanced deep learning transformer models. The classical models included Logistic Regression, Decision Tree, and K-Nearest Neighbors. For the deep learning approach, fine-tuned versions of BERT and RoBERTa were employed, known for their ability to capture complex semantic and contextual information.

The data underwent a preprocessing pipeline, including converting text to lowercase, removing punctuation, and tokenization. For classical models, features were extracted using the Term Frequency-Inverse Document Frequency (TF-IDF) technique, which weighs words based on their importance within a comment and across the entire dataset. The models were evaluated using 10-fold cross-validation and standard metrics like accuracy, precision, recall, F1-score, and AUC.

Surprising Results

The findings presented a fascinating and somewhat counterintuitive outcome. The Decision Tree classifier, a classical machine learning model, significantly outperformed the deep learning models. It achieved an impressive accuracy of 89.71% and an F1-score of 89.60%. In contrast, the fine-tuned BERT model achieved an accuracy of 79.7% and an F1-score of 75.62%, while RoBERTa followed with 76.6% accuracy and 72.92% F1-score. Logistic Regression also performed very well with an accuracy of 86.81% and the highest AUC score of 95.92%.

This surprising result suggests that for relatively small, domain-specific datasets like HausaMovieReview, classical models, especially when combined with effective feature engineering (like TF-IDF), can be more effective and computationally efficient than large, resource-intensive transformer models. The researchers hypothesize that transformer models, which typically require vast amounts of data, might be prone to overfitting or struggle to fully leverage their pre-trained knowledge in a limited data environment.

Also Read:

Implications and Future Directions

The HausaMovieReview dataset and the study’s findings lay a solid foundation for future research in sentiment analysis for low-resource languages. It highlights the potential of classical machine learning approaches in contexts where large datasets for deep learning are scarce. Future work includes expanding the dataset, developing more Hausa-specific NLP tools, and further investigating why classical models performed so well. Researchers also plan to explore larger multilingual transformer models and advanced fine-tuning techniques to potentially unlock even better performance in this vital area of NLP.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

HausaMovieReview: A New Dataset Paves the Way for Sentiment Analysis in Low-Resource Hausa

Building the Dataset

Testing the Models

Surprising Results

Implications and Future Directions

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates