TLDR: The paper introduces KuBERT, a BERT-based model for sentiment analysis in Central Kurdish, a low-resource language. It details the creation of a large Kurdish text corpus, a specialized tokenizer, and the training of various BERT models with different classifiers (Fine-Tuning, BiLSTM, MLP). KuBERT significantly outperforms traditional Word2Vec models, achieving up to 75.37% accuracy for 3-class sentiment analysis and 86.31% for 2-class, setting a new standard for NLP in Central Kurdish.
Sentiment analysis, a rapidly growing field within natural language processing (NLP), helps us understand opinions and feelings expressed in text. This capability is crucial for businesses, researchers, and governments to gain insights from public sentiment. While widely applied to major languages, sentiment analysis for low-resource languages like Central Kurdish faces significant challenges due to limited computational tools and linguistic diversity.
Historically, sentiment analysis for Central Kurdish relied on traditional word embedding models such as Word2Vec. However, with the advent of more advanced language models, particularly BERT (Bidirectional Encoder Representations from Transformers), there’s a promising path for substantial improvements. BERT’s superior ability to capture nuanced semantic meaning and contextual intricacies offers a new benchmark for sentiment analysis in languages with fewer digital resources.
A recent research paper, KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis, by Kozhin muhealddin Awlla, Hadi Veisi, and Abdulhady Abas Abdullah, introduces a significant advancement in this area. The study focuses on enhancing sentiment analysis for Central Kurdish by integrating the BERT model into NLP techniques. The core of their approach involves several key steps: collecting and normalizing a large corpus of Kurdish texts, pretraining BERT with a specialized tokenizer for Kurdish, and developing various models for sentiment analysis, including Bidirectional Long Short-Term Memory (BiLSTM), Multi-Layer Perceptron (MLP), and fine-tuning the BERT classifier.
The researchers meticulously gathered a substantial text corpus for training their BERT model, totaling nearly 300 million tokens from diverse sources like the AsoSoft corpus, Kurdish websites, and the Oscar 2019 corpus. This extensive dataset was crucial for training a robust language model for Central Kurdish. A critical component of their method is the Central Kurdish Tokenizer, a WordPiece tokenizer trained on this large corpus. This tokenizer is designed to handle the unique characteristics of Kurdish, including informal writing styles often found in social media, by breaking words into sub-word pieces to address out-of-vocabulary issues and capture emotional nuances.
The study developed and compared four distinct KuBERT models, each with varying parameter settings for epochs, iterations, hidden size, and other factors, to optimize performance. For sentiment analysis, these KuBERT models were then integrated with three different classifiers: Fine-Tuning BERT, BiLSTM, and MLP. The sentiment analysis task was designed to classify text into three categories: positive, negative, and neutral.
The results demonstrated a clear superiority of the BERT-based models over traditional Word2Vec approaches. The fine-tuned BERT model consistently achieved the highest accuracy across all configurations, with its best performance reaching 75.37% for the three-class sentiment analysis. The BiLSTM classifier also showed strong results, peaking at 74.09% accuracy. When the sentiment analysis was simplified to a two-class problem (positive and negative, by removing neutral samples), the fine-tuned BERT model’s accuracy significantly improved to 86.31%, highlighting the impact of data imbalance on model performance.
Also Read:
- Advancing Language Models for Low-Resource Languages Through Machine Translation
- Turk-LettuceDetect: Enhancing Trust in Turkish AI with Advanced Hallucination Detection
This research marks a crucial step forward for NLP in Central Kurdish. By developing a robust text corpus and fine-tuned BERT models, the study addresses a significant gap in resources for this low-resource language. The findings underscore BERT’s effectiveness in understanding complex linguistic phenomena and semantic relationships, paving the way for future advancements in sentiment analysis and other NLP tasks for Kurdish and similar languages. The trained KuBERT models and tokenizer are made publicly available, encouraging further research and application in the field.


