TLDR: This research introduces the first part-of-speech (POS) tagger for Nagamese, a resource-poor Creole language spoken in Northeast India. The study created a manually annotated corpus of 16,115 tokens and applied Conditional Random Fields (CRF) for tagging. The model achieved an overall tagging accuracy of 85.70%, with 86% precision and recall, and an 85% f1-score. This work establishes a crucial foundation for future Nagamese natural language processing applications.
The Nagamese language, a unique Assamese-lexified Creole spoken widely in Nagaland, India, has long presented a challenge for natural language processing (NLP) due to its status as a resource-poor language. While significant advancements have been made in NLP for languages like English and Hindi, Nagamese has largely remained unexplored in this domain. A recent research paper, Part-of-speech tagging for Nagamese Language using CRF, marks a significant milestone as the first attempt to develop a part-of-speech (POS) tagger for Nagamese.
Part-of-speech tagging is a fundamental task in NLP, involving the labeling of each word in a sentence with its appropriate grammatical category, such as noun, verb, adjective, or adverb. This process is crucial for many higher-level language understanding applications. For instance, in Nagamese, a sentence like “Itu/ADJECTIVE dikhikena/NOUN Isor/NOUN khusi/ADJECTIVE lagise/VERB ./SYM” (God was pleased with what He saw) demonstrates how each word is assigned a specific tag.
Understanding Nagamese
Nagamese, also known as Naga Pidgin, serves as a common language across Nagaland, facilitating communication among various tribal groups and with people from Assam. It is an Assamese-lexified creole, meaning its vocabulary is largely derived from Assamese, but its grammatical structure is distinct. The language features 28 phonemes, including 6 vowels and 22 consonants, and words can range from one to four syllables. Its grammar has been documented in previous linguistic works, but computational resources have been scarce.
Building the Tagger: Methodology
The core of this research involved building a POS tagger using Conditional Random Fields (CRF), a powerful machine learning technique well-suited for sequence labeling tasks like POS tagging. Unlike simpler classifiers that predict labels for individual words in isolation, CRFs can consider the context of neighboring words, leading to more accurate predictions.
A critical first step was the creation of an annotated corpus, as no such resource existed for Nagamese. The researchers meticulously collected articles from a local newspaper, ‘Nagamese Khobor’, and bible phrases to form a mixed corpus. This dataset, comprising 16,115 tokens (individual words or punctuation marks) across 749 sentences, was then manually annotated by a native Nagamese speaker. A smaller subset was independently tagged by another annotator to validate the consistency of the tagging, showing a low disagreement rate.
The tagset developed for Nagamese consists of 14 categories, including common parts of speech like Adjective (ADJ), Noun (N), Verb (V), and Pronoun (PN). Additionally, specific tags were introduced for Foreign Words (FW), Symbols (SYM), and Unknown words (UNK) to better handle the unique characteristics of the language.
For the CRF model, various features were extracted from the words to aid in tagging. These included the word itself, its position in the sentence (first or last), capitalization, lowercase status, prefixes and suffixes up to three characters, the previous and next words, and whether it contained a hyphen or was numeric. These features help the model understand the linguistic patterns and context necessary for accurate tagging.
Performance and Insights
The POS tagger was evaluated using a 70:30 split for training and testing the model. The results demonstrated an overall tagging accuracy of 85.70%, with precision and recall both at 86%, and an f1-score of 85%. These metrics indicate a robust performance for a first-time effort on a resource-poor language.
While the model performed well overall, an error analysis revealed specific areas for improvement. For instance, the tag for Complementizers (CMP) achieved perfect precision, while Nouns (N) had the lowest precision. Symbols (SYM) showed the highest recall and f1-score, whereas Unknown words (UNK) had the lowest. Common misclassifications included adjectives being tagged as adverbs or verbs, and nouns being confused with foreign words or postpositions/prepositions.
Also Read:
- Advancing English–Tigrinya Machine Translation with Custom Tokenizers and Refined Evaluation
- Advancing Ge’ez Language Technology: A Morphological Synthesizer Project
Conclusion and Future Directions
This pioneering work successfully established a part-of-speech tagger for the Nagamese language, providing a foundational resource for future NLP research. The creation of the 16,115-token POS-annotated Nagamese corpus is a significant contribution in itself. The researchers acknowledge limitations, such as the number of tags used and the size of the dataset, and propose several avenues for future work. These include expanding the tagset and corpus size, utilizing the tagger to build other NLP applications like sentiment analysis and machine translation for Nagamese, and exploring transfer learning techniques from related languages like Assamese to further enhance performance.


