TLDR: Researchers have created the first-ever syllabified corpus of 10,120 Tenyidie words, a low-resource Tibeto-Burman language. By applying deep learning models like LSTM, BLSTM, BLSTM+CRF, and Encoder-decoder, they achieved a high accuracy of 99.21% with the BLSTM model in automatically identifying word syllables. This foundational work paves the way for further Natural Language Processing applications in Tenyidie.
The Tenyidie language, a significant but low-resource language from the Tibeto-Burman family, is primarily spoken by the Tenyimia Community in Nagaland, India. Characterized by its tonal nature, Subject-Object-Verb structure, and highly agglutinative properties, Tenyidie has seen very limited research in Natural Language Processing (NLP). A recent study addresses a fundamental NLP task for Tenyidie: syllabification, which involves identifying the syllables within a given word.
To date, no prior work on syllabification for the Tenyidie language has been reported. This new research makes a significant contribution by creating a corpus of 10,120 manually syllabified Tenyidie words. This meticulously annotated dataset serves as a crucial foundation for applying deep learning techniques to the language.
The researchers applied several deep learning architectures to this newly created corpus, including Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BLSTM), BLSTM with Conditional Random Fields (BLSTM+CRF), and an Encoder-decoder model. The dataset was split into 80% for training, 10% for validation, and 10% for testing.
Understanding Tenyidie Language Characteristics
The Tenyidie language utilizes the Roman (Latin) alphabet, similar to English, but notably excludes the letters ‘Q’ and ‘X’ while incorporating a special letter ‘¨U’. Its alphabet consists of 25 letters: 6 vowels (a, e, i, o, u, ¨u) and 19 consonants. The language exhibits monosyllables, disyllables, and sequisyllables, with any vowel capable of functioning as the peak of a syllable. While generally considered an open syllable language (meaning no consonant occurs at the end of a syllable), some documentation suggests the presence of a rhotic consonant as a coda. Common syllable types identified include V (one vowel), CV (consonant followed by a vowel), and CCV (an initial consonant cluster followed by one vowel). Consonant clusters are typically found in root-initial positions, specifically plosive plus trill combinations.
Corpus Creation and Annotation
The creation of the syllabified corpus began with the acquisition of 16,022 unique words from news data. After filtering out non-Tenyidie words and other tokens, 10,120 words remained for annotation. The manual syllabification process involved two annotators: the first performed the initial annotation following language expert guidelines, and the second corrected the dataset after consultation. This rigorous process ensured high accuracy in the annotated corpus. The dataset statistics show an average word length of 8.58 characters, with a minimum of 1 and a maximum of 20. The most frequent syllable types observed were CV, CCV, and CVV.
Deep Learning Experiments and Results
The syllabification task was framed as a sequence-to-sequence labeling problem, where each character in a word was labeled as either the start (S) or continuation (C) of a syllable. For instance, the word “tenyidie” (te+nyi+die) would be labeled as “S C S C C S C C”.
The deep learning models were configured with a word embedding dimension of 128, optimized using the Adam optimizer, and trained for 40 epochs with a learning rate of 0.001. The batch size was 128 for LSTM, BLSTM, and BLSTM+CRF, and 16 for the Encoder-decoder model.
The experimental results on the 1,012-word test set demonstrated impressive accuracy:
- LSTM: 97.04%
- BLSTM: 99.21%
- BLSTM+CRF: 99.01%
- Encoder-decoder (NMT) with attention: 94.27%
The Bidirectional Long Short-Term Memory (BLSTM) model achieved the highest accuracy of 99.21%, indicating its effectiveness in capturing the sequential dependencies required for accurate syllabification in Tenyidie. The study also noted that Encoder-decoder models, while powerful, typically perform better with larger datasets, which could explain their slightly lower performance in this context.
Also Read:
- Improving Dialogue Flow in Persian Chatbots with a Hybrid AI Model
- Detecting AI-Generated Text in Central European Languages: A New Benchmark
Future Implications
This pioneering work provides a robust syllabifier for the Tenyidie language, which will be invaluable for numerous other NLP applications. These include morphological analysis, part-of-speech tagging, and machine translation, all of which rely on accurate syllable information. Future work aims to utilize this syllabifier in these tasks and to expand the annotated dataset further, potentially enhancing the performance of models like the Encoder-decoder. For more technical details, you can refer to the full research paper: Tenyidie Syllabification corpus creation and deep learning applications.


