TLDR: A project is developing a trilingual (Toto-Bangla-English) language learning application to preserve the endangered Toto language of West Bengal, India. It combines traditional linguistic fieldwork with AI, including morphological analysis, corpus creation, and training a Small Language Model and a Transformer-based translation engine, offering a sustainable model for language revitalization. The initiative addresses the language’s critical endangerment by making it accessible to both native and non-native speakers through digital archiving and interactive learning tools.
In an inspiring effort to safeguard linguistic diversity, a groundbreaking project is underway to preserve and revitalize the critically endangered Toto language of West Bengal, India. Spoken by fewer than 1,700 individuals of the Toto tribal community in Totopara, near the Indo-Bhutan border, this Sino-Tibetan language faces a serious threat of extinction. The initiative, detailed in the research paper “Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal”, combines traditional linguistic methods with cutting-edge Artificial Intelligence to create a sustainable model for language preservation.
The core of this project is the development of a trilingual language learning application, designed to facilitate learning in Toto, Bangla, and English. This application aims to serve both native Toto speakers, particularly younger generations who are often more proficient in Bangla, and non-native learners such as researchers, tourists, and educators. By connecting Toto speakers to their mother tongue while also enabling them to learn English – a language of economic and social mobility – the project seeks to revolutionize multilingual education within indigenous and minority communities.
The research involved extensive fieldwork, including systematic data collection from native Toto speakers. This process captured not only textual content but also crucial audio recordings, which will be integrated into the learning application to provide real speech examples. The data collection utilized structured questionnaires, including exhaustive word lists categorized by parts of speech, and specific questions to document inflectional morphology (person-number-gender agreement, tense-aspect-mood distinctions, and case marking) and derivational morphology (word-class changes like adjective to verb or verb to noun).
A significant aspect of the project is the detailed morphological analysis of the Toto language. Researchers meticulously documented inflectional morphemes, which indicate grammatical aspects like plurality (-bɪ), tense (-mi for present/past, -na for past/present, -ro for future), aspect (-dɑŋ, -diŋ, -duŋ for progressive, -pate for present/past perfect, -pu for future perfect), and mood (indicative, imperative, subjunctive). The study also identified various case morphemes, including null for nominative, -hẽ, -hiŋ, -hi for accusative, -ko, -kɔ for genitive, -ta, -ʃo for locative, -hiŋ, -ta for dative, and -ʃɔ for both ablative and instrumental cases. Furthermore, the paper explored derivational morphemes like -pɑɈoɑ, -Ɉoɑ, -pæʋɑ, and -ʋɑ, which are used to create new words and change their word classes, such as transforming adjectives into verbs or verbs into nouns.
AI Integration for Language Revitalization
To ensure the practical preservation and revitalization of Toto, the project incorporates a focused AI framework. This includes the development of a Small Language Model (SLM) and a Toto–Bangla–English trilingual translation engine. Unlike large-scale models, this solution is tailored for the limited data resources of the Toto language. The integration process involves several key stages:
- Corpus Collection: A trilingual parallel corpus was built from recorded Toto utterances, manually translated into Bangla and English, and verified by native speakers and linguists. Each entry is richly annotated with morpheme-level tags, part-of-speech, and syntactic boundaries.
- Script Standardization: Given that the Toto script was formalized only in 2015, ensuring its Unicode compatibility was crucial. Custom mapping tables were created, and a Romanized-to-Script transliteration tool was developed to support digital literacy.
- Data Processing: The collected data underwent extensive linguistic preprocessing, including tokenization, morpheme segmentation, script normalization, and trilingual sentence alignment. Data augmentation techniques were also used to expand the dataset.
- Small Language Model Training: An SLM was trained specifically for Toto using Masked Language Modeling objectives. A custom tokenizer was built, and a distilled Transformer model (2-4 layers, ~5M parameters) was trained on approximately 20,000 Toto sentences to enable tasks like word prediction and phrase suggestion.
- Trilingual Translator: A transformer-based encoder-decoder model was trained using the trilingual corpus, leveraging open-source frameworks and pretrained multilingual models for initialization.
The final AI models will be integrated into a lightweight web and mobile application, offering Toto-Bangla-English translation, morpheme-level explanations, script display, and transliteration. The app will also support offline inference, making it accessible even in areas with limited connectivity.
Also Read:
- Adapting Large Language Models to Québec French: A Low-Resource Dialect Case Study
- Bridging the Gap: New Dataset Improves Arabic Child Speech Recognition
Challenges and Future Implications
The project acknowledges several challenges, including the limited number of Toto speakers (1,606), which makes data collection and linguistic verification difficult. The language’s primarily oral transmission until 2015 also poses hurdles for written documentation. Variations in pronunciation and structure across generations further complicate standardization. From a technological standpoint, the scarcity of orthographic materials in Toto limits the extensive training data typically required for AI models.
Despite these challenges, this research offers a sustainable, ethically sound, and technologically feasible approach to language preservation. It provides a comprehensive analysis of Toto’s morphology and integrates AI-driven tools to make the language more accessible. The project fosters multilingual competence among Toto speakers and promotes cross-cultural understanding for non-native learners. By bridging traditional linguistic fieldwork with computational modeling, this initiative serves as a blueprint for future endangered language preservation projects, emphasizing the vital role of community participation, government support, and academic collaboration in sustaining indigenous languages in the digital age.


