TLDR: NUSA AKSARA is a novel benchmark dataset designed to preserve Indonesian indigenous scripts, which are facing decline due to the dominance of romanized text in NLP development. Covering 8 scripts across 7 languages, including low-resource ones and even a Unicode-unsupported script (Lampung), the dataset supports tasks like OCR, transliteration, and translation from both text and image modalities. Initial evaluations show that current NLP models, including advanced LLMs, perform poorly on these non-Latin scripts, underscoring a critical need for more research and resources to safeguard Indonesia’s rich linguistic and cultural heritage.
Indonesia is a nation celebrated for its incredible linguistic diversity, boasting over 700 languages. Many of these languages traditionally used their own unique writing systems, known locally as aksara. However, in recent times, there has been a growing trend towards adopting romanized (Latin-based) scripts, leading to a gradual decline and neglect of these traditional writing systems.
The field of Natural Language Processing (NLP) has largely mirrored this trend, with most development focusing on romanized text. This leaves a significant gap in support for Indonesia’s native writing systems, contributing to a cycle that further diminishes their use. These local scripts are not merely tools for communication; they are vital vessels of cultural identity and repositories of historical knowledge.
To address this critical gap, researchers have introduced NUSA AKSARA, a groundbreaking public benchmark designed to preserve Indonesian indigenous scripts. This novel benchmark covers both text and image modalities and encompasses a diverse range of tasks, including image segmentation, Optical Character Recognition (OCR), transliteration (converting local script to Latin), translation, and language identification.
The NUSA AKSARA dataset is meticulously constructed by human experts through rigorous steps. It covers 8 distinct scripts across 7 languages, including several low-resource languages that are rarely seen in typical NLP benchmarks. A notable inclusion is the Lampung script, which presents a unique challenge as it is currently unsupported by Unicode, making its digital preservation particularly difficult.
Benchmarking various models, from large language models (LLMs) and vision-language models (VLMs) like GPT-4o and Llama 3.2, to task-specific systems such as PP-OCR and LangID, revealed a stark reality: most existing NLP technologies struggle significantly with Indonesia’s local scripts. Many models achieved near-zero performance, especially when dealing with the original scripts directly. Performance was generally better when the input was already in transliterated (romanized) text, indicating that the primary challenge lies in the models’ lack of exposure and representation of these unique scripts.
The importance of preserving these scripts extends beyond academic interest. They are integral to everyday life in certain regions, appearing on street signs in places like Yogyakarta and Bali. They are also part of the curriculum in Indonesian schools, connecting students with their heritage. Crucially, historical manuscripts and legal documents written in these scripts hold invaluable historical, scientific, and cultural knowledge, which would be lost if the ability to read and interpret them fades.
The creation of the NUSA AKSARA corpus involved sourcing materials from historical manuscripts, literary works, books, religious texts, magazines, and educational literature. These physical resources underwent a careful digitization process, including manual unbinding and scanning. To process the digitized content, a fine-tuned PaddleOCR model was developed to detect local scripts and distinguish them from Latin text.
Annotation was a collaborative effort involving native speakers, educators, linguists, and members of grassroots communities dedicated to script preservation. Annotators were tested on their ability to transcribe images into local script, transliterate local script to Latin, and translate the Latin transliteration into Indonesian. While overall agreement was high, challenges arose with certain scripts like Lontara due to standardization issues and Jawa due to phonetic variations.
Also Read:
- Advancing Medical AI: A Deep Dive into Reasoning Capabilities of Large Language Models
- Human Touch in AI Evaluation: Introducing the FACTORY Dataset
The NUSA AKSARA benchmark highlights the urgent need for broader support and integration of indigenous scripts into NLP pipelines. This effort is crucial not only for linguistic preservation but also for improving accessibility to historically marginalized scripts and languages, ensuring that Indonesia’s rich cultural and historical heritage remains accessible for future generations.


