Unlocking Indonesia's Linguistic Heritage: A New Benchmark for Indigenous Scripts

TLDR: NUSA AKSARA is a novel benchmark dataset designed to preserve Indonesian indigenous scripts, which are facing decline due to the dominance of romanized text in NLP development. Covering 8 scripts across 7 languages, including low-resource ones and even a Unicode-unsupported script (Lampung), the dataset supports tasks like OCR, transliteration, and translation from both text and image modalities. Initial evaluations show that current NLP models, including advanced LLMs, perform poorly on these non-Latin scripts, underscoring a critical need for more research and resources to safeguard Indonesia’s rich linguistic and cultural heritage.

Indonesia is a nation celebrated for its incredible linguistic diversity, boasting over 700 languages. Many of these languages traditionally used their own unique writing systems, known locally as aksara. However, in recent times, there has been a growing trend towards adopting romanized (Latin-based) scripts, leading to a gradual decline and neglect of these traditional writing systems.

The field of Natural Language Processing (NLP) has largely mirrored this trend, with most development focusing on romanized text. This leaves a significant gap in support for Indonesia’s native writing systems, contributing to a cycle that further diminishes their use. These local scripts are not merely tools for communication; they are vital vessels of cultural identity and repositories of historical knowledge.

To address this critical gap, researchers have introduced NUSA AKSARA, a groundbreaking public benchmark designed to preserve Indonesian indigenous scripts. This novel benchmark covers both text and image modalities and encompasses a diverse range of tasks, including image segmentation, Optical Character Recognition (OCR), transliteration (converting local script to Latin), translation, and language identification.

The NUSA AKSARA dataset is meticulously constructed by human experts through rigorous steps. It covers 8 distinct scripts across 7 languages, including several low-resource languages that are rarely seen in typical NLP benchmarks. A notable inclusion is the Lampung script, which presents a unique challenge as it is currently unsupported by Unicode, making its digital preservation particularly difficult.

Benchmarking various models, from large language models (LLMs) and vision-language models (VLMs) like GPT-4o and Llama 3.2, to task-specific systems such as PP-OCR and LangID, revealed a stark reality: most existing NLP technologies struggle significantly with Indonesia’s local scripts. Many models achieved near-zero performance, especially when dealing with the original scripts directly. Performance was generally better when the input was already in transliterated (romanized) text, indicating that the primary challenge lies in the models’ lack of exposure and representation of these unique scripts.

The importance of preserving these scripts extends beyond academic interest. They are integral to everyday life in certain regions, appearing on street signs in places like Yogyakarta and Bali. They are also part of the curriculum in Indonesian schools, connecting students with their heritage. Crucially, historical manuscripts and legal documents written in these scripts hold invaluable historical, scientific, and cultural knowledge, which would be lost if the ability to read and interpret them fades.

The creation of the NUSA AKSARA corpus involved sourcing materials from historical manuscripts, literary works, books, religious texts, magazines, and educational literature. These physical resources underwent a careful digitization process, including manual unbinding and scanning. To process the digitized content, a fine-tuned PaddleOCR model was developed to detect local scripts and distinguish them from Latin text.

Annotation was a collaborative effort involving native speakers, educators, linguists, and members of grassroots communities dedicated to script preservation. Annotators were tested on their ability to transcribe images into local script, transliterate local script to Latin, and translate the Latin transliteration into Indonesian. While overall agreement was high, challenges arose with certain scripts like Lontara due to standardization issues and Jawa due to phonetic variations.

Also Read:

The NUSA AKSARA benchmark highlights the urgent need for broader support and integration of indigenous scripts into NLP pipelines. This effort is crucial not only for linguistic preservation but also for improving accessibility to historically marginalized scripts and languages, ensuring that Indonesia’s rich cultural and historical heritage remains accessible for future generations.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Indonesia’s Linguistic Heritage: A New Benchmark for Indigenous Scripts

Gen AI News and Updates

A New Benchmark for Evaluating AI in Electronic Health Records: Introducing EHRStruct

Accelerating ML Hardware Design: A New Benchmark and AI Models for FPGA Resource Estimation

ContextCRBench: A New Benchmark for Detailed LLM Evaluation in Code Review

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates