Enhancing Indonesian Question Answering with Adaptive Retrieval-Augmented Generation

TLDR: This research explores improving Question Answering (QA) in Indonesian, a low-resource language, by implementing an Adaptive Retrieval-Augmented Generation (RAG) system. The system classifies question complexity to dynamically select the most efficient retrieval strategy (non-retrieval, single-retrieval, or multi-retrieval). The study created an Indonesian multi-retrieval dataset by translating HotpotQA and fine-tuned an IndoBERT model for question classification. While the classifier proved reliable and adaptive RAG showed promise in reducing retrieval steps, the multi-retrieval strategy faced significant challenges with current LLMs (Gemma 3, Qwen 3) in Indonesian, leading to inconsistencies and performance degradation, especially for complex questions. The findings highlight the potential and current limitations of RAG in low-resource language contexts and suggest future directions for improvement.

In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) have made incredible strides in tasks like question answering. However, their state-of-the-art performance is predominantly observed in English, leaving a significant gap for low-resource languages such as Indonesian. A recent study by William Christian, Daniel Adamlu, Adrian Yu, and Derwin Suhartono from Bina Nusantara University addresses this challenge by introducing an Adaptive Retrieval-Augmented Generation (RAG) system tailored for Indonesian language question answering.

LLMs, despite their power, often struggle with questions requiring extensive external knowledge, sometimes leading to ‘hallucinations’ or inaccurate answers, especially for less popular entities. Updating these models frequently is impractical due to the immense time and computational resources required for training. This is where Retrieval-Augmented Generation (RAG) comes in. RAG enhances LLMs by adding a retrieval layer that fetches relevant information from external sources before the LLM generates an answer, significantly improving accuracy and informativeness.

While initial RAG systems used a single retrieval step, more advanced methods, like multi-retrieval, have emerged for complex questions. However, these advanced systems come with higher computational costs and latency. To optimize this, the concept of Adaptive RAG was developed. Adaptive RAG integrates a classifier to assess the complexity of a question, which then dictates the most efficient strategy for answering it – whether it needs no retrieval, a single retrieval, or multiple retrieval steps.

The core challenge for implementing Adaptive RAG in Indonesian lies in the scarcity of high-quality datasets, particularly those designed for multi-retrieval scenarios. To overcome this, the researchers created a new Indonesian multi-retrieval dataset by translating the existing English HotpotQA dataset using an OPUS-MT translation model. This allowed for systematic evaluation in a low-resource language setting, even with minor translation imperfections that reflect real-world data challenges.

The Adaptive RAG system employs three main answering strategies:

Non-Retrieval

This is the most efficient method, relying solely on the LLM’s existing knowledge without fetching any external information. It’s used for simple questions where the answer is likely within the model’s parametric memory.

Single Retrieval

For questions requiring external context, this method performs a single retrieval operation using an ElasticSearch engine with the BM25 algorithm. The retrieved information is then augmented to the LLM’s input to generate a concise answer.

Also Read:

Multi-Retrieval

Inspired by the IRCoT (Interleaving Retrieval with Chain-of-Thought) framework, this strategy iteratively combines retrieval and reasoning. The LLM guides the retrieval process by breaking down complex questions into sub-questions or keywords, retrieving information in cycles until a satisfactory answer is formed or a termination condition is met.

A crucial component of the Adaptive RAG system is the question classifier. This classifier, fine-tuned using IndoBERT (a BERT variant pretrained for Indonesian), categorizes questions into three complexity levels: ‘A’ (no retrieval needed), ‘B’ (single retrieval needed), and ‘C’ (multi-retrieval needed). The labeling process involved evaluating questions from IndoQA and QASiNa datasets with all three answering methods to determine their optimal complexity label. The classifier demonstrated reliable performance, particularly in identifying complex ‘C’ questions.

The study evaluated the Adaptive RAG system using open-source LLMs like Gemma 3-4B and Qwen 3-8B. On datasets like IndoQA and QASiNa, single-retrieval methods generally showed the strongest performance. Adaptive RAG demonstrated its potential by effectively reducing unnecessary retrieval steps, leading to lower computational costs. However, its overall performance was often constrained by significant inconsistencies in the multi-retrieval method.

The multi-retrieval strategy, while conceptually powerful, faced considerable hurdles. The LLMs (Gemma 3 and Qwen 3) struggled with reasoning in Indonesian, especially when dealing with long prompts created by concatenating multiple retrieved documents. This often led to increased hallucination and a failure to effectively utilize the provided information, sometimes concluding that information was insufficient even when it was present. Larger models, like Gemini 2.5 Flash Lite, showed better capability in extracting focused keywords for subsequent retrieval steps, highlighting the impact of model size and language-specific training.

In conclusion, this research highlights the promising potential of Adaptive RAG systems in bridging language gaps for LLMs, particularly for Indonesian. The question complexity classifier proved accurate and reliable. However, the study also underscored a key limitation: the performance of the multi-retrieval answering method is heavily constrained by the LLM’s reasoning capabilities in a low-resource language context. This often resulted in a significant decline in overall performance when dealing with inherently complex, multi-hop questions.

For future improvements, the researchers recommend developing datasets originally written in Indonesian to avoid translation artifacts, utilizing or training LLMs specifically for the Indonesian language to enhance contextual understanding, and improving the multi-retrieval method itself, possibly by designing approaches tailored for low-resource languages. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Indonesian Question Answering with Adaptive Retrieval-Augmented Generation

Non-Retrieval

Single Retrieval

Multi-Retrieval

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates