TLDR: This research explores improving Question Answering (QA) in Indonesian, a low-resource language, by implementing an Adaptive Retrieval-Augmented Generation (RAG) system. The system classifies question complexity to dynamically select the most efficient retrieval strategy (non-retrieval, single-retrieval, or multi-retrieval). The study created an Indonesian multi-retrieval dataset by translating HotpotQA and fine-tuned an IndoBERT model for question classification. While the classifier proved reliable and adaptive RAG showed promise in reducing retrieval steps, the multi-retrieval strategy faced significant challenges with current LLMs (Gemma 3, Qwen 3) in Indonesian, leading to inconsistencies and performance degradation, especially for complex questions. The findings highlight the potential and current limitations of RAG in low-resource language contexts and suggest future directions for improvement.
In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) have made incredible strides in tasks like question answering. However, their state-of-the-art performance is predominantly observed in English, leaving a significant gap for low-resource languages such as Indonesian. A recent study by William Christian, Daniel Adamlu, Adrian Yu, and Derwin Suhartono from Bina Nusantara University addresses this challenge by introducing an Adaptive Retrieval-Augmented Generation (RAG) system tailored for Indonesian language question answering.
LLMs, despite their power, often struggle with questions requiring extensive external knowledge, sometimes leading to ‘hallucinations’ or inaccurate answers, especially for less popular entities. Updating these models frequently is impractical due to the immense time and computational resources required for training. This is where Retrieval-Augmented Generation (RAG) comes in. RAG enhances LLMs by adding a retrieval layer that fetches relevant information from external sources before the LLM generates an answer, significantly improving accuracy and informativeness.
While initial RAG systems used a single retrieval step, more advanced methods, like multi-retrieval, have emerged for complex questions. However, these advanced systems come with higher computational costs and latency. To optimize this, the concept of Adaptive RAG was developed. Adaptive RAG integrates a classifier to assess the complexity of a question, which then dictates the most efficient strategy for answering it – whether it needs no retrieval, a single retrieval, or multiple retrieval steps.
The core challenge for implementing Adaptive RAG in Indonesian lies in the scarcity of high-quality datasets, particularly those designed for multi-retrieval scenarios. To overcome this, the researchers created a new Indonesian multi-retrieval dataset by translating the existing English HotpotQA dataset using an OPUS-MT translation model. This allowed for systematic evaluation in a low-resource language setting, even with minor translation imperfections that reflect real-world data challenges.
The Adaptive RAG system employs three main answering strategies:
Non-Retrieval
This is the most efficient method, relying solely on the LLM’s existing knowledge without fetching any external information. It’s used for simple questions where the answer is likely within the model’s parametric memory.
Single Retrieval
For questions requiring external context, this method performs a single retrieval operation using an ElasticSearch engine with the BM25 algorithm. The retrieved information is then augmented to the LLM’s input to generate a concise answer.
Also Read:
- CMOMgen: Automating Complex Ontology Alignment with Pattern-Guided AI
- Bridging Language and Structure: A Look at Large Language Models and Text-Attributed Graphs
Multi-Retrieval
Inspired by the IRCoT (Interleaving Retrieval with Chain-of-Thought) framework, this strategy iteratively combines retrieval and reasoning. The LLM guides the retrieval process by breaking down complex questions into sub-questions or keywords, retrieving information in cycles until a satisfactory answer is formed or a termination condition is met.
A crucial component of the Adaptive RAG system is the question classifier. This classifier, fine-tuned using IndoBERT (a BERT variant pretrained for Indonesian), categorizes questions into three complexity levels: ‘A’ (no retrieval needed), ‘B’ (single retrieval needed), and ‘C’ (multi-retrieval needed). The labeling process involved evaluating questions from IndoQA and QASiNa datasets with all three answering methods to determine their optimal complexity label. The classifier demonstrated reliable performance, particularly in identifying complex ‘C’ questions.
The study evaluated the Adaptive RAG system using open-source LLMs like Gemma 3-4B and Qwen 3-8B. On datasets like IndoQA and QASiNa, single-retrieval methods generally showed the strongest performance. Adaptive RAG demonstrated its potential by effectively reducing unnecessary retrieval steps, leading to lower computational costs. However, its overall performance was often constrained by significant inconsistencies in the multi-retrieval method.
The multi-retrieval strategy, while conceptually powerful, faced considerable hurdles. The LLMs (Gemma 3 and Qwen 3) struggled with reasoning in Indonesian, especially when dealing with long prompts created by concatenating multiple retrieved documents. This often led to increased hallucination and a failure to effectively utilize the provided information, sometimes concluding that information was insufficient even when it was present. Larger models, like Gemini 2.5 Flash Lite, showed better capability in extracting focused keywords for subsequent retrieval steps, highlighting the impact of model size and language-specific training.
In conclusion, this research highlights the promising potential of Adaptive RAG systems in bridging language gaps for LLMs, particularly for Indonesian. The question complexity classifier proved accurate and reliable. However, the study also underscored a key limitation: the performance of the multi-retrieval answering method is heavily constrained by the LLM’s reasoning capabilities in a low-resource language context. This often resulted in a significant decline in overall performance when dealing with inherently complex, multi-hop questions.
For future improvements, the researchers recommend developing datasets originally written in Indonesian to avoid translation artifacts, utilizing or training LLMs specifically for the Indonesian language to enhance contextual understanding, and improving the multi-retrieval method itself, possibly by designing approaches tailored for low-resource languages. You can read the full research paper here.


