TLDR: This research paper investigates cross-lingual Retrieval-Augmented Generation (RAG) in Arabic-English corporate datasets, revealing that retrieval is a critical bottleneck, especially when user queries and supporting documents are in different languages. The core issue lies in the retriever’s inability to effectively rank documents across languages. The study proposes a simple mitigation strategy: balancing the number of retrieved documents from each language, which significantly improves cross-lingual and overall RAG performance, highlighting opportunities for practical advancements in multilingual retrieval.
Retrieval-Augmented Generation (RAG) has become a cornerstone for enhancing large language models (LLMs) by grounding them in external knowledge. While much of the focus has been on high-resource languages like English, many real-world applications, especially in corporate environments, deal with multilingual information. This includes content spanning both widely spoken and less-resourced languages, such as Arabic.
This research paper, titled “The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora,” delves into the complexities of cross-lingual RAG, specifically in an Arabic-English context. The authors, Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, and Liane Lewin-Eytan, highlight a significant gap in previous studies, which often relied on open-domain sources like Wikipedia. Such benchmarks, while useful, can mask underlying retrieval challenges due to factors like language imbalances, overlap with pretraining data, and memorized content within the models.
The Hidden Bottleneck: Retrieval in Cross-Lingual RAG
The study introduces new benchmarks derived from real-world corporate datasets in the UAE, focusing on legal and travel information. These datasets feature parallel English-Arabic documents, allowing for a systematic evaluation of multilingual retrieval behavior. A crucial finding is that retrieval itself acts as a major bottleneck in cross-lingual, domain-specific scenarios. Performance drops significantly when the user’s query and the supporting document are in different languages.
Further analysis reveals that the primary cause of these failures isn’t the LLM’s ability to understand cross-lingual queries, but rather the retriever’s difficulty in ranking documents across different languages within a shared embedding space. Essentially, while the retrieval models perform well when comparing documents within the same language, they struggle to accurately prioritize relevant information when the languages of the query and the document differ. Different embedding models, like BGE-M3 and M-E5, showed varying degrees of this cross-lingual bias.
Also Read:
- SARA: Enhancing RAG Performance Through Hybrid Context Management
- Unpacking LLM Decisions: How to Attribute Responses to Source Documents in RAG Systems
A Simple Solution: Balanced Retrieval
To address this critical issue, the researchers propose a straightforward yet effective retrieval strategy: enforcing an equal selection of documents from each language. For instance, if 20 passages are to be retrieved, 10 would be in Arabic and 10 in English. This “balanced retriever” approach significantly improved cross-lingual performance without negatively impacting same-language retrieval.
The success of this simple intervention suggests that even with inherent biases in embedding models, debiasing strategies are feasible and can lead to substantial gains in real-world RAG applications. This work underscores the importance of re-evaluating cross-lingual retrieval in practical settings, moving beyond open-domain benchmarks to uncover and address real-world performance limitations.
For more in-depth information, you can read the full research paper here: The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora.


