TLDR: LC-Eval is a new bilingual (English and Arabic) benchmark designed to rigorously evaluate Large Language Models’ (LLMs) ability to understand and process very long texts, from 4k to over 128k tokens. It introduces four challenging tasks: multi-document question answering, bilingual question answering, claim verification, and multiple-choice questions. Evaluations show that even advanced models like GPT-4o find these tasks difficult, highlighting the benchmark’s complexity and the current limitations of LLMs, especially in Arabic.
Recent advancements in Large Language Models (LLMs) have brought forth sophisticated capabilities, particularly in processing and understanding extended contexts. These powerful models can now handle context lengths ranging from 4,000 to over 128,000 tokens, a significant leap from earlier models that typically managed up to 4,000 tokens. This extended capacity is crucial for tasks like understanding long documents, reducing factual errors (hallucinations), and improving retrieval-augmented generation (RAG).
However, evaluating these long-context LLMs (LCLMs) effectively has become a pressing challenge. Many existing benchmarks, especially for languages like Arabic, often fall short. Arabic, spoken by over 400 million people, has seen the rise of several dedicated LLMs, but their evaluation often relies on English benchmarks or private datasets. This makes it difficult to publicly assess their performance, particularly in deep reasoning tasks, which are often overlooked in current evaluations.
Introducing LC-Eval: A New Benchmark
To address these gaps, researchers have introduced LC-Eval, a novel bilingual, multi-task evaluation benchmark. Designed for both English and Arabic, LC-Eval aims to rigorously assess LCLMs’ understanding of long contexts, specifically targeting lengths from 4k to over 128k tokens. The benchmark introduces four new and challenging tasks:
- Multi-document Question Answering: This task requires models to synthesize information from several documents, some of which act as distractors, to answer a question. It tests deep reasoning, document comprehension, and the ability to trace information back to its source.
- Bilingual Question Answering: Here, a document might be in one language (e.g., Arabic) and the question in another (e.g., English). The model must understand the context in the source language and generate an accurate answer in the question’s language, demonstrating cross-lingual understanding and generation.
- Claim Verification: Models are presented with a paragraph containing multiple claims, some true and some false, based on a long document. The task is to identify the veracity of each claim, simulating real-world scenarios where information needs careful verification.
- Multiple-Choice Questions: This task involves answering multiple-choice questions based on long contexts, requiring a combination of document understanding and reasoning skills.
How the Data Was Created and Evaluated
The datasets for LC-Eval were curated from a variety of publicly available sources, including Wikipedia dumps, WikiNews, WikiHow, WikiBooks, Project Gutenberg (for English books), and the Hindawi Organization (for Arabic books), along with articles from the Saudi Press Agency. This diverse collection ensures a rich mix of text genres and domains.
Initial data generation for the tasks was performed using GPT-4o, followed by a multi-stage refinement process to increase complexity. Crucially, all data underwent rigorous human validation by three annotators to ensure accuracy and quality. The benchmark comprises a substantial 7,903 samples.
For evaluating open-ended questions in multi-document and bilingual QA, LC-Eval proposes an innovative entity relationship-based evaluation method. This approach, inspired by previous work, uses an LLM as a judge to assess the conceptual meaning and overlap of entities and their relationships between a model’s response and a gold standard answer, rather than relying solely on exact word matching (which can be unreliable for varied phrasing). Other metrics like recall@k and standard accuracy are also used for comprehensive assessment.
Key Findings and Challenges
The evaluations conducted on both open-weight and closed LLMs, including high-performing models like GPT-4o, revealed that LC-Eval presents significant challenges. Even GPT-4o struggled with certain tasks, underscoring the benchmark’s rigor. A consistent trend observed was that LCLMs generally performed better in English tasks compared to Arabic tasks, highlighting a potential gap in multilingual capabilities and the need for more dedicated Arabic training data.
Models often showed a decline in performance as context length increased, particularly in multi-document question answering and bilingual QA. This suggests limitations in their ability to handle very long contexts or a large number of documents effectively. Furthermore, the benchmark uncovered specific flaws, such as models generating correct-seeming answers but failing to accurately trace the information back to the correct source documents.
Also Read:
- Evaluating Long-Context Language Models with AcademicEval: A New Live Benchmark
- MemoryBench: A New Benchmark for LLM Continual Learning and Memory
Looking Ahead
LC-Eval is a significant contribution to the field, offering a much-needed benchmark for long-context understanding in both English and Arabic. It is particularly vital for Arabic, where such dedicated evaluation resources have been scarce. The human-validated dataset ensures high quality and serves as a valuable tool for advancing Artificial General Intelligence (AGI) in both languages. While the initial data was generated using GPT-4o, the methodology successfully introduced enough complexity to challenge even this advanced model, with other models occasionally outperforming it in specific tasks. The benchmark is also capable of evaluating context lengths up to 256k tokens, pushing the boundaries of current LCLM assessment.
For more details, you can read the full research paper here.


