spot_img
HomeResearch & DevelopmentChiMDQA: A New Comprehensive Dataset for Chinese Document Question...

ChiMDQA: A New Comprehensive Dataset for Chinese Document Question Answering

TLDR: The ChiMDQA research paper introduces a new, comprehensive Chinese Multi-Document Question Answering dataset designed for real-world business scenarios. It covers six diverse domains (academic, education, finance, law, medical, news) with 6,068 high-quality, fine-grained question-answer pairs. The dataset features a hierarchical question classification system and a detailed evaluation framework for both non-RAG and RAG systems. Experiments show that ChiMDQA effectively evaluates large language models, and Retrieval-Augmented Generation (RAG) significantly improves performance, though challenges like hallucination persist.

The field of Natural Language Processing (NLP) is constantly evolving, leading to a growing need for high-quality datasets that can train and evaluate intelligent question-answering (QA) systems. While significant progress has been made, particularly with models like BERT and GPT, much of the research has focused on English. Addressing this gap, a new research paper introduces ChiMDQA, a comprehensive Chinese Multi-Document Question Answering Dataset.

Introducing ChiMDQA: A New Benchmark for Chinese Document QA

Authored by Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, and Hongji Zeng, the paper titled “ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation” presents a dataset specifically designed for real-world business scenarios. It aims to overcome the limitations of existing Chinese QA datasets, which often lack diversity in documents and comprehensiveness in question types.

ChiMDQA is a meticulously curated dataset featuring 6,068 high-quality question-answer pairs. What makes it stand out is its coverage of long-form documents from six distinct and prevalent domains: academic, education, finance, law, medical treatment, and news. This broad topical coverage ensures the dataset is applicable to various NLP tasks, including document comprehension, knowledge extraction, and intelligent QA systems.

Diverse Document Topics and Fine-Grained Question Types

The dataset’s strength lies in its diverse document topics. Academic documents include research papers, educational materials cover textbooks, financial documents comprise reports, legal documents involve legal texts, medical documents feature clinical guidelines, and news documents consist of journalistic articles. This selection ensures a rich and representative collection of real-world information.

Beyond document diversity, ChiMDQA introduces a sophisticated, hierarchical question classification system. Questions are categorized into two main levels: Level 1 for explicit facts that require direct extraction, and Level 2 for implicit facts that demand inference or integration of information. These are further refined into ten fine-grained subtypes, aligning with various downstream tasks. Factual questions include retrieval, filtering, statistical, computational, and comparison tasks. Open-ended questions, which require deeper understanding and generative capabilities, encompass inference, expansion, summarization, suggestion, and generation tasks.

Rigorous Dataset Construction and Evaluation

The creation of ChiMDQA followed a multi-stage pipeline, starting with data collection of approximately 15,000 multilingual PDF documents, which were then filtered and processed. Large Language Models (LLMs) like GLM-4-Pro were used for generating QA pairs, guided by specialized prompts. To ensure quality, a hybrid verification pipeline combined automated evaluation with a comprehensive human-in-the-loop review process, involving initial screening, deep verification, dispute arbitration, diversity review, and expert validation.

The researchers conducted extensive experiments to evaluate eight closed-source LLMs, including GPT-4, GPT-4o, and GLM-4-Plus, using a suite of fine-grained metrics for both non-RAG (Retrieval-Augmented Generation) and RAG systems. For factual questions, metrics like Correct, Not Attempted, Incorrect, Correct Given Attempted, and F1-Score were used. For open-ended questions, metrics such as METEOR, ROUGE-L, CIDEr, Perplexity, and BERTScore-F1 were applied. RAG systems were further evaluated using the RAGChecker framework, assessing retrieval and generation modules with metrics like Claim Recall, Context Precision, Faithfulness, and Hallucination.

Also Read:

Key Findings and Future Directions

Experimental results revealed that GPT-4o consistently achieved superior overall performance across both factual and open-ended questions. All models showed notable improvements when the RAG strategy was applied, with an average F1-Score gain of 4.6% for factual questions and a significant reduction in perplexity for open-ended questions, indicating lower uncertainty and more coherent outputs. However, challenges remain, particularly with hallucination rates exceeding 20% in RAG systems, suggesting ongoing work is needed to control factuality.

The ChiMDQA dataset provides a robust foundation for future research and practical applications in Chinese QA, offering a challenging benchmark for advancing NLP methodologies. The researchers plan for future expansion, incorporating new high-value domains and leveraging a semi-automated pipeline with rigorous human verification to maintain quality and scale the dataset efficiently. You can read the full research paper for more details. Read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -