ChiMDQA: A New Comprehensive Dataset for Chinese Document Question Answering

TLDR: The ChiMDQA research paper introduces a new, comprehensive Chinese Multi-Document Question Answering dataset designed for real-world business scenarios. It covers six diverse domains (academic, education, finance, law, medical, news) with 6,068 high-quality, fine-grained question-answer pairs. The dataset features a hierarchical question classification system and a detailed evaluation framework for both non-RAG and RAG systems. Experiments show that ChiMDQA effectively evaluates large language models, and Retrieval-Augmented Generation (RAG) significantly improves performance, though challenges like hallucination persist.

The field of Natural Language Processing (NLP) is constantly evolving, leading to a growing need for high-quality datasets that can train and evaluate intelligent question-answering (QA) systems. While significant progress has been made, particularly with models like BERT and GPT, much of the research has focused on English. Addressing this gap, a new research paper introduces ChiMDQA, a comprehensive Chinese Multi-Document Question Answering Dataset.

Introducing ChiMDQA: A New Benchmark for Chinese Document QA

Authored by Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, and Hongji Zeng, the paper titled “ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation” presents a dataset specifically designed for real-world business scenarios. It aims to overcome the limitations of existing Chinese QA datasets, which often lack diversity in documents and comprehensiveness in question types.

ChiMDQA is a meticulously curated dataset featuring 6,068 high-quality question-answer pairs. What makes it stand out is its coverage of long-form documents from six distinct and prevalent domains: academic, education, finance, law, medical treatment, and news. This broad topical coverage ensures the dataset is applicable to various NLP tasks, including document comprehension, knowledge extraction, and intelligent QA systems.

Diverse Document Topics and Fine-Grained Question Types

The dataset’s strength lies in its diverse document topics. Academic documents include research papers, educational materials cover textbooks, financial documents comprise reports, legal documents involve legal texts, medical documents feature clinical guidelines, and news documents consist of journalistic articles. This selection ensures a rich and representative collection of real-world information.

Beyond document diversity, ChiMDQA introduces a sophisticated, hierarchical question classification system. Questions are categorized into two main levels: Level 1 for explicit facts that require direct extraction, and Level 2 for implicit facts that demand inference or integration of information. These are further refined into ten fine-grained subtypes, aligning with various downstream tasks. Factual questions include retrieval, filtering, statistical, computational, and comparison tasks. Open-ended questions, which require deeper understanding and generative capabilities, encompass inference, expansion, summarization, suggestion, and generation tasks.

Rigorous Dataset Construction and Evaluation

The creation of ChiMDQA followed a multi-stage pipeline, starting with data collection of approximately 15,000 multilingual PDF documents, which were then filtered and processed. Large Language Models (LLMs) like GLM-4-Pro were used for generating QA pairs, guided by specialized prompts. To ensure quality, a hybrid verification pipeline combined automated evaluation with a comprehensive human-in-the-loop review process, involving initial screening, deep verification, dispute arbitration, diversity review, and expert validation.

The researchers conducted extensive experiments to evaluate eight closed-source LLMs, including GPT-4, GPT-4o, and GLM-4-Plus, using a suite of fine-grained metrics for both non-RAG (Retrieval-Augmented Generation) and RAG systems. For factual questions, metrics like Correct, Not Attempted, Incorrect, Correct Given Attempted, and F1-Score were used. For open-ended questions, metrics such as METEOR, ROUGE-L, CIDEr, Perplexity, and BERTScore-F1 were applied. RAG systems were further evaluated using the RAGChecker framework, assessing retrieval and generation modules with metrics like Claim Recall, Context Precision, Faithfulness, and Hallucination.

Also Read:

Key Findings and Future Directions

Experimental results revealed that GPT-4o consistently achieved superior overall performance across both factual and open-ended questions. All models showed notable improvements when the RAG strategy was applied, with an average F1-Score gain of 4.6% for factual questions and a significant reduction in perplexity for open-ended questions, indicating lower uncertainty and more coherent outputs. However, challenges remain, particularly with hallucination rates exceeding 20% in RAG systems, suggesting ongoing work is needed to control factuality.

The ChiMDQA dataset provides a robust foundation for future research and practical applications in Chinese QA, offering a challenging benchmark for advancing NLP methodologies. The researchers plan for future expansion, incorporating new high-value domains and leveraging a semi-automated pipeline with rigorous human verification to maintain quality and scale the dataset efficiently. You can read the full research paper for more details. Read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ChiMDQA: A New Comprehensive Dataset for Chinese Document Question Answering

Introducing ChiMDQA: A New Benchmark for Chinese Document QA

Diverse Document Topics and Fine-Grained Question Types

Rigorous Dataset Construction and Evaluation

Key Findings and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates