spot_img
HomeResearch & DevelopmentAirQA: A New Benchmark and Data Synthesis Framework for...

AirQA: A New Benchmark and Data Synthesis Framework for AI Paper Question Answering

TLDR: AirQA is a new, comprehensive human-annotated dataset with 13,948 AI papers and 1,246 questions, designed to benchmark AI agents for scientific paper question answering across multi-task and multi-modal scenarios. It also introduces EXTRACTOR, an automated framework that synthesizes instruction data and interaction trajectories to improve LLM-based agents’ multi-turn tool-use capabilities, enabling smaller models to achieve performance comparable to larger ones.

The explosion of academic papers in artificial intelligence (AI) has made it increasingly challenging for researchers to efficiently extract key information. Navigating through extensive documents to pinpoint specific details is a tedious and time-consuming process. While large language models (LLMs) offer a promising avenue for automating question-answering (QA) workflows for scientific papers, there has been a significant gap in comprehensive and realistic benchmarks to effectively evaluate their capabilities. Additionally, the development of interactive AI agents for this specialized task is hampered by a shortage of high-quality interaction data.

To tackle these challenges, a team of researchers has introduced AirQA, a meticulously human-annotated dataset specifically designed for comprehensive paper QA within the AI domain. This substantial dataset comprises 13,948 papers and 1,246 questions, encompassing multi-task, multi-modal, and instance-level evaluation. AirQA distinguishes itself by covering a wide array of question types and integrating various elements commonly found in scientific papers, including plain text, tables, images, mathematical formulas, and metadata. This approach moves beyond previous datasets that often focused on narrow question types or simplified paper formats, aiming to mirror the complexities of real-world research scenarios.

The AirQA dataset categorizes questions into four primary types: single-document detail questions, which delve into specific information within one paper; multiple-document analysis questions, designed to prompt comparisons or connections across several papers; paper retrieval questions, focused on identifying papers from a particular conference and year based on a description; and comprehensive QA questions, which combine aspects of retrieval and detailed answering. The evaluation methodology for AirQA is equally innovative, employing 19 parameterized Python functions for instance-level assessment. These functions facilitate both objective and subjective evaluations, prioritizing factual accuracy over mere semantic similarity.

Also Read:

Introducing EXTRACTOR: An Automated Framework

Beyond the dataset, the researchers have also developed EXTRACTOR, an automated framework for synthesizing instruction data. This framework is crucial for addressing the scarcity of high-quality interaction trajectories needed to train interactive QA agents. EXTRACTOR operates through three LLM-based agents: an explorer, a tracker, and an actor. The explorer generates natural language question-answer pairs from various contexts within papers. The tracker then refines these QA pairs into properly formatted examples suitable for evaluation. Finally, the actor interacts with a simulated environment to collect multi-turn tool-use trajectories, effectively generating training data without human intervention.

Initial evaluations of a diverse range of open-source and proprietary LLMs on the AirQA dataset revealed that most models currently underperform, with the top-performing model achieving an overall accuracy of only 44.14%. This outcome underscores the challenging nature and high quality of the AirQA dataset, suggesting that current QA workflows for scientific papers are still in their early stages of development. However, experiments utilizing the EXTRACTOR framework demonstrated significant potential. Fine-tuning smaller models, such as the 7B model from the Qwen2.5 family, with just 4,000 synthetic interaction trajectories, enabled them to reach performance levels comparable to much larger, untrained models. This highlights EXTRACTOR’s effectiveness in enhancing the multi-turn tool-use capabilities of LLMs.

The introduction of the AirQA dataset and the EXTRACTOR framework marks a substantial advancement in AI research. AirQA provides a much-needed comprehensive benchmark for assessing LLM-based agents in scientific question answering, while EXTRACTOR offers a practical and scalable method for improving these agents’ abilities without extensive manual annotation. This work sets the stage for the development of more robust and practical LLM agents that can empower researchers to efficiently navigate and extract knowledge from the ever-expanding volume of academic literature. For a deeper dive into this research, you can access the full paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -