AirQA: A New Benchmark and Data Synthesis Framework for AI Paper Question Answering

TLDR: AirQA is a new, comprehensive human-annotated dataset with 13,948 AI papers and 1,246 questions, designed to benchmark AI agents for scientific paper question answering across multi-task and multi-modal scenarios. It also introduces EXTRACTOR, an automated framework that synthesizes instruction data and interaction trajectories to improve LLM-based agents’ multi-turn tool-use capabilities, enabling smaller models to achieve performance comparable to larger ones.

The explosion of academic papers in artificial intelligence (AI) has made it increasingly challenging for researchers to efficiently extract key information. Navigating through extensive documents to pinpoint specific details is a tedious and time-consuming process. While large language models (LLMs) offer a promising avenue for automating question-answering (QA) workflows for scientific papers, there has been a significant gap in comprehensive and realistic benchmarks to effectively evaluate their capabilities. Additionally, the development of interactive AI agents for this specialized task is hampered by a shortage of high-quality interaction data.

To tackle these challenges, a team of researchers has introduced AirQA, a meticulously human-annotated dataset specifically designed for comprehensive paper QA within the AI domain. This substantial dataset comprises 13,948 papers and 1,246 questions, encompassing multi-task, multi-modal, and instance-level evaluation. AirQA distinguishes itself by covering a wide array of question types and integrating various elements commonly found in scientific papers, including plain text, tables, images, mathematical formulas, and metadata. This approach moves beyond previous datasets that often focused on narrow question types or simplified paper formats, aiming to mirror the complexities of real-world research scenarios.

The AirQA dataset categorizes questions into four primary types: single-document detail questions, which delve into specific information within one paper; multiple-document analysis questions, designed to prompt comparisons or connections across several papers; paper retrieval questions, focused on identifying papers from a particular conference and year based on a description; and comprehensive QA questions, which combine aspects of retrieval and detailed answering. The evaluation methodology for AirQA is equally innovative, employing 19 parameterized Python functions for instance-level assessment. These functions facilitate both objective and subjective evaluations, prioritizing factual accuracy over mere semantic similarity.

Also Read:

Introducing EXTRACTOR: An Automated Framework

Beyond the dataset, the researchers have also developed EXTRACTOR, an automated framework for synthesizing instruction data. This framework is crucial for addressing the scarcity of high-quality interaction trajectories needed to train interactive QA agents. EXTRACTOR operates through three LLM-based agents: an explorer, a tracker, and an actor. The explorer generates natural language question-answer pairs from various contexts within papers. The tracker then refines these QA pairs into properly formatted examples suitable for evaluation. Finally, the actor interacts with a simulated environment to collect multi-turn tool-use trajectories, effectively generating training data without human intervention.

Initial evaluations of a diverse range of open-source and proprietary LLMs on the AirQA dataset revealed that most models currently underperform, with the top-performing model achieving an overall accuracy of only 44.14%. This outcome underscores the challenging nature and high quality of the AirQA dataset, suggesting that current QA workflows for scientific papers are still in their early stages of development. However, experiments utilizing the EXTRACTOR framework demonstrated significant potential. Fine-tuning smaller models, such as the 7B model from the Qwen2.5 family, with just 4,000 synthetic interaction trajectories, enabled them to reach performance levels comparable to much larger, untrained models. This highlights EXTRACTOR’s effectiveness in enhancing the multi-turn tool-use capabilities of LLMs.

The introduction of the AirQA dataset and the EXTRACTOR framework marks a substantial advancement in AI research. AirQA provides a much-needed comprehensive benchmark for assessing LLM-based agents in scientific question answering, while EXTRACTOR offers a practical and scalable method for improving these agents’ abilities without extensive manual annotation. This work sets the stage for the development of more robust and practical LLM agents that can empower researchers to efficiently navigate and extract knowledge from the ever-expanding volume of academic literature. For a deeper dive into this research, you can access the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AirQA: A New Benchmark and Data Synthesis Framework for AI Paper Question Answering

Introducing EXTRACTOR: An Automated Framework

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates