A New Benchmark for Evaluating AI in Scholarly Question Answering

TLDR: RESEARCH QA is a new large-scale dataset for evaluating AI systems in scholarly question answering. It distills 21,000 queries and 160,000 evaluation criteria (rubrics) from survey articles across 75 research fields. Validated by Ph.D. experts, it helps identify competency gaps in current LLMs, showing that even top systems struggle with tasks like citing papers and describing limitations, and performance varies significantly across different research domains.

The rapid advancement of Large Language Models (LLMs) has opened new possibilities for navigating the vast and ever-growing landscape of scientific literature. These AI tools promise to help researchers and non-experts stay informed, but accurately evaluating their ability to generate long-form, nuanced answers to complex research questions has been a significant challenge. Traditional evaluation methods often rely on expert annotators, which is costly and limits the scope of evaluation to a few specialized fields.

A new resource, called RESEARCH QA, aims to address this challenge by providing a scalable and multi-field approach to evaluating scholarly question answering systems. Developed by researchers at the University of Pennsylvania, this innovative framework distills knowledge from academic survey articles across 75 diverse research fields.

What is RESEARCH QA?

RESEARCH QA is a comprehensive dataset comprising 21,400 queries and 160,000 detailed evaluation criteria, known as rubric items. These queries and rubrics are meticulously extracted from high-quality survey articles, which are themselves syntheses of knowledge in specific research areas. Each rubric item provides query-specific criteria for evaluating an answer, such as whether it cites relevant papers, offers clear explanations, or describes limitations.

How was it built?

The creation of RESEARCH QA involved a multi-stage pipeline. First, top publishing venues were identified across various research fields. Then, survey articles were extracted from these venues. Finally, LLMs were used to generate queries and their corresponding rubrics from the content of these survey sections. Rigorous filtering mechanisms were applied at each stage to ensure the quality and relevance of the data. The dataset spans seven major domains, including Health Sciences & Medicine, Life & Earth Sciences, Engineering & Computer Science, Physical Sciences, Social Sciences, Humanities, and Economics, ensuring a broad and diverse evaluation landscape.

Expert Validation and Automated Evaluation

To ensure the quality of RESEARCH QA, 31 Ph.D. annotators from eight fields validated the queries and rubrics. Their assessments showed that 96% of the queries reflected the information needs of Ph.D. students, and 87% of the rubric items were deemed essential for a comprehensive answer. Furthermore, the researchers developed an automatic pairwise judge using these rubrics, which achieved a 74% agreement rate with expert judgments, demonstrating its effectiveness as a proxy for human evaluation.

Evaluating LLM Systems

Using RESEARCH QA, 18 different LLM systems—including parametric, retrieval-augmented, and agentic systems—were evaluated. The results revealed significant competency gaps across the board. No parametric or retrieval-augmented system managed to cover more than 70% of the rubric items. Even the highest-ranking agentic system, Perplexity’s deep research, achieved only 75% coverage, indicating substantial room for improvement in current AI capabilities for scholarly tasks.

A detailed error analysis highlighted specific areas where systems struggled most. For instance, the highest-ranking system fully addressed less than 11% of citation-related rubric items, 48% of limitation descriptions, and 49% of comparison items. This suggests that while LLMs can generate information, they often fall short in critical academic skills like proper attribution, critical assessment, and comparative analysis.

The evaluation also underscored the importance of multi-field assessments. Performance varied significantly across domains, with systems generally achieving higher coverage in Physical Sciences compared to Health Sciences and Humanities. This finding emphasizes that benchmarks focused solely on engineering or computer science may not accurately reflect an AI system’s overall research capabilities.

Also Read:

Looking Ahead

RESEARCH QA provides a valuable tool for benchmarking and improving LLM systems designed for research synthesis. While current systems show promise, the evaluations clearly demonstrate that there is significant headroom for development, particularly in areas requiring nuanced academic skills. The resource is openly available to facilitate more comprehensive and diverse evaluations across the research community. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Benchmark for Evaluating AI in Scholarly Question Answering

What is RESEARCH QA?

How was it built?

Expert Validation and Automated Evaluation

Evaluating LLM Systems

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates