Assessing the Dependability of AI in Academic Research: Insights from the PaperAsk Benchmark

TLDR: PaperAsk is a new benchmark evaluating the reliability of LLMs (GPT-4o, GPT-5, Gemini 2.5 Flash) in academic tasks like finding papers, extracting content, and verifying claims. It reveals widespread failures due to LLMs prioritizing semantic relevance over instructions and web search mechanisms polluting context. The study suggests simplifying queries, selectively disabling search, and using reliability classifiers to improve performance, emphasizing that deployment architecture significantly impacts AI’s dependability in scholarly work.

Large Language Models (LLMs) are increasingly becoming indispensable tools for researchers, assisting with tasks ranging from finding relevant papers to extracting specific information. However, a new benchmark called PaperAsk reveals significant reliability issues with these AI assistants in scholarly contexts.

Introduced by researchers from Deakin University and Fudan University, PaperAsk systematically evaluates LLMs across four critical research tasks: citation retrieval, content extraction, paper discovery, and claim verification. The study specifically tested leading commercial LLMs – GPT-4o, GPT-5, and Gemini 2.5 Flash – under realistic conditions, using their web interfaces where search operations are often opaque to the user.

Widespread Failures Across Core Tasks

The findings from PaperAsk paint a concerning picture of LLM reliability. For instance, citation retrieval, a fundamental task, failed in 48–98% of queries when multiple references were requested simultaneously. Content extraction, which involves pulling specific details from papers, showed failure rates between 72–91%, with models frequently returning abstract text instead of content from the introduction section as instructed.

Paper discovery, where LLMs identify relevant literature based on topical descriptions, yielded F1 scores below 0.32, meaning over 60% of relevant papers were missed. Claim verification, assessing whether a paper supports a specific claim, also showed significant failure rates, especially with multiple claims per query.

Understanding the Root Causes of Unreliability

Human analysis attributed these failures to two primary issues. Firstly, LLMs tend to prioritize semantically relevant text over explicit task instructions. For example, when asked for the last sentence of an introduction, they often provide abstract content because it’s semantically similar and more readily available in shallow search snippets. Secondly, the uncontrolled expansion of retrieved context by web search mechanisms often pollutes the LLM’s working memory with conflicting information.

The study observed distinct failure behaviors among the evaluated LLMs. ChatGPT often adopted a conservative approach, refusing to answer rather than risking errors, especially in multi-reference queries. In contrast, Gemini frequently produced fluent but fabricated answers, prioritizing the appearance of completeness over accuracy.

Deployment Architecture: A Major Hurdle

A crucial insight from PaperAsk is the impact of deployment architecture on reliability. When LLMs were provided with paper URLs through web interfaces, claim verification failed 12–18% of the time. However, when the same models were accessed via API calls with full-text availability, the failure rate dropped dramatically to 1–3%. This significant gap suggests that the web services of these LLMs prioritize responsiveness through shallow, broad retrieval, often at the cost of accurate and focused content access.

Ablation studies further supported this, showing that while stronger internal reasoning capabilities improve performance, the constraints of web search mechanisms still limit accuracy. Controlled environments, though improving performance, also led to significantly longer response times and higher computational costs.

Also Read:

Pathways to Improved Reliability

The researchers propose several lightweight methods to enhance LLM reliability without requiring complex infrastructure changes:

Simplify Queries: Decompose multi-question queries into atomic requests (e.g., one paper per citation retrieval, one question per paper for content extraction) to reduce context pollution.
Disable Search for Reasoning Models: Paradoxically, turning off ChatGPT’s built-in search function can improve accuracy, as models then engage in more thorough, albeit slower, reasoning.
Deploy Lightweight Reliability Classifiers: Training a classifier on annotated explanations from LLM responses can detect unreliable outputs with high accuracy (96%), serving as a practical filtering mechanism.

PaperAsk provides a robust framework for evaluating and advancing the reliability of LLM-based scholarly assistance systems. It highlights that while LLMs possess immense potential, their current deployment architectures and inherent tendencies pose significant challenges to their dependable use in academic research. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing the Dependability of AI in Academic Research: Insights from the PaperAsk Benchmark

Widespread Failures Across Core Tasks

Understanding the Root Causes of Unreliability

Deployment Architecture: A Major Hurdle

Pathways to Improved Reliability

Gen AI News and Updates

CrochetBench: Advancing AI’s Ability to Understand and Create Crochet Patterns

Georgia Tech’s Cloud Hub Propels Generative AI Research with Microsoft’s Strategic Support

Unveiling LLM Efficiency: OckBench Introduces a New Metric Beyond Accuracy

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates