TLDR: PaperAsk is a new benchmark evaluating the reliability of LLMs (GPT-4o, GPT-5, Gemini 2.5 Flash) in academic tasks like finding papers, extracting content, and verifying claims. It reveals widespread failures due to LLMs prioritizing semantic relevance over instructions and web search mechanisms polluting context. The study suggests simplifying queries, selectively disabling search, and using reliability classifiers to improve performance, emphasizing that deployment architecture significantly impacts AI’s dependability in scholarly work.
Large Language Models (LLMs) are increasingly becoming indispensable tools for researchers, assisting with tasks ranging from finding relevant papers to extracting specific information. However, a new benchmark called PaperAsk reveals significant reliability issues with these AI assistants in scholarly contexts.
Introduced by researchers from Deakin University and Fudan University, PaperAsk systematically evaluates LLMs across four critical research tasks: citation retrieval, content extraction, paper discovery, and claim verification. The study specifically tested leading commercial LLMs – GPT-4o, GPT-5, and Gemini 2.5 Flash – under realistic conditions, using their web interfaces where search operations are often opaque to the user.
Widespread Failures Across Core Tasks
The findings from PaperAsk paint a concerning picture of LLM reliability. For instance, citation retrieval, a fundamental task, failed in 48–98% of queries when multiple references were requested simultaneously. Content extraction, which involves pulling specific details from papers, showed failure rates between 72–91%, with models frequently returning abstract text instead of content from the introduction section as instructed.
Paper discovery, where LLMs identify relevant literature based on topical descriptions, yielded F1 scores below 0.32, meaning over 60% of relevant papers were missed. Claim verification, assessing whether a paper supports a specific claim, also showed significant failure rates, especially with multiple claims per query.
Understanding the Root Causes of Unreliability
Human analysis attributed these failures to two primary issues. Firstly, LLMs tend to prioritize semantically relevant text over explicit task instructions. For example, when asked for the last sentence of an introduction, they often provide abstract content because it’s semantically similar and more readily available in shallow search snippets. Secondly, the uncontrolled expansion of retrieved context by web search mechanisms often pollutes the LLM’s working memory with conflicting information.
The study observed distinct failure behaviors among the evaluated LLMs. ChatGPT often adopted a conservative approach, refusing to answer rather than risking errors, especially in multi-reference queries. In contrast, Gemini frequently produced fluent but fabricated answers, prioritizing the appearance of completeness over accuracy.
Deployment Architecture: A Major Hurdle
A crucial insight from PaperAsk is the impact of deployment architecture on reliability. When LLMs were provided with paper URLs through web interfaces, claim verification failed 12–18% of the time. However, when the same models were accessed via API calls with full-text availability, the failure rate dropped dramatically to 1–3%. This significant gap suggests that the web services of these LLMs prioritize responsiveness through shallow, broad retrieval, often at the cost of accurate and focused content access.
Ablation studies further supported this, showing that while stronger internal reasoning capabilities improve performance, the constraints of web search mechanisms still limit accuracy. Controlled environments, though improving performance, also led to significantly longer response times and higher computational costs.
Also Read:
- New Study Explores AI’s Ability to Grade Academic Papers
- A New Standard for Evaluating AI’s Scientific Research Capabilities
Pathways to Improved Reliability
The researchers propose several lightweight methods to enhance LLM reliability without requiring complex infrastructure changes:
- Simplify Queries: Decompose multi-question queries into atomic requests (e.g., one paper per citation retrieval, one question per paper for content extraction) to reduce context pollution.
- Disable Search for Reasoning Models: Paradoxically, turning off ChatGPT’s built-in search function can improve accuracy, as models then engage in more thorough, albeit slower, reasoning.
- Deploy Lightweight Reliability Classifiers: Training a classifier on annotated explanations from LLM responses can detect unreliable outputs with high accuracy (96%), serving as a practical filtering mechanism.
PaperAsk provides a robust framework for evaluating and advancing the reliability of LLM-based scholarly assistance systems. It highlights that while LLMs possess immense potential, their current deployment architectures and inherent tendencies pose significant challenges to their dependable use in academic research. For more details, you can refer to the full research paper here.


