TLDR: CareMedEval is a new dataset of 534 questions from French medical exams, based on 37 scientific articles, designed to evaluate large language models (LLMs) on critical appraisal and reasoning in biomedicine. Benchmarking revealed that LLMs struggle significantly, especially with questions on study limitations and statistics, failing to reach human passing scores. Full article access and generating intermediate reasoning steps notably improve performance, while specialized biomedical models do not consistently outperform generalist ones. The dataset highlights current LLM limitations in complex medical reasoning.
The ability to critically appraise scientific literature is a cornerstone of medical practice, allowing professionals to stay informed and make evidence-based decisions. However, this complex skill, which involves understanding methodology, statistics, and potential biases, presents a significant challenge for even trained physicians. With the rise of large language models (LLMs), there’s growing interest in their potential to assist in this area, but their reliability, especially for critical reasoning in specialized domains like biomedicine, remains a key concern.
Introducing CareMedEval: A New Benchmark for Medical Critical Appraisal
To address the need for robust evaluation, researchers have introduced CareMedEval, an innovative dataset specifically designed to test LLMs on biomedical critical appraisal and reasoning tasks. Unlike many existing benchmarks that focus on factual comprehension or general domain knowledge, CareMedEval explicitly evaluates a model’s ability to critically read and reason based on scientific papers.
The dataset is unique because it’s derived from authentic exams taken by French medical students, ensuring its relevance and difficulty. It comprises 534 multiple-choice questions based on 37 genuine scientific articles. These articles cover a broad spectrum of medical specialties, including observational studies and randomized clinical trials, reflecting the real-world challenges medical students face.
What CareMedEval Evaluates
Each question in the CareMedEval dataset is meticulously annotated with labels that correspond to specific cognitive and analytical skills required for critical appraisal. These include: identifying study design, understanding and interpreting statistical results, knowledge of scientific methodology, critically reviewing biases and study limitations, and assessing clinical relevance and applicability.
This detailed labeling allows researchers to pinpoint exactly where LLMs excel or struggle, providing valuable insights into their reasoning capabilities.
Benchmarking LLMs: Key Findings
The researchers benchmarked various state-of-the-art LLMs, including generalist and biomedical-specialized models, under different context conditions. The results highlight the inherent difficulty of the task:
Overall Performance: Even top models like GPT-4.1 struggled to exceed an Exact Match Rate of 0.5, and none achieved the passing score typically required for human medical students in the original exams (70% LCA score). This indicates a significant gap between current LLM capabilities and human-level critical appraisal.
Challenging Areas: Models found questions related to “study limitations” and “statistical analysis” particularly difficult. This suggests that LLMs struggle with implicit critical reasoning and quantitative interpretation, especially when statistical information is presented in figures not included in the text.
Generalist vs. Specialized Models: Surprisingly, biomedical-specialized LLMs did not consistently outperform their generalist counterparts. In many cases, generalist models performed comparably or even better, suggesting that domain-specific pre-training alone doesn’t guarantee superior critical reasoning in this context.
Importance of Context: Providing the full scientific article as context significantly improved model performance compared to using only abstracts or no context at all. This underscores the necessity of complete information for accurate critical appraisal.
Impact of Reasoning Tokens: A crucial finding was that generating intermediate reasoning tokens considerably improved results across all metrics. This suggests that explicit reasoning steps help LLMs produce more accurate and contextually grounded answers, indicating that CareMedEval effectively evaluates this aspect of LLM performance.
Also Read:
- Rethinking AI Oversight: Why Healthcare Needs Capability-Based Monitoring for Large Language Models
- Unpacking LLM Performance in Healthcare: The Critical Role of Diverse Evaluation
Looking Ahead
CareMedEval serves as a challenging benchmark for grounded reasoning in the biomedical field, exposing current limitations of LLMs and paving the way for future developments. While current models fall short of human performance in critical appraisal, the insights gained from this dataset can guide the creation of more reliable automated support tools for medical professionals. Future work aims to extend the benchmark to vision LLMs to incorporate figures and tables, and to develop frameworks for evaluating the quality of the reasoning traces produced by models.
For more in-depth information, you can read the full research paper here: CareMedEval Dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field.


