Enhancing Biomedical Question Answering with Fine-Tuned LLaMA 3: CaresAI's MedHopQA Journey

TLDR: This research paper details CaresAI’s approach to the BioCreative IX MedHopQA task, focusing on multi-hop biomedical question answering. The team fine-tuned LLaMA 3 8B using a curated biomedical dataset and explored different training strategies (combined, short-only, long-only answers). While their models demonstrated strong concept-level understanding (up to 0.8 accuracy), they struggled with Exact Match (EM) scores, particularly in the test phase (as low as 0.0), due to verbosity and formatting inconsistencies. A two-stage inference pipeline was introduced to extract precise short answers, improving EM to 0.49 after refinement. The study highlights the need for better output control and post-processing in biomedical LLM applications.

Large language models (LLMs) are rapidly transforming various fields, and their potential in critical domains like biomedical and healthcare is particularly significant. However, before these powerful AI tools can be widely deployed in real-world medical applications, their ability to accurately answer complex questions, especially those requiring multi-step reasoning, needs rigorous evaluation.

A recent study by researchers Reem Abdel-Salam, Mary Adewunmi, and Modinat A. Abayomi, titled “CaresAI at BioCreative IX Track 1 – LLM for Biomedical QA”, delves into this challenge. The paper presents their approach to the MedHopQA track of the BioCreative IX shared task, which specifically focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. This task is designed to push the boundaries of LLMs, requiring them to integrate information from multiple sources to answer complex questions, often about rare diseases.

CaresAI’s Approach to Multi-Hop QA

The CaresAI team adopted a supervised fine-tuning strategy, leveraging LLaMA 3 8B, a pre-trained large language model. To enhance its performance in the biomedical domain, they augmented the model with a carefully curated dataset of biomedical question-answer pairs. This dataset was compiled from various external sources, including well-known benchmarks like BioASQ, MedQuAD, and TREC, ensuring a diverse and relevant training base.

The researchers explored three distinct experimental setups to understand the impact of answer format on model performance:

Fine-tuning on a combined dataset of both short and long answers.
Fine-tuning exclusively on short answer data.
Fine-tuning exclusively on long answer data.

Each model was evaluated independently, with the expectation that the short-answer fine-tuned model would excel in precision, while the long-answer version would capture broader context.

Addressing Verbosity and Precision

One of the significant challenges encountered was the LLaMA 3 8B model’s tendency to produce overly verbose or imprecise answers, often restating the question or adding extraneous information. To mitigate this, the team introduced a two-stage inference pipeline. In the first stage, the model generates an initial response. In the second stage, a follow-up prompt explicitly instructs the model to extract the exact answer phrase or entity from the initial, longer response. This post-processing step was crucial for aligning the outputs with the strict short-answer format required by the evaluation metrics.

Evaluation and Key Findings

The models were evaluated using two primary metrics: Exact Match (EM) and concept-level evaluation. EM measures if a prediction precisely matches the gold standard answer after normalization, emphasizing strict precision. Concept-level evaluation, on the other hand, assesses semantic equivalence, reflecting the model’s understanding of biomedical concepts regardless of surface-level differences.

On the validation set, all three approaches achieved approximately 0.5 in EM score and around 0.8 in concept-level accuracy. This indicates a strong grasp of biomedical knowledge and semantic understanding. However, the models struggled with generating exact matches, often due to verbosity, paraphrasing, or formatting inconsistencies. In contrast, general-purpose models like LLaMA 8B and Qwen Instruct 7B, when used in a zero-shot setting, yielded near-zero EM scores, highlighting the necessity of domain-specific fine-tuning.

The testing phase, evaluated on 1,000 unseen examples, revealed a significant drop in performance. The combined approach yielded an EM score of 0.2, while the short-only and long-only approaches achieved an EM score of 0.0. This drop underscored the difficulty in generalizing to new data with precise answer formatting. The researchers found that a major reason for this was the model’s inability to consistently generate concise 1-2 phrase answers in the exact format specified by the task organizers (e.g., producing “2” or “Chr.2” instead of “Chromosome 2”). After addressing these issues through prompt refinement and lightweight post-processing in an unofficial test evaluation, the model’s EM score significantly improved to 0.49.

Also Read:

Future Directions

The study concludes that while LLaMA 3 8B, when fine-tuned, effectively captures biomedical semantics, it faces considerable challenges in producing exact and precisely formatted answers under strict evaluation criteria. The findings emphasize the gap between semantic understanding and exact answer evaluation in biomedical LLM applications. This motivates further research into more robust extraction mechanisms, advanced post-processing strategies, and fine-tuning techniques to align model outputs more reliably with evaluation criteria. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Biomedical Question Answering with Fine-Tuned LLaMA 3: CaresAI’s MedHopQA Journey

CaresAI’s Approach to Multi-Hop QA

Addressing Verbosity and Precision

Evaluation and Key Findings

Future Directions

Gen AI News and Updates

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Jorie AI Unveils SmartCore Engine: Revolutionizing Healthcare Intelligence and Automation

Get Well and RhythmX AI Unite to Form GW RhythmX, Pioneering AI-Native Healthcare Intelligence

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates