spot_img
HomeResearch & DevelopmentEnhancing Biomedical Question Answering with Fine-Tuned LLaMA 3: CaresAI's...

Enhancing Biomedical Question Answering with Fine-Tuned LLaMA 3: CaresAI’s MedHopQA Journey

TLDR: This research paper details CaresAI’s approach to the BioCreative IX MedHopQA task, focusing on multi-hop biomedical question answering. The team fine-tuned LLaMA 3 8B using a curated biomedical dataset and explored different training strategies (combined, short-only, long-only answers). While their models demonstrated strong concept-level understanding (up to 0.8 accuracy), they struggled with Exact Match (EM) scores, particularly in the test phase (as low as 0.0), due to verbosity and formatting inconsistencies. A two-stage inference pipeline was introduced to extract precise short answers, improving EM to 0.49 after refinement. The study highlights the need for better output control and post-processing in biomedical LLM applications.

Large language models (LLMs) are rapidly transforming various fields, and their potential in critical domains like biomedical and healthcare is particularly significant. However, before these powerful AI tools can be widely deployed in real-world medical applications, their ability to accurately answer complex questions, especially those requiring multi-step reasoning, needs rigorous evaluation.

A recent study by researchers Reem Abdel-Salam, Mary Adewunmi, and Modinat A. Abayomi, titled “CaresAI at BioCreative IX Track 1 – LLM for Biomedical QA”, delves into this challenge. The paper presents their approach to the MedHopQA track of the BioCreative IX shared task, which specifically focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. This task is designed to push the boundaries of LLMs, requiring them to integrate information from multiple sources to answer complex questions, often about rare diseases.

CaresAI’s Approach to Multi-Hop QA

The CaresAI team adopted a supervised fine-tuning strategy, leveraging LLaMA 3 8B, a pre-trained large language model. To enhance its performance in the biomedical domain, they augmented the model with a carefully curated dataset of biomedical question-answer pairs. This dataset was compiled from various external sources, including well-known benchmarks like BioASQ, MedQuAD, and TREC, ensuring a diverse and relevant training base.

The researchers explored three distinct experimental setups to understand the impact of answer format on model performance:

  • Fine-tuning on a combined dataset of both short and long answers.
  • Fine-tuning exclusively on short answer data.
  • Fine-tuning exclusively on long answer data.

Each model was evaluated independently, with the expectation that the short-answer fine-tuned model would excel in precision, while the long-answer version would capture broader context.

Addressing Verbosity and Precision

One of the significant challenges encountered was the LLaMA 3 8B model’s tendency to produce overly verbose or imprecise answers, often restating the question or adding extraneous information. To mitigate this, the team introduced a two-stage inference pipeline. In the first stage, the model generates an initial response. In the second stage, a follow-up prompt explicitly instructs the model to extract the exact answer phrase or entity from the initial, longer response. This post-processing step was crucial for aligning the outputs with the strict short-answer format required by the evaluation metrics.

Evaluation and Key Findings

The models were evaluated using two primary metrics: Exact Match (EM) and concept-level evaluation. EM measures if a prediction precisely matches the gold standard answer after normalization, emphasizing strict precision. Concept-level evaluation, on the other hand, assesses semantic equivalence, reflecting the model’s understanding of biomedical concepts regardless of surface-level differences.

On the validation set, all three approaches achieved approximately 0.5 in EM score and around 0.8 in concept-level accuracy. This indicates a strong grasp of biomedical knowledge and semantic understanding. However, the models struggled with generating exact matches, often due to verbosity, paraphrasing, or formatting inconsistencies. In contrast, general-purpose models like LLaMA 8B and Qwen Instruct 7B, when used in a zero-shot setting, yielded near-zero EM scores, highlighting the necessity of domain-specific fine-tuning.

The testing phase, evaluated on 1,000 unseen examples, revealed a significant drop in performance. The combined approach yielded an EM score of 0.2, while the short-only and long-only approaches achieved an EM score of 0.0. This drop underscored the difficulty in generalizing to new data with precise answer formatting. The researchers found that a major reason for this was the model’s inability to consistently generate concise 1-2 phrase answers in the exact format specified by the task organizers (e.g., producing “2” or “Chr.2” instead of “Chromosome 2”). After addressing these issues through prompt refinement and lightweight post-processing in an unofficial test evaluation, the model’s EM score significantly improved to 0.49.

Also Read:

Future Directions

The study concludes that while LLaMA 3 8B, when fine-tuned, effectively captures biomedical semantics, it faces considerable challenges in producing exact and precisely formatted answers under strict evaluation criteria. The findings emphasize the gap between semantic understanding and exact answer evaluation in biomedical LLM applications. This motivates further research into more robust extraction mechanisms, advanced post-processing strategies, and fine-tuning techniques to align model outputs more reliably with evaluation criteria. For more details, you can read the full research paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -