TLDR: A new study reveals that current medical AI benchmarks are heavily biased towards high-income countries, failing to represent African disease burdens (like malaria, HIV, TB, sickle-cell disease) and local clinical guidelines. This can lead to unsafe AI deployments in Africa. The study introduces Alama Health QA, a new benchmark grounded in Kenyan health guidelines, which significantly outperforms global benchmarks in representing African health realities and clinical relevance. The research advocates for developing more region-specific, guideline-aligned benchmarks to ensure equitable and safe AI in global health.
Large Language Models (LLMs) are rapidly changing how we approach healthcare, from powering chatbots to assisting with clinical decisions. However, a recent study highlights a critical issue: the benchmarks used to evaluate these medical LLMs often don’t reflect the unique health challenges and realities faced in African countries.
The research, titled “Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens,” points out that most existing benchmarks are designed based on medical school exams and disease patterns prevalent in high-income nations. This creates a significant “gap” when these models are deployed in Africa, where diseases like malaria, HIV, tuberculosis (TB), sickle-cell disease, and neglected tropical diseases (NTDs) are far more common, and national guidelines dictate patient care.
A team of authors, including Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu, Diana Marion, Job Nyangena, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha, Eric Mibuari, Jean Philbert Nsengemana, and Talkmore Chidede, conducted a comprehensive review of 31 quantitative LLM evaluation papers published between January 2019 and May 2025. They identified 19 English medical question-answering benchmarks and then focused on six widely used ones for a detailed comparison: AfriMed-QA, MMLU-Medical, PubMedQA, MedMCQA, MedQA-USMLE, and their newly developed Alama Health QA.
The Alama Health QA dataset was specifically created to address this gap. It uses a method called retrieval-augmented generation (RAG) and is firmly based on the Kenyan Ministry of Health’s Clinical Practice Guidelines. This ensures that the questions and answers are relevant to the local context and align with official treatment protocols. The researchers performed both quantitative (analyzing language complexity and disease mentions) and qualitative (expert review by clinicians) assessments of these benchmarks.
The findings were striking. Alama Health QA proved to be the most representative of African health realities, capturing over 40% of all NTD mentions across all datasets. It also had the highest frequency of mentions for malaria (7.7%), HIV (4.1%), and TB (5.2%) within its own set. AfriMed-QA, another regional benchmark, came in second but lacked formal links to national guidelines. In stark contrast, the global benchmarks (MMLU-Medical, MedQA-USMLE, MedMCQA, and PubMedQA) collectively accounted for less than 20% of total NTD mentions and showed almost no presence of several critical conditions, such as malaria being completely absent from MMLU-Medical, and dengue absent from all four. Sickle-cell disease, highly prevalent in tropical regions, was entirely missing from three of these global datasets.
Qualitative evaluations further supported these findings. Alama Health QA received the highest scores for clinical relevance and alignment with guidelines, with reviewers praising its questions that simulated real-world scenarios faced by frontline health workers in Kenya. For example, a question about managing severe child malnutrition in Kibera was highlighted for its strong alignment with guidelines. On the other hand, PubMedQA was criticized for being too academic and lacking clinical framing, while MedMCQA was often found to be overly verbose and MedQA-USMLE’s long, vignette-style questions, while rich in vocabulary, were rooted in high-income clinical contexts.
The study concludes that relying heavily on current global medical LLM benchmarks can lead to misleading performance claims and even undermine patient safety in African health systems. It emphasizes the urgent need for region-specific, guideline-aligned, and culturally appropriate benchmarks. Alama Health QA serves as a powerful example, demonstrating that by grounding benchmarks in local guidelines and health worker scenarios, we can significantly improve the validity and safety of LLM evaluations for low-resource settings.
Also Read:
- Benchmarking AI for Clinical Care in Kenya: A Localized Approach
- Assessing AI Models for Healthcare: A Multimodal EHR Benchmark
The authors suggest that future efforts should focus on developing disease-specific benchmarks for high-burden conditions in Africa, such as HIV, malaria, sickle-cell disease, and viral hemorrhagic fevers like Ebola. These new benchmarks should build upon the methodology used for Alama Health QA, ensuring they are linked to authoritative guidelines, include culturally and linguistically appropriate phrasing, and offer stratified clinical reasoning difficulty. For those interested in the technical details of how Alama Health QA was built, the accompanying methodological paper, “Alama-Health-QA: A Guideline-Grounded Benchmark Pipeline for Creating Medical Language Models Benchmarks in African Primary Care,” provides further insights. You can read the full research paper here.


