TLDR: DexBench is the first benchmark to evaluate Large Language Models (LLMs) for personalized diabetes management, using data from 15,000 individuals across 7 real-world tasks. It assesses LLMs on accuracy, groundedness, safety, clarity, and actionability. Initial evaluations of 8 LLMs show varied performance, with no single model excelling across all metrics, highlighting needs for improvement in complex reasoning and data interpretation for patient-facing AI in healthcare.
Managing diabetes is a daily challenge that involves constant decision-making, from interpreting glucose levels to planning meals and activities. With the rise of Artificial Intelligence (AI) and Large Language Models (LLMs), there’s a significant opportunity to create personalized tools that support individuals in their diabetes management. However, before these AI solutions can be widely adopted, their performance needs to be rigorously evaluated in real-world scenarios.
This is where DexBench comes in. Developed by researchers including Maria Ana Cardei, Josephine Lamp, Mark Derdzinski, and Karan Bhatia, DexBench is the first benchmark specifically designed to assess how well LLMs perform on patient-facing decision-making tasks in diabetes management. Unlike existing health benchmarks that are often generic or clinician-focused, DexBench addresses the unique needs of individuals living with diabetes.
Understanding DexBench’s Approach
The benchmark is built on a comprehensive evaluation framework that covers seven distinct task categories. These tasks reflect the wide range of questions and decisions individuals with diabetes face daily. They include basic glucose interpretation, educational queries, understanding how behaviors like eating and exercise affect glucose, advanced decision-making, and long-term planning. For example, a user might ask, “What is my glucose variability?” or “How did this meal impact my glucose levels?”
To create a realistic testing environment, DexBench utilizes a rich dataset. This dataset comprises one month of time-series data from 15,000 individuals across three different diabetes populations: type 1 diabetes, type 2 diabetes, and prediabetes/general health and wellness. This data includes glucose readings from continuous glucose monitors (CGMs) and behavioral logs like eating and activity patterns. From this extensive dataset, an impressive 360,600 personalized, contextual questions were generated across the seven task categories.
Evaluating LLM Performance
DexBench evaluates LLM performance across five key metrics: accuracy, groundedness, safety, clarity, and actionability. Accuracy checks for factual correctness and logical soundness, especially for diabetes-specific terms. Groundedness assesses whether the model’s response is contextualized, personalized, and faithful to the user’s data. Safety ensures that outputs avoid harmful suggestions and medical recommendations. Clarity measures conciseness and readability, aiming for a Flesch-Kincaid Grade level below 8. Finally, actionability determines if the responses provide useful and practical guidance.
The researchers evaluated eight recent LLMs, including proprietary models like Gemini 2.5 Pro and GPT-5, and open-source models such as Llama 3.1 8B Instruct and MedGemma 4B Instruct. The findings revealed significant variability in performance across tasks and metrics. No single model consistently outperformed others across all dimensions. Generally, models performed well on safety and actionability but struggled more with accuracy, groundedness, and clarity, particularly when dealing with complex calculations or interpreting long-term data trends.
For instance, in “Glucose Math” tasks, models often made calculation errors or misunderstood diabetes-specific metrics like MAGE (Mean Amplitude of Glycemic Excursions) and CONGA (Continuous Overall Net Glycemic Action Index). In “Advanced Reasoning” tasks, which required interpreting 30 days of data, models sometimes hallucinated data or struggled to logically reason about complex relationships. Clarity was also a common challenge, with many models failing to provide responses at the recommended reading level.
Also Read:
- EHR-ChatQA: A New Benchmark for Evaluating Interactive AI Agents in Healthcare Databases
- VitaBench: A New Standard for Evaluating LLM Agents in Real-World Scenarios
Implications for Future AI Development
The insights from DexBench are crucial for advancing AI solutions in diabetes care. The benchmark highlights areas where LLMs need significant improvement, particularly in complex reasoning, data faithfulness, and generating structured, actionable plans. It also suggests a trade-off between response quality and speed, with higher-performing models often exhibiting higher latency.
The framework developed for DexBench is not limited to diabetes management. It can be extended to other domains involving wearable devices and continuous monitoring, such as preventative care, fitness optimization, and managing other chronic conditions like hypertension or sleep disorders. By establishing this benchmark, the researchers aim to foster the development of more reliable, safe, effective, and practical AI tools that empower individuals in their daily health management. You can read the full research paper for more details here.


