TLDR: A significant international study conducted by scientists from National Taiwan University and Harvard T.H. Chan School of Public Health concludes that leading generative AI chatbots, including ChatGPT-4o, Claude 3 Sonnet, and Gemini Ultra 1.0, are not yet reliable enough to provide clinically safe and accurate advice across various stages of stroke care. While AI shows promise for general health information, its inconsistency in high-risk medical situations like stroke necessitates human oversight.
An international collaborative study, spearheaded by researchers from National Taiwan University and Harvard T.H. Chan School of Public Health, has cast a critical eye on the current capabilities of generative artificial intelligence (AI) chatbots in providing clinically reliable guidance for stroke care. The findings indicate that despite their advanced nature, models such as ChatGPT-4o, Claude 3 Sonnet, and Gemini Ultra 1.0 consistently fall short of the necessary clinical competency threshold.
The study aimed to evaluate whether these AI chatbots could offer safe and accurate advice across the entire continuum of stroke care, encompassing prevention, early symptom recognition, acute treatment, and rehabilitation. Researchers crafted stroke-related inquiries based on common patient questions encountered in clinical practice, reflecting realistic, patient-oriented scenarios. These inquiries were posed to the AI models under three distinct prompting strategies: Zero-Shot Learning (ZSL), Chain-of-Thought (COT), and Talking Out Your Thoughts (TOT).
Four senior stroke specialists, blinded to the AI model and prompt type, meticulously scored the outputs on accuracy, presence of hallucinations, specificity, relevance, empathy, understanding, and actionability. A critical clinical competency threshold was set at a score of 60. The results revealed that none of the tested AI models were able to consistently achieve this minimum threshold for providing safe, high-quality patient advice. Performance was particularly inconsistent, with responses concerning stroke treatment proving to be notably unreliable.
John Tayu Lee, Associate Professor at National Taiwan University and Senior Researcher at the Health Systems Innovation Lab at Harvard T.H. Chan School of Public Health, commented on the findings: “Existing evidence suggests generative AI has real potential to help close health gaps and ease the shortage of healthcare workers in underserved and rural areas, especially when specialist access is limited. Our results show that while generative AI is impressive for general health information, it remains unreliable when patients face high-risk medical situations like stroke.”
Stroke remains the second-leading cause of death and the third-leading cause of disability globally, underscoring the urgent need for accurate and actionable patient guidance. While acknowledging AI’s potential for enhancing global health equity, particularly in areas with limited specialist access, the study emphasizes that for this potential to be fully realized, significant improvements in the technology are required. Furthermore, there is a call to educate patients on how to formulate questions that elicit safer and more useful answers from AI tools.
Also Read:
- Cultivating Trust in Healthcare AI: Beyond Design to Lived Experience for Clinicians and Patients
- MIT Sloan Study Highlights Crucial Role of User Prompts in Generative AI Outcomes
The researchers advocate for the careful integration of AI tools into healthcare, stressing the continued necessity of professional oversight to ensure the appropriateness and safety of the advice provided. This landmark study highlights that while AI holds immense promise for the future of medicine, its application in critical, high-stakes areas like stroke care demands further rigorous development and validation before widespread, unsupervised deployment.


