TLDR: A new study published in the journal iScience, led by researchers from Binghamton University, has evaluated ChatGPT’s diagnostic capabilities. The AI demonstrated high accuracy in identifying disease terms, drug names, and genetic information, often exceeding researchers’ expectations. However, it showed lower accuracy in symptom identification and a tendency to ‘hallucinate’ specific genetic accession numbers, pointing to areas for improvement in AI’s medical application.
A recent study, spearheaded by Ahmed Abdeen Hamed, a research fellow at Binghamton University’s Thomas J. Watson College of Engineering and Applied Science, has delved into the diagnostic prowess of generative artificial intelligence, specifically ChatGPT. Published in the journal iScience, the research aimed to assess the accuracy of AI-generated medical information, a growing concern as individuals increasingly consult platforms like ChatGPT for health diagnoses.
Hamed, alongside collaborators from AGH University of Krakow, Poland, Howard University, and the University of Vermont, tested ChatGPT across various biomedical categories: disease terms, drug names, genetics, and symptoms. The findings presented a mix of impressive successes and notable limitations.
Remarkably, ChatGPT exhibited high accuracy in identifying disease terms (ranging from 88% to 97%), drug names (90% to 91%), and genetic information (88% to 98%). Hamed expressed his astonishment at these results, stating, ‘I thought it would be at most 25% accuracy.’ He further elaborated on the AI’s capabilities, noting, ‘The exciting result was ChatGPT said cancer is a disease, hypertension is a disease, fever is a symptom, Remdesivir is a drug and BRCA is a gene related to breast cancer. Incredible, absolutely incredible!’
However, the study identified a significant weakness in symptom identification, where ChatGPT’s accuracy dropped to between 49% and 61%. This discrepancy is attributed to the difference in language used by medical professionals and the general public. While doctors and researchers rely on precise biomedical ontologies, ChatGPT is trained on more informal, social language to communicate with average users. Hamed explained, ‘The LLM is apparently trying to simplify the definition of these symptoms, because there is a lot of traffic asking such questions, so it started to minimize the formalities of medical language to appeal to those users.’
Another critical issue highlighted was ChatGPT’s tendency to ‘hallucinate’ information, particularly when asked for specific genetic accession numbers from databases like the National Institutes of Health’s GenBank. For instance, when prompted for the designation of the Breast Cancer 1 gene (BRCA1), which is NM_007294.4, ChatGPT would generate made-up numbers. Hamed views this as a major flaw despite the otherwise positive results.
Also Read:
- ChatGPT’s User Base Doubles to 400 Million Weekly Active Users by Early 2025, Continues Rapid Expansion
- Optimizing Generative AI in Healthcare: The Critical Role of Prompt Engineering
Looking ahead, Hamed sees an opportunity to enhance these AI tools. He suggests ‘introducing these biomedical ontologies to the LLMs to provide much higher accuracy, get rid of all the hallucinations and make these tools into something amazing.’ His ongoing research, which began in 2023 due to concerns about fact-checking in large language models, aims to expose these flaws, enabling data scientists to refine and improve AI models for safer and more accurate biomedical applications.


