TLDR: ShizhenGPT is the first multimodal large language model (LLM) specifically designed for Traditional Chinese Medicine (TCM). It addresses the challenges of limited TCM data and the multimodal nature of TCM diagnostics by curating the largest TCM dataset to date (over 300GB of text and multimodal data). ShizhenGPT integrates deep TCM knowledge with the ability to interpret visual, auditory, olfactory, and pulse signals, aligning with TCM’s “Four Diagnostic Methods.” Evaluations show it outperforms comparable LLMs and competes with larger proprietary models in TCM expertise and visual understanding, paving the way for more holistic AI in TCM.
Traditional Chinese Medicine (TCM), a medical system with thousands of years of history, has remained largely separate from recent advancements in artificial intelligence (AI). This gap exists primarily due to two significant challenges: a scarcity of high-quality TCM data and the inherently multimodal nature of TCM diagnostics, which involve sensory-rich methods like looking, listening, smelling, and pulse-taking. Conventional large language models (LLMs) are typically limited to text, making them unsuitable for these complex diagnostic approaches.
To bridge this gap, researchers have introduced ShizhenGPT, the first multimodal LLM specifically designed for Traditional Chinese Medicine. This innovative model aims to bring AI closer to real-world clinical practice in TCM by understanding and reasoning across various sensory inputs.
Addressing Data Scarcity
One of ShizhenGPT’s foundational achievements is the creation of the largest TCM dataset to date. This extensive collection comprises over 100GB of text data, gathered from 3,256 TCM-specific books and various online sources. In addition to text, the dataset includes over 200GB of multimodal data, featuring 1.2 million annotated images, more than 200 hours of audio, and diverse physiological signals such as pulse and electrocardiograms (ECG).
This massive dataset is crucial for training a robust AI model, as previous TCM-specific LLMs often relied on less than 1GB of text, which is insufficient for the complexity of TCM theory.
Multimodal Capabilities for TCM Diagnostics
TCM diagnosis traditionally relies on the “Four Diagnostic Methods”: observing (e.g., tongue, visual cues), listening (e.g., voice, breath), smelling, and pulse-taking. ShizhenGPT is engineered to integrate these rich sensory modalities. Its architecture includes an LLM backbone for core reasoning, a vision encoder for visual inputs, and a signal encoder for continuous signals like voice, pulse, and smell.
The model undergoes a two-stage pre-training process. The first stage focuses on infusing knowledge from extensive TCM text, while the second introduces multimodal alignment through image-text and audio-text data. Following pre-training, an instruction-tuning phase aligns the model for instruction-following and extends its capabilities to various downstream tasks, including adapting to less data-rich modalities like sound and smell.
Performance and Evaluation
ShizhenGPT’s capabilities were rigorously evaluated using a comprehensive benchmark suite covering text, vision, and physiological signals. For textual understanding, the model was tested on recent national TCM qualification exams, including licensing exams for pharmacists, physicians, and assistant physicians, as well as postgraduate entrance exams. ShizhenGPT-7B, the smaller version, achieved the highest average score among comparable-scale LLMs, even outperforming some larger models.
In visual tasks, ShizhenGPT set a new state-of-the-art, demonstrating strong ability in medicinal recognition and visual diagnosis (e.g., interpreting tongue and palm images). Furthermore, it showed effective multimodal perception across various signal modalities, such as smell, ECG, and pulse, consistently outperforming random baselines. Notably, it achieved 80% accuracy in pregnancy detection from pulse signals alone.
Human evaluations conducted by licensed TCM practitioners also indicated a higher preference for ShizhenGPT’s responses compared to other leading models, highlighting its clinical relevance and insight.
Also Read:
- Shifting Medical AI from Reactive to Proactive Questioning
- Optimizing Medical Diagnosis with Adaptive AI Collaboration
Future Outlook
ShizhenGPT represents a significant step towards more holistic medical AI systems in Traditional Chinese Medicine. By expanding diagnostic capabilities beyond text-based interaction to include direct analysis of visual cues, sounds, and physiological signals, it brings AI interaction closer to real-world clinical practice. The datasets, models, and code for ShizhenGPT are publicly available, aiming to inspire further research and collaboration in this vital field.
While ShizhenGPT shows immense promise, the researchers acknowledge limitations, including the scarcity of high-quality signal data for certain modalities and the need for real-world clinical testing. The model is currently intended for scientific research and not for clinical deployment due to potential for inaccuracies. For more technical details, you can refer to the full research paper: ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine.


