spot_img
HomeResearch & DevelopmentCDrugRed: A New Dataset for Chinese Discharge Drug Recommendations...

CDrugRed: A New Dataset for Chinese Discharge Drug Recommendations in Metabolic Diseases

TLDR: CDrugRed is the first publicly available Chinese dataset for discharge drug recommendations in metabolic diseases, built from 5,894 de-identified real-world EHRs. It addresses the scarcity of non-English medical datasets and includes comprehensive patient information. Benchmarking with LLMs shows that supervised fine-tuning is crucial for effective drug recommendation, significantly outperforming prompt-based methods, and establishing CDrugRed as a valuable resource for developing accurate clinical decision support systems.

A new and significant resource for advancing intelligent drug recommendation systems in China has been introduced: CDrugRed. This dataset addresses a critical gap in the field, specifically the scarcity of publicly available, real-world Electronic Health Records (EHR) datasets in languages other than English, particularly for Chinese patients.

The development of intelligent drug recommendation systems is vital for enhancing the quality and efficiency of clinical decision-making. These systems can help doctors select the most suitable medications by analyzing extensive patient data, including medical history, diagnoses, lab results, and co-existing conditions. However, the progress of such systems has been hindered by the lack of diverse and accessible datasets.

CDrugRed is the first publicly available Chinese drug recommendation dataset specifically designed for discharge medications in patients with metabolic diseases. Metabolic diseases, such as diabetes, hypertension, and fatty liver disease, are widespread chronic conditions with complex treatment plans. Ensuring continuity of care, especially with discharge medications, is crucial for managing these conditions and preventing readmissions.

The dataset comprises 5,894 de-identified medical records from 3,190 patients, along with 651 candidate drugs. These records were collected from a Grade A tertiary hospital in China between 2013 and 2023. The information within CDrugRed is comprehensive, covering patient demographics, medical history, clinical course during hospitalization, and discharge diagnoses. This rich detail is a key differentiator from other datasets, which often extract only partial information like diagnoses or surgery records.

The data collection process involved strict ethical and privacy protection standards. Patient records were carefully selected based on inclusion criteria such as age (18 or older), diagnosis of metabolic diseases (hypertension, hyperlipidemia, hyperglycemia, hyperuricemia), and data completeness. Records with severe allergies, participation in other clinical studies, or severe comorbidities were excluded.

To ensure patient privacy, sensitive information like names and phone numbers was de-identified using a large language model (Qwen3-30B-A3B) deployed on a local server. The same model was also used to extract medication-related content from discharge instructions and to standardize drug names, correcting misspellings and inconsistent suffixes. This two-stage normalization process, which also involved cross-referencing with the DXY database, ensures consistency and alignment with clinical terminology.

Statistical analysis of CDrugRed reveals interesting demographic insights. The majority of patients were middle-aged and elderly, aligning with the higher prevalence of metabolic diseases in these age groups. Hospital admissions for chronic metabolic conditions showed a gradual upward trend from 2015 to 2023. The most common discharge diagnoses include Type 2 Diabetes Mellitus, Hypertension, and Fatty Liver, often accompanied by complications. Correspondingly, frequently prescribed discharge medications include atorvastatin, aspirin enteric-coated tablets, acarbose, and metformin, which are standard treatments for these conditions.

The research paper also details benchmarking experiments conducted on CDrugRed using several state-of-the-art large language models (LLMs), including GLM4-9B-Chat, Llama3.1-8B-Instruct, and Qwen2.5-7B-Instruct. The goal was to evaluate the models’ ability to understand clinical contexts and make drug recommendations. Various inference strategies were tested: 0-shot, 1-shot, chain-of-thought (CoT) prompting, and supervised fine-tuning (SFT).

The results clearly demonstrated that supervised fine-tuning significantly outperformed all other prompting strategies. This highlights that while general LLMs possess impressive capabilities, they require specialized training with labeled data to effectively handle complex, domain-specific tasks like drug recommendation. Simple prompt-based methods (0-shot, 1-shot, CoT) showed limited benefits, and in some cases, CoT even performed worse than 0-shot, suggesting that current LLMs don’t reliably leverage generative reasoning chains for this task.

Among the models tested, GLM4 achieved the best performance under the SFT strategy. The study also observed that increasing model size generally led to improved performance, further emphasizing the potential of larger models when properly fine-tuned. A case study illustrated how SFT produced recommendations that were much more clinically accurate and relevant compared to the other prompting methods, which often included irrelevant medications.

Also Read:

CDrugRed has already been adopted in the 11th China Health Information Processing Conference (CHIP) Challenge, attracting over 500 participating teams. This underscores its value as a robust benchmark for future research in automated medication recommendation. While the dataset is a high-quality resource, a current limitation is its single-hospital origin, which might affect the generalizability of models trained on it. Future work aims to expand the dataset with data from multiple hospitals and diverse clinical departments to enhance its representativeness and robustness. You can find the full research paper here: CDrugRed Research Paper.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -