TLDR: Ladder-base, a new Large Language Model (LLM) for Traditional Chinese Medicine (TCM), utilizes Group Relative Policy Optimization (GRPO) to significantly improve reasoning and factual consistency. Trained on the textual subset of the TCM-Ladder benchmark, Ladder-base outperforms both general-purpose and existing TCM-specific LLMs across various diagnostic and disciplinary tasks, demonstrating GRPO’s effectiveness in aligning AI with expert-level clinical reasoning in TCM.
Traditional Chinese Medicine (TCM) represents a profound and intricate knowledge system that has been a cornerstone of East Asian healthcare for over two millennia. It encompasses a wide array of practices, from herbal remedies to acupuncture, and continues to be relevant in modern medicine, even contributing to drug discovery. However, the rich yet complex and often unstandardized nature of TCM texts poses significant challenges for the application of modern artificial intelligence, particularly large language models (LLMs).
While LLMs have transformed natural language understanding in many fields, including general and biomedical domains, most advancements in medical AI have focused on Western medicine. The unique symbolic reasoning, holistic logic, and classical Chinese semantics inherent in TCM have remained largely unexplored by these advanced models, creating a notable gap in computational intelligence for traditional medical reasoning.
A recent study introduces a groundbreaking approach to bridge this gap with a new LLM called Ladder-base. This model is the first TCM-focused LLM to be trained using Group Relative Policy Optimization (GRPO), a sophisticated reinforcement learning method. GRPO is designed to enhance reasoning and factual consistency by optimizing how the model selects responses, based on comparisons within a group of generated answers, rather than relying on a single, explicit value network.
Ladder-base is built upon the robust Qwen2.5-7B-Instruct foundation model. Its training was exclusively conducted using the textual data from the TCM-Ladder benchmark, a comprehensive dataset specifically curated for multimodal question-answering in TCM. This dataset, which includes over 52,000 entries of high-quality QA pairs and diagnostic dialogues, was independently verified by licensed TCM physicians to ensure accuracy and clinical relevance. For training Ladder-base, 80% of this textual data was used, with the remainder split for validation and testing.
The GRPO framework, a variant of the well-known Proximal Policy Optimization (PPO) algorithm, operates by having the policy model generate a group of responses for a given query. Each response is then assigned a reward, and the model is optimized through a group-relative learning process. This method helps mitigate issues like “reward hacking” by directly using the final accuracy of a verifiable task as the outcome reward. The reward system prioritizes correctness, proper formatting, and accurate tagging, with correctness being the most heavily weighted factor.
The training of Ladder-base involved two NVIDIA A100 PCIe GPUs and approximately 60 hours of processing. Key parameters, such as temperature and top-p sampling, were carefully set, and a clipped objective function with a Kullback–Leibler (KL) divergence penalty term was employed to ensure stable optimization and prevent the model from deviating too much from its reference during training. During inference, a greedy search approach was used to generate consistent responses.
The evaluation of Ladder-base demonstrated its superior performance across various reasoning metrics. It was rigorously compared against both state-of-the-art general-purpose LLMs like GPT-4o, Gemini 2.5 Pro, and Claude 3, as well as existing domain-specific TCM models such as BenTsao, HuatuoGPT2, and Zhongjing. On text-based diagnostic dialogue and fill-in-the-blank tasks from the TCM-Ladder benchmark, Ladder-base achieved the highest overall performance, with a Ladder-Score of 0.803 and an Exact Match Accuracy of 0.8623. These results surpassed those of its competitors, highlighting its improved logical coherence and factual precision in multi-turn diagnostic dialogues.
Furthermore, Ladder-base showed strong generalization across seven core TCM disciplines: Diagnostics, Pharmacognosy, Surgery, Herbal Formulas, Internal Medicine, Pediatrics, and Fundamentals. It consistently outperformed all other models, achieving an average score of 0.7823. Notably, it made significant gains in Pharmacognosy and Pediatrics, indicating enhanced contextual reasoning in knowledge-intensive or symptom-dependent scenarios. Even in complex areas like Surgery and Herbal Formulas, Ladder-base maintained higher consistency.
These findings suggest that GRPO offers an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains. The group-wise normalization in GRPO provides a more stable optimization process, allowing the model to better capture the implicit causal patterns in clinical reasoning. This approach enables the model to learn relative judgments, mirroring how experienced physicians evaluate diagnoses and treatment principles, leading to a more interpretable and clinically consistent decision-making process.
Also Read:
- BenCao: An AI Assistant Bridging Modern Language Models with Traditional Chinese Medicine
- Adaptive Learning for Medical Text Understanding: The TACL Framework
While this study marks a significant step forward, it primarily focuses on text-based data. Future research aims to extend the GRPO framework to integrate multimodal inputs, such as images of tongues, pulses, and herbs, along with patient interaction data, to enable even more comprehensive diagnostic reasoning. Long-term validation in real clinical environments will also be crucial to ensure the model’s safety, reliability, and ethical compliance before its deployment in clinical decision support systems. For more in-depth information, you can refer to the full research paper here.


