New AI Model, Ladder-base, Elevates Reasoning in Traditional Chinese Medicine

TLDR: Ladder-base, a new Large Language Model (LLM) for Traditional Chinese Medicine (TCM), utilizes Group Relative Policy Optimization (GRPO) to significantly improve reasoning and factual consistency. Trained on the textual subset of the TCM-Ladder benchmark, Ladder-base outperforms both general-purpose and existing TCM-specific LLMs across various diagnostic and disciplinary tasks, demonstrating GRPO’s effectiveness in aligning AI with expert-level clinical reasoning in TCM.

Traditional Chinese Medicine (TCM) represents a profound and intricate knowledge system that has been a cornerstone of East Asian healthcare for over two millennia. It encompasses a wide array of practices, from herbal remedies to acupuncture, and continues to be relevant in modern medicine, even contributing to drug discovery. However, the rich yet complex and often unstandardized nature of TCM texts poses significant challenges for the application of modern artificial intelligence, particularly large language models (LLMs).

While LLMs have transformed natural language understanding in many fields, including general and biomedical domains, most advancements in medical AI have focused on Western medicine. The unique symbolic reasoning, holistic logic, and classical Chinese semantics inherent in TCM have remained largely unexplored by these advanced models, creating a notable gap in computational intelligence for traditional medical reasoning.

A recent study introduces a groundbreaking approach to bridge this gap with a new LLM called Ladder-base. This model is the first TCM-focused LLM to be trained using Group Relative Policy Optimization (GRPO), a sophisticated reinforcement learning method. GRPO is designed to enhance reasoning and factual consistency by optimizing how the model selects responses, based on comparisons within a group of generated answers, rather than relying on a single, explicit value network.

Ladder-base is built upon the robust Qwen2.5-7B-Instruct foundation model. Its training was exclusively conducted using the textual data from the TCM-Ladder benchmark, a comprehensive dataset specifically curated for multimodal question-answering in TCM. This dataset, which includes over 52,000 entries of high-quality QA pairs and diagnostic dialogues, was independently verified by licensed TCM physicians to ensure accuracy and clinical relevance. For training Ladder-base, 80% of this textual data was used, with the remainder split for validation and testing.

The GRPO framework, a variant of the well-known Proximal Policy Optimization (PPO) algorithm, operates by having the policy model generate a group of responses for a given query. Each response is then assigned a reward, and the model is optimized through a group-relative learning process. This method helps mitigate issues like “reward hacking” by directly using the final accuracy of a verifiable task as the outcome reward. The reward system prioritizes correctness, proper formatting, and accurate tagging, with correctness being the most heavily weighted factor.

The training of Ladder-base involved two NVIDIA A100 PCIe GPUs and approximately 60 hours of processing. Key parameters, such as temperature and top-p sampling, were carefully set, and a clipped objective function with a Kullback–Leibler (KL) divergence penalty term was employed to ensure stable optimization and prevent the model from deviating too much from its reference during training. During inference, a greedy search approach was used to generate consistent responses.

The evaluation of Ladder-base demonstrated its superior performance across various reasoning metrics. It was rigorously compared against both state-of-the-art general-purpose LLMs like GPT-4o, Gemini 2.5 Pro, and Claude 3, as well as existing domain-specific TCM models such as BenTsao, HuatuoGPT2, and Zhongjing. On text-based diagnostic dialogue and fill-in-the-blank tasks from the TCM-Ladder benchmark, Ladder-base achieved the highest overall performance, with a Ladder-Score of 0.803 and an Exact Match Accuracy of 0.8623. These results surpassed those of its competitors, highlighting its improved logical coherence and factual precision in multi-turn diagnostic dialogues.

Furthermore, Ladder-base showed strong generalization across seven core TCM disciplines: Diagnostics, Pharmacognosy, Surgery, Herbal Formulas, Internal Medicine, Pediatrics, and Fundamentals. It consistently outperformed all other models, achieving an average score of 0.7823. Notably, it made significant gains in Pharmacognosy and Pediatrics, indicating enhanced contextual reasoning in knowledge-intensive or symptom-dependent scenarios. Even in complex areas like Surgery and Herbal Formulas, Ladder-base maintained higher consistency.

These findings suggest that GRPO offers an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains. The group-wise normalization in GRPO provides a more stable optimization process, allowing the model to better capture the implicit causal patterns in clinical reasoning. This approach enables the model to learn relative judgments, mirroring how experienced physicians evaluate diagnoses and treatment principles, leading to a more interpretable and clinically consistent decision-making process.

Also Read:

While this study marks a significant step forward, it primarily focuses on text-based data. Future research aims to extend the GRPO framework to integrate multimodal inputs, such as images of tongues, pulses, and herbs, along with patient interaction data, to enable even more comprehensive diagnostic reasoning. Long-term validation in real clinical environments will also be crucial to ensure the model’s safety, reliability, and ethical compliance before its deployment in clinical decision support systems. For more in-depth information, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New AI Model, Ladder-base, Elevates Reasoning in Traditional Chinese Medicine

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates