TLDR: Researchers developed an automated Retrieval-Augmented Generation (RAG) system, powered by LLaMA-4 109B, for evaluating radiotherapy treatment plans. The system integrates a scoring module, a retrieval engine, and a clinical constraint checker, guided by an LLM. It achieved 100% agreement with computed values from individual modules, demonstrating high accuracy in percentile prediction and constraint identification, primarily relying on numerical dose metrics for plan similarity.
A new study introduces an advanced AI system designed to automate and improve the evaluation of radiotherapy treatment plans. This system, powered by the LLaMA-4 109B large language model, aims to make the assessment process more efficient, consistent, and transparent for clinicians.
The Challenge of Radiotherapy Plan Evaluation
Radiotherapy is a critical cancer treatment that involves delivering precise doses of radiation to tumors while minimizing harm to surrounding healthy tissues. A crucial step in this process is evaluating the treatment plan to ensure its quality and clinical suitability. Traditionally, this evaluation can be time-consuming and often involves subjective judgments that may vary among clinicians. While statistical and mathematical methods have been developed to make this process more objective, they often require manual adjustments, are limited to predefined protocols, and may not adapt well to different clinical settings or evolving guidelines.
Introducing the RAG System for Radiotherapy
To address these limitations, researchers have developed a Retrieval-Augmented Generation (RAG) system. This system combines the powerful language understanding and generation capabilities of large language models (LLMs) with external knowledge retrieval mechanisms. In this context, the RAG system for radiotherapy plan evaluation integrates three core modules:
- Scoring Module: This component calculates normalized dose metrics and determines population-based percentiles, providing a quantitative measure of plan quality.
- Retrieval Module: This module identifies similar historical treatment plans from a vast knowledge base, using both numerical and textual features to find the most relevant comparisons.
- Constraint-Checking Tool: This tool automatically flags any violations of protocol-defined clinical constraints, ensuring the plan adheres to safety and efficacy standards.
These tools are orchestrated by the LLaMA-4 109B model through a multi-step, prompt-driven reasoning pipeline. This approach allows the system to produce concise, grounded evaluations that are both protocol-aware and interpretable.
How the System Works
Upon receiving a new treatment plan, the system first computes its dose metrics and a summary score. The LLM then queries the retrieval module to get percentile estimates for the plan, comparing it to similar historical cases. Simultaneously, it uses the constraint-checking tool to identify any metrics that exceed clinical limits. With all this contextual information, the LLM generates a clear, human-readable summary that describes the plan’s quality based on its percentile ranking and lists any identified constraint violations. This modular design helps to minimize “hallucinations” (incorrect information generated by the LLM) and ensures that the outputs are traceable and aligned with clinical practice.
Key Findings and Performance
The research involved curating a multi-protocol dataset of 614 radiotherapy plans across four disease sites. The retrieval engine was optimized using various SentenceTransformer backbones. The best configuration, based on the all-MiniLM-L6-v2 model, achieved perfect nearest-neighbor accuracy within a 5-percentile-point margin and a sub-2pt Mean Absolute Error (MAE). This means the system could very accurately find historical plans that closely matched the quality of a new plan.
When tested end-to-end, the RAG system achieved 100% agreement with the computed values from its standalone retrieval and constraint-checking modules. This confirms that the system reliably executes all its steps, from retrieving information and making predictions to identifying constraint violations. The study highlighted that numerical dose metrics played a dominant role in determining plan similarity, with textual descriptions contributing minimally. This suggests that structured clinical features are highly informative for risk estimation in this domain.
Also Read:
- AI Agents Streamline CT Scan Protocol Management
- Evaluating AI’s Role in Connecting Patients to Clinical Trials
Implications for Clinical Practice
This system offers a transparent and scalable framework for evaluating radiotherapy plans. Its ability to provide traceable outputs and minimize hallucinations is crucial for building trust and acceptance among clinicians. The modular design also allows for flexible integration into existing clinical workflows and adaptation to evolving guidelines. Future work will include clinician-led validation studies to assess how well the system’s evaluations align with expert judgment and if it can improve decision-making, especially in time-sensitive scenarios like adaptive treatment planning.
For more detailed information, you can refer to the full research paper available at arXiv.


