TLDR: RadOnc-GPT is an autonomous AI agent that uses large language models to retrieve and interpret patient data, accurately label complex clinical outcomes like cancer recurrence and osteoradionecrosis, and identify errors in existing medical records. It significantly improves the scalability, accuracy, and timeliness of patient outcomes research in radiation oncology by acting as both a labeler and an auditor of clinical data.
In the evolving landscape of healthcare, the ability to accurately and efficiently track patient outcomes is paramount, especially in specialized fields like radiation oncology. Traditionally, this process has been heavily reliant on manual labeling, a method that often struggles with scale, accuracy, and timeliness. A new research paper introduces RadOnc-GPT, an innovative autonomous large language model (LLM)–based agent designed to overcome these limitations by independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes in real-time.
The paper, titled “RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale,” was authored by Jason Holmes, Yuexing Hao, Mariana Borras-Osorio, Federico Mastroleo, Santiago Romero Brufau, Valentina Carducci, Katie M Van Abel, David M Routman, Andrew Y. K. Foong, Liv M Muller, Satomi Shiraishi, Daniel K Ebner, Daniel J Ma, Sameer R Keole, Samir H Patel, Mirek Fatyga, Martin Bues, Brad J Stish, Yolanda I Garces, Michelle A Neben Wittich, Robert L Foote, Sujay A Vora, Nadia N Laack, Mark R Waddle, and Wei Liu. Their work highlights a significant step forward in leveraging AI for clinical data management.
RadOnc-GPT is not just another chatbot; it’s an autonomous agent capable of conducting multi-turn conversations and making independent decisions on which functions to call and when to stop. Its architecture integrates both internal data resources, such as Mayo Clinic’s radiation oncology database, Aria (Varian Medical Systems), and enterprise electronic health record (EHR) systems like Epic, with external public data sources including PubMed, ClinicalTrials.gov, and the National Cancer Institute (NCI) Common Terminology Criteria for Adverse Events (CTCAE) via their public APIs.
A key distinction of RadOnc-GPT’s design is its departure from conventional retrieval-augmented generation (RAG). Instead of relying on vector similarity for poorly organized data, it leverages the systematically organized and indexed nature of patient data within Epic. This allows for targeted, well-structured data retrieval through a large set of highly specific functions, ensuring that the model receives relevant information without being overwhelmed.
The evaluation of RadOnc-GPT was conducted through a rigorous two-tier strategy. The first tier, a structured quality assurance (QA) task, assessed the agent’s ability to accurately retrieve demographic and radiotherapy treatment plan details. This foundational step established trust in its structured-data retrieval capabilities. RadOnc-GPT achieved remarkable accuracy, matching all six demographic fields for 500 patients (100%) and accurately reproducing radiation-course counts in 497 out of 500 cases (99.4%).
The second tier involved more complex clinical outcomes labeling. Here, RadOnc-GPT autonomously combined structured EHR data with unstructured clinical notes, radiology, and pathology reports to determine outcomes such as mandibular osteoradionecrosis (ORN) in head-and-neck cancer patients and cancer recurrence in independent prostate and head-and-neck cancer cohorts. Ground-truth labels, initially generated by expert radiation oncologists, were used for comparison. Crucially, discrepancies between RadOnc-GPT’s outputs and these ground-truth labels underwent independent adjudication by other radiation oncologists.
The results from the complex clinical outcomes labeling were particularly insightful. For ORN determination (233 patients), accuracy rose from 84.5% to 95.2% post-adjudication. Prostate cancer recurrence detection (80 patients) improved from 92.5% to 95.0%, and head-and-neck recurrence detection (82 patients) improved from 92.7% to 96.3%. A significant finding was that among 48 initial discrepancies across these tasks, adjudication revealed 30 (63%) to be previously unrecognized ground-truth errors, highlighting RadOnc-GPT’s dual capacity as both a labeler and an auditor of existing data.
The study concludes that RadOnc-GPT reliably retrieves foundational structured data and effectively generalizes complex clinical outcome labeling tasks, notably using a single cancer recurrence detection prompt across multiple disease sites. Its high recall performance minimizes clinically critical false negatives, and its ability to identify latent errors significantly enhances registry data integrity. This autonomous LLM agent promises to enable scalable, trustworthy, and real-time curation of radiation-oncology research datasets, allowing clinicians to focus on judgment rather than data wrangling.
Also Read:
- Next Event Prediction: Enhancing AI’s Understanding of Patient Journeys in Electronic Health Records
- AI Agents Transform Data Analysis: A Comprehensive Overview
For more detailed information, you can read the full research paper here: RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale.


