ReXGroundingCT: Connecting Free-Text Radiology Findings to 3D CT Segmentations

TLDR: ReXGroundingCT is the first publicly available dataset that links free-text radiology findings with pixel-level segmentations in 3D chest CT scans. It was created using a pipeline that involved standardizing radiology reports with GPT-4, extracting and categorizing abnormalities, and then manually annotating these findings in 3D CT volumes. This dataset is crucial for developing AI systems that can accurately localize and describe medical findings from narrative reports, enhancing explainability and clinical utility in medical imaging.

Artificial intelligence is rapidly changing healthcare, especially in interpreting complex medical images. A key challenge in this field is connecting the detailed, often free-form text in radiology reports with the exact locations of findings within 3D medical scans. Imagine a report describing a ‘3 mm nodule in the left lower lobe’ – for AI to truly assist clinicians, it needs to know precisely where that nodule is in the 3D image.

Addressing this critical gap, researchers have introduced ReXGroundingCT, the first publicly available dataset that links free-text radiology findings with pixel-level segmentations in 3D chest CT scans. Unlike previous datasets that relied on structured labels or predefined categories, ReXGroundingCT captures the full richness of clinical language and grounds it to specific 3D anatomical locations.

How ReXGroundingCT Was Built

The creation of ReXGroundingCT involved a systematic, multi-stage process:

First, the dataset is built upon CT-RATE, a publicly available collection of 25,692 non-contrast 3D chest CT volumes and their corresponding radiology reports. For ReXGroundingCT, 3,142 of these scans were carefully selected.

Next, the original radiology reports, which were initially in Turkish and machine-translated, underwent a significant ‘rewriting’ phase. GPT-4, a large language model, was used to standardize the terminology and phrasing to align with typical U.S. radiology practices, while ensuring all clinical details were preserved. This step was crucial for consistency and clarity.

Following the rewriting, a two-stage pipeline was implemented for ‘abnormality extraction and categorization’. GPT-4 systematically analyzed the standardized reports to isolate distinct anatomical observations. Each extracted finding was then categorized using GPT-4o-mini into a hierarchical schema of 12 parent categories and 61 subcategories, covering a wide range of chest CT findings. Quality control was rigorously performed, with very low rates of missing descriptors or false negatives.

Finally, the ‘annotation’ stage involved manual pixel-level segmentation of findings within the 3D CT volumes. The training set of 2,992 cases was annotated using two protocols: one by professional annotators with radiologist refinement, and another by medical students supervised by radiologists. The validation and test sets (100 and 50 cases, respectively) were exclusively annotated by board-certified radiologists to ensure the highest quality. Findings were excluded if they were outside the scope (only lung and pleura findings were annotated), could not be localized, were not feasible to segment due to diffuse nature, or represented normal structures.

What the Dataset Contains

The ReXGroundingCT dataset comprises 3,142 chest CT scans and 8,028 segmented findings, representing a diverse array of pulmonary and pleural abnormalities. It includes 16,301 separate entities, with approximately 79% being focal abnormalities (like nodules) and 21% being non-focal patterns (like diffuse interstitial changes). Each finding is linked to a precise 3D segmentation mask, enabling detailed spatial analysis. The dataset can be accessed at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.

Also Read:

Considerations and Future Directions

While ReXGroundingCT is a significant step forward, the researchers acknowledge some limitations. The training set annotations were not exclusively performed by board-certified radiologists, though rigorous quality control was in place. Also, in the training set, annotators were instructed to segment no more than three representative instances per finding, which means the spatial annotations are not always exhaustive. Lastly, the dataset focuses only on lung and pleural findings, limiting its applicability to other thoracic or abdominal abnormalities.

Despite these points, ReXGroundingCT is poised to advance multimodal medical AI by supporting two core tasks: ‘finding grounding’ (localizing a specific free-text finding in a 3D CT scan) and ‘grounded report generation’ (generating descriptive radiology reports with spatial references). This capability has direct clinical relevance, potentially reducing the time radiologists spend correlating reports with images, facilitating clearer communication with other physicians, and improving patient understanding of their health information. It also serves as an invaluable tool for radiology trainees to develop essential spatial reasoning skills.

ReXGroundingCT sets a new benchmark for developing and evaluating models that can understand and connect complex clinical language to precise anatomical locations in 3D medical images, paving the way for more explainable and anatomically grounded AI systems in healthcare.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ReXGroundingCT: Connecting Free-Text Radiology Findings to 3D CT Segmentations

How ReXGroundingCT Was Built

What the Dataset Contains

Considerations and Future Directions

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates