spot_img
HomeResearch & DevelopmentAdvancing Pancreatic Cancer Detection with Integrated Electronic Health Records

Advancing Pancreatic Cancer Detection with Integrated Electronic Health Records

TLDR: A new AI model leverages a multimodal approach, combining longitudinal diagnosis codes and laboratory measurements from electronic health records, to significantly improve the early detection of pancreatic cancer up to one year before clinical diagnosis. The method, which uses neural controlled differential equations and pre-trained language models with cross-attention, outperforms existing single-modality approaches and identifies both established and novel biomarkers associated with increased pancreatic cancer risk.

Pancreatic ductal adenocarcinoma (PDAC) stands as one of the most lethal cancers, primarily due to the significant challenge of early detection. Patients often receive a diagnosis at advanced or metastatic stages because symptoms typically do not manifest in the early phases of the disease. This grim reality contributes to a five-year survival rate of approximately 10%.

However, the increasing availability of longitudinal electronic health records (EHRs) presents a promising opportunity to enhance PDAC detection across broader populations. Researchers have recently proposed a novel multimodal approach that integrates various data points from EHRs to identify PDAC up to one year prior to a clinical diagnosis. This innovative method combines a patient’s historical diagnosis codes with routinely collected laboratory measurements.

The core of this new approach lies in its ability to handle and combine different types of medical data. For irregular lab time series, the model employs neural controlled differential equations (NCDEs) to capture continuous physiological changes. For diagnosis code trajectories, it leverages pre-trained language models like BioGPT and recurrent networks to learn rich representations of disease progression. A crucial element is the use of cross-attention mechanisms, which are designed to capture complex interactions between these two distinct data modalities.

Prior machine learning efforts for PDAC detection from EHRs often relied on a single data modality, either diagnosis codes or laboratory tests. While these methods showed some promise, they failed to capture the complementary information that exists when both are considered together. For instance, lab tests can reveal subtle, continuous physiological changes, such as rising glucose levels months before symptoms appear. In contrast, diagnostic codes capture clinically validated events like comorbidities and imaging findings that lab tests might miss. By integrating these, the new multimodal approach creates a more comprehensive representation of a patient’s disease history.

One of the challenges in combining lab measurements and diagnosis codes is their differing temporal resolution and structure. Lab tests are continuous, irregular time series, while diagnosis codes are sparse, discrete clinical events. To overcome this, the researchers grouped lab measurements into clinically meaningful panels (e.g., metabolic, Complete Blood Count, lipid, and liver panels), modeling each as a multivariate irregular time series. This allows the model to learn system-specific dynamics. For diagnosis codes, context-specific embeddings are generated using BioGPT, and their temporal evolution is modeled with a bidirectional Long Short-Term Memory (Bi-LSTM) network.

The cross-attention fusion module then explicitly models inter-modal dependencies, allowing each modality to capture the most relevant features from the other. This sophisticated fusion mechanism significantly improves predictive performance compared to simpler methods like concatenation.

The approach was developed and evaluated on a real-world dataset of nearly 4,700 patients from OSF Saint Francis Medical Center, with up to 14 years of history. The results demonstrated significant improvements in Area Under the Curve (AUC) ranging from 6.5% to 15.5% over state-of-the-art methods for early detection at 6, 9, and 12 months prior to diagnosis. The model consistently outperformed baselines, even as the prediction window increased and cohort sizes decreased.

Beyond its predictive power, the model also offers valuable insights into interpretability. It identified diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new potential biomarkers. For example, chronic and acute pancreatitis, known risk factors for PDAC, were among the top contributing diagnosis codes. Interestingly, the model also highlighted codes like “Acute bronchitis” and several skin-related conditions, suggesting potential indirect associations that warrant further clinical investigation. When both modalities were used, the model shifted its attention more towards liver and metabolic panels, which are known to reflect early PDAC signals, indicating that the model learns clinically relevant patterns.

Also Read:

While this work marks a significant step forward, the researchers acknowledge limitations, such as the current fusion mechanism being at an aggregated representation level and the use of a single dataset. Future work aims to capture more fine-grained interactions between specific lab panels and diagnosis codes, develop dedicated approaches for learning disease trajectory representations from structured codes, and validate the models across more diverse datasets. For more detailed information, you can refer to the full research paper available at https://arxiv.org/pdf/2508.06627.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -