TLDR: A new benchmark systematically compares three ways to represent Electronic Health Records (EHRs) for clinical prediction: multivariate time-series, event streams, and textual event streams for LLMs. It evaluates these methods across ICU and longitudinal care tasks using MIMIC-IV and EHRSHOT datasets. The study finds that event stream models consistently perform best, with pre-trained models excelling in few-shot settings and simpler count models being strong with ample data. LLMs are competitive for short-term ICU predictions. Crucially, feature selection strategies must adapt to the clinical setting: pruning sparse features helps ICU tasks, while retaining them is vital for longitudinal tasks.
Electronic Health Records (EHRs) are a treasure trove of patient data, offering immense potential for deep learning to predict clinical outcomes like mortality or disease progression. However, harnessing this data effectively has been a challenge due to the many ways patient information can be organized and presented to AI models. Researchers have struggled to determine the best approach because evaluation methods have been inconsistent across different studies.
A new study titled “CROSS-REPRESENTATION BENCHMARKING IN TIME-SERIES ELECTRONIC HEALTH RECORDS FOR CLINICAL OUTCOME PREDICTION” by Tianyi Chen, Mingcheng Zhu, Zhiyao Luo, and Tingting Zhu from the University of Oxford addresses this very issue. This groundbreaking work introduces the first systematic benchmark to compare different EHR representation methods, providing much-needed clarity in the field. You can read the full paper here: Research Paper.
Understanding EHR Representations
The researchers investigated three primary ways to represent EHR data:
- Multivariate Time-Series: This is a traditional approach where patient data, such as vital signs and lab results, are organized into a matrix over fixed time intervals. Imagine a spreadsheet where each row is a time point and each column is a clinical measurement.
- Event Streams: This method treats a patient’s record as a chronological sequence of individual clinical events, each with a timestamp, code, and value. This allows models to learn directly from the irregular timing of events, rather than from pre-aggregated bins.
- Textual Event Streams: Building on event streams, this approach converts clinical events into descriptive sentences, maintaining their temporal order. This natural language format is designed to be processed by large language models (LLMs).
The Benchmark and Tasks
To ensure a fair comparison, the benchmark standardized data curation and evaluation across two distinct clinical settings and four prediction tasks:
- MIMIC-IV Dataset (ICU Setting): Used for predicting ICU mortality (whether a patient will die during an ICU stay) and ICU phenotyping (identifying acute care conditions).
- EHRSHOT Dataset (Longitudinal Care Setting): Used for predicting 30-day readmission and 1-year pancreatic cancer diagnosis.
For each representation, appropriate modeling families were evaluated, including Transformers, MLPs, LSTMs, and Retain for time-series data; CLMBR and count-based models for event streams; and various 8-20B LLMs for textual streams.
Key Findings and Practical Guidance
The experiments yielded several crucial insights:
- Event Stream Models Lead the Way: Overall, models trained with the event stream representation consistently delivered the strongest performance across tasks and datasets. This suggests that capturing the irregular timing and sequence of events is highly effective.
- Pre-trained Models vs. Simple Counts: Pre-trained models like CLMBR proved to be highly efficient in “few-shot” settings, meaning they performed well even with limited training data. However, simpler count-based models could surpass CLMBR when sufficient data was available for training.
- LLMs Show Promise in Specific Scenarios: Large Language Models, when fed textual event streams, were competitive in short-term, high-frequency clinical scenarios like ICU mortality prediction. However, they sometimes struggled with the long and sparse contexts found in longitudinal care, where structured event-stream models performed better.
- The Importance of Feature Selection: The study also highlighted that how features are selected significantly impacts performance. For ICU tasks, pruning sparse features (those with a lot of missing data) actually improved predictions, as these variables often added noise due to redundant measurements. Conversely, for longitudinal care tasks, retaining these sparse but potentially informative features was critical, as clinicians typically only record essential measurements over longer periods.
Also Read:
- Moving Beyond Textbook Cases: A New Way to Evaluate AI in Medical Diagnosis
- Standardizing Evaluation for Interactive Medical Segmentation Tools
Conclusion
This comprehensive benchmark provides invaluable practical guidance for healthcare professionals and AI developers. It clarifies which EHR representation methods are most effective depending on the clinical context and the amount of available data. The findings suggest that event-stream models are generally robust, while the utility of time-series models and LLMs can vary based on the specific prediction task and data characteristics. Future research aims to expand this benchmark to other clinical settings and tasks, further refining our understanding of optimal EHR data utilization.


