TLDR: This research introduces SynthEHR-Eviction, a new pipeline that uses large language models (LLMs), human input, and automated prompt optimization to extract eviction statuses from electronic health records (EHRs). It created the largest public dataset of 14 fine-grained eviction-related social determinants of health (SDoH) categories. Fine-tuned LLMs trained on this synthetic data achieved high accuracy (88.8% for eviction), outperforming other models and significantly reducing data annotation effort by over 80%. The pipeline enables scalable, cost-effective, and interpretable detection of eviction risks, crucial for integrating social factors into healthcare.
Social determinants of health (SDoH) are the conditions in which people are born, grow, live, work, and age, profoundly influencing health outcomes. While clinical indicators are crucial, factors like housing, economic stability, and access to food can account for a significant portion of an individual’s health. Integrating this information into healthcare is vital for personalized care and effective public health interventions.
Among the many SDoH categories, eviction stands out as a highly impactful yet often overlooked factor. Eviction can trigger a cascade of negative consequences, including housing instability, unemployment, homelessness, and mental health issues. Despite its profound public health implications, information about eviction is rarely systematically coded in electronic health records (EHRs), often buried within unstructured clinical notes. This makes it challenging for healthcare providers and policymakers to identify and address eviction-related risks effectively.
To bridge this critical gap, researchers have introduced a novel and scalable information extraction pipeline called SynthEHR-Eviction. This innovative system combines the power of large language models (LLMs), human expertise, and automated prompt optimization (APO) to accurately extract eviction statuses from clinical notes. The goal is to transform how eviction-related SDoH data is captured and utilized in healthcare.
Using this pipeline, the researchers created the largest public dataset of eviction-related SDoH to date. This dataset comprises 14 detailed categories, including nuanced eviction statuses such as “Eviction Absent,” “Eviction Pending,” and “Mutual Rescission History,” alongside other related SDoH categories like homelessness and housing instability. This rich dataset provides a robust foundation for training advanced AI models.
The performance of models trained on SynthEHR-Eviction has been impressive. Fine-tuned LLMs, such as Qwen2.5 and LLaMA3, achieved high accuracy in detecting eviction statuses, outperforming other advanced models like GPT-4o-APO and BioBERT. For instance, fine-tuned LLMs achieved Macro-F1 scores of 88.8% for eviction detection and 90.3% for other SDoH categories on human-validated data. This demonstrates the effectiveness of the SynthEHR-Eviction dataset in enabling high-performing and cost-effective AI solutions.
One of the most significant advantages of the SynthEHR-Eviction pipeline is its efficiency. It dramatically reduces the human effort required for data annotation. Traditional manual annotation of complex SDoH categories is labor-intensive. In contrast, this GPT-assisted, human-in-the-loop workflow achieved comparable data quality with over an 80% reduction in annotation time. This efficiency accelerates dataset creation and enables scalable eviction detection, making it a practical solution for real-world healthcare settings.
The research also explored the impact of including explicit reasoning annotations in the training data. It was found that smaller LLMs benefited significantly from these reasoning explanations, improving their performance and transparency. This means that even more resource-efficient models can achieve high accuracy when guided by clear decision logic, making them suitable for deployment in environments with limited computing resources.
While the synthetic data generated by the pipeline is high-quality, the study highlighted the importance of incorporating real-world clinical notes for better generalization. Models performed best on synthetic data, moderately on real-world EHRs (MIMIC), and faced more challenges with academic case reports (PMC) due to their length and complex narrative style. Including even a small proportion of real-world examples in training data substantially improved the models’ ability to generalize to diverse clinical documentation.
Despite these advancements, challenges remain, particularly in temporal reasoning—distinguishing between historical and current eviction events based on subtle cues in free-text notes. Future work will focus on improving models’ ability to understand the timing of events to enhance accuracy further.
Also Read:
- Enhancing French Electronic Health Records with AI for Social Determinants of Health
- RawMed: A New Framework for Comprehensive Synthetic Electronic Health Records
In summary, SynthEHR-Eviction offers a scalable and clinically grounded approach to enhancing the detection of eviction-related SDoH in unstructured clinical notes. By providing a high-fidelity dataset, reducing annotation effort, and enabling the development of accurate and interpretable AI models, this work paves the way for better integration of social context into healthcare delivery, ultimately supporting personalized care and public health interventions. For more details, you can refer to the full research paper: SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data.


