TLDR: A new study utilized a large language model (Flan-T5-Large) to extract 13 social determinants of health (SDoH) from French clinical notes. The model demonstrated strong performance for well-documented SDoH categories and significantly outperformed traditional structured EHR data, identifying SDoH in 95.8% of patients compared to 2.8% via ICD-10 codes. While effective on social history sections, its performance on full clinical notes was lower, indicating areas for future improvement in generalization and language-specific NLP tools to enhance SDoH documentation and address health disparities.
Social determinants of health (SDoH) are crucial factors that significantly influence an individual’s health outcomes, affecting everything from disease progression to how well treatments work and contributing to health disparities. However, capturing this vital information in structured electronic health records (EHRs) is often incomplete or missing. This gap makes it challenging to understand the full picture of a patient’s health and to address broader health inequalities.
A recent study tackles this challenge by proposing an innovative approach using large language models (LLMs) to extract 13 specific SDoH categories from French clinical notes. This is particularly significant because most existing research and tools for SDoH extraction using natural language processing (NLP) have focused on the English language, leaving a considerable void for other languages like French.
The researchers trained a model called Flan-T5-Large on annotated social history sections from clinical notes collected at Nantes University Hospital in France. The 13 SDoH categories targeted for extraction included living condition, marital status, descendants, employment status, occupation, tobacco use, alcohol use, drug use, housing, education, physical activity, income, and ethnicity/country of birth. The study evaluated the model’s performance at two levels: first, identifying SDoH categories and their associated values, and second, extracting detailed SDoH information, including temporal and quantitative data.
The model demonstrated strong performance in identifying well-documented SDoH categories such as living condition, marital status, descendants, job, and tobacco and alcohol use, achieving F1 scores above 0.80. This indicates its effectiveness in recognizing these common and consistently documented factors. However, performance was lower for categories like employment status, housing, physical activity, income, and education. The researchers attributed this to limited training data for these categories and the highly variable ways in which they are expressed in clinical notes.
One of the most compelling findings of the study was the comparison between the LLM’s extraction capabilities and traditional structured EHR data. The model successfully identified at least one SDoH for 95.8% of patients, a stark contrast to only 2.8% identified using ICD-10 codes from structured EHR data. This highlights the immense value of leveraging unstructured clinical notes, which often contain richer and more detailed SDoH information than coded fields.
The study also shed light on some limitations. The model, trained exclusively on social history sections, showed a significant drop in performance when applied to full clinical notes. This suggests that while effective for specific sections, its generalization to broader clinical text needs further development. Errors were also linked to inconsistencies in human annotation, the reliance on an English-centric tokenizer that struggled with French characters, and the inherent challenges of converting complex natural language into a structured output format.
Also Read:
- AI Models Streamline Clinical Data Standardization with HL7 FHIR
- AI Models Streamline Healthcare Documentation with New Clinical Datasets
Despite these challenges, the research underscores the potential of NLP in improving the completeness of real-world SDoH data in non-English EHR systems. By making two of their four datasets publicly available, the researchers aim to foster further development and reproducibility in French SDoH extraction. Future work will focus on data augmentation, using synthetic clinical text, and releasing the model itself to support multilingual SDoH research. Ultimately, advancing automated SDoH extraction from unstructured clinical text can lead to more equitable healthcare by providing richer, more representative data for research, policy-making, and targeted public health interventions. For more details, you can refer to the full research paper here.


