TLDR: SynLLM is a new framework that uses Large Language Models (LLMs) and structured prompt engineering to generate high-quality, privacy-preserving synthetic medical tabular data. It employs four prompt types, from example-based to rule-based, to control data generation without fine-tuning LLMs. Evaluations across 20 open-source LLMs and three medical datasets show that rule-based prompts achieve the best privacy-quality balance, demonstrating that carefully designed prompts are key to creating clinically plausible and privacy-aware synthetic data useful for healthcare research.
Access to real-world medical data is often a significant hurdle for healthcare research and the development of AI solutions. Strict privacy regulations, like HIPAA and GDPR, are crucial for patient confidentiality but can limit data availability. This is where synthetic data comes in, offering a promising alternative by allowing researchers to train and validate machine learning models without exposing sensitive patient records.
While various methods exist for generating synthetic data, including Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), they often come with limitations such as mode collapse, computational intensity, or a tendency to oversimplify rare but important clinical conditions. More recently, Large Language Models (LLMs) have emerged as a new avenue for structured data generation, but current approaches often lack systematic ways to guide these models and comprehensive evaluation methods.
Addressing these challenges, researchers have introduced SynLLM, a new framework designed to generate high-quality synthetic medical tabular data. SynLLM leverages 20 state-of-the-art open-source LLMs, including popular models like LLaMA, Mistral, and GPT variants. What makes SynLLM unique is its reliance on structured prompt engineering to guide the LLMs, eliminating the need for complex model fine-tuning.
How SynLLM Works: The Power of Prompts
SynLLM proposes four distinct types of prompts, each designed to control the data generation process by encoding different levels of information:
- SEED EX (Example-Seed Minimal Prompt): This is the most basic prompt, providing column headers and a few example rows from real data. While it helps establish a baseline for model generalization, it carries the highest risk of the LLM memorizing and inadvertently replicating sensitive information.
- FEATDESC (Feature-Description Prompt): Instead of examples, this prompt provides natural-language definitions for each data attribute (e.g., “bmi: body-mass index in kg/m2”). This helps guide the model to generate realistic values within expected ranges.
- STATGUIDE (Statistical-Metadata Prompt): Building on FEATDESC, this prompt includes statistical summaries like means, standard deviations, min-max bounds, and category frequencies. This helps the LLM generate data that closely matches the statistical distributions of real data.
- CLIN RULE (Clinically-Constrained Prompt): This is the most advanced and privacy-conscious prompt. It completely removes example records and instead provides declarative logic rules derived from medical guidelines (e.g., “If pregnant=True, then sex=Female”). The LLM is then required to generate data that adheres to these clinical constraints, significantly reducing privacy risk.
This systematic approach to prompt design allows for a controlled exploration of the trade-off between data quality and privacy. By operating exclusively on non-identifiable summaries and domain rules, SynLLM aims to reduce disclosure risk while still leveraging the LLMs’ vast knowledge.
Comprehensive Evaluation for Trustworthy Data
The SynLLM framework includes a rigorous evaluation pipeline that assesses the generated synthetic data across multiple critical dimensions:
- Statistical Fidelity: This checks how well the synthetic data matches the statistical properties of the original real data, including marginal and joint distributions.
- Clinical Consistency: A rule engine validates synthetic records against known medical and physiological constraints to ensure they are clinically plausible.
- Privacy Protection: Disclosure risk is evaluated using metrics that estimate how likely synthetic records are to closely resemble or directly duplicate real ones.
- Machine Learning Utility: This assesses whether models trained on the synthetic data perform comparably to those trained on real data, ensuring the synthetic data is useful for downstream analytical tasks.
Key Findings and Insights
The evaluation of SynLLM across three public medical datasets (Diabetes, Cirrhosis, and Stroke) and 20 open-source LLMs yielded significant insights. The results clearly show that prompt engineering plays a crucial role in determining the quality and privacy risk of the generated data. Notably, the rule-based CLIN RULE prompts consistently achieved the best balance between privacy and data quality, even without relying on any example records from real data. This highlights that well-designed, constraint-based prompting can lead to high-quality outputs with minimal privacy risk.
Models like OpenChat 7B, Zephyr 7B, and Nous Hermes 34B consistently performed well across various metrics. The study also confirmed that SynLLM-generated data retains enough structure to support meaningful predictive tasks, with models trained on synthetic data performing comparably to those trained on real data in many cases.
Also Read:
- SQL-Exchange: Bridging Database Schemas with Intelligent Query Transformation
- Adaptive Prompting Strategies Boost Biomedical Named Entity Recognition
Looking Ahead
SynLLM represents a significant step forward in generating high-fidelity, clinically plausible, and privacy-preserving synthetic medical data using large language models. Its prompt-driven approach simplifies deployment and model reuse compared to traditional methods. Future work may explore adaptive prompt optimization strategies and expand support for multimodal electronic health records, further enhancing privacy and utility in healthcare research. For more in-depth details, you can refer to the full research paper: SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering.


