spot_img
HomeResearch & DevelopmentUnmasking Privacy Risks in Synthetic Tabular Data: A New...

Unmasking Privacy Risks in Synthetic Tabular Data: A New Attack Method Revealed

TLDR: MIA-EPT is a novel black-box membership inference attack designed for tabular diffusion models. It identifies whether a record was used in a model’s training by analyzing prediction errors when reconstructing masked attributes from synthetic data. The attack achieved strong results in the MIDST 2025 competition, demonstrating significant privacy leakage in state-of-the-art synthetic tabular data and highlighting the need for improved privacy defenses.

Synthetic data generation has emerged as a powerful tool, especially in sensitive sectors like healthcare and finance, allowing organizations to share and utilize data while aiming to protect individual privacy. However, a critical question remains: how truly private is this synthetic data? Recent research highlights a significant vulnerability: even data generated by advanced models can inadvertently “memorize” parts of the original training data, potentially leaking sensitive information about individuals.

This concern is particularly relevant for diffusion models, a cutting-edge type of generative AI that has shown impressive capabilities in creating realistic and high-quality tabular data. While these models are celebrated for their ability to mimic complex data distributions, they are not immune to privacy risks. This is where Membership Inference Attacks (MIAs) come into play. MIAs are designed to determine whether a specific record was part of a model’s training dataset, thereby exposing potential privacy breaches.

A new black-box attack, called MIA-EPT (Membership Inference Attack via Error Prediction for Tabular Data), has been introduced to specifically target these tabular diffusion models. Developed by Eyal German, Daniel Samira, Yuval Elovici, and Asaf Shabtai, MIA-EPT operates without needing access to the internal workings of the generative model. Instead, it relies solely on the synthetic data produced by the model. The core idea behind MIA-EPT is simple yet effective: if a generative model has memorized a training record, it will be easier to predict the attributes of that record from the synthetic data it generates, leading to lower prediction errors.

How MIA-EPT Works

MIA-EPT constructs “error-based feature vectors” by masking and then reconstructing attributes (columns) of target records. It then observes how accurately these attributes are predicted. Records that were part of the original training data are expected to yield lower prediction errors, providing a signal of their “membership.” The attack follows a five-step pipeline:

1. Shadow Model Training: Auxiliary data, similar to the target model’s training data, is used to train “shadow” diffusion models. These models simulate the target’s generative process.

2. Attribute Prediction Model Training: Separate prediction models are trained on the synthetic data generated by these shadow models. Each model learns to predict a specific column’s value based on the other columns.

3. Feature Extraction (Error Profiles): For both “member” (used in training) and “non-member” (not used) records, the attribute prediction models are used to predict masked column values. The prediction errors (or accuracy for categorical data) are then aggregated into a unique “error profile” for each record.

4. Attack Classifier Training: An attack classifier is trained using these error profiles, learning to distinguish between members and non-members based on their error patterns.

5. Membership Prediction: Finally, this trained attack classifier is applied to a “challenge dataset” of unknown records to determine their membership status, providing a score indicating the likelihood of a record being part of the original training data. You can find more details about this innovative approach in the full research paper: MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data.

Also Read:

Key Findings and Implications

MIA-EPT has been rigorously validated on three state-of-the-art tabular diffusion models: TabDDPM, TabSyn, and ClavaDDPM. In internal tests, it achieved AUC-ROC scores of up to 0.599 and True Positive Rate at 10% False Positive Rate (TPR@10% FPR) values of 22.0%. Notably, under the challenging conditions of the MIDST 2025 competition, MIA-EPT secured second place in the Black-box Multi-Table track, with a TPR@10% FPR of 20.0%.

These results are significant because they demonstrate that substantial membership leakage can be uncovered in synthetic tabular data, even when the attacker has limited information (a black-box setting). This challenges the common assumption that synthetic data is inherently privacy-preserving. The success of MIA-EPT highlights a crucial trade-off: maximizing the utility and realism of synthetic data by preserving important patterns can inadvertently increase the risk of retaining traces of individual data points from the original training set.

The research emphasizes the need for organizations using diffusion-based data synthesis to rigorously evaluate their outputs for such leaks. It also motivates the development and implementation of robust privacy defenses, such as noise injection, stronger regularization techniques, or differential privacy, to better balance data utility with individual privacy in the future.

Dev Sundaram
Dev Sundaramhttps://blogs.edgentiq.com
Dev Sundaram is an investigative tech journalist with a nose for exclusives and leaks. With stints in cybersecurity and enterprise AI reporting, Dev thrives on breaking big stories—product launches, funding rounds, regulatory shifts—and giving them context. He believes journalism should push the AI industry toward transparency and accountability, especially as Generative AI becomes mainstream. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -