TLDR: This research investigates how Privacy Enhancing Technologies (PETs) can protect sensitive personal information leaked by Explainable Artificial Intelligence (XAI) methods. The study evaluates synthetic data, differential privacy, and noise addition at different stages of the AI pipeline against attribute inference attacks. It finds that adding noise to explanations *after* the model is trained (post-model stage) is the most effective strategy for reducing privacy leakage while maintaining explanation quality, model accuracy, and minimizing performance overhead. In contrast, synthetic data and differential privacy were largely ineffective or even detrimental to privacy.
Artificial Intelligence (AI) systems are increasingly used across various industries, from medicine to finance. However, many of these systems operate as ‘black boxes,’ meaning their decision-making processes are not easily understood. This lack of transparency can be a major concern, especially in high-risk applications. To address this, the field of Explainable Artificial Intelligence (XAI) has emerged, aiming to make AI decisions more transparent and trustworthy.
While XAI offers significant benefits by shedding light on how AI models arrive at their conclusions, it also introduces a critical challenge: privacy. Recent research has shown that the explanations provided by XAI methods can inadvertently leak sensitive personal information about the individuals whose data was used to train or query the models. Adversaries can exploit these explanations to infer private attributes, reconstruct missing data, or even extract the underlying model itself, posing a threat to both individual privacy and intellectual property.
One particularly concerning privacy risk is ‘attribute inference,’ where an attacker uses explanations to deduce sensitive features of an individual’s data, such as age, gender, or race. This type of attack can affect a broad range of AI users, not just those whose data was part of the original training set. Currently, there’s a notable gap in effective defenses against these attacks, especially when vulnerable XAI methods are deployed in real-world systems or as part of machine learning as a service (MLaaS).
Exploring Privacy Enhancing Technologies (PETs)
To tackle this privacy dilemma, researchers are exploring the integration of Privacy Enhancing Technologies (PETs) into XAI systems. PETs are methods designed to safeguard data in computer systems. This research specifically evaluates three types of PETs: synthetic training data, differentially private training, and noise addition, applying them at different stages of the AI development pipeline.
The study categorizes the application of PETs into three stages:
- Pre-model: PETs are applied to the training data *before* the AI model is trained. Here, synthetic data, which mimics the statistical properties of real data without containing actual individual records, is used to train the model.
- In-model: PETs are applied *during* the model’s training process. Differential Privacy (DP), a technique that adds carefully calibrated noise during training to ensure that the presence of any single data record is indistinguishable, falls into this category.
- Post-model: PETs are applied *after* the model has been trained, specifically to the explanations generated by the model. This involves adding noise directly to the explanations before they are released.
The researchers empirically evaluated these PETs on two common categories of feature-based XAI methods: backpropagation-based (like Integrated Gradients and SmoothGrad) and perturbation-based (like SHAP and LIME). They used four diverse datasets from finance, medical, and justice domains, and measured the impact of PETs on privacy (attack success), explanation quality (faithfulness), model utility (accuracy), and performance time.
Also Read:
- VeFIA: A New Framework for Auditing AI Inference in Collaborative Systems
- BBoxER: Enhancing LLMs with Guaranteed Privacy and Robustness Through Black-Box Optimization
Key Findings and Recommendations
The evaluation yielded crucial insights into the effectiveness and side-effects of integrating PETs:
- Privacy Protection: The most significant finding was that adding noise to explanations in the post-model stage consistently and effectively reduced the success of attribute inference attacks. Both random and differentially private calibrated noise performed well, bringing the attack success closer to a random guess. In the best cases, this reduced the risk of attack by nearly 50%. In contrast, using synthetic data (pre-model) or differentially private training (in-model) was largely ineffective at mitigating attribute inference, and in some instances, even increased the attack’s success.
- Explanation Quality: Encouragingly, the integration of PETs, particularly noise addition in the post-model stage, did not adversely affect the faithfulness or quality of the explanations. This means that the explanations remained accurate and true to the model’s behavior, which is vital for maintaining trust in XAI systems.
- Model Utility: When synthetic data was used for training (pre-model) or differential privacy was applied during training (in-model), there was a noticeable drop in the model’s accuracy. This trade-off between privacy and utility is a common challenge in privacy-preserving machine learning. However, with post-model noise addition, the original, accurate model was used, and only its explanations were perturbed, resulting in no change to the model’s accuracy.
- Performance Overhead: The pre-model and in-model PETs introduced significant computational overheads due to the time required for synthetic data generation or the more complex differentially private training process. Post-model noise addition, especially random noise, introduced negligible performance overhead, making it a highly efficient solution.
Based on these findings, the research offers several recommendations:
- For inherent resilience without PETs, SmoothGrad (backpropagation-based) and LIME (perturbation-based) showed more resilience to attribute inference.
- Synthetic data and differentially private training are generally not recommended for defending against attribute inference attacks on XAI explanations.
- Noise addition in the post-model stage is the most effective and balanced approach, preserving privacy while maintaining explanation quality and model utility with minimal performance cost. Gaussian noise techniques often outperformed Laplace noise, except for LIME explanations where Laplace worked better.
- Random noise is a practical and efficient choice, often performing similarly to more complex calibrated noise, especially for Integrated Gradients, SHAP, and LIME.
This comprehensive study, detailed further in the original research paper, highlights that achieving a balance between AI transparency and data privacy is indeed possible. By strategically integrating PETs, particularly post-model noise addition, we can build more trustworthy AI systems that offer valuable explanations without compromising sensitive personal information.


