TLDR: This research introduces a novel framework for In-Context Learning (ICL) in Large Language Models (LLMs) that leverages public data to achieve strong differential privacy (DP) guarantees without significantly compromising model utility. By integrating public data into the aggregation and selection processes, the proposed algorithm effectively balances privacy protection and performance. Experiments demonstrate improved utility in question-answering and summarization tasks, robustness against membership inference attacks, and practical enhancements for efficiency and public data quality.
In-context learning (ICL) has become a cornerstone of Large Language Models (LLMs), allowing them to perform a wide array of tasks by learning from examples provided in the prompt, without needing extensive fine-tuning. This flexibility has led to its widespread adoption across various domains. However, this powerful capability comes with a significant concern: the risk of private data leakage. When LLMs are exposed to sensitive information in prompts, especially under malicious attacks, there’s a real danger that private data could be inferred or exposed.
Consider a scenario where private patient treatment records are used as demonstration examples for an LLM. A malicious attacker could attempt to determine if a specific patient’s record was part of the data used, potentially violating data protection regulations like GDPR. This highlights the critical need for robust privacy protection in ICL.
Differential privacy (DP) is widely recognized as the gold standard for safeguarding privacy in machine learning. Its core principle is to ensure that an algorithm’s output is minimally affected by the inclusion or exclusion of any single individual’s data, thereby drastically reducing the risk of privacy leakage. While DP offers strong guarantees, integrating it into ICL algorithms often leads to a significant reduction in the model’s utility or performance.
To tackle this challenge, a new research paper titled Public Data Assisted Differentially Private In-Context Learning proposes an innovative approach. The authors, Seongho Joo, Hyukhun Koh, and Kyomin Jung from Seoul National University, introduce a private ICL algorithm that incorporates task-related public data while maintaining strong DP guarantees. Their goal is to effectively balance privacy protection with model utility.
How the Framework Works
The proposed framework integrates public data at several stages to enhance privacy and utility. First, it involves subsampling and partitioning both private and public datasets. This subsampling not only amplifies privacy but also reduces memory costs. Next, to address the high-dimensionality of LLM outputs, the generated responses are embedded into a semantic space and then privately clustered. This ‘private aggregation’ step prevents attackers from inferring private information. Finally, the algorithm selects the most appropriate response by leveraging public data as guidance, choosing from top-k candidate clusters.
Key Findings and Benefits
Experiments conducted on question-answering tasks (ChatDoctor) and document summarization tasks (SAMsum) demonstrate the effectiveness of this approach. The results show that the private ICL framework, assisted by public data, significantly improves utility compared to private data-only counterparts, even under strong privacy protection levels (e.g., ε=1). The research also highlights that both in-distribution (ID) and out-of-distribution (OOD) public data are beneficial in minimizing utility degradation.
Furthermore, the framework proves robust against empirical privacy attacks, specifically membership inference attacks. These attacks aim to determine if a particular data point was part of the training set. The private models consistently maintained a low attack success rate, indicating strong empirical privacy protection.
Also Read:
- Teaching LLMs to Trust Context: The SI-FACT Framework
- GAMA: Securing AI Multi-Agent Systems with Smart Anonymization and Enhanced Reasoning
Practical Enhancements and Future Directions
The researchers also explored ways to enhance the utility and efficiency of their private ICL framework. They found that augmenting public datasets, even with a small privacy budget, can improve performance, especially when high-quality public data is scarce. To address the computational demands of using multiple ensembles, techniques like coreset sampling were introduced, significantly improving efficiency without compromising privacy guarantees.
While this framework marks a significant step forward, the authors acknowledge limitations, such as the accumulation of privacy risk over multiple queries and the computational cost. Future work will focus on more computationally efficient DP mechanisms and optimized ensemble methods.
This research offers a practical and robust solution for deploying in-context learning in real-world applications where data privacy is paramount, ensuring that the power of LLMs can be harnessed responsibly.


