Enhancing Private In-Context Learning Through Public Information

TLDR: This research introduces a novel framework for In-Context Learning (ICL) in Large Language Models (LLMs) that leverages public data to achieve strong differential privacy (DP) guarantees without significantly compromising model utility. By integrating public data into the aggregation and selection processes, the proposed algorithm effectively balances privacy protection and performance. Experiments demonstrate improved utility in question-answering and summarization tasks, robustness against membership inference attacks, and practical enhancements for efficiency and public data quality.

In-context learning (ICL) has become a cornerstone of Large Language Models (LLMs), allowing them to perform a wide array of tasks by learning from examples provided in the prompt, without needing extensive fine-tuning. This flexibility has led to its widespread adoption across various domains. However, this powerful capability comes with a significant concern: the risk of private data leakage. When LLMs are exposed to sensitive information in prompts, especially under malicious attacks, there’s a real danger that private data could be inferred or exposed.

Consider a scenario where private patient treatment records are used as demonstration examples for an LLM. A malicious attacker could attempt to determine if a specific patient’s record was part of the data used, potentially violating data protection regulations like GDPR. This highlights the critical need for robust privacy protection in ICL.

Differential privacy (DP) is widely recognized as the gold standard for safeguarding privacy in machine learning. Its core principle is to ensure that an algorithm’s output is minimally affected by the inclusion or exclusion of any single individual’s data, thereby drastically reducing the risk of privacy leakage. While DP offers strong guarantees, integrating it into ICL algorithms often leads to a significant reduction in the model’s utility or performance.

To tackle this challenge, a new research paper titled Public Data Assisted Differentially Private In-Context Learning proposes an innovative approach. The authors, Seongho Joo, Hyukhun Koh, and Kyomin Jung from Seoul National University, introduce a private ICL algorithm that incorporates task-related public data while maintaining strong DP guarantees. Their goal is to effectively balance privacy protection with model utility.

How the Framework Works

The proposed framework integrates public data at several stages to enhance privacy and utility. First, it involves subsampling and partitioning both private and public datasets. This subsampling not only amplifies privacy but also reduces memory costs. Next, to address the high-dimensionality of LLM outputs, the generated responses are embedded into a semantic space and then privately clustered. This ‘private aggregation’ step prevents attackers from inferring private information. Finally, the algorithm selects the most appropriate response by leveraging public data as guidance, choosing from top-k candidate clusters.

Key Findings and Benefits

Experiments conducted on question-answering tasks (ChatDoctor) and document summarization tasks (SAMsum) demonstrate the effectiveness of this approach. The results show that the private ICL framework, assisted by public data, significantly improves utility compared to private data-only counterparts, even under strong privacy protection levels (e.g., ε=1). The research also highlights that both in-distribution (ID) and out-of-distribution (OOD) public data are beneficial in minimizing utility degradation.

Furthermore, the framework proves robust against empirical privacy attacks, specifically membership inference attacks. These attacks aim to determine if a particular data point was part of the training set. The private models consistently maintained a low attack success rate, indicating strong empirical privacy protection.

Also Read:

Practical Enhancements and Future Directions

The researchers also explored ways to enhance the utility and efficiency of their private ICL framework. They found that augmenting public datasets, even with a small privacy budget, can improve performance, especially when high-quality public data is scarce. To address the computational demands of using multiple ensembles, techniques like coreset sampling were introduced, significantly improving efficiency without compromising privacy guarantees.

While this framework marks a significant step forward, the authors acknowledge limitations, such as the accumulation of privacy risk over multiple queries and the computational cost. Future work will focus on more computationally efficient DP mechanisms and optimized ensemble methods.

This research offers a practical and robust solution for deploying in-context learning in real-world applications where data privacy is paramount, ensuring that the power of LLMs can be harnessed responsibly.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Private In-Context Learning Through Public Information

How the Framework Works

Key Findings and Benefits

Practical Enhancements and Future Directions

Gen AI News and Updates

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

OneShield Achieves Landmark Registration Under Cloud Security Alliance AI Controls Matrix, Setting New Industry Standard

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates