TLDR: A research paper proposes a four-level taxonomy—non-usability, privacy preservation, traceability, and deletability—to redefine data protection in the generative AI era. It addresses how data permeates the entire AI lifecycle, from training samples to AI-generated content, and highlights the need for updated technical approaches and regulatory frameworks to safeguard diverse data assets against new challenges.
The rise of generative Artificial Intelligence (AI) has fundamentally changed how we think about data. No longer just static files, data now plays a crucial role at every stage of an AI model’s life, from the initial training to the prompts users give and the content AI generates. This shift means our traditional ways of protecting data are no longer enough, and it’s often unclear what exactly needs to be protected.
Failing to secure data in AI systems can have serious consequences for both society and individuals. This highlights an urgent need to clearly define and enforce data protection in this new era. A recent research paper, Rethinking Data Protection in the (Generative) Artificial Intelligence Era, proposes a new four-level framework to address these diverse protection needs.
The Evolving Landscape of Data in AI
In the past, data protection mainly focused on keeping digital content like photos or videos safe from unauthorized use. Owners might encrypt files or embed digital watermarks. However, with generative AI, data’s value is increasingly tied to the AI model itself, not just its raw content. For example, large datasets are compiled to train models, and these trained models become valuable assets. Even the inputs users provide (prompts) and the content AI generates (like images or code) are now valuable forms of data that need protection.
Incidents like Samsung employees accidentally leaking source code to ChatGPT or Italy temporarily banning ChatGPT over privacy concerns underscore the complexity and urgency of this issue. It’s clear we need a systematic way to understand what must be protected in the context of AI.
A Four-Level Taxonomy for Data Protection
To bring clarity to this complex area, the paper introduces a hierarchical taxonomy with four distinct levels, each balancing data usability with the degree of control or protection:
Level 1: Data Non-usability. This is the strictest level. It ensures that certain data cannot be used by AI models at all, whether for training or inference. This offers maximum protection by completely sacrificing the data’s utility. Think of it as making data completely inaccessible for AI purposes, even if it’s publicly available.
Level 2: Data Privacy-preservation. This level allows data to be used for AI development or applications while still safeguarding sensitive information. It’s a compromise that maintains some utility but ensures confidentiality of personal or private details. For instance, a hospital might use patient records to train a disease-detection model, but without revealing individual identities.
Level 3: Data Traceability. Here, data can be used almost fully, but mechanisms are in place to track its origin, usage, and any modifications. This enables transparency and accountability, allowing stakeholders to detect if data has been misused. It has minimal impact on the data’s functionality, often by embedding subtle identifiers like watermarks.
Level 4: Data Deletability. This is the most permissive level initially. Data can be fully integrated, but its influence can later be removed from the model upon request. This aligns with principles like the ‘right to be forgotten’ and emphasizes post-hoc control without impeding initial data utility.
Data Across the AI Lifecycle
The paper emphasizes that data protection must span the entire AI lifecycle, covering:
- Training Datasets: These are the fuel for AI models and often contain sensitive or copyrighted material. Protecting them is crucial for legal, ethical, and commercial reasons.
- Trained Models: Once trained, the model itself, with its learned parameters, becomes a valuable asset. Protecting it is like safeguarding trade secrets.
- Deployment-integrated Data: This includes elements like system prompts (hidden instructions for AI) and external knowledge bases used during inference. They can contain sensitive information and directly influence model outputs.
- User’s Input: Prompts supplied by users can be highly sensitive, from medical histories to proprietary code. Protecting this data is vital for privacy, security, and building user trust.
- AI-generated Content (AIGC): The outputs of generative AI, like text, images, or code, are valuable in their own right and can even feed back into the AI cycle as new training data. Ensuring their provenance and controlling their reuse is important.
Techniques and Regulations
The paper also delves into various technical approaches for each protection level, from encryption and access control for non-usability, to differential privacy and federated learning for privacy preservation, and watermarking and blockchain for traceability. For deletability, techniques like machine unlearning are discussed, which aim to remove data’s influence without retraining the entire model.
Globally, regulations like the EU’s GDPR and AI Act, China’s PIPL, and the US’s CCPA are beginning to address these issues, covering aspects of non-usability, privacy, traceability, and deletability. However, significant gaps remain, particularly concerning cross-border enforceability, the technical feasibility of data deletion from models, and the protection of non-personal but sensitive data like copyrighted content or the models themselves.
Also Read:
- Generative AI: A New Frontier for Strengthening Machine Learning in Cybersecurity
- New Research Reveals Deep Learning Dataset Auditing is Vulnerable to Adversarial Attacks
Looking Ahead
The paper highlights that data protection is distinct from, yet deeply intertwined with, data safety (e.g., preventing misinformation or bias). Robust data protection, especially traceability and controlled access, provides the foundation for addressing safety concerns. The rise of AIGC also introduces challenges around ownership and copyright, where technical solutions like watermarking can help establish provenance even when legal frameworks are unclear.
Ultimately, addressing data protection in the generative AI era requires a holistic approach, combining conceptual frameworks, technical innovations, and evolving regulations. This new taxonomy offers a common language to facilitate discussions among developers, researchers, and regulators, ensuring responsible innovation in AI.


