Rethinking Data Protection in The (Generative)
Rethinking Data Protection in The (Generative)
ize
liz
parameters to the prompts and outputs that drive real-world del Traini
tra
dT
Mo ng
model deployment. This shift renders traditional notions of
Cen
rain
data protection insufficient, while the boundaries of what needs
ing
The Life
Cycle of
safeguarding remain poorly defined. Failing to safeguard data in (Generative)
AI systems can inflict societal and individual harm, underscoring AI-generated Content
AI Models
Testing Samples
on
the urgent need to clearly delineate the scope of and rigorously
icti
M e
Mo
ode c
enforce data protection. In this perspective, we propose a four- l Inferen
ed
de
Pr
G
l
level taxonomy, including non-usability, privacy-preservation, en l
era de
t i on Mo
traceability, and deletability, that captures the diverse protection
needs arising in modern (generative) AI models and systems. Our Prompts
Art Works
framework offers a structured understanding of the trade-offs "Translate from English to French:
between data utility and control, spanning the entire AI pipeline, I'm very happy to see you"
“cartoon character of a person
including training datasets, model weights, system prompts, with a hoodie, ork, gold chains,
realistic anime cat..."
and AI-generated content. We analyze representative technical
approaches at each level and reveal regulatory blind spots that Fig. 1: Data flow across the life-cycle of a (generative) AI
leave critical assets exposed. By offering a structured lens to model. The schematic traces how different forms of data emerge
align future AI technologies and governance with trustworthy
data practices, we highlight the urgent need to rethink data and circulate from the moment raw samples are collected to
protection for modern AI techniques and provide timely guidance the point at which a deployed model generates new content. (i)
for developers, researchers, and regulators alike. Data Collection and Curation: Samples, such as images, texts,
and audio clips are gathered and annotated; once aggregated,
they form the training dataset that drives model learning and
I. I NTRODUCTION
the testing dataset used for validation. (ii) Model Training:
Artificial Intelligence (AI) has experienced tremendous These datasets are transformed into model parameters (e.g.,
progress in the last few decades and is widely and successfully weights and biases), turning the well-trained model itself into
deployed in almost all domains, such as identity verification, a valuable, model-centric data asset. (iii) Model Inference:
e-commerce, and healthcare [1, 2, 3, 4]. With the recent After deployment, users supply inputs or prompts—which may
rapid development of AI-enpowered generative models (e.g., contain private or proprietary information—that the model
large language model (LLM) [5] and diffusion model [6]), processes to produce AI-generated content ranging from class
people can use them to easily generate high-quality images, labels to code, images, or full documents. Arrows indicate
audio, video, and text (instead of simple predictions). More how each artefact (e.g., dataset, model parameters, prompts,
importantly, these powerful models are close at hand, where and outputs) can be independently copied, released, or shared,
users can simply exploit them via APIs (e.g., GPT-4 [7] and underscoring why all of them must be considered within a
Midjourney [8]) or even directly download them from open- comprehensive data-protection framework.
source communities/platforms (e.g., Hugging Face). Arguably,
we have moved into the era of (generative) AI. LLMs like GPT-4 [7] and DeepSeek [9] required vast, curated
In general, the prosperity of AI heavily relies on high- datasets from diverse sources, often refined with costly human
quality data, with which researchers and developers can train, feedback to ensure quality and alignment. Similarly, specialized
evaluate, and improve their models. For example, advanced medical models like Google’s Med-PaLM [10], designed for
clinical question answering and summarization, or diagnostic
The first two authors contributed equally to this paper.
This work was partly completed while Yiming Li was a Research Professor AI systems for tasks like cancer detection from images,
at Zhejiang University; he is now with Nanyang Technological University. relied heavily on large, diverse clinical datasets (e.g., the
PERSPECTIVE 2
Cancer Genome Atlas (TCGA) [11]) meticulously annotated by how usable the data remains for AI models and the degree of
medical experts, a complex and resource-intensive necessity. In control or protection imposed on that data. At the most stringent
particular, collecting and annotating data remains a significant end of the spectrum, data non-usability ensures that certain
obstacle for most companies since they are time-consuming and data cannot be used for model training or inference at all,
even expensive [12]. Accordingly, these data are undoubtedly offering maximal protection by completely sacrificing utility.
valuable assets to their owners and deserve to be protected. Progressing down the hierarchy, privacy-preservation permits
Data protection has long been a critical area of research data to be utilized in model development and application while
due to its significance in safeguarding the legitimate rights safeguarding sensitive information, a trade-off that maintains
of data owners. Various regulations, such as GDPR [13] or some utility but enforces confidentiality of personal or private
EU AI Act [14], highlighted the importance of data protection. attributes. Further along, traceability allows nearly full data
In the past, data typically existed as discrete digital items, usage, yet embeds mechanisms to track the data’s origin
whose value was derived largely from their content. For and usage, thereby enabling transparency and accountability
example, it could be digitized artwork, photographs, videos, (for instance, detecting if data has been misused) with only
etc. Accordingly, traditional data protection mainly refers to minimal impact on the data’s functionality. Finally, at the most
protecting the content of data from unauthorized use and permissive level, data deletability lets data be fully integrated
redistribution, although its specific definition and scope still on the condition that its influence can be later removed from
remain ambiguous to some extent. In practice, data owners the model upon request. This last level emphasizes post-
would encrypt files [15, 16, 17] before storage or transmission hoc control (aligning with ‘right to be forgotten’ principles)
and embed digital watermarks [18, 19, 20] when releasing data without impeding initial data utility. In particular, to ground this
publicly or in digital marketplaces in the past. taxonomy, we systematically review representative technical
However, in the AI era, especially with the emergence of approaches at each level, highlighting their strengths and
generative AI models, the scope of data protection has become limitations in practical settings.
far more complex and ambiguous [21, 22, 23]. As shown in By clearly delineating these four levels, our framework
Figure 1, data permeates every stage of an AI model’s life cycle, brings much-needed clarity to the often conflated notion of
making its value increasingly tied to the model rather than just ‘data protection’ in the (generative) AI era. Researchers and
the raw content of the data. For instance, developers compile practitioners can now specify whether they aim to prevent any
many individual samples into large training datasets that feed use of certain data, protect privacy during use, ensure traceable
into model development. The trained models themselves then usage, or enable later deletion. This structured hierarchy not
become valuable data assets with significant commercial value. only highlights the progressive relaxation of restrictions (from
In addition, high-value or sensitive data (e.g., original artworks strict non-use to full use with after-the-fact removal) but also
or personal medical records) may also be incorporated as helps disambiguate the scope of data protective measures in
inputs during a model’s inference stage. Besides, with the rise the AI era. Moreover, it provides a structured lens to evaluate
of generative AI models, the outputs of inference are no longer existing legal and regulatory instruments: in the later section,
simple predictions – they can be substantial content in their we will show how existing national and international policies
own right. For example, an LLM might generate executable or regulations align (or fail to align) with each data protection
code for a requested function [24], or a diffusion model might level, illuminating where governance already supports these
produce a realistic image for an advertisement or animation protective goals and where further action is required.
clips [25]. These AI-generated outputs are also valuable forms
of data that merit protection. II. H IERARCHICAL TAXONOMY OF DATA P ROTECTION
This ambiguity in scope makes meaningful protection and
regulation difficult. For instance, in 2023, Samsung Electronics A. What Data Do We Need to Protect in the AI Era?
discovered that employees had inadvertently leaked proprietary In the (generative) AI era, the scope of data protection
source code by inputting it into OpenAI’s ChatGPT, prompting has expanded significantly, moving far beyond the traditional
it to prevent its staff from using such external generative focus on static data content. Specifically, AI models generate
AI tools on company systems [26]; that same year, Italy’s and consume various forms of data throughout their lifecycle,
Data Protection Authority (Garante) imposed a nationwide from initial training to final inference. At each stage, different
suspension of ChatGPT after a leak of user conversations and categories of data emerge as assets that warrant protection,
allegations that personal data had been ingested for training whether for reasons of privacy, intellectual property, security,
without a lawful basis [27]. These incidents underscore the or commercial value. As presented in the previous section,
urgent need for a systematic understanding of what, precisely, Figure 1 illustrates this lifecycle, where raw samples become
must be protected against the backdrop of blooming AI- training datasets, which in turn yield models; those models are
integrated applications and data markets. then deployed to handle user prompts and produce AI-generated
To tackle this problem, this paper offers the first timely outputs. Every artifact along this chain, such as the training
overview and categorization of data protection in the (genera- datasets, the trained model, the user inputs/prompts, and the
tive) AI era. Specifically, we introduce a hierarchical taxonomy AI-generated content (AIGC), carries its own significance
of data protection comprising four distinct levels: data non- and sensitivities. Below, we examine why each of these data
usability, privacy-preservation, traceability, and deletability. categories matters and why they must be safeguarded within a
Each level in this taxonomy reflects a different balance between comprehensive protection framework.
PERSPECTIVE 3
Training Datasets: In the development phase of a model, examples are system prompts [45, 46] and external knowl-
large curated datasets serve as the fuel for learning. These edge bases [47, 48] used in conversational AI and retrieval-
collections of samples (images, text, audio, etc.) are often augmented generation (RAG). Such deployment-integrated data
aggregated from diverse sources, which inherently raises the elements are often invisible to end-users but are pivotal in
risk of including sensitive personal information or copyrighted determining how the model responds to inputs. Importantly,
material [28]. Protecting training data is therefore crucial for they may embed sensitive or proprietary information, and their
legal and ethical reasons: developers must respect privacy rights compromise can be just as damaging as a leak of the model
and intellectual property (e.g., avoiding unauthorized use of itself. Even though this data is not ‘learned’ during training, it
personal photos [29] or scraping of copyrighted text [30]) must be protected because it directly influences the model’s
to comply with regulations and moral norms. Moreover, outputs and can inadvertently reveal protected information if
assembling and labeling high-quality datasets is expensive and misused. Specifically, system prompts are predefined directives
time-consuming, making them commercially valuable assets or contexts given to a model at inference time, especially in
for the organizations that curate them. Companies treat their large language model (LLM) deployments. For instance, a
data as proprietary know-how. For example, the success of ChatGPT-like assistant might have a hidden prompt saying:
ImageNet [31] spurred competitive advantages in computer ‘You are an expert medical assistant. Always answer with
vision and beyond [1]. If such a dataset were stolen or evidence-based information and in a reassuring tone.’ This
misused, the original collector could suffer a significant loss prompt is not part of the model’s parameters but is provided by
of investment and competitive edge. For all these reasons, the developers to guide the model’s behavior and set boundaries
training data merits strong protection. This includes measures on its responses. System prompts help ensure consistency,
to preserve privacy (e.g., removing or anonymizing personal align the model with ethical or style guidelines, and can
identifiers [32]) and to enforce rights management, ensuring embed institutional knowledge and policies, or even achieve
the data is not redistributed or used beyond its permitted differentiated services through carefully designed prompts.
scope [30, 33, 34]. In some cases, dataset owners even embed Because they often encode rules and content that the provider
subtle markers (e.g., watermarks or fingerprints) into the data considers sensitive (including possibly proprietary instructions
to enable traceability [35, 36, 37, 38, 39], so that if the data or content examples), system prompts are sensitive deployment
appears in an unauthorized model or repository, it can be data [49, 50]. If an adversary were to discover the exact
identified and linked back to the source. Overall, securing the contents of these prompts, they might exploit them (e.g.,
training dataset is the first pillar of data protection in the AI by crafting inputs that override and manipulate the system
pipeline, preventing downstream issues that could arise from instructions, or developing competitive applications by illegally
contaminated or compromised training information. acquiring the system prompts). External knowledge bases
are specialized repositories of curated information, integrated
Trained Models: Once an AI model has been trained,
at inference time to enhance the capability of AI models
the model itself, encompassing its architectural configuration
(especially LLMs) through a mechanism known as RAG.
and learned parameters, becomes a model-centric data asset
Unlike system prompts, external knowledge bases are extensive
of immense value. Unlike raw training datasets, a trained
collections of documents or structured data that models dynam-
model encapsulates generalizations drawn from potentially vast
ically retrieve and incorporate into their reasoning process to
training data [6, 40, 41]. In effect, it is a compressed repository
produce accurate, timely, and domain-specific responses. For
of that data’s information. This gives the model significant
example, medical assistants powered by retrieval-augmented
commercial and strategic significance. Organizations invest
large language models (RA-LLMs) might access confidential
heavily in developing high-performing models, and the resulting
diagnostic records to inform clinical decisions, while financial
structure and weights are often regarded as trade secrets or key
agents leverage internal market databases for precise forecasting.
intellectual property. For example, the parameters of a state-of-
Although external knowledge bases are not part of the trained
the-art language model or image recognition network can confer
model parameters, their content may be highly sensitive, often
a competitive edge, making the model file itself as sensitive as
comprising proprietary or confidential information crucial to
any proprietary dataset. Protecting this trained model data is
an organization’s operational advantage [51]. Together, these
therefore paramount – if it is exposed or stolen, an adversary
examples highlight that deployment-integrated data, exemplified
or competitor could reuse it, undermining the original owner’s
by system prompts and external knowledge bases, represent
investment and advantage [42, 43, 44]. Accordingly, the trained
critical yet often overlooked data assets whose protection is
model must be safeguarded much like any confidential dataset
also indispensable in today’s (generative) AI era.
in the AI era, especially to preserve the commercial integrity
of the model as a proprietary asset.
User’s Input: When a model is deployed, new data enters the
Deployment-integrated Data: Beyond the model’s learned picture: the inputs (especially prompts) supplied by users during
parameters, modern AI deployments usually incorporate ad- inference. These inputs can be as trivial as a search query or
ditional auxiliary data that plays a crucial role in shaping as sensitive as a detailed medical history or proprietary source
their inference performance. These data are introduced at code, depending on the application [21, 52]. In the AI era,
the deployment or runtime stage (after model training), and particularly with the rise of accessible generative AI chatbots
while not part of the model’s weights, they effectively become and assistants, users routinely provide personal or confidential
extensions of the model’s knowledge and policy. Two prominent data to AI systems in exchange for tailored outputs. It is
PERSPECTIVE 4
imperative to protect this prompt data for privacy, security, contains information that could be sensitive, proprietary, or
and ethical reasons. From a privacy standpoint, any personal otherwise regulated. Notably, each type of data can be copied
information in a user’s query (names, addresses, health details, or transmitted independently – one can leak a dataset, steal
etc.) should be handled in compliance with data protection a model’s weights, expose a user’s prompt, or misappropriate
laws and the user’s expectations of privacy. There have already an AI output, which is why all of them must be considered
been real-world incidents underscoring this need: for example, in a holistic protection strategy. By clearly identifying these
in 2023, Italy temporarily banned ChatGPT over concerns categories, we can align specific protection goals and techniques
that the platform was not adequately protecting user-provided to each: e.g., privacy-preservation for personal data in training
personal data [27]. Commercial confidentiality is equally at sets and prompts, traceability mechanisms for outputs, and so
stake – consider an employee who uses an AI coding assistant forth. The following sections will build on this lifecycle view to
and enters proprietary code as a prompt. If the AI service explore how a hierarchical taxonomy can collectively safeguard
retains this input, it could lead to an unintended leak of the myriad data assets in the AI era, and how emerging data
trade secrets. This scenario is not hypothetical: employees protection techniques map onto each protection level.
at Samsung accidentally disclosed confidential source code
and meeting notes by submitting them to ChatGPT, which
B. Towards the Hierarchical Taxonomy of Data Protection
retained those prompts on its servers [26]. To address such
issues, techniques like robust access control [53, 54] and privacy Taxonomy Overview. AI’s data-protection challenges span
guarantees [55, 56] must be in place at the inference stage. a spectrum from extremely strict control of data to more
Ethically, users should have transparency and agency regarding permissive use with after-the-fact safeguards. To make sense of
their inputs – they should know if prompts will be logged this spectrum, we propose a four-level hierarchical taxonomy
or used for training, and ideally have the right to deletion of data protection: data non-usability, data privacy-preservation,
(aligning with the ‘right to be forgotten’ in privacy regulations). data traceability, and data deletability. Each successive level
Protecting users’ input data not only complies with privacy in this hierarchy relaxes the protections on data slightly,
laws but also builds trust. If users fear their prompts might be trading off some degree of control for greater data utility.
misused or leaked, they will be reluctant to adopt AI solutions, At the highest, the most restrictive end, data non-usability
limiting the technology’s benefits. Thus, safeguarding users’ ensures that certain data cannot be used for model training
input is now a fundamental component of data protection in or inference at all, thereby offering maximum protection by
AI, aimed at preserving individual privacy and maintaining completely sacrificing that data’s utility. Stepping down one
confidentiality in AI services [57]. level, data privacy-preservation permits data to be employed in
model development or inference while safeguarding sensitive
AI-generated Content (AIGC): The final category of pro- information – a compromise that preserves some utility but
tected data arises from the model’s own outputs. In particular, enforces confidentiality of personal or private attributes. Next,
instead of simple numbers, modern (generative) AI systems can data traceability allows nearly full use of data for AI models,
produce rich content like paragraphs of text, realistic images, yet embeds mechanisms to track the data’s origin, usage,
and code snippet [58, 59]. These AIGCs have already become and modifications (e.g., to detect if data has been misused),
valuable digital objects [60, 61]. While the standalone content thereby enabling transparency and accountability with only
of AIGC has inherent protection needs related to intellectual minimal impact on the data’s functionality. Finally, at the most
property, ownership, and potential sensitivities [62, 63, 64, 65], permissive level, data deletability imposes nearly no restriction
our primary focus here aligns with the model-centric perspec- on a dataset’s initial use for training and inference, instead
tive: protecting AIGC in its role as a data asset within the requiring that the data’s influence can later be removed from the
(generative) AI ecosystem. Given its high fidelity and utility, model upon the user’s requests. This last level emphasizes post-
AIGC is increasingly leveraged not just as a final product, but hoc control (aligning with ‘right to be forgotten’ principles)
also as data that feeds back into the AI cycle. For example, without impeding the data’s immediate usefulness. Figure 2
AIGC is valuable for creating large-scale synthetic datasets, illustrates this hierarchy of protection levels, which forms a
for knowledge distillation [66], or as deployment-integrated clear gradient from strong protection/low utility at Level 1 to
data (e.g., instances used in retrieval-augmented generation). low protection/high utility at Level 4.
Protecting AIGC in this capacity is therefore crucial. This can
Level 1: Data Non-usability. Data non-usability encom-
involve ensuring traceability to understand its provenance if
passes methods that intentionally render certain data entirely
used for training [37], or employing mechanisms akin to non-
useless for AI applications, including training and inference,
usability or access control to prevent unauthorized reuse for
even if that data is publicly available. In essence, it ensures
training competing models. Our framework thus emphasizes the
that specified data cannot contribute to model learning or
governance needed when this generated content itself becomes
predicting whatsoever. This is crucial in scenarios where
data for subsequent model training or inference, highlighting
individuals or organizations demand strict control over how
its flow within the broader (generative) AI model’s lifecycle.
their data is utilized by AI systems. For instance, authors and
In conclusion, data protection in the (generative) AI era journalists have voiced objections to their articles or books
must extend across the model’s entire lifecycle. From the being used to train language models without consent [67];
raw training dataset, to the trained model, to the prompts similarly, visual artists often share their works online but may
it processes and the content it generates, each component strongly oppose using AI models to transfer their style to others
PERSPECTIVE 5
usable
non-usable non-usable
Non-private Data AI Models
decouple
transform
e
Unauthorized abl
Data ver
AI Models Data eco
Domains unr
Private Data
Level 1: Data Non-usability Level 2: Data Privacy-preservation
update unrecoverable
trace trace deletion
modification modification
trace origin request
Fig. 2: Hierarchical taxonomy of data protection in the (generative) AI era. This taxonomy comprises four distinct protection
levels, each representing a trade-off between data usability and the degree of protection provided. At the most stringent level,
data non-usability completely restricts the use of specific data in model training and inference, thus offering maximal protection
at the cost of total data utility. The next level, data privacy-preservation, allows data use under stringent privacy safeguards,
enabling some practical utility while protecting sensitive or private attributes. Moving further, data traceability permits extensive
data usage but integrates methods to track data origins and modifications, supporting transparency and accountability with
minimal functional interference. At the most permissive level, data deletability places no initial restriction on data usage but
mandates mechanisms for fully removing data’s influence from trained models post hoc, aligned with principles such as the
‘right to be forgotten’. This hierarchical taxonomy helps disambiguate the scope of data protective measures and provides a
structured lens to evaluate and further design related regulations in protecting data in the (generative) AI era.
during inference [68]. By completely precluding any use of ability to track the origin, history, and influence of data as
the data in model development, data non-usability offers the it is used in AI applications during training and inference.
most stringent level of protection in our taxonomy – achieving This capability allows stakeholders to audit and verify data
maximal data control at the expense of all potential utility. usage. For example, an individual might want to check
whether their personal data was incorporated into a model
Level 2: Data Privacy-preservation. Data Privacy-
for training or generating works of art without permission,
Preservation focuses on protecting sensitive information within
and a model developer might need to detect if a training
data while still allowing the data to be used for developing AI
dataset or a pre-trained model has been tampered with or
models or producing meaningful responses/inferences [69, 70].
misused and avoid the potential backdoor in them [73, 74, 75].
This approach is especially critical in sectors like healthcare,
By enabling such oversight, traceability measures greatly
social media, and online services—domains where large
enhance transparency and accountability in how data fuels
volumes of personal data (e.g., age, gender, location, or
AI systems. Importantly, implementing traceability need not
purchasing behavior) are routinely collected and analyzed [71].
significantly hinder the data’s usefulness for modeling: the data
For instance, a hospital or research institute might analyze
remains almost fully available for training or inference, with
patient records to train a disease-detection model, but it
at most slight modifications introduced to embed identifiers
must do so without exposing any individual’s identity or
(e.g., imperceptible watermarks or metadata tags) that enable
private details. Users also do not want to leak their private
later tracking [37, 76]. Thus, data traceability provides a more
information when chatting with AI chatbots interacting with
moderate level of protection – less restrictive than privacy-
prompts. Ensuring privacy is not only a legal obligation for
preservation since it leaves the data content largely intact, but
data handlers, but also a crucial measure to prevent misuse
still offers an important safeguard through post hoc auditability.
of personal information and to maintain public trust in AI-
driven technologies and applications. In practice, privacy-
Level 4: Data Deletability. Data deletability is the capacity
preserving measures mean that a significant portion of each
to completely remove a specific piece of data and its influence
dataset (namely, the privacy-sensitive attributes) is withheld,
from a trained (AI) model. While deleting a data file from
masked, or otherwise not directly accessible during training
a storage database is trivial, eliminating that data’s imprint
or inference [72]. Consequently, data privacy-preservation still
on an AI model is a far more challenging task [77]. This
represents a high level of protection for the data, second only
level of protection ensures that if a particular data sample
to complete non-usability in its restrictiveness, while enabling
must be purged – for example, because it is no longer needed
much more data utility than the latter.
or because the individual who provided the data withdraws
Level 3: Data Traceability. Data Traceability refers to the consent – there is a mechanism to do so cleanly and effectively.
PERSPECTIVE 6
Fig. 3: Design principles of techniques for each level. Level 1. Non-usability: Encryption and (fine-grained) authorization
confine direct data access solely to authorized parties, while techniques such as unlearnable examples and non-transferable
learning disable data exploitation in unauthorized domains by mitigating particular data features, thereby achieving non-usability
indirectly; Level 2. Privacy-preservation: These techniques generally fall into two main categories: tampering-based and
non-tampering-based methods. The former perturbs private portions of the data (occasionally at the cost of tampering with
some non-private content), whereas the latter prevents direct access without data modification while preserving data utilities;
Level 3. Traceability: Traceability techniques intrusively attach ownership signals (i.e., watermarks) to original data or directly
infer provenance and potential modifications non-intrusively by analyzing data’s intrinsic information; Level 4. Deletability:
The influence of protected data (denoted by ‘purple circle’ in the sub-figure) can be removed either by excising the data and
rebuilding the AI model from scratch to directly change the decision surface (marked in ‘black dot-line’) or, more efficiently, by
targeted unlearning that erases its influence (to the surface) without full model reconstruction, thereby ensuring data deletability.
Such capability is particularly pertinent to user rights and data knowledge gleaned cannot generalize to unintended tasks.
governance frameworks (e.g., complying with the ‘right to Together, these techniques align with a ‘secure-by-design’
be forgotten’ in GDPR regulations [13]). Notably, enabling philosophy: the data would be essentially non-usable and
deletability does not require compromising the data’s utility remain protected even in worst-case scenarios.
during initial model training; the data can be used to its full
extent upfront, and the protective measure comes into play only Techniques for Privacy-preservation. Privacy-preservation
later, if and when deletion is required. Because this approach techniques enable the beneficial use of data for AI development
imposes no upfront usage restrictions, it offers the lowest while shielding sensitive information. They fall into tampering-
immediate level of protection among the four levels – instead, based and non-tampering-based categories. In tampering-based
its strength lies in allowing retrospective removal. In summary, approaches, the data themselves (at least their private compo-
data deletability prioritizes giving data owners ultimate control nents) are modified so that private/sensitive attributes become
after model development, even though it provides only minimal indistinguishable or masked [86]. For example, early schemes
protection at the time of data use. such like k-anonymity and L-diversity schemes generalize
or suppress identifying details [32, 87], though they can
III. T ECHNIQUES FOR DATA P ROTECTION reduce utility and remain vulnerable to linkage attacks [88];
To translate the conceptual taxonomy of data protection Differential privacy [69, 89] provides a stronger guarantee
into practice, this section briefly describes a range of design by injecting carefully calibrated noise into data, intermediate
principles and corresponding representative techniques tailored computations, or outputs [29, 90]. The added randomness
to the four protection levels introduced above. Figure 3 masks each individual’s contribution while remaining versatile
illustrates the design principles of techniques for each level. enough for model training, synthetic-data generation, and
attack mitigation [22, 91]. In contrast, non-tampering-based
Techniques for Non-usability. Non-usability encompasses techniques avoid modifying raw data, seeking privacy protection
strategies that block any unauthorized party from using or even with minimal impact on data utility. For example, homomorphic
accessing protected data. Arguably, the most direct method is encryption [16, 92] enables computations on encrypted inputs,
encryption [78, 79, 80]: by securing data with strong crypto- eliminating exposure during processing; Privacy-preserving
graphic keys, the information remains unintelligible without distributed learning, such as federated learning [93] and split
proper authorization. A complementary line of defense ensures learning [94], keeps data local while sharing only aggregated
that the data cannot be exploited even if an adversary obtains it. model updates [93, 95]. In this way, global models benefit
For example, authorization mechanisms, including fine-grained from diverse datasets without centralizing sensitive records.
data-access control [53, 54, 81] and model-level authorization
[82, 83], allow only approved entities to obtain (correct) model Techniques for Traceability. Traceability seeks to record and
outputs. Unauthorized requests receive degraded or nonsensical verify where data (including models) originate, how they are
responses. Beyond controlling access or general utility, a further used, and whether they have been altered. Existing approaches
class of techniques makes the data unusable in unauthorized can be broadly categorized into intrusive and non-intrusive
domains: unlearnable examples [33, 68] embed imperceptible methods. Intrusive methods embed explicit and external iden-
perturbations that frustrate a model’s ability to extract ‘task- tifiers (dubbed ‘watermarks’) into the data asset [96, 97].
relevant’ features, whereas non-transferable learning [84, 85] For example, digital watermarking adds hidden yet robust
deliberately suppresses ‘task-agnostic’ features so that any signatures to datasets [30, 35], model parameters [98, 99], or
PERSPECTIVE 7
TABLE I: The representative regulations of data protection in the (generative) AI era. The last column shows the levels of data
protection covered in the regulation (N: non-usability, P: privacy-preservation, T: traceability, D: deletability).
Country/Region Regulation Name Year Protection Level(s)
California Consumer Privacy Act [114] 2018 N, P, T, D
USA
Federal Zero Trust Data Security Guide [115] 2024 N, P, T
General Data Protection Regulation [13] 2016 N, P, T, D
Ethics Guidelines for Trustworthy AI [116] 2019 N, P, T
EU
EU AI Act [14] 2024 N, P, T, D
General-Purpose AI Code of Practice (Draft) [117] 2025 N, P, T
Cybersecurity Law of the PRC [118] 2016 N, P, T
Data Security Law of the PRC [119] 2021 N, P, T
Personal Information Protection Law of the PRC [120] 2021 N, P, T
Administrative Measures for Generative Artificial Intelligence Services [121] 2023 N, P, T
China
Action Plan of the Development of Trustworthy Data Space [122] 2024 N, P, T
Implementation Plan on Improving Data Circulation Security Governance to Better
2025 N, P, T
Promote the Marketization and Valorization of Data Elements [123]
Methods for Identifying Synthetic Content Generated by Artificial Intelligence [124] 2025 T
Artificial Intelligence Mission Austria 2030 [125] 2019 N, P, T
Artificial Intelligence and Data Act [126] 2022 P, T
Brazilian AI Regulation [127] 2023 N, P, T
Others
Enhancing Access to and Sharing of Data in the Age of Artificial Intelligence [128] 2025 N, P
Joint Statement on Building Trustworthy Data Governance Frameworks
2025 N, P, T
to Encourage Development of Innovative and Privacy-Protective AI [129]
prompts [49]. These robust watermarks survive ordinary use Unlike traditional data systems, where protection focuses on
and prove ownership. Contrarily, fragile watermarks [100, 101] static storage and access control, AI systems rely on dynamic,
are deliberately brittle, breaking if tampered with, and therefore model-centric data use: once data is absorbed into a model’s
alerting potential modification. Non-intrusive methods, on parameters, it may persist, influence downstream outputs, and
the other hand, enable traceability by analyzing its intrinsic defy straightforward removal. Legal frameworks thus serve
information or detecting its modifications without altering the as critical instruments to constrain unauthorized use, enforce
underlying data asset. For example, membership inference privacy-preservation, ensure traceability, and empower users
[102, 103] evaluates whether a data point was in a model’s with redress and deletion rights. As shown in Table I, there are
training set; model fingerprinting [104, 105] probes a model already some pioneering related regulations. For instance, many
with crafted inputs to reveal its identity; Cryptographic hashing privacy laws operationalize L1 (Non-usability) by prohibiting
[106, 107] produces unique fingerprints that change upon the use of sensitive or unlawfully collected data for AI
any bit-level alteration, while blockchain ledgers [108, 109] training altogether. Similarly, L2 (Privacy-preservation) is
maintain an immutable, time-stamped record of data states, widely mandated through consent requirements, anonymization,
making secret edits computationally infeasible. and processing limits. Emerging regulations now also touch on
Techniques for Deletability. Ensuring that specific data and L3 (Traceability)—requiring documentation of data provenance
their influence can be removed from AI models underpins and logging of model operations—and even aspire to L4
rights such as GDPR’s ‘right to be forgotten’. The most (Deletability), allowing individuals to remove their data’s
straightforward, yet costly, route is to directly delete the data influence post-training. As the diffusion of data across AI
and rebuild the AI model from scratch [77, 110]. A more pipelines complicates direct user control, the regulation remains
efficient alternative is offered by unlearning techniques that the strongest binding force for aligning model development
specifically focus on erasing the influence of the data instead of with ethical and societal norms.
directly the content. These algorithms aim to approximate the
model state that would have arisen had the targeted data never Globally, several regulatory regimes have responded to this
been used, thereby avoiding the significant expense and time challenge with varying scope and emphasis. The European
required for complete retraining or rebuilding [111, 112, 113]. Union’s General Data Protection Regulation (GDPR) remains
the archetype of a rights-based data framework, offering
expansive protections including the ‘right to erasure’ and strict
IV. R EGULATIONS ON DATA P ROTECTION processing limitations on personal data [13]. These provisions
In the era of (generative) AI, regulation plays a foundational collectively enforce L1–L4 protections robustly and are often
role in safeguarding data integrity, privacy, and accountability. interpreted to cover AI training data. The EU’s 2024 AI Act
PERSPECTIVE 8
on protection because establishing fundamental controls over AIGC, define the boundaries of rights associated with training
data usage and provenance is often a prerequisite for tackling data and model-derived knowledge, and establish clear rules
complex safety issues effectively. Nonetheless, recognizing the for the use and attribution of generated content. A specific
deep interplay is essential. Robust data protection mechanisms, protection framework like ours can inform these developments
particularly those ensuring traceability and controlled access, by highlighting what we need to protect and even what is
provide the foundational transparency and oversight needed to technically feasible in terms of control and transparency.
audit systems for safety concerns, attribute harmful outputs,
and enforce safety-related policies. Future frameworks must C. Challenges of Cross-Jurisdictional Data Protection
holistically consider both aspects to build truly trustworthy The inherently global nature of the AI ecosystem presents sig-
(generative) AI models and systems. nificant hurdles for consistent data protection. The lifecycle of
AI models, from data collection via web scraping or distributed
sensors, annotation by global crowdsourcing platforms, training
B. Emerging Challenges brought by AIGC in Data Protection
on cloud servers located potentially anywhere, to deployment
The rise of AI-generated content (AIGC) powered by for a worldwide user base, routinely cross multiple national
generative models introduces profound new challenges in data borders. This immediately runs into the fragmented and vague
protection. In particular, many existing legal systems, including landscape of international data protection regulations.
those in the US and EU, struggle to grant copyright protection Currently, different jurisdictions have markedly different
to purely AIGC because it often lacks the requisite human manners. The European Union’s GDPR [13] imposes strict,
authorship [62]. This leaves the ownership and copyrights rights-based obligations with extra-territorial reach. The US
associated with vast amounts of potentially valuable AIGC in employs a sectoral approach supplemented by state-level laws
a state of ambiguity. Who owns the novel image created by a like the CCPA [114]. China’s PIPL [120] emphasizes state
diffusion model, or the code snippet generated by an LLM? oversight and data localization requirements. Other regions may
Rather than treating AIGC purely as content itself, our have nascent or less comprehensive regulations [128]. This
model-centric data protection perspective highlights further regulatory patchwork creates significant compliance challenges
complexities. When AIGC is itself used as data – for instance, for developers and opens avenues for exploitation. For example,
synthetic data for training new models, knowledge distillation data scraped in a jurisdiction with lenient rules might be used to
[140], or as input for retrieval-augmented generation systems train an AI model deployed in a region with strict privacy laws,
– its copyright status becomes even more convoluted. Does creating legal jeopardy. Conversely, malicious actors might
the copyright (or lack thereof) of the original data used to deliberately host AI models trained on improperly acquired
train the generative model influence the status of the synthetic data in jurisdictions with weak enforcement capabilities,
data? If a model distills knowledge from copyrighted data, undermining protection efforts globally.
does the resulting trained model (as a compact representation Addressing these cross-jurisdictional challenges requires
of information contained in these data) or the data it generates multifaceted solutions. On the policy front, greater international
inherit restrictions? This debate touches upon the core definition cooperation towards regulatory harmonization or establishing
of data rights: Are they solely tied to the direct expression common minimum standards (perhaps through bodies like
of content, or do they extend to the statistical patterns, styles, the OECD or UN initiatives) is desirable, although politically
and knowledge implicitly captured and transferable by a complex [128]. Interoperability frameworks that allow different
model [68]? The potential for AI models (especially generative regulatory systems to recognize and interact with each other
ones) to launder copyrighted information into seemingly novel, could offer a more pragmatic path than full unification. From
unprotected AIGC is a significant concern. a technical perspective, one approach is to adopt the strictest
Even amidst this legal uncertainty, our proposed data standard (e.g., GDPR compliance) globally, but this often
protection framework offers valuable tools. The L3 (Trace- imposes excessive costs and sacrifices utility unnecessarily in
ability), through techniques like watermarking or fingerprinting many contexts. A more promising direction lies in developing
[49, 60], can help establish the provenance of AIGC, potentially adaptive data protection techniques. Future systems could po-
linking it back to specific models or even training datasets. tentially leverage context-aware mechanisms, perhaps inspired
This provides crucial evidence for detecting plagiarism or by meta-learning or zero-shot adaptation principles [141], to
unauthorized use of protected styles or content, even if the dynamically adjust protection levels (e.g., the type of water-
AIGC itself isn’t copyrightable [37]. Furthermore, L1 (Non- marking, the rigor of privacy mechanisms, the implementation
usability) techniques, such as data cloaking methods designed of deletion) based on the legal requirements of the data’s origin,
to disrupt style mimicry [68], offer technical safeguards for the user’s location, or the operational jurisdiction. However,
creators where legal protections are currently inadequate. These realizing such adaptive systems effectively still requires clear
techniques and tools allow stakeholders to exert a degree of regulatory signaling and international collaboration on technical
control over how their data or derived AI models influence standards. Arguably, our hierarchical taxonomy can serve as a
future generations, shifting focus from solely legal ownership foundational conceptual framework – a common language – to
to technical prevention of undesired use. facilitate these multi-stakeholder discussions, allowing different
Ultimately, these complex questions surrounding AIGC and jurisdictions to map their specific requirements onto shared
copyright require urgent attention from policymakers and levels of protection, thereby aiding both policy alignment and
legal scholars. Future legislation must clarify the status of the development of interoperable technical solutions.
PERSPECTIVE 10
[20] S. Ranjbar Alvar, M. Akbari, D. Yue, and Y. Zhang, “Nft-based data external features,” IEEE Transactions on Pattern Analysis and Machine
marketplace with digital watermarking,” in ACM SIGKDD International Intelligence, 2025.
Conference on Knowledge Discovery and Data Mining, 2023, pp. 4756– [42] D. Oliynyk, R. Mayer, and A. Rauber, “I know what you trained last
4767. summer: A survey on stealing machine learning models and defences,”
[21] M. Rigaki and S. Garcia, “A survey of privacy attacks in machine ACM Computing Surveys, vol. 55, no. 14s, pp. 1–41, 2023.
learning,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–34, 2023. [43] S. Shao, Y. Li, H. Yao, Y. He, Z. Qin, and K. Ren, “Explanation as a
[22] A. Ziller, T. T. Mueller, S. Stieger, L. F. Feiner, J. Brandt, R. Braren, watermark: Towards harmless and multi-bit model ownership verification
D. Rueckert, and G. Kaissis, “Reconciling privacy and accuracy in ai via watermarking feature attribution,” in Network and Distributed System
for medical imaging,” Nature Machine Intelligence, vol. 6, no. 7, pp. Security Symposium, 2025.
764–774, 2024. [44] Z. Wang, J. Guo, J. Zhu, Y. Li, H. Huang, M. Chen, and Z. Tu,
[23] S. Mittal, K. Thakral, R. Singh, M. Vatsa, T. Glaser, C. Canton Ferrer, “Sleepermark: Towards robust watermark against fine-tuning text-to-
and T. Hassner, “On responsible machine learning datasets emphasizing image diffusion models,” in IEEE/CVF Conference on Computer Vision
fairness, privacy and regulatory norms with examples in biometrics and and Pattern Recognition, 2025.
healthcare,” Nature Machine Intelligence, vol. 6, no. 8, pp. 936–949, [45] Z. Wu, O. Zhang, X. Wang, L. Fu, H. Zhao, J. Wang, H. Du, D. Jiang,
2024. Y. Deng, D. Cao et al., “Leveraging language model for advanced
[24] Y. He, H. She, X. Qian, X. Zheng, Z. Chen, Z. Qin, and L. Cavallaro, multiproperty molecular optimization via prompt engineering,” Nature
“On benchmarking code llms for android malware analysis,” in ACM Machine Intelligence, vol. 6, no. 11, pp. 1359–1369, 2024.
SIGSOFT International Symposium on Software Testing and Analysis [46] M. P. Polak and D. Morgan, “Extracting accurate materials data from
Workshop, 2025. research papers with conversational language models and prompt
[25] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” engineering,” Nature Communications, vol. 15, no. 1, p. 1569, 2024.
in Annual Conference on Neural Information Processing Systems, [47] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
vol. 33, 2020, pp. 6840–6851. H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
[26] S. Ray, “Samsung bans chatgpt among employees after sensitive code augmented generation for knowledge-intensive nlp tasks,” in Annual
leak,” 2023. [Online]. Available: https://www.forbes.com/sites/siladitya Conference on Neural Information Processing Systems, vol. 33, 2020,
ray/2023/05/02/samsung-bans-chatgpt-and-other-chatbots-for-emplo pp. 9459–9474.
yees-after-sensitive-code-leak/ [48] A. Salemi and H. Zamani, “Evaluating retrieval quality in retrieval-
[27] A. Satariano, “Chatgpt is banned in italy over privacy concerns,” 2023. augmented generation,” in International ACM SIGIR Conference on
[Online]. Available: https://www.nytimes.com/2023/03/31/technology/ Research and Development in Information Retrieval, 2024, pp. 2395–
chatgpt-italy-ban.html 2400.
[28] S. Longpre, N. Singh, M. Cherep, K. Tiwary, J. Materzynska, W. Bran- [49] H. Yao, J. Lou, Z. Qin, and K. Ren, “Promptcare: Prompt copyright
non, R. Mahari, N. Obeng-Marnu, M. Dey, M. Hamdy et al., “Bridging protection by watermark injection and verification,” in IEEE Symposium
the data provenance gap across text, speech and video,” in International on Security and Privacy. IEEE, 2024, pp. 845–861.
Conference on Learning Representations, 2025. [50] X. Shen, Y. Qu, M. Backes, and Y. Zhang, “Prompt stealing attacks
[29] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, against text-to-image generation models,” in USENIX Security Sympo-
K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in sium, 2024, pp. 5823–5840.
ACM SIGSAC Conference on Computer and Communications Security, [51] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and
2016, pp. 308–318. Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented
[30] Y. Li, M. Zhu, X. Yang, Y. Jiang, T. Wei, and S.-T. Xia, “Black- large language models,” in ACM SIGKDD International Conference on
box dataset ownership verification via backdoor watermarking,” IEEE Knowledge Discovery and Data Mining, 2024, pp. 6491–6501.
Transactions on Information Forensics and Security, vol. 18, pp. 2318– [52] B. Liu, M. Ding, S. Shaham, W. Rahayu, F. Farokhi, and Z. Lin,
2332, 2023. “When machine learning meets privacy: A survey and outlook,” ACM
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: Computing Surveys, vol. 54, no. 2, pp. 1–36, 2021.
A large-scale hierarchical image database,” in IEEE/CVF Conference [53] S. Yu, C. Wang, K. Ren, and W. Lou, “Achieving secure, scalable,
on Computer Vision and Pattern Recognition, 2009. and fine-grained data access control in cloud computing,” in IEEE
[32] L. Sweeney, “k-anonymity: A model for protecting privacy,” Interna- International Conference on Computer Communications. IEEE, 2010,
tional Journal of Uncertainty, Fuzziness and Knowledge-based Systems, pp. 1–9.
vol. 10, no. 05, pp. 557–570, 2002. [54] D. Han, Y. Zhu, D. Li, W. Liang, A. Souri, and K.-C. Li, “A blockchain-
[33] H. Huang, X. Ma, S. M. Erfani, J. Bailey, and Y. Wang, “Unlearn- based auditable access control system for private data in service-centric
able examples: Making personal data unexploitable,” in International iot environments,” IEEE Transactions on Industrial Informatics, vol. 18,
Conference on Learning Representations, 2021. no. 5, pp. 3530–3540, 2021.
[34] L. Du, X. Zhou, M. Chen, C. Zhang, Z. Su, P. Cheng, J. Chen, and [55] Z. Huang, W.-j. Lu, C. Hong, and J. Ding, “Cheetah: Lean and fast
Z. Zhang, “Sok: Dataset copyright auditing in machine learning systems,” secure two-party deep neural network inference,” in USENIX Security
in IEEE Symposium on Security and Privacy, 2025. Symposium, 2022, pp. 809–826.
[35] Y. Li, Y. Bai, Y. Jiang, Y. Yang, S.-T. Xia, and B. Li, “Untargeted [56] J. Zhang, X. Yang, L. He, K. Chen, W.-j. Lu, Y. Wang, X. Hou,
backdoor watermark: Towards harmless and stealthy dataset copyright J. Liu, K. Ren, and X. Yang, “Secure transformer inference made non-
protection,” in Annual Conference on Neural Information Processing interactive,” in Network and Distributed System Security Symposium,
Systems, vol. 35, 2022, pp. 13 238–13 250. 2025.
[36] J. Guo, Y. Li, R. Chen, Y. Wu, C. Liu, and H. Huang, “Zeromark: [57] W. Qu, Y. Zhou, Y. Wu, T. Xiao, B. Yuan, Y. Li, and J. Zhang, “Prompt
Towards dataset ownership verification without disclosing watermarks,” inversion attack against collaborative inference of large language models,”
in Annual Conference on Neural Information Processing Systems, in IEEE Symposium on Security and Privacy, 2025.
vol. 37, 2024, pp. 120 468–120 500. [58] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
[37] B. Li, Y. Wei, Y. Fu, Z. Wang, Y. Li, J. Zhang, R. Wang, and T. Zhang, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied
“Towards reliable verification of unauthorized data usage in personalized multimodal language model,” in International Conference on Machine
text-to-image diffusion models,” in IEEE Symposium on Security and Learning. PMLR, 2023, pp. 8469–8488.
Privacy, 2025. [59] R. Haase, “Towards transparency and knowledge exchange in ai-assisted
[38] C. Zhu, J. Galjaard, P.-Y. Chen, and L. Chen, “Duwak: Dual water- data analysis code generation,” Nature Computational Science, pp. 1–2,
marks in large language models,” in Findings of the Association for 2025.
Computational Linguistics ACL 2024, 2024, pp. 11 416–11 436. [60] K. Ren, Z. Yang, L. Lu, J. Liu, Y. Li, J. Wan, X. Zhao, X. Feng, and
[39] C. Zhu, J. Tang, J. M. Galjaard, P.-Y. Chen, R. Birke, C. Bos, L. Y. S. Shao, “Sok: On the role and future of aigc watermarking in the era
Chen et al., “Tabwak: A watermark for tabular diffusion models,” in of gen-ai,” arXiv preprint arXiv:2411.11478, 2024.
International Conference on Learning Representations, 2025. [61] X. Zhao, S. Gunn, M. Christ, J. Fairoze, A. Fabrega, N. Carlini, S. Garg,
[40] M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P. Lu, Z. Huang, S. Hong, M. Nasr, F. Tramer et al., “Sok: Watermarking for ai-generated
C. Guestrin, and J. Zou, “Optimizing generative ai by backpropagating content,” in IEEE Symposium on Security and Privacy, 2025.
language model feedback,” Nature, vol. 639, no. 8055, pp. 609–616, [62] P. Samuelson, “Generative ai meets copyright,” Science, vol. 381, no.
2025. 6654, pp. 158–161, 2023.
[41] Y. Li, L. Zhu, X. Jia, Y. Bai, Y. Jiang, S.-T. Xia, X. Cao, and K. Ren, [63] X. Hu, P.-Y. Chen, and T.-Y. Ho, “Radar: Robust ai-text detection via
“Move: Effective and harmless ownership verification via embedded adversarial learning,” in Annual Conference on Neural Information
PERSPECTIVE 12
Processing Systems, vol. 36, 2023, pp. 15 077–15 095. [88] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith, “Composition
[64] Z. He, P.-Y. Chen, and T.-Y. Ho, “Rigid: A training-free and model- attacks and auxiliary information in data privacy,” in ACM SIGKDD
agnostic framework for robust ai-generated image detection,” arXiv International Conference on Knowledge Discovery and Data Mining,
preprint arXiv:2405.20112, 2024. 2008, pp. 265–273.
[65] X. Li, P.-Y. Chen, and W. Wei, “Where are we in audio deepfake [89] S. Utpala, S. Hooker, and P.-Y. Chen, “Locally differentially private
detection? a systematic analysis over generative and detection models,” document generation using zero shot prompting,” in Conference on
ACM Transactions on Internet Technology, 2025. Empirical Methods in Natural Language Processing, 2023.
[66] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A [90] X. Li, W. Liu, J. Lou, Y. Hong, L. Zhang, Z. Qin, and K. Ren,
survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. “Local differentially private heavy hitter detection in data streams with
1789–1819, 2021. bounded memory,” ACM SIGMOD/PODS International Conference on
[67] S. Abdali, R. Anarfi, C. Barberan, and J. He, “Decoding the ai pen: Management of Data, vol. 2, no. 1, pp. 1–27, 2024.
Techniques and challenges in detecting ai-generated text,” in ACM [91] J. Jordon, J. Yoon, and M. Van Der Schaar, “Pate-gan: Generating
SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, synthetic data with differential privacy guarantees,” in International
pp. 6428–6436. Conference on Learning Representations, 2018.
[68] S. Shan, J. Cryan, E. Wenger, H. Zheng, R. Hanocka, and B. Y. Zhao, [92] P. Paillier, “Public-key cryptosystems based on composite degree
“Glaze: Protecting artists from style mimicry by text-to-image models,” residuosity classes,” in International Conference on the Theory and
in USENIX Security Symposium, 2023, pp. 2187–2204. Applications of Cryptographic Techniques. Springer, 1999, pp. 223–
[69] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise 238.
to sensitivity in private data analysis,” in Theory of Cryptography [93] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
Conference. Springer, 2006, pp. 265–284. “Communication-efficient learning of deep networks from decentralized
[70] X. Li, K. Li, Y. Zheng, C. Yan, X. Ji, and W. Xu, “Safeear: data,” in International Conference on Artificial Intelligence and Statistics.
Content privacy-preserving audio deepfake detection,” in ACM SIGSAC PMLR, 2017, pp. 1273–1282.
Conference on Computer and Communications Security, 2024, pp. 3585– [94] O. Gupta and R. Raskar, “Distributed learning of deep neural network
3599. over multiple agents,” Journal of Network and Computer Applications,
[71] C. Meurisch and M. Mühlhäuser, “Data protection in ai services: A vol. 116, pp. 1–8, 2018.
survey,” ACM Computing Surveys, vol. 54, no. 2, pp. 1–38, 2021. [95] T. Hanser, E. Ahlberg, A. Amberg, L. T. Anger, C. Barber, R. J. Brennan,
[72] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in A. Brigo, A. Delaunois, S. Glowienke, N. Greene et al., “Data-driven
ACM SIGSAC Conference on Computer and Communications Security, federated learning in drug discovery with knowledge distillation,” Nature
2015, pp. 1310–1321. Machine Intelligence, pp. 1–14, 2025.
[73] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” [96] J. Guo, Y. Li, L. Wang, S.-T. Xia, H. Huang, C. Liu, and B. Li, “Domain
IEEE Transactions on Neural Networks and Learning Systems, vol. 35, watermark: Effective and harmless dataset copyright protection is closed
no. 1, pp. 5–22, 2022. at hand,” in Annual Conference on Neural Information Processing
[74] D. A. Alber, Z. Yang, A. Alyakin, E. Yang, S. Rai, A. A. Valliani, Systems, vol. 36, 2024.
J. Zhang, G. R. Rosenbaum, A. K. Amend-Thomas, D. B. Kurland [97] C. Wei, Y. Wang, K. Gao, S. Shao, Y. Li, Z. Wang, and Z. Qin,
et al., “Medical large language models are vulnerable to data-poisoning “Pointncbw: Towards dataset ownership verification for point clouds
attacks,” Nature Medicine, pp. 1–9, 2025. via negative clean-label backdoor watermark,” IEEE Transactions on
[75] Y. Chen, S. Shao, E. Huang, Y. Li, P.-Y. Chen, Z. Qin, and K. Ren, Information Forensics and Security, 2024.
“Refine: Inversion-free backdoor defense via model reprogramming,” in [98] Y. Li, L. Zhu, X. Jia, Y. Jiang, S.-T. Xia, and X. Cao, “Defending
International Conference on Learning Representations, 2025. against model stealing via verifying embedded external features,” in
[76] F. Liu, H. Luo, Y. Li, P. Torr, and J. Gu, “Which model generated this AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp.
image? a model-agnostic approach for origin attribution,” in European 1464–1472.
Conference on Computer Vision, 2024, pp. 282–301. [99] S. Shao, W. Yang, H. Gu, Z. Qin, L. Fan, Q. Yang, and K. Ren,
[77] L. Bourtoule, V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, “Fedtracker: Furnishing ownership verification and traceability for
A. Travers, B. Zhang, D. Lie, and N. Papernot, “Machine unlearning,” in federated learning model,” IEEE Transactions on Dependable and
IEEE Symposium on Security and Privacy. IEEE, 2021, pp. 141–159. Secure Computing, vol. 22, no. 1, pp. 114–131, 2024.
[78] C. E. Shannon, “Communication theory of secrecy systems,” The Bell [100] X. Zhang and S. Wang, “Fragile watermarking with error-free restoration
System Technical Journal, vol. 28, no. 4, pp. 656–715, 1949. capability,” IEEE Transactions on Multimedia, vol. 10, no. 8, pp. 1490–
[79] W. Diffie and M. E. Hellman, “New directions in cryptography,” IEEE 1499, 2008.
Transactions on Information Theory, vol. 22, no. 6, 1976. [101] M. Botta, D. Cavagnino, and R. Esposito, “Neunac: A novel fragile
[80] O. Purcell, J. Wang, P. Siuti, and T. K. Lu, “Encryption and steganog- watermarking algorithm for integrity protection of neural networks,”
raphy of synthetic gene circuits,” Nature Communications, vol. 9, no. 1, Information Sciences, vol. 576, pp. 228–241, 2021.
p. 4942, 2018. [102] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership
[81] K. Yang, X. Jia, K. Ren, B. Zhang, and R. Xie, “Dac-macs: Effective inference attacks against machine learning models,” in IEEE Symposium
data access control for multiauthority cloud storage systems,” IEEE on Security and Privacy. IEEE, 2017, pp. 3–18.
Transactions on Information Forensics and Security, vol. 8, no. 11, pp. [103] Y. He, B. Li, L. Liu, Z. Ba, W. Dong, Y. Li, Z. Qin, K. Ren, and
1790–1801, 2013. C. Chen, “Towards label-only membership inference attack against
[82] G. Ren, J. Wu, G. Li, S. Li, and M. Guizani, “Protecting intellectual pre-trained large language models,” in USENIX Security Symposium,
property with reliable availability of learning models in ai-based 2025.
cybersecurity services,” IEEE Transactions on Dependable and Secure [104] X. Cao, J. Jia, and N. Z. Gong, “Ipguard: Protecting intellectual property
Computing, vol. 21, no. 2, pp. 600–617, 2022. of deep neural networks via fingerprinting the classification boundary,”
[83] G. Ren, G. Li, S. Li, L. Chen, and K. Ren, “Activedaemon: Unconscious in ACM Asia Conference on Computer and Communications Security,
dnn dormancy and waking up via user-specific invisible token,” in 2021, pp. 14–25.
Network and Distributed System Security Symposium, 2024. [105] D. Pasquini, E. M. Kornaropoulos, and G. Ateniese, “Llmmap: Finger-
[84] L. Wang, S. Xu, R. Xu, X. Wang, and Q. Zhu, “Non-transferable printing for large language models,” in USENIX Security Symposium,
learning: A new approach for model ownership verification and 2025.
applicability authorization,” in International Conference on Learning [106] R. C. Merkle, “A digital signature based on a conventional encryption
Representations, 2021. function,” in Conference on the Theory and Application of Cryptographic
[85] Z. Hong, Z. Wang, L. Shen, Y. Yao, Z. Huang, S. Chen, C. Yang, Techniques. Springer, 1987, pp. 369–378.
M. Gong, and T. Liu, “Improving non-transferable representation [107] R. L. Rivest, “The md4 message digest algorithm,” in International
learning by harnessing content and style,” in International Conference Cryptology Conference. Springer, 1991, pp. 303–311.
on Learning Representations, 2024. [108] E. J. De Aguiar, B. S. Faiçal, B. Krishnamachari, and J. Ueyama, “A
[86] X. Gong, Z. Wang, S. Li, Y. Chen, and Q. Wang, “A gan-based defense survey of blockchain-based strategies for healthcare,” ACM Computing
framework against model inversion attacks,” IEEE Transactions on Surveys, vol. 53, no. 2, pp. 1–27, 2020.
Information Forensics and Security, vol. 18, pp. 4475–4487, 2023. [109] X. Guo, M. A. Khalid, I. Domingos, A. L. Michala, M. Adriko, C. Rowel,
[87] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, D. Ajambo, A. Garrett, S. Kar, X. Yan et al., “Smartphone-based dna
“l-diversity: Privacy beyond k-anonymity,” ACM Transactions on Knowl- diagnostics for malaria detection using deep learning for local decision
edge Discovery from Data, vol. 1, no. 1, pp. 3–54, 2007. support and blockchain technology for security,” Nature Electronics,
PERSPECTIVE 13