0% found this document useful (0 votes)

11 views13 pages

Rethinking Data Protection in The (Generative)

The document discusses the need to rethink data protection in the context of generative artificial intelligence (AI), highlighting how data now plays a crucial role throughout the AI lifecycle. It proposes a four-level taxonomy for data protection that includes non-usability, privacy-preservation, traceability, and deletability, emphasizing the importance of safeguarding various forms of data generated and utilized by AI systems. The authors argue that existing regulations are insufficient and that a structured understanding of data protection is essential for developers, researchers, and regulators in the evolving AI landscape.

Uploaded by

Lucas Nascimento

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views13 pages

Rethinking Data Protection in The (Generative)

Uploaded by

Lucas Nascimento

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

PERSPECTIVE 1

Rethinking Data Protection in the (Generative)

Artificial Intelligence Era
Yiming Li1,2 , Shuo Shao1 , Yu He1 , Junfeng Guo3 , Tianwei Zhang2 , Zhan Qin1 , Pin-Yu Chen4 ,
Michael Backes5 , Philip Torr6 , Dacheng Tao2 , Kui Ren1
1
The State Key Laboratory of Blockchain and Data Security, Zhejiang University
2
Nanyang Technological University 3 University of Maryland 4 IBM Research
5
CISPA Helmholtz Center for Information Security 6 University of Oxford
[email protected]; [email protected]
arXiv:2507.03034v3 [cs.LG] 19 Jul 2025

Training Samples Well-trained AI Models

Abstract—The (generative) artificial intelligence (AI) era has
profoundly reshaped the meaning and value of data. No longer
confined to static content, data now permeates every stage of ing Dec
ain ent
the AI lifecycle—from the training samples that shape model Tr ra
ed l

ize
liz
parameters to the prompts and outputs that drive real-world del Traini

tra

dT
Mo ng
model deployment. This shift renders traditional notions of

Cen

rain
data protection insufficient, while the boundaries of what needs

ing
The Life
Cycle of
safeguarding remain poorly defined. Failing to safeguard data in (Generative)
AI systems can inflict societal and individual harm, underscoring AI-generated Content
AI Models
Testing Samples

on
the urgent need to clearly delineate the scope of and rigorously

icti
M e

Mo
ode c
enforce data protection. In this perspective, we propose a four- l Inferen

ed
de

Pr
G

l
level taxonomy, including non-usability, privacy-preservation, en l
era de
t i on Mo
traceability, and deletability, that captures the diverse protection
needs arising in modern (generative) AI models and systems. Our Prompts
Art Works
framework offers a structured understanding of the trade-offs "Translate from English to French:
between data utility and control, spanning the entire AI pipeline, I'm very happy to see you"
“cartoon character of a person
including training datasets, model weights, system prompts, with a hoodie, ork, gold chains,
realistic anime cat..."
and AI-generated content. We analyze representative technical
approaches at each level and reveal regulatory blind spots that Fig. 1: Data flow across the life-cycle of a (generative) AI
leave critical assets exposed. By offering a structured lens to model. The schematic traces how different forms of data emerge
align future AI technologies and governance with trustworthy
data practices, we highlight the urgent need to rethink data and circulate from the moment raw samples are collected to
protection for modern AI techniques and provide timely guidance the point at which a deployed model generates new content. (i)
for developers, researchers, and regulators alike. Data Collection and Curation: Samples, such as images, texts,
and audio clips are gathered and annotated; once aggregated,
they form the training dataset that drives model learning and
I. I NTRODUCTION
the testing dataset used for validation. (ii) Model Training:
Artificial Intelligence (AI) has experienced tremendous These datasets are transformed into model parameters (e.g.,
progress in the last few decades and is widely and successfully weights and biases), turning the well-trained model itself into
deployed in almost all domains, such as identity verification, a valuable, model-centric data asset. (iii) Model Inference:
e-commerce, and healthcare [1, 2, 3, 4]. With the recent After deployment, users supply inputs or prompts—which may
rapid development of AI-enpowered generative models (e.g., contain private or proprietary information—that the model
large language model (LLM) [5] and diffusion model [6]), processes to produce AI-generated content ranging from class
people can use them to easily generate high-quality images, labels to code, images, or full documents. Arrows indicate
audio, video, and text (instead of simple predictions). More how each artefact (e.g., dataset, model parameters, prompts,
importantly, these powerful models are close at hand, where and outputs) can be independently copied, released, or shared,
users can simply exploit them via APIs (e.g., GPT-4 [7] and underscoring why all of them must be considered within a
Midjourney [8]) or even directly download them from open- comprehensive data-protection framework.
source communities/platforms (e.g., Hugging Face). Arguably,
we have moved into the era of (generative) AI. LLMs like GPT-4 [7] and DeepSeek [9] required vast, curated
In general, the prosperity of AI heavily relies on high- datasets from diverse sources, often refined with costly human
quality data, with which researchers and developers can train, feedback to ensure quality and alignment. Similarly, specialized
evaluate, and improve their models. For example, advanced medical models like Google’s Med-PaLM [10], designed for
clinical question answering and summarization, or diagnostic
The first two authors contributed equally to this paper.
This work was partly completed while Yiming Li was a Research Professor AI systems for tasks like cancer detection from images,
at Zhejiang University; he is now with Nanyang Technological University. relied heavily on large, diverse clinical datasets (e.g., the
PERSPECTIVE 2

Cancer Genome Atlas (TCGA) [11]) meticulously annotated by how usable the data remains for AI models and the degree of
medical experts, a complex and resource-intensive necessity. In control or protection imposed on that data. At the most stringent
particular, collecting and annotating data remains a significant end of the spectrum, data non-usability ensures that certain
obstacle for most companies since they are time-consuming and data cannot be used for model training or inference at all,
even expensive [12]. Accordingly, these data are undoubtedly offering maximal protection by completely sacrificing utility.
valuable assets to their owners and deserve to be protected. Progressing down the hierarchy, privacy-preservation permits
Data protection has long been a critical area of research data to be utilized in model development and application while
due to its significance in safeguarding the legitimate rights safeguarding sensitive information, a trade-off that maintains
of data owners. Various regulations, such as GDPR [13] or some utility but enforces confidentiality of personal or private
EU AI Act [14], highlighted the importance of data protection. attributes. Further along, traceability allows nearly full data
In the past, data typically existed as discrete digital items, usage, yet embeds mechanisms to track the data’s origin
whose value was derived largely from their content. For and usage, thereby enabling transparency and accountability
example, it could be digitized artwork, photographs, videos, (for instance, detecting if data has been misused) with only
etc. Accordingly, traditional data protection mainly refers to minimal impact on the data’s functionality. Finally, at the most
protecting the content of data from unauthorized use and permissive level, data deletability lets data be fully integrated
redistribution, although its specific definition and scope still on the condition that its influence can be later removed from
remain ambiguous to some extent. In practice, data owners the model upon request. This last level emphasizes post-
would encrypt files [15, 16, 17] before storage or transmission hoc control (aligning with ‘right to be forgotten’ principles)
and embed digital watermarks [18, 19, 20] when releasing data without impeding initial data utility. In particular, to ground this
publicly or in digital marketplaces in the past. taxonomy, we systematically review representative technical
However, in the AI era, especially with the emergence of approaches at each level, highlighting their strengths and
generative AI models, the scope of data protection has become limitations in practical settings.
far more complex and ambiguous [21, 22, 23]. As shown in By clearly delineating these four levels, our framework
Figure 1, data permeates every stage of an AI model’s life cycle, brings much-needed clarity to the often conflated notion of
making its value increasingly tied to the model rather than just ‘data protection’ in the (generative) AI era. Researchers and
the raw content of the data. For instance, developers compile practitioners can now specify whether they aim to prevent any
many individual samples into large training datasets that feed use of certain data, protect privacy during use, ensure traceable
into model development. The trained models themselves then usage, or enable later deletion. This structured hierarchy not
become valuable data assets with significant commercial value. only highlights the progressive relaxation of restrictions (from
In addition, high-value or sensitive data (e.g., original artworks strict non-use to full use with after-the-fact removal) but also
or personal medical records) may also be incorporated as helps disambiguate the scope of data protective measures in
inputs during a model’s inference stage. Besides, with the rise the AI era. Moreover, it provides a structured lens to evaluate
of generative AI models, the outputs of inference are no longer existing legal and regulatory instruments: in the later section,
simple predictions – they can be substantial content in their we will show how existing national and international policies
own right. For example, an LLM might generate executable or regulations align (or fail to align) with each data protection
code for a requested function [24], or a diffusion model might level, illuminating where governance already supports these
produce a realistic image for an advertisement or animation protective goals and where further action is required.
clips [25]. These AI-generated outputs are also valuable forms
of data that merit protection. II. H IERARCHICAL TAXONOMY OF DATA P ROTECTION
This ambiguity in scope makes meaningful protection and
regulation difficult. For instance, in 2023, Samsung Electronics A. What Data Do We Need to Protect in the AI Era?
discovered that employees had inadvertently leaked proprietary In the (generative) AI era, the scope of data protection
source code by inputting it into OpenAI’s ChatGPT, prompting has expanded significantly, moving far beyond the traditional
it to prevent its staff from using such external generative focus on static data content. Specifically, AI models generate
AI tools on company systems [26]; that same year, Italy’s and consume various forms of data throughout their lifecycle,
Data Protection Authority (Garante) imposed a nationwide from initial training to final inference. At each stage, different
suspension of ChatGPT after a leak of user conversations and categories of data emerge as assets that warrant protection,
allegations that personal data had been ingested for training whether for reasons of privacy, intellectual property, security,
without a lawful basis [27]. These incidents underscore the or commercial value. As presented in the previous section,
urgent need for a systematic understanding of what, precisely, Figure 1 illustrates this lifecycle, where raw samples become
must be protected against the backdrop of blooming AI- training datasets, which in turn yield models; those models are
integrated applications and data markets. then deployed to handle user prompts and produce AI-generated
To tackle this problem, this paper offers the first timely outputs. Every artifact along this chain, such as the training
overview and categorization of data protection in the (genera- datasets, the trained model, the user inputs/prompts, and the
tive) AI era. Specifically, we introduce a hierarchical taxonomy AI-generated content (AIGC), carries its own significance
of data protection comprising four distinct levels: data non- and sensitivities. Below, we examine why each of these data
usability, privacy-preservation, traceability, and deletability. categories matters and why they must be safeguarded within a
Each level in this taxonomy reflects a different balance between comprehensive protection framework.
PERSPECTIVE 3

Training Datasets: In the development phase of a model, examples are system prompts [45, 46] and external knowl-
large curated datasets serve as the fuel for learning. These edge bases [47, 48] used in conversational AI and retrieval-
collections of samples (images, text, audio, etc.) are often augmented generation (RAG). Such deployment-integrated data
aggregated from diverse sources, which inherently raises the elements are often invisible to end-users but are pivotal in
risk of including sensitive personal information or copyrighted determining how the model responds to inputs. Importantly,
material [28]. Protecting training data is therefore crucial for they may embed sensitive or proprietary information, and their
legal and ethical reasons: developers must respect privacy rights compromise can be just as damaging as a leak of the model
and intellectual property (e.g., avoiding unauthorized use of itself. Even though this data is not ‘learned’ during training, it
personal photos [29] or scraping of copyrighted text [30]) must be protected because it directly influences the model’s
to comply with regulations and moral norms. Moreover, outputs and can inadvertently reveal protected information if
assembling and labeling high-quality datasets is expensive and misused. Specifically, system prompts are predefined directives
time-consuming, making them commercially valuable assets or contexts given to a model at inference time, especially in
for the organizations that curate them. Companies treat their large language model (LLM) deployments. For instance, a
data as proprietary know-how. For example, the success of ChatGPT-like assistant might have a hidden prompt saying:
ImageNet [31] spurred competitive advantages in computer ‘You are an expert medical assistant. Always answer with
vision and beyond [1]. If such a dataset were stolen or evidence-based information and in a reassuring tone.’ This
misused, the original collector could suffer a significant loss prompt is not part of the model’s parameters but is provided by
of investment and competitive edge. For all these reasons, the developers to guide the model’s behavior and set boundaries
training data merits strong protection. This includes measures on its responses. System prompts help ensure consistency,
to preserve privacy (e.g., removing or anonymizing personal align the model with ethical or style guidelines, and can
identifiers [32]) and to enforce rights management, ensuring embed institutional knowledge and policies, or even achieve
the data is not redistributed or used beyond its permitted differentiated services through carefully designed prompts.
scope [30, 33, 34]. In some cases, dataset owners even embed Because they often encode rules and content that the provider
subtle markers (e.g., watermarks or fingerprints) into the data considers sensitive (including possibly proprietary instructions
to enable traceability [35, 36, 37, 38, 39], so that if the data or content examples), system prompts are sensitive deployment
appears in an unauthorized model or repository, it can be data [49, 50]. If an adversary were to discover the exact
identified and linked back to the source. Overall, securing the contents of these prompts, they might exploit them (e.g.,
training dataset is the first pillar of data protection in the AI by crafting inputs that override and manipulate the system
pipeline, preventing downstream issues that could arise from instructions, or developing competitive applications by illegally
contaminated or compromised training information. acquiring the system prompts). External knowledge bases
are specialized repositories of curated information, integrated
Trained Models: Once an AI model has been trained,
at inference time to enhance the capability of AI models
the model itself, encompassing its architectural configuration
(especially LLMs) through a mechanism known as RAG.
and learned parameters, becomes a model-centric data asset
Unlike system prompts, external knowledge bases are extensive
of immense value. Unlike raw training datasets, a trained
collections of documents or structured data that models dynam-
model encapsulates generalizations drawn from potentially vast
ically retrieve and incorporate into their reasoning process to
training data [6, 40, 41]. In effect, it is a compressed repository
produce accurate, timely, and domain-specific responses. For
of that data’s information. This gives the model significant
example, medical assistants powered by retrieval-augmented
commercial and strategic significance. Organizations invest
large language models (RA-LLMs) might access confidential
heavily in developing high-performing models, and the resulting
diagnostic records to inform clinical decisions, while financial
structure and weights are often regarded as trade secrets or key
agents leverage internal market databases for precise forecasting.
intellectual property. For example, the parameters of a state-of-
Although external knowledge bases are not part of the trained
the-art language model or image recognition network can confer
model parameters, their content may be highly sensitive, often
a competitive edge, making the model file itself as sensitive as
comprising proprietary or confidential information crucial to
any proprietary dataset. Protecting this trained model data is
an organization’s operational advantage [51]. Together, these
therefore paramount – if it is exposed or stolen, an adversary
examples highlight that deployment-integrated data, exemplified
or competitor could reuse it, undermining the original owner’s
by system prompts and external knowledge bases, represent
investment and advantage [42, 43, 44]. Accordingly, the trained
critical yet often overlooked data assets whose protection is
model must be safeguarded much like any confidential dataset
also indispensable in today’s (generative) AI era.
in the AI era, especially to preserve the commercial integrity
of the model as a proprietary asset.
User’s Input: When a model is deployed, new data enters the
Deployment-integrated Data: Beyond the model’s learned picture: the inputs (especially prompts) supplied by users during
parameters, modern AI deployments usually incorporate ad- inference. These inputs can be as trivial as a search query or
ditional auxiliary data that plays a crucial role in shaping as sensitive as a detailed medical history or proprietary source
their inference performance. These data are introduced at code, depending on the application [21, 52]. In the AI era,
the deployment or runtime stage (after model training), and particularly with the rise of accessible generative AI chatbots
while not part of the model’s weights, they effectively become and assistants, users routinely provide personal or confidential
extensions of the model’s knowledge and policy. Two prominent data to AI systems in exchange for tailored outputs. It is
PERSPECTIVE 4

imperative to protect this prompt data for privacy, security, contains information that could be sensitive, proprietary, or
and ethical reasons. From a privacy standpoint, any personal otherwise regulated. Notably, each type of data can be copied
information in a user’s query (names, addresses, health details, or transmitted independently – one can leak a dataset, steal
etc.) should be handled in compliance with data protection a model’s weights, expose a user’s prompt, or misappropriate
laws and the user’s expectations of privacy. There have already an AI output, which is why all of them must be considered
been real-world incidents underscoring this need: for example, in a holistic protection strategy. By clearly identifying these
in 2023, Italy temporarily banned ChatGPT over concerns categories, we can align specific protection goals and techniques
that the platform was not adequately protecting user-provided to each: e.g., privacy-preservation for personal data in training
personal data [27]. Commercial confidentiality is equally at sets and prompts, traceability mechanisms for outputs, and so
stake – consider an employee who uses an AI coding assistant forth. The following sections will build on this lifecycle view to
and enters proprietary code as a prompt. If the AI service explore how a hierarchical taxonomy can collectively safeguard
retains this input, it could lead to an unintended leak of the myriad data assets in the AI era, and how emerging data
trade secrets. This scenario is not hypothetical: employees protection techniques map onto each protection level.
at Samsung accidentally disclosed confidential source code
and meeting notes by submitting them to ChatGPT, which
B. Towards the Hierarchical Taxonomy of Data Protection
retained those prompts on its servers [26]. To address such
issues, techniques like robust access control [53, 54] and privacy Taxonomy Overview. AI’s data-protection challenges span
guarantees [55, 56] must be in place at the inference stage. a spectrum from extremely strict control of data to more
Ethically, users should have transparency and agency regarding permissive use with after-the-fact safeguards. To make sense of
their inputs – they should know if prompts will be logged this spectrum, we propose a four-level hierarchical taxonomy
or used for training, and ideally have the right to deletion of data protection: data non-usability, data privacy-preservation,
(aligning with the ‘right to be forgotten’ in privacy regulations). data traceability, and data deletability. Each successive level
Protecting users’ input data not only complies with privacy in this hierarchy relaxes the protections on data slightly,
laws but also builds trust. If users fear their prompts might be trading off some degree of control for greater data utility.
misused or leaked, they will be reluctant to adopt AI solutions, At the highest, the most restrictive end, data non-usability
limiting the technology’s benefits. Thus, safeguarding users’ ensures that certain data cannot be used for model training
input is now a fundamental component of data protection in or inference at all, thereby offering maximum protection by
AI, aimed at preserving individual privacy and maintaining completely sacrificing that data’s utility. Stepping down one
confidentiality in AI services [57]. level, data privacy-preservation permits data to be employed in
model development or inference while safeguarding sensitive
AI-generated Content (AIGC): The final category of pro- information – a compromise that preserves some utility but
tected data arises from the model’s own outputs. In particular, enforces confidentiality of personal or private attributes. Next,
instead of simple numbers, modern (generative) AI systems can data traceability allows nearly full use of data for AI models,
produce rich content like paragraphs of text, realistic images, yet embeds mechanisms to track the data’s origin, usage,
and code snippet [58, 59]. These AIGCs have already become and modifications (e.g., to detect if data has been misused),
valuable digital objects [60, 61]. While the standalone content thereby enabling transparency and accountability with only
of AIGC has inherent protection needs related to intellectual minimal impact on the data’s functionality. Finally, at the most
property, ownership, and potential sensitivities [62, 63, 64, 65], permissive level, data deletability imposes nearly no restriction
our primary focus here aligns with the model-centric perspec- on a dataset’s initial use for training and inference, instead
tive: protecting AIGC in its role as a data asset within the requiring that the data’s influence can later be removed from the
(generative) AI ecosystem. Given its high fidelity and utility, model upon the user’s requests. This last level emphasizes post-
AIGC is increasingly leveraged not just as a final product, but hoc control (aligning with ‘right to be forgotten’ principles)
also as data that feeds back into the AI cycle. For example, without impeding the data’s immediate usefulness. Figure 2
AIGC is valuable for creating large-scale synthetic datasets, illustrates this hierarchy of protection levels, which forms a
for knowledge distillation [66], or as deployment-integrated clear gradient from strong protection/low utility at Level 1 to
data (e.g., instances used in retrieval-augmented generation). low protection/high utility at Level 4.
Protecting AIGC in this capacity is therefore crucial. This can
Level 1: Data Non-usability. Data non-usability encom-
involve ensuring traceability to understand its provenance if
passes methods that intentionally render certain data entirely
used for training [37], or employing mechanisms akin to non-
useless for AI applications, including training and inference,
usability or access control to prevent unauthorized reuse for
even if that data is publicly available. In essence, it ensures
training competing models. Our framework thus emphasizes the
that specified data cannot contribute to model learning or
governance needed when this generated content itself becomes
predicting whatsoever. This is crucial in scenarios where
data for subsequent model training or inference, highlighting
individuals or organizations demand strict control over how
its flow within the broader (generative) AI model’s lifecycle.
their data is utilized by AI systems. For instance, authors and
In conclusion, data protection in the (generative) AI era journalists have voiced objections to their articles or books
must extend across the model’s entire lifecycle. From the being used to train language models without consent [67];
raw training dataset, to the trained model, to the prompts similarly, visual artists often share their works online but may
it processes and the content it generates, each component strongly oppose using AI models to transfer their style to others
PERSPECTIVE 5

usable

non-usable non-usable
Non-private Data AI Models
decouple
transform
e
Unauthorized abl
Data ver
AI Models Data eco
Domains unr

Private Data
Level 1: Data Non-usability Level 2: Data Privacy-preservation

update unrecoverable
trace trace deletion
modification modification
trace origin request

Data Owner Original Model Updated Model Deleted Data

Data AI Models

Level 3: Data Traceability Level 4: Data Traceability

Fig. 2: Hierarchical taxonomy of data protection in the (generative) AI era. This taxonomy comprises four distinct protection
levels, each representing a trade-off between data usability and the degree of protection provided. At the most stringent level,
data non-usability completely restricts the use of specific data in model training and inference, thus offering maximal protection
at the cost of total data utility. The next level, data privacy-preservation, allows data use under stringent privacy safeguards,
enabling some practical utility while protecting sensitive or private attributes. Moving further, data traceability permits extensive
data usage but integrates methods to track data origins and modifications, supporting transparency and accountability with
minimal functional interference. At the most permissive level, data deletability places no initial restriction on data usage but
mandates mechanisms for fully removing data’s influence from trained models post hoc, aligned with principles such as the
‘right to be forgotten’. This hierarchical taxonomy helps disambiguate the scope of data protective measures and provides a
structured lens to evaluate and further design related regulations in protecting data in the (generative) AI era.

during inference [68]. By completely precluding any use of ability to track the origin, history, and influence of data as
the data in model development, data non-usability offers the it is used in AI applications during training and inference.
most stringent level of protection in our taxonomy – achieving This capability allows stakeholders to audit and verify data
maximal data control at the expense of all potential utility. usage. For example, an individual might want to check
whether their personal data was incorporated into a model
Level 2: Data Privacy-preservation. Data Privacy-
for training or generating works of art without permission,
Preservation focuses on protecting sensitive information within
and a model developer might need to detect if a training
data while still allowing the data to be used for developing AI
dataset or a pre-trained model has been tampered with or
models or producing meaningful responses/inferences [69, 70].
misused and avoid the potential backdoor in them [73, 74, 75].
This approach is especially critical in sectors like healthcare,
By enabling such oversight, traceability measures greatly
social media, and online services—domains where large
enhance transparency and accountability in how data fuels
volumes of personal data (e.g., age, gender, location, or
AI systems. Importantly, implementing traceability need not
purchasing behavior) are routinely collected and analyzed [71].
significantly hinder the data’s usefulness for modeling: the data
For instance, a hospital or research institute might analyze
remains almost fully available for training or inference, with
patient records to train a disease-detection model, but it
at most slight modifications introduced to embed identifiers
must do so without exposing any individual’s identity or
(e.g., imperceptible watermarks or metadata tags) that enable
private details. Users also do not want to leak their private
later tracking [37, 76]. Thus, data traceability provides a more
information when chatting with AI chatbots interacting with
moderate level of protection – less restrictive than privacy-
prompts. Ensuring privacy is not only a legal obligation for
preservation since it leaves the data content largely intact, but
data handlers, but also a crucial measure to prevent misuse
still offers an important safeguard through post hoc auditability.
of personal information and to maintain public trust in AI-
driven technologies and applications. In practice, privacy-
Level 4: Data Deletability. Data deletability is the capacity
preserving measures mean that a significant portion of each
to completely remove a specific piece of data and its influence
dataset (namely, the privacy-sensitive attributes) is withheld,
from a trained (AI) model. While deleting a data file from
masked, or otherwise not directly accessible during training
a storage database is trivial, eliminating that data’s imprint
or inference [72]. Consequently, data privacy-preservation still
on an AI model is a far more challenging task [77]. This
represents a high level of protection for the data, second only
level of protection ensures that if a particular data sample
to complete non-usability in its restrictiveness, while enabling
must be purged – for example, because it is no longer needed
much more data utility than the latter.
or because the individual who provided the data withdraws
Level 3: Data Traceability. Data Traceability refers to the consent – there is a mechanism to do so cleanly and effectively.
PERSPECTIVE 6

Data Features inaccessible Data Information inaccessible

Watermarked Data

Encryption& Non-tampering Original Model Rebuilding

Intrusive
Authorization -based Methods watermark
Task-agnostic Methods
mitigate Non-private Information
Features
Updated Model
Non-transferable mitigate
Learning
mitigate
Task-relevant mitigate Private Suspicious Unlearning
Data Non-intrusive
Features Information Data
Unlearnable Tampering-based Methods
Examples Methods

Level 1: Non-usability Level 2: Privacy-preservation Level 3: Traceability Level 4: Unlearnability

Fig. 3: Design principles of techniques for each level. Level 1. Non-usability: Encryption and (fine-grained) authorization
confine direct data access solely to authorized parties, while techniques such as unlearnable examples and non-transferable
learning disable data exploitation in unauthorized domains by mitigating particular data features, thereby achieving non-usability
indirectly; Level 2. Privacy-preservation: These techniques generally fall into two main categories: tampering-based and
non-tampering-based methods. The former perturbs private portions of the data (occasionally at the cost of tampering with
some non-private content), whereas the latter prevents direct access without data modification while preserving data utilities;
Level 3. Traceability: Traceability techniques intrusively attach ownership signals (i.e., watermarks) to original data or directly
infer provenance and potential modifications non-intrusively by analyzing data’s intrinsic information; Level 4. Deletability:
The influence of protected data (denoted by ‘purple circle’ in the sub-figure) can be removed either by excising the data and
rebuilding the AI model from scratch to directly change the decision surface (marked in ‘black dot-line’) or, more efficiently, by
targeted unlearning that erases its influence (to the surface) without full model reconstruction, thereby ensuring data deletability.

Such capability is particularly pertinent to user rights and data knowledge gleaned cannot generalize to unintended tasks.
governance frameworks (e.g., complying with the ‘right to Together, these techniques align with a ‘secure-by-design’
be forgotten’ in GDPR regulations [13]). Notably, enabling philosophy: the data would be essentially non-usable and
deletability does not require compromising the data’s utility remain protected even in worst-case scenarios.
during initial model training; the data can be used to its full
extent upfront, and the protective measure comes into play only Techniques for Privacy-preservation. Privacy-preservation
later, if and when deletion is required. Because this approach techniques enable the beneficial use of data for AI development
imposes no upfront usage restrictions, it offers the lowest while shielding sensitive information. They fall into tampering-
immediate level of protection among the four levels – instead, based and non-tampering-based categories. In tampering-based
its strength lies in allowing retrospective removal. In summary, approaches, the data themselves (at least their private compo-
data deletability prioritizes giving data owners ultimate control nents) are modified so that private/sensitive attributes become
after model development, even though it provides only minimal indistinguishable or masked [86]. For example, early schemes
protection at the time of data use. such like k-anonymity and L-diversity schemes generalize
or suppress identifying details [32, 87], though they can
III. T ECHNIQUES FOR DATA P ROTECTION reduce utility and remain vulnerable to linkage attacks [88];
To translate the conceptual taxonomy of data protection Differential privacy [69, 89] provides a stronger guarantee
into practice, this section briefly describes a range of design by injecting carefully calibrated noise into data, intermediate
principles and corresponding representative techniques tailored computations, or outputs [29, 90]. The added randomness
to the four protection levels introduced above. Figure 3 masks each individual’s contribution while remaining versatile
illustrates the design principles of techniques for each level. enough for model training, synthetic-data generation, and
attack mitigation [22, 91]. In contrast, non-tampering-based
Techniques for Non-usability. Non-usability encompasses techniques avoid modifying raw data, seeking privacy protection
strategies that block any unauthorized party from using or even with minimal impact on data utility. For example, homomorphic
accessing protected data. Arguably, the most direct method is encryption [16, 92] enables computations on encrypted inputs,
encryption [78, 79, 80]: by securing data with strong crypto- eliminating exposure during processing; Privacy-preserving
graphic keys, the information remains unintelligible without distributed learning, such as federated learning [93] and split
proper authorization. A complementary line of defense ensures learning [94], keeps data local while sharing only aggregated
that the data cannot be exploited even if an adversary obtains it. model updates [93, 95]. In this way, global models benefit
For example, authorization mechanisms, including fine-grained from diverse datasets without centralizing sensitive records.
data-access control [53, 54, 81] and model-level authorization
[82, 83], allow only approved entities to obtain (correct) model Techniques for Traceability. Traceability seeks to record and
outputs. Unauthorized requests receive degraded or nonsensical verify where data (including models) originate, how they are
responses. Beyond controlling access or general utility, a further used, and whether they have been altered. Existing approaches
class of techniques makes the data unusable in unauthorized can be broadly categorized into intrusive and non-intrusive
domains: unlearnable examples [33, 68] embed imperceptible methods. Intrusive methods embed explicit and external iden-
perturbations that frustrate a model’s ability to extract ‘task- tifiers (dubbed ‘watermarks’) into the data asset [96, 97].
relevant’ features, whereas non-transferable learning [84, 85] For example, digital watermarking adds hidden yet robust
deliberately suppresses ‘task-agnostic’ features so that any signatures to datasets [30, 35], model parameters [98, 99], or
PERSPECTIVE 7

TABLE I: The representative regulations of data protection in the (generative) AI era. The last column shows the levels of data
protection covered in the regulation (N: non-usability, P: privacy-preservation, T: traceability, D: deletability).
Country/Region Regulation Name Year Protection Level(s)
California Consumer Privacy Act [114] 2018 N, P, T, D
USA
Federal Zero Trust Data Security Guide [115] 2024 N, P, T
General Data Protection Regulation [13] 2016 N, P, T, D
Ethics Guidelines for Trustworthy AI [116] 2019 N, P, T
EU
EU AI Act [14] 2024 N, P, T, D
General-Purpose AI Code of Practice (Draft) [117] 2025 N, P, T
Cybersecurity Law of the PRC [118] 2016 N, P, T
Data Security Law of the PRC [119] 2021 N, P, T
Personal Information Protection Law of the PRC [120] 2021 N, P, T
Administrative Measures for Generative Artificial Intelligence Services [121] 2023 N, P, T
China
Action Plan of the Development of Trustworthy Data Space [122] 2024 N, P, T
Implementation Plan on Improving Data Circulation Security Governance to Better
2025 N, P, T
Promote the Marketization and Valorization of Data Elements [123]
Methods for Identifying Synthetic Content Generated by Artificial Intelligence [124] 2025 T
Artificial Intelligence Mission Austria 2030 [125] 2019 N, P, T
Artificial Intelligence and Data Act [126] 2022 P, T
Brazilian AI Regulation [127] 2023 N, P, T
Others
Enhancing Access to and Sharing of Data in the Age of Artificial Intelligence [128] 2025 N, P
Joint Statement on Building Trustworthy Data Governance Frameworks
2025 N, P, T
to Encourage Development of Innovative and Privacy-Protective AI [129]

prompts [49]. These robust watermarks survive ordinary use Unlike traditional data systems, where protection focuses on
and prove ownership. Contrarily, fragile watermarks [100, 101] static storage and access control, AI systems rely on dynamic,
are deliberately brittle, breaking if tampered with, and therefore model-centric data use: once data is absorbed into a model’s
alerting potential modification. Non-intrusive methods, on parameters, it may persist, influence downstream outputs, and
the other hand, enable traceability by analyzing its intrinsic defy straightforward removal. Legal frameworks thus serve
information or detecting its modifications without altering the as critical instruments to constrain unauthorized use, enforce
underlying data asset. For example, membership inference privacy-preservation, ensure traceability, and empower users
[102, 103] evaluates whether a data point was in a model’s with redress and deletion rights. As shown in Table I, there are
training set; model fingerprinting [104, 105] probes a model already some pioneering related regulations. For instance, many
with crafted inputs to reveal its identity; Cryptographic hashing privacy laws operationalize L1 (Non-usability) by prohibiting
[106, 107] produces unique fingerprints that change upon the use of sensitive or unlawfully collected data for AI
any bit-level alteration, while blockchain ledgers [108, 109] training altogether. Similarly, L2 (Privacy-preservation) is
maintain an immutable, time-stamped record of data states, widely mandated through consent requirements, anonymization,
making secret edits computationally infeasible. and processing limits. Emerging regulations now also touch on
Techniques for Deletability. Ensuring that specific data and L3 (Traceability)—requiring documentation of data provenance
their influence can be removed from AI models underpins and logging of model operations—and even aspire to L4
rights such as GDPR’s ‘right to be forgotten’. The most (Deletability), allowing individuals to remove their data’s
straightforward, yet costly, route is to directly delete the data influence post-training. As the diffusion of data across AI
and rebuild the AI model from scratch [77, 110]. A more pipelines complicates direct user control, the regulation remains
efficient alternative is offered by unlearning techniques that the strongest binding force for aligning model development
specifically focus on erasing the influence of the data instead of with ethical and societal norms.
directly the content. These algorithms aim to approximate the
model state that would have arisen had the targeted data never Globally, several regulatory regimes have responded to this
been used, thereby avoiding the significant expense and time challenge with varying scope and emphasis. The European
required for complete retraining or rebuilding [111, 112, 113]. Union’s General Data Protection Regulation (GDPR) remains
the archetype of a rights-based data framework, offering
expansive protections including the ‘right to erasure’ and strict
IV. R EGULATIONS ON DATA P ROTECTION processing limitations on personal data [13]. These provisions
In the era of (generative) AI, regulation plays a foundational collectively enforce L1–L4 protections robustly and are often
role in safeguarding data integrity, privacy, and accountability. interpreted to cover AI training data. The EU’s 2024 AI Act
PERSPECTIVE 8

further builds on this by banning certain high-risk AI uses (L1), V. D ISCUSSIONS

requesting data desensitization, requiring dataset documentation While our proposed hierarchical taxonomy for data protection
and labeling (L3), and reaffirming user-centric rights that provides a critical guideline for the (generative) AI era, its
overlap with L4 [14]; China’s approach, through the Personal establishment is a starting point, not an endpoint. This section
Information Protection Law (PIPL), Data Security Law, and moves beyond this foundational structure to dissect compelling
the 2023 Measures for Generative AI, emphasizes state-centric emergent issues and underlying complexities that demand
oversight. These policies prohibit certain data uses (L1), require deeper exploration. Our goal is to spark the vital conversations
user consent and data anonymization (L2), and impose content needed to ensure data is handled responsibly and ethically as
labeling and prompt logging obligations (L3) [120]. While AI techniques and applications continue to evolve.
deletability (L4) is nominally protected under Chinese law,
enforcement practice remains limited; In contrast, the United A. Data Protection vs. Data Safety
States currently lacks a comprehensive federal data protection Distinguishing between data protection and data safety is
regime. Instead, privacy and deletion rights derive from sectoral crucial, yet often overlooked. Data protection (in the AI era),
and state-level statutes such as the California Consumer Privacy as conceptualized in this paper through the hierarchy of non-
Act (CCPA), which supports L2 and L4 protections [114]. usability, privacy-preservation, traceability, and deletability,
The Federal Zero Trust Data Security Guide reflects growing fundamentally concerns the governance and control over data
interest in traceability and risk-based governance (L3) but as an asset throughout the AI model’s lifecycle [30, 130]. It
leaves implementation largely voluntary or agency-led [115]. In addresses questions of ownership, authorized use, provenance,
terms of regulatory design, the EU favors detailed, enforceable and the right to be forgotten — essentially controlling how
rights; China emphasizes preemptive control and compliance data flows and is utilized within the AI ecosystem, irrespective
through licensing and supervision; and the U.S. leans on ex- of the specific harm its content might cause. It focuses on
post accountability and corporate commitments. Despite these safeguarding the rights and interests tied to the data itself and
differences, there is a broad convergence around the necessity the models derived from it.
of data non-usability, privacy-preservation, traceability, and Data safety, in contrast, is primarily concerned with the
deletability introduced by this paper. content of the data and the potential harms arising from that
content or the model’s behavior influenced by it [12, 131]. This
Nonetheless, current regulations remain incomplete. A first includes issues like misinformation and deepfakes generated
major gap concerns cross-border enforceability: data used in by models [70, 132], biases encoded in training data leading
AI training often travels internationally, and fragmented legal to discriminatory outcomes [133], the generation of harmful or
standards create blind spots. For example, a dataset scraped incorrect hazardous content [134], and the overall robustness
in the U.S. and hosted in Singapore might be trained into a and reliability of AI models and systems against adversarial
model deployed in Europe—yet only partial protections may manipulation aimed at causing malfunction or harm [135]. In
apply depending on jurisdiction. Without global interoperabil- essence, data safety seeks to mitigate the negative consequences
ity, enforcement becomes inconsistent and rights unevenly stemming from the data’s substance or the AI model’s outputs.
distributed [128]. Second, even where rights to deletion exist However, in the (generative) AI era, the lines between
(e.g., GDPR), technical feasibility lags. Removing data from data protection and data safety are increasingly blurred and
AI models remains challenging once it has influenced model intertwined. Firstly, a lapse in one dimension often precip-
parameters (known as ‘model unlearning’) [77]. Regulatory itates a failure in the other. For instance, a data-poisoning
texts rarely specify how such deletion should occur, leaving attack—classically a safety issue—can coerce a model into
ambiguity in both compliance and remedy. Finally, many revealing sensitive training samples, thereby breaching privacy
data protection laws remain focused on personal data. Large protections [136]; conversely, theft of a proprietary model—a
portions of AI training data involve non-personal but sensitive protection failure—gives adversaries the means to mass-
content: copyrighted content, synthetic datasets, or proprietary produce deepfakes or targeted misinformation, elevating safety
corpora. These fall outside privacy statutes and are instead risks [41, 137]; Besides, many countermeasures serve dual
covered unevenly under IP law or trade secret frameworks. roles: watermarking, conceived as a traceability tool for data
Similarly, models themselves—containing learned represen- protection [60, 138, 139], also helps attribute and filter AI-
tations of training data—are not clearly governed. Moving generated misinformation, while access-control mechanisms,
forward, regulators may need to embrace AI-specific rules designed to safeguard data integrity, likewise prevent unautho-
for traceability by design (e.g., mandatory dataset disclosure, rized generation of harmful content; More broadly, guarantees
logging, and watermarking), technical mandates for deletability, of data protection feed data-centric AI developing pipelines,
and broader coverage of non-personal data (e.g., artworks improving dataset quality and control, thereby reducing bias,
and models). Cross-border frameworks, such as global AI hallucination, etc.—core challenges in data safety.
governance compacts or aligned certification standards, could As we mentioned before, data safety, encompassing fairness,
help fill the compliance vacuum. In essence, while current robustness, bias mitigation, and content moderation, is an
regulations lay important foundations, future ones must evolve equally critical but vast research area deserving its own
alongside the AI model’s capabilities—embedding safeguards dedicated treatment [12, 133]. However, this paper concentrates
at every level of our taxonomy to ensure responsible innovation primarily on the data protection dimension – establishing con-
in the (generative) AI era. trol over data assets within the AI lifecycle. In general, we focus
PERSPECTIVE 9

on protection because establishing fundamental controls over AIGC, define the boundaries of rights associated with training
data usage and provenance is often a prerequisite for tackling data and model-derived knowledge, and establish clear rules
complex safety issues effectively. Nonetheless, recognizing the for the use and attribution of generated content. A specific
deep interplay is essential. Robust data protection mechanisms, protection framework like ours can inform these developments
particularly those ensuring traceability and controlled access, by highlighting what we need to protect and even what is
provide the foundational transparency and oversight needed to technically feasible in terms of control and transparency.
audit systems for safety concerns, attribute harmful outputs,
and enforce safety-related policies. Future frameworks must C. Challenges of Cross-Jurisdictional Data Protection
holistically consider both aspects to build truly trustworthy The inherently global nature of the AI ecosystem presents sig-
(generative) AI models and systems. nificant hurdles for consistent data protection. The lifecycle of
AI models, from data collection via web scraping or distributed
sensors, annotation by global crowdsourcing platforms, training
B. Emerging Challenges brought by AIGC in Data Protection
on cloud servers located potentially anywhere, to deployment
The rise of AI-generated content (AIGC) powered by for a worldwide user base, routinely cross multiple national
generative models introduces profound new challenges in data borders. This immediately runs into the fragmented and vague
protection. In particular, many existing legal systems, including landscape of international data protection regulations.
those in the US and EU, struggle to grant copyright protection Currently, different jurisdictions have markedly different
to purely AIGC because it often lacks the requisite human manners. The European Union’s GDPR [13] imposes strict,
authorship [62]. This leaves the ownership and copyrights rights-based obligations with extra-territorial reach. The US
associated with vast amounts of potentially valuable AIGC in employs a sectoral approach supplemented by state-level laws
a state of ambiguity. Who owns the novel image created by a like the CCPA [114]. China’s PIPL [120] emphasizes state
diffusion model, or the code snippet generated by an LLM? oversight and data localization requirements. Other regions may
Rather than treating AIGC purely as content itself, our have nascent or less comprehensive regulations [128]. This
model-centric data protection perspective highlights further regulatory patchwork creates significant compliance challenges
complexities. When AIGC is itself used as data – for instance, for developers and opens avenues for exploitation. For example,
synthetic data for training new models, knowledge distillation data scraped in a jurisdiction with lenient rules might be used to
[140], or as input for retrieval-augmented generation systems train an AI model deployed in a region with strict privacy laws,
– its copyright status becomes even more convoluted. Does creating legal jeopardy. Conversely, malicious actors might
the copyright (or lack thereof) of the original data used to deliberately host AI models trained on improperly acquired
train the generative model influence the status of the synthetic data in jurisdictions with weak enforcement capabilities,
data? If a model distills knowledge from copyrighted data, undermining protection efforts globally.
does the resulting trained model (as a compact representation Addressing these cross-jurisdictional challenges requires
of information contained in these data) or the data it generates multifaceted solutions. On the policy front, greater international
inherit restrictions? This debate touches upon the core definition cooperation towards regulatory harmonization or establishing
of data rights: Are they solely tied to the direct expression common minimum standards (perhaps through bodies like
of content, or do they extend to the statistical patterns, styles, the OECD or UN initiatives) is desirable, although politically
and knowledge implicitly captured and transferable by a complex [128]. Interoperability frameworks that allow different
model [68]? The potential for AI models (especially generative regulatory systems to recognize and interact with each other
ones) to launder copyrighted information into seemingly novel, could offer a more pragmatic path than full unification. From
unprotected AIGC is a significant concern. a technical perspective, one approach is to adopt the strictest
Even amidst this legal uncertainty, our proposed data standard (e.g., GDPR compliance) globally, but this often
protection framework offers valuable tools. The L3 (Trace- imposes excessive costs and sacrifices utility unnecessarily in
ability), through techniques like watermarking or fingerprinting many contexts. A more promising direction lies in developing
[49, 60], can help establish the provenance of AIGC, potentially adaptive data protection techniques. Future systems could po-
linking it back to specific models or even training datasets. tentially leverage context-aware mechanisms, perhaps inspired
This provides crucial evidence for detecting plagiarism or by meta-learning or zero-shot adaptation principles [141], to
unauthorized use of protected styles or content, even if the dynamically adjust protection levels (e.g., the type of water-
AIGC itself isn’t copyrightable [37]. Furthermore, L1 (Non- marking, the rigor of privacy mechanisms, the implementation
usability) techniques, such as data cloaking methods designed of deletion) based on the legal requirements of the data’s origin,
to disrupt style mimicry [68], offer technical safeguards for the user’s location, or the operational jurisdiction. However,
creators where legal protections are currently inadequate. These realizing such adaptive systems effectively still requires clear
techniques and tools allow stakeholders to exert a degree of regulatory signaling and international collaboration on technical
control over how their data or derived AI models influence standards. Arguably, our hierarchical taxonomy can serve as a
future generations, shifting focus from solely legal ownership foundational conceptual framework – a common language – to
to technical prevention of undesired use. facilitate these multi-stakeholder discussions, allowing different
Ultimately, these complex questions surrounding AIGC and jurisdictions to map their specific requirements onto shared
copyright require urgent attention from policymakers and levels of protection, thereby aiding both policy alignment and
legal scholars. Future legislation must clarify the status of the development of interoperable technical solutions.
PERSPECTIVE 10

D. Ethical Considerations in Data Protection ACKNOWLEDGEMENTS

Beyond the conceptual and technical mechanisms and legal We sincerely thank Prof. Bo Li (University of Illinois
mandates, data protection in the AI era is intrinsically linked Urbana-Champaign) for her incisive insights and constructive
to fundamental ethical considerations. The choices made about suggestions from a professional side. We are also grateful
how data is collected, used, shared, and managed reflect to Prof. Dong Chen (Michigan State University), Jing Lyu
underlying values and have direct consequences for individuals (Columbia University), Yisheng Lin (Peking University), and
and society. Our framework, while presented conceptually and Yiqiu Zhang (Shanghai AI Laboratory) for suggestions that
technologically, implicitly engages with core ethical principles broadened the relevance to a wider scientific audience. Fi-
that warrant explicit discussion. nally, we acknowledge Chenfei Yao, Hua Tu, and Boheng
Arguably, the principle of autonomy is central to this Li (Nanyang Technological University) for their invaluable
problem. Data privacy-preservation (i.e., Level 2) and data assistance in polishing Figures 1–3, respectively.
deletability (i.e., Level 4) directly support an individual’s
R EFERENCES
right to control their personal information and digital footprint,
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
aligning with the ‘right to be forgotten’ [77]. Ensuring users no. 7553, pp. 436–444, 2015.
have agency over their data is not just a legal requirement [2] X. Wang, Y. Cheng, Y. Yang, Y. Yu, F. Li, and S. Peng, “Multitask
but an ethical imperative in an increasingly data-driven world. joint strategies of self-supervised representation learning on biomedical
networks for drug discovery,” Nature Machine Intelligence, vol. 5, no. 4,
Fairness is another critical dimension. While often discussed pp. 445–456, 2023.
under data safety (e.g., mitigating algorithmic bias [133]), [3] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F.
protection mechanisms contribute significantly. Traceability Tan, and D. S. W. Ting, “Large language models in medicine,” Nature
Medicine, vol. 29, no. 8, pp. 1930–1940, 2023.
(Level 3) enables audits to uncover biased data sourcing [4] P.-Y. Chen and S. Liu, Introduction to Foundation Models. Springer
or discriminatory model behavior, fostering accountability. Nature, 2025.
Preventing the unauthorized use of data (i.e., Level 1) can [5] H. Zhang, H. Song, S. Li, M. Zhou, and D. Song, “A survey
of controllable text generation using transformer-based pre-trained
stop the malicious exploitation of vulnerable groups’ data. language models,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–
Transparency and accountability are cornerstones of ethical 37, 2023.
AI, directly supported by traceability. Knowing the provenance [6] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion
models in vision: A survey,” IEEE Transactions on Pattern Analysis
of data and models allows stakeholders to understand how and Machine Intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023.
systems work, assign responsibility for outcomes, and build [7] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
trust. This is vital not only for redress but also for enabling D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4
technical report,” arXiv preprint arXiv:2303.08774, 2023.
informed public discourse about AI’s role. Furthermore, the [8] M. Team, “Midjourney,” Artificial intelligence image generation
principles of beneficence (doing good) and non-maleficence tool, 2023, accessed: 2025-05-13. [Online]. Available: https:
(avoiding harm) are pertinent. Data protection helps ensure //www.midjourney.com
[9] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma,
that the benefits of AI are realized responsibly. By preventing P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability
unauthorized access and misuse, it safeguards individuals from in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948,
potential harms like identity theft, reputational damage from 2025.
[10] T. Tu, S. Azizi, D. Driess, M. Schaekermann, M. Amin, P.-C. Chang,
deepfakes, or the exploitation of creative work. A. Carroll, C. Lau, R. Tanno, I. Ktena et al., “Towards generalist
Navigating these ethical considerations often involves bal- biomedical ai,” New England Journal of Medicine AI, vol. 1, no. 3,
ancing competing values. There can be tension between 2024.
[11] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. Shaw, B. A.
maximizing data utility for societal benefit (e.g., in medical Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, and J. M. Stuart,
research) and upholding individual privacy. Innovation fueled “The cancer genome atlas pan-cancer analysis project,” Nature Genetics,
by large datasets may clash with the rights of original data vol. 45, no. 10, pp. 1113–1120, 2013.
[12] W. Liang, G. A. Tadesse, D. Ho, L. Fei-Fei, M. Zaharia, C. Zhang, and
creators. The proposed hierarchy helps to make these trade-offs J. Zou, “Advances, challenges and opportunities in creating data for
explicit, offering different levels of control to strike varying trustworthy ai,” Nature Machine Intelligence, vol. 4, no. 8, pp. 669–677,
balances based on context and societal values. Responsibility 2022.
[13] E. Union, “General data protection regulation,” 2016, eU. [Online].
for ethical data protection is shared across the entire AI Available: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex
lifecycle, involving data collectors, annotators, model develop- %3A32016R0679
ers, platform providers, deployers, and end-users. It requires [14] ——, “Eu ai act,” 2024, eU. [Online]. Available: https://artificialintelli
genceact.eu/
fostering a culture of data stewardship that goes beyond mere [15] D. Boneh and M. Franklin, “Identity-based encryption from the weil
legal compliance, embedding ethical reflection into the design, pairing,” in International Cryptology Conference, 2001.
development, and deployment process. Our framework aims not [16] P. Martins, L. Sousa, and A. Mariano, “A survey on fully homomorphic
encryption: An engineering perspective,” ACM Computing Surveys,
only to provide conceptual, technical, and regulatory clarity but vol. 50, no. 6, pp. 1–33, 2017.
also to serve as a guideline and tool that encourages developers [17] H. Deng, Z. Qin, Q. Wu, Z. Guan, R. H. Deng, Y. Wang, and
and policymakers to engage proactively with the profound Y. Zhou, “Identity-based encryption transformation for flexible sharing
of encrypted data in public cloud,” IEEE Transactions on Information
ethical dimensions of data protection in the AI era. Forensics and Security, vol. 15, pp. 3168–3180, 2020.
[18] F. Hartung and M. Kutter, “Multimedia watermarking techniques,”
Proceedings of the IEEE, vol. 87, no. 7, pp. 1079–1107, 1999.
[19] Z. Guan, J. Jing, X. Deng, M. Xu, L. Jiang, Z. Zhang, and Y. Li,
“Deepmih: Deep invertible network for multiple image hiding,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 45,
no. 1, pp. 372–390, 2022.
PERSPECTIVE 11

[20] S. Ranjbar Alvar, M. Akbari, D. Yue, and Y. Zhang, “Nft-based data external features,” IEEE Transactions on Pattern Analysis and Machine
marketplace with digital watermarking,” in ACM SIGKDD International Intelligence, 2025.
Conference on Knowledge Discovery and Data Mining, 2023, pp. 4756– [42] D. Oliynyk, R. Mayer, and A. Rauber, “I know what you trained last
4767. summer: A survey on stealing machine learning models and defences,”
[21] M. Rigaki and S. Garcia, “A survey of privacy attacks in machine ACM Computing Surveys, vol. 55, no. 14s, pp. 1–41, 2023.
learning,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–34, 2023. [43] S. Shao, Y. Li, H. Yao, Y. He, Z. Qin, and K. Ren, “Explanation as a
[22] A. Ziller, T. T. Mueller, S. Stieger, L. F. Feiner, J. Brandt, R. Braren, watermark: Towards harmless and multi-bit model ownership verification
D. Rueckert, and G. Kaissis, “Reconciling privacy and accuracy in ai via watermarking feature attribution,” in Network and Distributed System
for medical imaging,” Nature Machine Intelligence, vol. 6, no. 7, pp. Security Symposium, 2025.
764–774, 2024. [44] Z. Wang, J. Guo, J. Zhu, Y. Li, H. Huang, M. Chen, and Z. Tu,
[23] S. Mittal, K. Thakral, R. Singh, M. Vatsa, T. Glaser, C. Canton Ferrer, “Sleepermark: Towards robust watermark against fine-tuning text-to-
and T. Hassner, “On responsible machine learning datasets emphasizing image diffusion models,” in IEEE/CVF Conference on Computer Vision
fairness, privacy and regulatory norms with examples in biometrics and and Pattern Recognition, 2025.
healthcare,” Nature Machine Intelligence, vol. 6, no. 8, pp. 936–949, [45] Z. Wu, O. Zhang, X. Wang, L. Fu, H. Zhao, J. Wang, H. Du, D. Jiang,
2024. Y. Deng, D. Cao et al., “Leveraging language model for advanced
[24] Y. He, H. She, X. Qian, X. Zheng, Z. Chen, Z. Qin, and L. Cavallaro, multiproperty molecular optimization via prompt engineering,” Nature
“On benchmarking code llms for android malware analysis,” in ACM Machine Intelligence, vol. 6, no. 11, pp. 1359–1369, 2024.
SIGSOFT International Symposium on Software Testing and Analysis [46] M. P. Polak and D. Morgan, “Extracting accurate materials data from
Workshop, 2025. research papers with conversational language models and prompt
[25] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” engineering,” Nature Communications, vol. 15, no. 1, p. 1569, 2024.
in Annual Conference on Neural Information Processing Systems, [47] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
vol. 33, 2020, pp. 6840–6851. H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
[26] S. Ray, “Samsung bans chatgpt among employees after sensitive code augmented generation for knowledge-intensive nlp tasks,” in Annual
leak,” 2023. [Online]. Available: https://www.forbes.com/sites/siladitya Conference on Neural Information Processing Systems, vol. 33, 2020,
ray/2023/05/02/samsung-bans-chatgpt-and-other-chatbots-for-emplo pp. 9459–9474.
yees-after-sensitive-code-leak/ [48] A. Salemi and H. Zamani, “Evaluating retrieval quality in retrieval-
[27] A. Satariano, “Chatgpt is banned in italy over privacy concerns,” 2023. augmented generation,” in International ACM SIGIR Conference on
[Online]. Available: https://www.nytimes.com/2023/03/31/technology/ Research and Development in Information Retrieval, 2024, pp. 2395–
chatgpt-italy-ban.html 2400.
[28] S. Longpre, N. Singh, M. Cherep, K. Tiwary, J. Materzynska, W. Bran- [49] H. Yao, J. Lou, Z. Qin, and K. Ren, “Promptcare: Prompt copyright
non, R. Mahari, N. Obeng-Marnu, M. Dey, M. Hamdy et al., “Bridging protection by watermark injection and verification,” in IEEE Symposium
the data provenance gap across text, speech and video,” in International on Security and Privacy. IEEE, 2024, pp. 845–861.
Conference on Learning Representations, 2025. [50] X. Shen, Y. Qu, M. Backes, and Y. Zhang, “Prompt stealing attacks
[29] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, against text-to-image generation models,” in USENIX Security Sympo-
K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in sium, 2024, pp. 5823–5840.
ACM SIGSAC Conference on Computer and Communications Security, [51] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and
2016, pp. 308–318. Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented
[30] Y. Li, M. Zhu, X. Yang, Y. Jiang, T. Wei, and S.-T. Xia, “Black- large language models,” in ACM SIGKDD International Conference on
box dataset ownership verification via backdoor watermarking,” IEEE Knowledge Discovery and Data Mining, 2024, pp. 6491–6501.
Transactions on Information Forensics and Security, vol. 18, pp. 2318– [52] B. Liu, M. Ding, S. Shaham, W. Rahayu, F. Farokhi, and Z. Lin,
2332, 2023. “When machine learning meets privacy: A survey and outlook,” ACM
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: Computing Surveys, vol. 54, no. 2, pp. 1–36, 2021.
A large-scale hierarchical image database,” in IEEE/CVF Conference [53] S. Yu, C. Wang, K. Ren, and W. Lou, “Achieving secure, scalable,
on Computer Vision and Pattern Recognition, 2009. and fine-grained data access control in cloud computing,” in IEEE
[32] L. Sweeney, “k-anonymity: A model for protecting privacy,” Interna- International Conference on Computer Communications. IEEE, 2010,
tional Journal of Uncertainty, Fuzziness and Knowledge-based Systems, pp. 1–9.
vol. 10, no. 05, pp. 557–570, 2002. [54] D. Han, Y. Zhu, D. Li, W. Liang, A. Souri, and K.-C. Li, “A blockchain-
[33] H. Huang, X. Ma, S. M. Erfani, J. Bailey, and Y. Wang, “Unlearn- based auditable access control system for private data in service-centric
able examples: Making personal data unexploitable,” in International iot environments,” IEEE Transactions on Industrial Informatics, vol. 18,
Conference on Learning Representations, 2021. no. 5, pp. 3530–3540, 2021.
[34] L. Du, X. Zhou, M. Chen, C. Zhang, Z. Su, P. Cheng, J. Chen, and [55] Z. Huang, W.-j. Lu, C. Hong, and J. Ding, “Cheetah: Lean and fast
Z. Zhang, “Sok: Dataset copyright auditing in machine learning systems,” secure two-party deep neural network inference,” in USENIX Security
in IEEE Symposium on Security and Privacy, 2025. Symposium, 2022, pp. 809–826.
[35] Y. Li, Y. Bai, Y. Jiang, Y. Yang, S.-T. Xia, and B. Li, “Untargeted [56] J. Zhang, X. Yang, L. He, K. Chen, W.-j. Lu, Y. Wang, X. Hou,
backdoor watermark: Towards harmless and stealthy dataset copyright J. Liu, K. Ren, and X. Yang, “Secure transformer inference made non-
protection,” in Annual Conference on Neural Information Processing interactive,” in Network and Distributed System Security Symposium,
Systems, vol. 35, 2022, pp. 13 238–13 250. 2025.
[36] J. Guo, Y. Li, R. Chen, Y. Wu, C. Liu, and H. Huang, “Zeromark: [57] W. Qu, Y. Zhou, Y. Wu, T. Xiao, B. Yuan, Y. Li, and J. Zhang, “Prompt
Towards dataset ownership verification without disclosing watermarks,” inversion attack against collaborative inference of large language models,”
in Annual Conference on Neural Information Processing Systems, in IEEE Symposium on Security and Privacy, 2025.
vol. 37, 2024, pp. 120 468–120 500. [58] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
[37] B. Li, Y. Wei, Y. Fu, Z. Wang, Y. Li, J. Zhang, R. Wang, and T. Zhang, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied
“Towards reliable verification of unauthorized data usage in personalized multimodal language model,” in International Conference on Machine
text-to-image diffusion models,” in IEEE Symposium on Security and Learning. PMLR, 2023, pp. 8469–8488.
Privacy, 2025. [59] R. Haase, “Towards transparency and knowledge exchange in ai-assisted
[38] C. Zhu, J. Galjaard, P.-Y. Chen, and L. Chen, “Duwak: Dual water- data analysis code generation,” Nature Computational Science, pp. 1–2,
marks in large language models,” in Findings of the Association for 2025.
Computational Linguistics ACL 2024, 2024, pp. 11 416–11 436. [60] K. Ren, Z. Yang, L. Lu, J. Liu, Y. Li, J. Wan, X. Zhao, X. Feng, and
[39] C. Zhu, J. Tang, J. M. Galjaard, P.-Y. Chen, R. Birke, C. Bos, L. Y. S. Shao, “Sok: On the role and future of aigc watermarking in the era
Chen et al., “Tabwak: A watermark for tabular diffusion models,” in of gen-ai,” arXiv preprint arXiv:2411.11478, 2024.
International Conference on Learning Representations, 2025. [61] X. Zhao, S. Gunn, M. Christ, J. Fairoze, A. Fabrega, N. Carlini, S. Garg,
[40] M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P. Lu, Z. Huang, S. Hong, M. Nasr, F. Tramer et al., “Sok: Watermarking for ai-generated
C. Guestrin, and J. Zou, “Optimizing generative ai by backpropagating content,” in IEEE Symposium on Security and Privacy, 2025.
language model feedback,” Nature, vol. 639, no. 8055, pp. 609–616, [62] P. Samuelson, “Generative ai meets copyright,” Science, vol. 381, no.
2025. 6654, pp. 158–161, 2023.
[41] Y. Li, L. Zhu, X. Jia, Y. Bai, Y. Jiang, S.-T. Xia, X. Cao, and K. Ren, [63] X. Hu, P.-Y. Chen, and T.-Y. Ho, “Radar: Robust ai-text detection via
“Move: Effective and harmless ownership verification via embedded adversarial learning,” in Annual Conference on Neural Information
PERSPECTIVE 12

Processing Systems, vol. 36, 2023, pp. 15 077–15 095. [88] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith, “Composition
[64] Z. He, P.-Y. Chen, and T.-Y. Ho, “Rigid: A training-free and model- attacks and auxiliary information in data privacy,” in ACM SIGKDD
agnostic framework for robust ai-generated image detection,” arXiv International Conference on Knowledge Discovery and Data Mining,
preprint arXiv:2405.20112, 2024. 2008, pp. 265–273.
[65] X. Li, P.-Y. Chen, and W. Wei, “Where are we in audio deepfake [89] S. Utpala, S. Hooker, and P.-Y. Chen, “Locally differentially private
detection? a systematic analysis over generative and detection models,” document generation using zero shot prompting,” in Conference on
ACM Transactions on Internet Technology, 2025. Empirical Methods in Natural Language Processing, 2023.
[66] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A [90] X. Li, W. Liu, J. Lou, Y. Hong, L. Zhang, Z. Qin, and K. Ren,
survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. “Local differentially private heavy hitter detection in data streams with
1789–1819, 2021. bounded memory,” ACM SIGMOD/PODS International Conference on
[67] S. Abdali, R. Anarfi, C. Barberan, and J. He, “Decoding the ai pen: Management of Data, vol. 2, no. 1, pp. 1–27, 2024.
Techniques and challenges in detecting ai-generated text,” in ACM [91] J. Jordon, J. Yoon, and M. Van Der Schaar, “Pate-gan: Generating
SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, synthetic data with differential privacy guarantees,” in International
pp. 6428–6436. Conference on Learning Representations, 2018.
[68] S. Shan, J. Cryan, E. Wenger, H. Zheng, R. Hanocka, and B. Y. Zhao, [92] P. Paillier, “Public-key cryptosystems based on composite degree
“Glaze: Protecting artists from style mimicry by text-to-image models,” residuosity classes,” in International Conference on the Theory and
in USENIX Security Symposium, 2023, pp. 2187–2204. Applications of Cryptographic Techniques. Springer, 1999, pp. 223–
[69] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise 238.
to sensitivity in private data analysis,” in Theory of Cryptography [93] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
Conference. Springer, 2006, pp. 265–284. “Communication-efficient learning of deep networks from decentralized
[70] X. Li, K. Li, Y. Zheng, C. Yan, X. Ji, and W. Xu, “Safeear: data,” in International Conference on Artificial Intelligence and Statistics.
Content privacy-preserving audio deepfake detection,” in ACM SIGSAC PMLR, 2017, pp. 1273–1282.
Conference on Computer and Communications Security, 2024, pp. 3585– [94] O. Gupta and R. Raskar, “Distributed learning of deep neural network
3599. over multiple agents,” Journal of Network and Computer Applications,
[71] C. Meurisch and M. Mühlhäuser, “Data protection in ai services: A vol. 116, pp. 1–8, 2018.
survey,” ACM Computing Surveys, vol. 54, no. 2, pp. 1–38, 2021. [95] T. Hanser, E. Ahlberg, A. Amberg, L. T. Anger, C. Barber, R. J. Brennan,
[72] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in A. Brigo, A. Delaunois, S. Glowienke, N. Greene et al., “Data-driven
ACM SIGSAC Conference on Computer and Communications Security, federated learning in drug discovery with knowledge distillation,” Nature
2015, pp. 1310–1321. Machine Intelligence, pp. 1–14, 2025.
[73] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” [96] J. Guo, Y. Li, L. Wang, S.-T. Xia, H. Huang, C. Liu, and B. Li, “Domain
IEEE Transactions on Neural Networks and Learning Systems, vol. 35, watermark: Effective and harmless dataset copyright protection is closed
no. 1, pp. 5–22, 2022. at hand,” in Annual Conference on Neural Information Processing
[74] D. A. Alber, Z. Yang, A. Alyakin, E. Yang, S. Rai, A. A. Valliani, Systems, vol. 36, 2024.
J. Zhang, G. R. Rosenbaum, A. K. Amend-Thomas, D. B. Kurland [97] C. Wei, Y. Wang, K. Gao, S. Shao, Y. Li, Z. Wang, and Z. Qin,
et al., “Medical large language models are vulnerable to data-poisoning “Pointncbw: Towards dataset ownership verification for point clouds
attacks,” Nature Medicine, pp. 1–9, 2025. via negative clean-label backdoor watermark,” IEEE Transactions on
[75] Y. Chen, S. Shao, E. Huang, Y. Li, P.-Y. Chen, Z. Qin, and K. Ren, Information Forensics and Security, 2024.
“Refine: Inversion-free backdoor defense via model reprogramming,” in [98] Y. Li, L. Zhu, X. Jia, Y. Jiang, S.-T. Xia, and X. Cao, “Defending
International Conference on Learning Representations, 2025. against model stealing via verifying embedded external features,” in
[76] F. Liu, H. Luo, Y. Li, P. Torr, and J. Gu, “Which model generated this AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp.
image? a model-agnostic approach for origin attribution,” in European 1464–1472.
Conference on Computer Vision, 2024, pp. 282–301. [99] S. Shao, W. Yang, H. Gu, Z. Qin, L. Fan, Q. Yang, and K. Ren,
[77] L. Bourtoule, V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, “Fedtracker: Furnishing ownership verification and traceability for
A. Travers, B. Zhang, D. Lie, and N. Papernot, “Machine unlearning,” in federated learning model,” IEEE Transactions on Dependable and
IEEE Symposium on Security and Privacy. IEEE, 2021, pp. 141–159. Secure Computing, vol. 22, no. 1, pp. 114–131, 2024.
[78] C. E. Shannon, “Communication theory of secrecy systems,” The Bell [100] X. Zhang and S. Wang, “Fragile watermarking with error-free restoration
System Technical Journal, vol. 28, no. 4, pp. 656–715, 1949. capability,” IEEE Transactions on Multimedia, vol. 10, no. 8, pp. 1490–
[79] W. Diffie and M. E. Hellman, “New directions in cryptography,” IEEE 1499, 2008.
Transactions on Information Theory, vol. 22, no. 6, 1976. [101] M. Botta, D. Cavagnino, and R. Esposito, “Neunac: A novel fragile
[80] O. Purcell, J. Wang, P. Siuti, and T. K. Lu, “Encryption and steganog- watermarking algorithm for integrity protection of neural networks,”
raphy of synthetic gene circuits,” Nature Communications, vol. 9, no. 1, Information Sciences, vol. 576, pp. 228–241, 2021.
p. 4942, 2018. [102] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership
[81] K. Yang, X. Jia, K. Ren, B. Zhang, and R. Xie, “Dac-macs: Effective inference attacks against machine learning models,” in IEEE Symposium
data access control for multiauthority cloud storage systems,” IEEE on Security and Privacy. IEEE, 2017, pp. 3–18.
Transactions on Information Forensics and Security, vol. 8, no. 11, pp. [103] Y. He, B. Li, L. Liu, Z. Ba, W. Dong, Y. Li, Z. Qin, K. Ren, and
1790–1801, 2013. C. Chen, “Towards label-only membership inference attack against
[82] G. Ren, J. Wu, G. Li, S. Li, and M. Guizani, “Protecting intellectual pre-trained large language models,” in USENIX Security Symposium,
property with reliable availability of learning models in ai-based 2025.
cybersecurity services,” IEEE Transactions on Dependable and Secure [104] X. Cao, J. Jia, and N. Z. Gong, “Ipguard: Protecting intellectual property
Computing, vol. 21, no. 2, pp. 600–617, 2022. of deep neural networks via fingerprinting the classification boundary,”
[83] G. Ren, G. Li, S. Li, L. Chen, and K. Ren, “Activedaemon: Unconscious in ACM Asia Conference on Computer and Communications Security,
dnn dormancy and waking up via user-specific invisible token,” in 2021, pp. 14–25.
Network and Distributed System Security Symposium, 2024. [105] D. Pasquini, E. M. Kornaropoulos, and G. Ateniese, “Llmmap: Finger-
[84] L. Wang, S. Xu, R. Xu, X. Wang, and Q. Zhu, “Non-transferable printing for large language models,” in USENIX Security Symposium,
learning: A new approach for model ownership verification and 2025.
applicability authorization,” in International Conference on Learning [106] R. C. Merkle, “A digital signature based on a conventional encryption
Representations, 2021. function,” in Conference on the Theory and Application of Cryptographic
[85] Z. Hong, Z. Wang, L. Shen, Y. Yao, Z. Huang, S. Chen, C. Yang, Techniques. Springer, 1987, pp. 369–378.
M. Gong, and T. Liu, “Improving non-transferable representation [107] R. L. Rivest, “The md4 message digest algorithm,” in International
learning by harnessing content and style,” in International Conference Cryptology Conference. Springer, 1991, pp. 303–311.
on Learning Representations, 2024. [108] E. J. De Aguiar, B. S. Faiçal, B. Krishnamachari, and J. Ueyama, “A
[86] X. Gong, Z. Wang, S. Li, Y. Chen, and Q. Wang, “A gan-based defense survey of blockchain-based strategies for healthcare,” ACM Computing
framework against model inversion attacks,” IEEE Transactions on Surveys, vol. 53, no. 2, pp. 1–27, 2020.
Information Forensics and Security, vol. 18, pp. 4475–4487, 2023. [109] X. Guo, M. A. Khalid, I. Domingos, A. L. Michala, M. Adriko, C. Rowel,
[87] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, D. Ajambo, A. Garrett, S. Kar, X. Yan et al., “Smartphone-based dna
“l-diversity: Privacy beyond k-anonymity,” ACM Transactions on Knowl- diagnostics for malaria detection using deep learning for local decision
edge Discovery from Data, vol. 1, no. 1, pp. 3–54, 2007. support and blockchain technology for security,” Nature Electronics,
PERSPECTIVE 13

vol. 4, no. 8, pp. 615–624, 2021. 2024, pp. 4807–4821.

[110] Y. Hu, J. Lou, J. Liu, W. Ni, F. Lin, Z. Qin, and K. Ren, “Eraser: Machine [132] T. Wang, X. Liao, K. P. Chow, X. Lin, and Y. Wang, “Deepfake
unlearning in mlaas via an inference serving-aware approach,” in ACM detection: A comprehensive survey from the reliability perspective,”
SIGSAC Conference on Computer and Communications Security, 2024, ACM Computing Surveys, vol. 57, no. 3, pp. 1–35, 2024.
pp. 3883–3897. [133] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan,
[111] C. Guo, T. Goldstein, A. Hannun, and L. Van Der Maaten, “Certified data “A survey on bias and fairness in machine learning,” ACM Computing
removal from machine learning models,” in International Conference Surveys, vol. 54, no. 6, pp. 1–35, 2021.
on Machine Learning, 2020, pp. 3832–3842. [134] Q. Zhang, H. Qiu, D. Wang, Y. Li, T. Zhang, W. Zhu, H. Weng, L. Yan,
[112] J. Liu, J. Lou, Z. Qin, and K. Ren, “Certified minimax unlearning with and C. Zhang, “A benchmark for semantic sensitive information in
generalization rates and deletion capacity,” in Annual Conference on llms outputs,” in International Conference on Learning Representations,
Neural Information Processing Systems, 2023. 2025.
[113] J. Jia, J. Liu, Y. Zhang, P. Ram, N. B. Angel, and S. Liu, “Wagle: [135] Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu,
Strategic weight attribution for effective and modular unlearning in “Defending chatgpt against jailbreak attack via self-reminders,” Nature
large language models,” in Annual Conference on Neural Information Machine Intelligence, vol. 5, no. 12, pp. 1486–1496, 2023.
Processing Systems, 2024. [136] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee,
[114] C. S. Legislature, “California consumer privacy act,” 2018, uSA. A. Roberts, T. B. Brown, D. Song, U. Erlingsson, A. Oprea, and
[Online]. Available: https://leginfo.legislature.ca.gov/faces/billCompare C. Raffel, “Extracting training data from large language models,” in
Client.xhtml?bill_id=201720180AB375 USENIX Security Symposium, 2021, pp. 1–18.
[115] Z. T. Z. D. S. W. Group, “Federal zero trust data security guide,” 2024. [137] P.-Y. Chen, “Computational safety for generative ai: A signal processing
[Online]. Available: https://www.cio.gov/assets/files/Zero-Trust-Data-S perspective,” arXiv preprint arXiv:2502.12445, 2025.
ecurity-Guide_Oct24-Final.pdf [138] X. Jia, X. Wei, X. Cao, and X. Han, “Adv-watermark: A novel watermark
[116] E. Commission, “Ethics guidelines for trustworthy ai,” 2019, eU. perturbation for adversarial examples,” in ACM International Conference
[Online]. Available: https://digital-strategy.ec.europa.eu/en/library/ethics on Multimedia, 2020, pp. 1579–1587.
-guidelines-trustworthy-ai [139] X. Liu, J. Liu, Y. Bai, J. Gu, T. Chen, X. Jia, and X. Cao, “Watermark
[117] E. Union, “General-purpose ai code of practice (draft),” 2025, eU; vaccine: Adversarial attacks to prevent watermark removal,” in European
Found via news article on computing.co.uk. [Online]. Available: Conference on Computer Vision. Springer, 2022, pp. 1–17.
https://www.computing.co.uk/news/2025/legislation-regulation/third-d [140] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
raft-of-general-purpose-ai-code-of-practice-published network,” in Annual Conference on Neural Information Processing
[118] C. A. of China, “Cybersecurity law of the prc,” 2016, china. [Online]. Systems Workshop, 2014.
Available: https://www.cac.gov.cn/2016-11/07/c_1119867116.htm [141] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[119] T. S. C. of the National People’s Congress, “Data security law for fast adaptation of deep networks,” in International Conference on
of the people’s republic of china,” 2021. [Online]. Available: Machine Learning. PMLR, 2017, pp. 1126–1135.
https://www.cac.gov.cn/2021-06/11/c_1624994566919140.htm
[120] G. of the People’s Republic of China, “Personal information
protection law of the prc,” 2021, china. [Online]. Available:
https://www.gov.cn/xinwen/2021-08/20/content_5632486.htm
[121] C. A. of China, “Administrative measures for generative artificial
intelligence services,” 2023, china. [Online]. Available: https:
//www.cac.gov.cn/2023-07/13/c_1690898327029107.htm
[122] N. D. Administration, “Action plan of the development of trustworthy
data space,” 2024, china. [Online]. Available: https://www.nda.gov.cn/s
jj/zwgk/zcfb/1122/20241122164142182915964_pc.html
[123] N. Development and R. Commission, “Implementation plan on
improving data circulation security governance to better promote the
marketization and valorization of data elements,” 2025, china. [Online].
Available: https://www.ndrc.gov.cn/xwdt/tzgg/202501/t20250115_139
5694.html
[124] C. A. of China, “Methods for identifying synthetic content
generated by artificial intelligence,” 2025, china. [Online]. Available:
https://www.cac.gov.cn/2025-03/14/c_1743654684782215.htm
[125] A. Ministry, “Artificial intelligence mission austria 2030,” 2019, austria.
[Online]. Available: https://www.bmk.gv.at/themen/innovation/publikat
ionen/ikt/ai/aimat.html
[126] G. of Canada, “Artificial intelligence and data act,” 2022,
canada; URL links to a companion document. [Online]. Available:
https://ised-isde.canada.ca/site/innovation-better-canada/en/artificial-i
ntelligence-and-data-act-aida-companion-document
[127] B. Senate, “Brazilian ai regulation,” 2023, brazil. [Online]. Available:
https://www25.senado.leg.br/web/atividade/materias/-/materia/157233
[128] OECD, “Enhancing Access to and Sharing of Data in the Age
of Artificial Intelligence,” 2025, [Online; accessed date]. [Online].
Available: https://www.oecd.org/en/publications/enhancing-access-to-a
nd-sharing-of-data-in-the-age-of-artificial-intelligence_23a70dca-en.
html
[129] O. of the Australian Information Commissioner, “Joint statement
on building trustworthy data governance frameworks to encourage
development of innovative and privacy-protective ai,” 2025, joint
Statement. [Online]. Available: https://www.oaic.gov.au/news/media-c
entre/joint-statement-on-building-trustworthy-data-governance-frame
works-to-encourage-development-of-innovative-and-privacy-protectiv
e-ai
[130] S. Shao, H. Zhu, H. Yao, Y. Li, T. Zhang, Z. Qin, and K. Ren, “Fit-print:
Towards false-claim-resistant model ownership verification via targeted
fingerprint,” arXiv preprint arXiv:2501.15509, 2025.
[131] X. Li, Y. Yang, J. Deng, C. Yan, Y. Chen, X. Ji, and W. Xu, “Safegen:
Mitigating sexually explicit content generation in text-to-image models,”
in ACM SIGSAC Conference on Computer and Communications Security,

Generative AI for Business Leaders
100% (18)
Generative AI for Business Leaders
80 pages
Petrol Pump Management System Project Report
88% (8)
Petrol Pump Management System Project Report
136 pages
2023 Vertical Snapshot Generative AI Preview
No ratings yet
2023 Vertical Snapshot Generative AI Preview
11 pages
C# Practical File
No ratings yet
C# Practical File
15 pages
Generative Ai
No ratings yet
Generative Ai
7 pages
4.data Protection Challenges
No ratings yet
4.data Protection Challenges
34 pages
How To Stop Gen AI Threats With Zero Trust
No ratings yet
How To Stop Gen AI Threats With Zero Trust
11 pages
Lakshmi Priya Module 7 Assignment
No ratings yet
Lakshmi Priya Module 7 Assignment
5 pages
Generative Ai New Inflection Point Ebook
No ratings yet
Generative Ai New Inflection Point Ebook
20 pages
Dfa
No ratings yet
Dfa
22 pages
Gen AI Module1
No ratings yet
Gen AI Module1
130 pages
Generative AI
No ratings yet
Generative AI
2 pages
Article
No ratings yet
Article
37 pages
Generative Ai With Python Harnessing The Power of Machine Learning and Deep Learning To Build Creative and Intelligent Systems
100% (3)
Generative Ai With Python Harnessing The Power of Machine Learning and Deep Learning To Build Creative and Intelligent Systems
239 pages
Module1 L1 L2
No ratings yet
Module1 L1 L2
35 pages
Research Paper 9
No ratings yet
Research Paper 9
8 pages
Generative AI for Tech Professionals
No ratings yet
Generative AI for Tech Professionals
8 pages
Exploring Computing Innovations Milestones (APCSP)
No ratings yet
Exploring Computing Innovations Milestones (APCSP)
6 pages
GenAI QBank Ans
No ratings yet
GenAI QBank Ans
37 pages
Generative AI: Credit: Rahul Kumar A Bca' A Sec 4 Sem
No ratings yet
Generative AI: Credit: Rahul Kumar A Bca' A Sec 4 Sem
10 pages
Wa0008.
No ratings yet
Wa0008.
7 pages
How To Stop Gen AI Threats With Zero Trust
No ratings yet
How To Stop Gen AI Threats With Zero Trust
10 pages
Generative Ai
0% (1)
Generative Ai
21 pages
Generative AI - The Creative Frontier of Artificial Intelligence
No ratings yet
Generative AI - The Creative Frontier of Artificial Intelligence
6 pages
Copyright Related Concerns
No ratings yet
Copyright Related Concerns
25 pages
History of AI
No ratings yet
History of AI
17 pages
2022-01-01 How Much Longer Can Computing Power Drive Artificial Intelligence Progress (CSET)
No ratings yet
2022-01-01 How Much Longer Can Computing Power Drive Artificial Intelligence Progress (CSET)
34 pages
Madhavan M Ts Report
No ratings yet
Madhavan M Ts Report
32 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
4 pages
21B21A4238 Generative AI
No ratings yet
21B21A4238 Generative AI
26 pages
Evolution of AI MLA With Works Cited
No ratings yet
Evolution of AI MLA With Works Cited
8 pages
Generative AI 35 Pages Report 1
No ratings yet
Generative AI 35 Pages Report 1
174 pages
Gen AI
No ratings yet
Gen AI
8 pages
History and Evolution of Generative AI
No ratings yet
History and Evolution of Generative AI
3 pages
Open Data and Generative AI: A Fourth Wave of Open Data? Exploring The Spectrum of Scenarios For
No ratings yet
Open Data and Generative AI: A Fourth Wave of Open Data? Exploring The Spectrum of Scenarios For
58 pages
AI Reading 2 Generative AI and Training Data Journal Article
No ratings yet
AI Reading 2 Generative AI and Training Data Journal Article
8 pages
Generative AI - Concepts and Applications Riyyya - Opos
No ratings yet
Generative AI - Concepts and Applications Riyyya - Opos
7 pages
AI Essay Data Contnet
No ratings yet
AI Essay Data Contnet
20 pages
How To Stop Gen AI Threats With Zero Trust
No ratings yet
How To Stop Gen AI Threats With Zero Trust
3 pages
BigPicture GenAI 2023 FINAL
No ratings yet
BigPicture GenAI 2023 FINAL
9 pages
The Rise of Generative AI
No ratings yet
The Rise of Generative AI
2 pages
Conference Paper Final Draft
No ratings yet
Conference Paper Final Draft
8 pages
Briefing Note On Artificial Intelligence and The Humanitarian Sector
No ratings yet
Briefing Note On Artificial Intelligence and The Humanitarian Sector
8 pages
Generative AI - Research Paper by Hiren Kodwani
No ratings yet
Generative AI - Research Paper by Hiren Kodwani
7 pages
Artificial Intelligence Research
No ratings yet
Artificial Intelligence Research
3 pages
Generative AI 1
No ratings yet
Generative AI 1
40 pages
Week 1 Lec 1
No ratings yet
Week 1 Lec 1
159 pages
Class Note 3: Advanced Topics in Generative AI: State-of-the-Art Models GPT-4 and Beyond
No ratings yet
Class Note 3: Advanced Topics in Generative AI: State-of-the-Art Models GPT-4 and Beyond
4 pages
Information: Privacy-Preserving Techniques in Generative AI and Large Language Models: A Narrative Review
No ratings yet
Information: Privacy-Preserving Techniques in Generative AI and Large Language Models: A Narrative Review
25 pages
Generative AI
100% (1)
Generative AI
6 pages
Artificial Intelligence and Machine Lear
No ratings yet
Artificial Intelligence and Machine Lear
6 pages
FFFRT
No ratings yet
FFFRT
3 pages
Generative AI
No ratings yet
Generative AI
15 pages
Privacy and Security Concerns in Generative AI A C
No ratings yet
Privacy and Security Concerns in Generative AI A C
20 pages
Unveiling The Evolution of Generative AI (GAI) A Comprehensive and Investigative Analysis Toward LLM Models (2021-2024) and Beyond
No ratings yet
Unveiling The Evolution of Generative AI (GAI) A Comprehensive and Investigative Analysis Toward LLM Models (2021-2024) and Beyond
21 pages
Gen AI Word
No ratings yet
Gen AI Word
46 pages
Unit I
No ratings yet
Unit I
11 pages
Unit 4
No ratings yet
Unit 4
27 pages
Week 3 Required Reading 2
No ratings yet
Week 3 Required Reading 2
16 pages
Ai in Today Era
No ratings yet
Ai in Today Era
5 pages
Abstract 1
No ratings yet
Abstract 1
1 page
Siebel Interview Questions Guide
No ratings yet
Siebel Interview Questions Guide
20 pages
Zanasi Z1 Thermal Inkjet RemoteManager Handbuch
No ratings yet
Zanasi Z1 Thermal Inkjet RemoteManager Handbuch
185 pages
BE - Cyber - Security - and - Digital - Forensics - Question Bank
No ratings yet
BE - Cyber - Security - and - Digital - Forensics - Question Bank
2 pages
GFT Agile Runbook: A Guick Guide For You To Start Running Your Project Using An Agile Approach
No ratings yet
GFT Agile Runbook: A Guick Guide For You To Start Running Your Project Using An Agile Approach
32 pages
4test: Free Valid Test Questions and Dumps PDF For Certification Test Prep
No ratings yet
4test: Free Valid Test Questions and Dumps PDF For Certification Test Prep
6 pages
ATP Madrid Open Broadcast Guide
No ratings yet
ATP Madrid Open Broadcast Guide
4 pages
HTC 121107
No ratings yet
HTC 121107
11 pages
Dba Fresher
No ratings yet
Dba Fresher
3 pages
Java Arrays: Declaration and Initialization
No ratings yet
Java Arrays: Declaration and Initialization
82 pages
Syserr
No ratings yet
Syserr
2 pages
Boss ME 25 Patches
No ratings yet
Boss ME 25 Patches
1 page
Exercise - Digital Documentation - Elementary
No ratings yet
Exercise - Digital Documentation - Elementary
3 pages
Module 1 - Intro To Computing
No ratings yet
Module 1 - Intro To Computing
23 pages
Google It Automation With Python
No ratings yet
Google It Automation With Python
28 pages
Syllabus
No ratings yet
Syllabus
50 pages
Black Document
No ratings yet
Black Document
12 pages
Audit Data Standards & Analytics Quiz
No ratings yet
Audit Data Standards & Analytics Quiz
3 pages
Lenovo T470 External CD Drive
No ratings yet
Lenovo T470 External CD Drive
32 pages
The Essential of Software Requirement
No ratings yet
The Essential of Software Requirement
12 pages
Pharos Atonis
No ratings yet
Pharos Atonis
2 pages
Short Answer Questions (MC) : UNIT-1 Wireless Transmission
No ratings yet
Short Answer Questions (MC) : UNIT-1 Wireless Transmission
7 pages
End-to-End Machine Learning With TensorFlow On GCP
No ratings yet
End-to-End Machine Learning With TensorFlow On GCP
150 pages
Taiga Software: Agile Project Management
No ratings yet
Taiga Software: Agile Project Management
12 pages
Deabark University: Group Member
No ratings yet
Deabark University: Group Member
12 pages
sm630 krvg5000 Install 1104 en
No ratings yet
sm630 krvg5000 Install 1104 en
23 pages
Advanced Security Camera Guide
No ratings yet
Advanced Security Camera Guide
4 pages
Sample Field Report Ict
No ratings yet
Sample Field Report Ict
2 pages
OceanStor Dorado 6.1.3 Command Reference
No ratings yet
OceanStor Dorado 6.1.3 Command Reference
1,939 pages