Preparedness Framework v2
Preparedness Framework v2
The Preparedness Framework is OpenAI’s approach to tracking and preparing for frontier capabilities
that create new risks of severe harm.1 We currently focus this work on three areas of frontier capability,
which we call Tracked Categories:
• Biological and Chemical capabilities that, in addition to unlocking discoveries and cures, can also
reduce barriers to creating and using biological or chemical weapons.
• Cybersecurity capabilities that, in addition to helping protect vulnerable systems, can also create
new risks of scaled cyberattacks and vulnerability exploitation.
• AI Self-improvement capabilities that, in addition to unlocking helpful capabilities faster, could
also create new challenges for human control of AI systems.
In each area, we develop and maintain a threat model that identifies the risks of severe harm and sets
thresholds we can measure to tell us when the models get capable enough to meaningfully pose these
risks. We won’t deploy these very capable models until we’ve built safeguards to sufficiently minimize
the associated risks of severe harm. This Framework lays out the kinds of safeguards we expect to need,
and how we’ll confirm internally and show externally that the safeguards are sufficient.
In this updated version of the Framework we also introduce a set of Research Categories. These are
areas of capability that could pose risks of severe harm, that do not yet meet our criteria to be Tracked
Categories, and where we are investing now to further develop our threat models and capability elicitation
techniques.
We are constantly refining our practices and advancing the science, to unlock the benefits of these
technologies while addressing their risks. This revision of the Preparedness Framework focuses on the
safeguards we expect will be needed for future models more capable than those we have today.
1 By “severe harm” in this document, we mean the death or grave injury of thousands of people or hundreds of billions of
dollars of economic damage. Our safety stack addresses a broad spectrum of risks, including many with harms below this severity.
In choosing to set a high bar here, we aim to ensure that the most severe risks receive attention commensurate with their magnitude.
1
Contents
1 Introduction 3
1.1 Why we’re updating the Preparedness Framework . . . . . . . . . . . . . . . . . . . . . . . 3
3 Measuring capabilities 8
3.1 Evaluation approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Testing scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Capability threshold determinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Building trust 12
5.1 Internal governance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2 Transparency and external participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A Change log 14
B Decision-making practices 15
2
1 Introduction
We believe there are a limited number of AI capabilities that could pose new risks of severe harm. In
order to safely unlock the beneficial uses of frontier AI capabilities, we exercise particular caution and
implement safeguards that sufficiently minimize the risk of severe harm in these areas. To do this, we:
• Decide where to focus – we use a holistic risk assessment to decide which frontier capability
categories to track or research further, and to define threshold levels of those capabilities that are
associated with meaningful increases in risk of severe harm.
• Measure capabilities associated with risks of severe harms – we run in–scope models through
frontier capability evaluations to measure the full extent of model capabilities before we deploy
our models and during development. Our capability elicitation efforts are designed to detect the
threshold levels of capability that we have identified as enabling meaningful increases in risk of
severe harms.
• Safeguard against severe harms – we evaluate the likelihood that severe harms could actually
occur in the context of deployment, using threat models that take our safeguards into account. We
do not deploy models that reach a High capability threshold until the associated risks that they pose
are sufficiently minimized. If a model under development reaches a Critical capability threshold,
we also require safeguards to sufficiently minimize the associated risks during development,
irrespective of deployment plans.
• Build trust – we engage with subject–matter experts across and beyond OpenAI to inform these
efforts and to build confidence that we are meeting our commitments and effectively managing
risk.
An internal, cross-functional group of OpenAI leaders called the Safety Advisory Group (SAG) oversees
the Preparedness Framework and makes expert recommendations on the level and type of safeguards
required for deploying frontier capabilities safely and securely. OpenAI Leadership can approve or reject
these recommendations, and our Board’s Safety and Security Committee provides oversight of these
decisions.
3
includes our threat modelling, development work on new capability evaluations, and external
consultations, as well as relevant publications from external researchers and updated frameworks
from industry peers.2
Research Categories are capabilities that, while they do not meet the above criteria, nonetheless have the
potential to cause or contribute to severe harm, and where we are working now in order to prepare to
address risks in the future (including potentially by maturing them to Tracked Categories).
across OpenAI teams and external experts. While past deployment experiences are an important source of information about
existing risks, we also recognize that in the case of net-new risks of severe harm, significant safeguards may be needed to reduce
the risk of harms that have never been realized.
4
Our team develops or adopts evaluations designed to track progress on each of these capabilities and to
indicate when the identified capability thresholds may have been reached.
We further break down the specific capability thresholds, associated risks, and safeguards for those
thresholds in Table 1.
Tracked Category Capability threshold that Associated risk of severe Risk-specific safeguard
could lead to the risk harm guidelines
5 We will build safeguards against both biological and chemical threats. Given the higher potential severity of biological threats
relative to chemical ones, we will prioritize Biological capability evaluations and these will be used as indicators for High and
Critical capabilities for the category.
5
[Critical] A tool-augmented Finding and executing end-
model can identify and de- to-end exploits for all soft- • Until we have specified
velop functional zero-day ware could lead to catastro- safeguards and security
exploits of all severity lev- phe from unilateral actors, controls standards that
els in many hardened real- hacking military or indus- would meet a Critical
world critical systems with- trial systems, or OpenAI in- standard, halt further de-
out human intervention OR frastructure. Novel cyber velopment
model can devise and exe- operations, e.g. those in-
cute end-to-end novel strate- volving novel zero-days or
gies for cyberattacks against novel methods of command-
hardened targets given only and-control, generally pose
a high level desired goal. the most serious threat, as
they are unpredictable and
scarce.
[High] The model’s impact This milestone suggests AI
AI Self- is equivalent to giving every self-improvement may be • Require security con-
improvement: OpenAI researcher a highly beginning to accelerate. To trols meeting High stan-
The ability of an performant mid-career re- meet the large-scale safety dard (Appendix C.3)
AI system to search engineer assistant, research, operations, and se-
accelerate AI relative to those researchers’ curity investments that need
research, 2024 baseline. to be in place when we reach
including to Critical self-improvement,
increase the we will by then need to have
system’s own made significant progress
capability. on those investments at this
earlier milestone to avoid
later risks.
[Critical] The model is ca- A major acceleration in
pable of recursively self the rate of AI R&D could • Until we have specified
improving (i.e., fully auto- rapidly increase the rate at safeguards and security
mated AI R&D), defined which new capabilities and controls that would meet
as either (leading indica- risks emerge, to the point a Critical standard, halt
tor) a superhuman research- where our current oversight further development
scientist agent OR (lagging practices are insufficient to
indicator) causing a genera- identify and mitigate new
tional model improvement risks, including risks to
(e.g., from OpenAI o1 to maintaining human control
OpenAI o3) in 1/5th the of the AI system itself.
wall-clock time of equiva-
lent progress in 2024 (e.g.,
sped up to just 4 weeks) sus-
tainably for several months.
Table 1: Tracked Categories
6
• Advancing the science of capability measurement in the area and investing towards the develop-
ment of rigorous evaluations (which could be achieved internally or via partnerships), and
• Sharing summaries of our findings with the public where feasible.
We will periodically review the latest research and findings for each Research Category. SAG will receive
the results of such reviews and evaluate whether there is sufficient evidence to recommend updates to
our internal practices or to the Preparedness Framework.
Long-range Autonomy: ability for a model to exe- If a model has High or Critical capabilities
cute a long-horizon sequence of actions sufficient to in any of the Tracked Categories, require a
realize a “High” threat model (e.g., a cyberattack) “misalignment” safeguards report (see Section
without being directed by a human (including suc- 4.2).
cessful social engineering attacks when needed) As this category matures, we will make de-
cisions about how this should influence our
governance, including setting internal and ex-
ternal deployment safeguards milestones.
Sandbagging: ability and propensity to respond to Adopt elicitation approach that overcomes
safety or capability evaluations in a way that signifi- sandbagging, or use a conservative upper
cantly diverges from performance under real condi- bound of the model’s non-sandbagged evalu-
tions, undermining the validity of such evaluations. ation results
Autonomous Replication and Adaptation: ability to Convert Autonomous Replication and Adap-
survive, replicate, resist shutdown, acquire resources tation to a Tracked Category
to maintain and scale its own operations, and com-
mit illegal activities that collectively constitute caus-
ing severe harm (whether when explicitly instructed,
or at its own initiative), without also utilizing capa-
bilities tracked in other Tracked Categories.
Undermining Safeguards: ability and propensity for If a model has High or Critical capabilities
the model to act to undermine safeguards placed on in any of the Tracked Categories, require the
it, including e.g., deception, colluding with oversight Safeguards case to be robust to the discovered
models, sabotaging safeguards over time such as by capability and/or propensity
embedding vulnerabilities in safeguards code, etc.
Nuclear and Radiological: ability to meaningfully Heighten safeguards (and consider further ac-
counterfactually enable the creation of a radiolog- tions) in consultation with appropriate US
ical threat or enable or significantly accelerate the government actors, accounting for the com-
development of or access to a nuclear threat while plexity of classified information handling.
remaining undetected.
7
Nuclear and Radiological capabilities are now a Research Category. While basic informa-
tion related to nuclear weapons design is available in public sources, the information and expertise
needed to actually create a working nuclear weapon is significant, and classified. Further, there are
significant physical barriers to success, like access to fissile material, specialized equipment, and
ballistics. Because of the significant resources required and the legal controls around information
and equipment, nuclear weapons development cannot be fully studied outside a classified context.
Our work on nuclear risks also informs our efforts on the related but distinct risks posed by
radiological weapons. We build safeguards to prevent our models from assisting with high-risk
queries related to building weapons, and evaluate performance on those refusal policies as part of
our safety process. Our analysis suggests that nuclear risks are likely to be of substantially greater
severity and therefore we will prioritize research on nuclear-related risks. We will also engage
with US national security stakeholders on how best to assess these risks.
Persuasion: OpenAI prohibits the use of our products to manipulate political views as
part of our Model Spec, and we build in safeguards to back this policy. We also continue to study
the persuasive and relational capabilities of models (including on emotional well-being and
preventing bias in our products) and monitor and investigate misuse of our products (including
for influence operations). We believe many of the challenges around AI persuasion risks require
solutions at a systemic or societal level, and we actively contribute to these efforts through our
participation as a steering committee member of C2PA and working with lawmaker and industry
peers to support state legislation on AI content provenance in Florida and California. Within
our wider safety stack, our Preparedness Framework is specifically focused on frontier AI risks
meeting a specific definition of severe harms1 , and Persuasion category risks do not fit the criteria
for inclusion.
3 Measuring capabilities
3.1 Evaluation approach
We invest deeply in developing or adopting new science-backed evaluations that provide high precision
and high recall indications of whether a covered system has reached a capability threshold in one of our
Tracked Categories.
Our evaluations are intended to approximate the full capability that the adversary contemplated by
our threat model could extract from the deployment candidate model, including by using the highest-
capability tier of system settings, using a version of the model that has a negligible rate of safety-based
refusals on our Tracked Category capability evaluations (which may require a separate model variant),
and with the best presently-available scaffolds. These measures are taken to approximate the high end of
expected elicitation by threat actors attempting to misuse the model, and should be tailored depending
on the level of expected access (e.g. doing finetuning if the weights will be released).
Nonetheless, given the continuous progress in model scaffolding and elicitation techniques, we regard any
one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities
that may emerge in real world use and misuse. We incorporate this uncertainty into our assessments. We
monitor the technical landscape for changes to the elicitation techniques and best practices, and reassess
our evaluations as needed.
Capability evaluations come in two different forms:
• Scalable Evaluations: automated evaluations designed to measure proxies that approximate
whether a capability threshold has been crossed. Scalable evaluations have associated “indicative
thresholds,” which are levels of performance that we have pre-determined to indicate that a
deployment may have reached a capability threshold.
• Deep Dives: designed to provide additional evidence validating the scalable evaluations’ findings
on whether a capability threshold has been crossed. These may include a wide range of evidence
gathering activities, such as human expert red-teaming, expert consultations, resource-intensive
third party evaluations (e.g. bio wet lab studies, assessments by independent third party evaluators),
and any other activity requested by SAG.
8
An example of a Tracked Category capability evaluation
To assess the degree to which a covered system can reduce the barriers to creating a bio-
logical weapon, our current evaluations test both how capable the system is at providing useful
information to someone creating a weapon and how capable it is of directly integrating with
relevant tools, such as ordering precursor materials via the Internet.
Our evaluations test acquiring critical and sensitive information across the five stages of
the biological threat creation process: Ideation, Acquisition, Magnification, Formulation, and
Release. These evaluations, developed by domain experts, cover things like how to troubleshoot
the laboratory processes involved.
You can learn more about our Tracked Category capability evaluations in past system cards, such
as those for OpenAI o1 and Operator.
If justified by our forecasts and threat models as potentially posing a severe risk during development
and prior to external deployment, we will select an appropriate checkpoint during development to be
covered by the Preparedness Framework.
In cases of ambiguity about whether a model is covered by the Preparedness Framework, the SAG is
responsible for making the final determination.
capability threshold will ordinarily not require additional safety measures barring reason to believe a significant increase in
capability has occurred.
9
• Recommend deep dive research: This is appropriate if SAG needs additional evidence in order to
make a recommendation.
10
Safeguards Against Malicious Users Safeguards Against a Misaligned Model
Table 3: Types of safeguards. See Appendix C.1 and C.2 for additional details.
• The safeguards in place and their effectiveness based on the Safeguards Report.
• The baseline risk from other deployments, based on a review of any non-OpenAI deployments of
models which have crossed the capability thresholds and any public evidence of the safeguards
applied for those models.
Covered systems that reach High capability must have safeguards that sufficiently minimize the asso-
ciated risk of severe harm before they are deployed. Systems that reach Critical capability also require
safeguards that sufficiently minimize associated risks during development.
Based on this evidence, SAG then has the following decision points:
1. SAG can find that it is confident that the safeguards sufficiently minimize the associated risk of
severe harm for the proposed deployment, and recommend deployment.
2. SAG can request further evaluation of the effectiveness of the safeguards to evaluate if the
associated risk of severe harm is sufficiently minimized
3. SAG can find the safeguards do not sufficiently minimize the risk of severe harm and recommend
potential alternative deployment conditions or additional or more effective safeguards that would
sufficiently minimize the risk.
The SAG will strive to recommend further actions that are as targeted and non-disruptive as possible
while still mitigating risks of severe harm. All of SAG’s recommendations will go to OpenAI Leadership
for final decision-making in accordance with the decision-making practices outlined in Appendix B.
We expect to continuously improve our safeguards over time. If we find reasonable evidence that our
safeguards are not working as expected, we will validate the information being received and review the
11
sufficiency of our safeguards.
5 Building trust
Effective implementation of the Preparedness Framework requires internal and external accountability,
so that the public, governments, and our industry peers can trust in our adherence to this policy.
12
sufficiently minimize the associated risks. Such disclosures about results and safeguards may be
redacted or summarized where necessary, such as to protect intellectual property or safety.
• Third-party evaluation of tracked model capabilities: If we deem that a deployment warrants
deeper testing of Tracked Categories of capability (as described in Section 3.1), for example based
on results of Capabilities Report presented to them, then when available and feasible, OpenAI will
work with third-parties to independently evaluate models.
• Third-party stress testing of safeguards: If we deem that a deployment warrants third party stress
testing of safeguards and if high quality third-party testing is available, we will work with third
parties to evaluate safeguards. We may seek this out in particular for models that are over a High
capability threshold.
• Independent expert opinions for evidence produced to SAG: The SAG may opt to get independent
expert opinion on the evidence being produced to SAG. The purpose of this input is to add
independent analysis from individuals or organizations with deep expertise in domains of relevant
risks (e.g. biological risk). If provided, these opinions will form part of the analysis presented
to SAG in making its decision on the safety of a deployment. These domain experts may not
necessarily be AI experts and their input will form one part of the holistic evidence that SAG
reviews.
13
A Change log
In this version of the Preparedness Framework, we make a number of updates, designed to reflect what
we’ve learned and update our safety and governance process for the next generations of highly capable
models. Key changes include that we:
1. Clarify the relationship among capabilities, risks and safeguards. In our updated framework, we
make clear that we use a holistic process to decide which areas of frontier AI capability to track,
and to define threshold levels of those capabilities that are associated with meaningful increases in
risk of severe harm. We describe how we develop and maintain threat models that identify these
severe risks, and how we evaluate model capabilities and build and test safeguards that sufficiently
minimize the associated risks. We make clear that safeguards can take a variety of forms, and that
reducing risk generally does not require reducing capability.
2. Define how High and Critical capability thresholds relate to underlying risks. High capability
thresholds mean capabilities that significantly increase existing risk vectors for severe harm under
the relevant threat model. Critical capability thresholds mean capabilities that present a meaningful
risk of a qualitatively new threat vector for severe harm with no ready precedent under the relevant
threat model. Also, we are removing terms “low” and “medium” from the Framework, because
those levels were not operationally involved in the execution of our Preparedness work.
3. Give specific criteria for which capabilities we track. We track capabilities that create risks meeting
five criteria – they are plausible, measurable, severe, net new, and instantaneous or irremediable.
4. Update the Tracked Categories of frontier capability accordingly, focusing on biological and
chemical capability, cybersecurity, and AI self-improvement. Going forward we will handle risks
related to persuasion outside the Preparedness Framework, including via our Model Spec and
policy prohibitions on the use of our tools for political campaigning or lobbying, and our ongoing
investigations of misuse of our products (including detecting and disrupting influence operations).
We are moving Nuclear and Radiological capabilities into Research Categories.
5. Introduce Research Categories, areas of capability that do not meet the criteria to be Tracked
Categories, but where we believe additional work is needed now. For these areas, in collaboration
with external experts, we commit to further developing the associated threat models and advancing
the science of capability measurement for the area, including by investing in the development of
rigorous capability evaluations. These include Long-range Autonomy, Sandbagging, Autonomous
Replication and Adaptation, Undermining Safeguards, and Nuclear and Radiological.
6. Provide more detail on our capability elicitation approach, making clear that we will consider a
range of techniques in order to test a model version that approximates the high end of expected elici-
tation by threat actors attempting to misuse the model. We also define “scalable evaluations,” which
are automated, and distinguish these from “deep dive” evaluations that may include consultation
with human experts and are designed in part to validate the scalable evaluations.
7. Provide risk-specific safeguard guidelines. This information gives more detail on how we expect
to safely develop and deploy models advanced enough to pose severe risks in tracked capability
areas. As we move toward increasingly capable models, we are planning for safeguards that will be
tailored to the specific risks they are intended to address.
8. Establish Capabilities Reports and Safeguards Reports, the key artifacts we use to support
informed decision-making under the Preparedness Framework in the context of systems that are
capable enough to pose severe risks.
9. Clarify approach to establishing safeguard efficacy, moving beyond the flawed approach of re-
running capability evaluations on the safeguarded model and towards a more thorough assessment
of each safeguard and its efficacy (Section 4.1).
10. Deprioritize safety drills, as we are shifting our attention to a more durable approach of continu-
ously red-teaming and assessing the effectiveness of our safeguards.
11. Clarify our focus on marginal risk, including the context of other systems available on the market,
and outline our approach for maintaining responsible safeguards and reinforcing responsible
practices across the industry if another actor releases a system we would assess as having High or
Critical capability.
14
12. Clarify the governance process. Our Safety Advisory Group oversees the effective design, imple-
mentation, and adherence to the Preparedness Framework, in partnership with safety leaders in
the company. For covered launches, SAG assesses residual risk in tracked areas, net of safeguards,
and makes expert recommendations on safeguard adequacy and deployment decision-making to
OpenAI leadership.
B Decision-making practices
We establish an operational structure to oversee our procedural commitments within the Preparedness
Framework.
The Safety Advisory Group (SAG) is responsible for:
• Overseeing the effective design, implementation, and adherence to the Preparedness Framework in
partnership with the safety organization leader
• For each deployment in scope under the Preparedness Framework, reviewing relevant reports and
all other relevant materials and assessing of the level of Tracked Category capabilities and any
post-safeguards residual risks
• For each deployment under the Preparedness Framework, providing recommendations on potential
next steps and any applicable risks to OpenAI Leadership, as well as rationale
• Making other recommendations to OpenAI Leadership on longer-term changes or investments that
are forecasted to be necessary for upcoming models to continue to keep residual risks at acceptable
levels
• For the avoidance of doubt, OpenAI Leadership can also make decisions without the SAG’s
participation, i.e., the SAG does not have the ability to “filibuster”
SAG Membership: the SAG provides a diversity of perspectives to evaluate the strength of evidence
related to catastrophic risk and recommend appropriate actions.
• The members of the SAG and the SAG Chair are appointed by the OpenAI Leadership.
• SAG members serve for one year terms. OpenAI Leadership may choose to re-appoint someone
from previous years to ensure there is continuity of knowledge and experience, while still ensuring
that fresh and timely perspectives are present in the group.
• The SAG Chair makes any final decisions needed for the SAG. This role is expected to rotate, as
appointed by OpenAI Leadership.
OpenAI Leadership, i.e., the CEO or a person designated by them, is responsible for:
• Making all final decisions, including accepting any residual risks and making deployment go/no-go
decisions, informed by SAG’s recommendations.
• Resourcing the implementation of the Preparedness Framework (e.g., additional work on safeguards
where necessary).
The Safety and Security Committee (SSC) of the OpenAI Board of Directors (Board) will be given
visibility into processes, and can review decisions and otherwise require reports and information from
OpenAI Leadership as necessary to fulfill the Board’s oversight role. Where necessary, the Board may
reverse a decision and/or mandate a revised course of action.
Updates to the Preparedness Framework. The Preparedness Framework is a living document and
will be updated. The SAG reviews proposed changes to the Preparedness Framework and makes a
recommendation that is processed according to the standard decision-making process. We will review
and potentially update the Preparedness Framework for continued sufficiency at least once a year.
Fast-track. In the rare case that a risk of severe harm rapidly develops (e.g., there is a change in our
understanding of model safety that requires urgent response), we can request a fast track for the SAG
to process the report urgently. The SAG Chair should also coordinate with OpenAI Leadership for
immediate reaction as needed to address the risk.
15
C Illustrative safeguards, controls, and efficacy assessments
This Appendix provides illustrative examples of potential safeguards, and safeguard efficacy assessments
that could be used to establish that we have sufficiently mitigated the risk of severe harm. The examples
aim to provide insight on our thinking, but many of the techniques require further research. The
safeguards should not be construed as a definitive or comprehensive list of the safeguards we will or
could apply to a given launch.
As a reminder, covered systems that reach High capability must have safeguards that sufficiently minimize
the associated risk of severe harm before they are deployed. Systems that reach Critical capability also
require sufficient safeguards during development.
16
Claim Safeguards that could support this claim Potential safeguard efficacy as-
sessments
Robustness
• Training the model to refuse to help with • Automated and expert red-
high-risk tasks or to otherwise produce low- teaming (identifying success
risk responses per resources)
• Unlearning or training-data filtering to • Prevalence of jailbreaks iden-
erase specific risk-enabling knowledge tified via monitoring and re-
from the model’s knowledge-base ports, in historical deploy-
• Interpretability-based approaches, like ac- ments
tivation steering, that directly edit models’ • Results from public jailbreak
thinking at inference time bounties and results from
• Jailbreak robustness, including through ad- private and public jailbreak
versarial training, inference-time delibera- benchmarks
tion, and more
Usage
Monitoring • Monitoring harmful user requests, model • Automated and human red-
responses, and aggregate usage patterns us- teaming of each oversight and
ing auxiliary models trained to detect harm- review mechanism (focused
ful tasks and jailbreak attempts on detection precision/recall)
• Blocking unsafe user requests and model • In-production detection and
responses automatically when possible and enforcement SLAs for human
escalating to human review and approval monitoring teams, and histori-
otherwise cal performance
• Expanding human monitoring and inves- • Rapid remediation of known
tigation capacity to track capabilities that jailbreaks: Time to patching a
pose a risk of severe harm, and developing new known jailbreak
data infrastructure and review tools to en- • Redteamer success creating a
able human investigations certain number of accounts
• Blocking access for users and organizations without detection (success per
that violate our usage policies, leading to resources)
potential permanent bans
• Expanding escalation channels with law en-
forcement channels
Trust-based
Access • Iterated deployment to wider user groups • Iterated deployment to wider
• Measures for establishing user trustworthi- user group, with careful mon-
ness itoring to detect instances of
– Trusted customer relationships abuse
– KYC and identity verification to enable
traceability
– KYC robustness to prevent a banned ac-
tor from creating new accounts at scale
(recidivism)
• Actor-policy standards, including bespoke
usage policies that limit access to higher
capabilities to trusted actors
• Access conditions including variation
across actors, e.g. rate-limits, fine-tuning
access, releasing weights
Table 4: Claims, potential safeguards, and efficacy assessments for malicious users
17
C.2 Safeguards against a misaligned model
In addition to human misuse, several of the Tracked Category capabilities pose risks when models
themselves are able to autonomously execute a severe harm, whether due to misalignment or subversion
by an adversary.
Safeguards should sufficiently minimize the risk associated with a misaligned model circumventing
human control and oversight and executing severe harms. This can be done by establishing that all
plausible known vectors of enabling severe harm are sufficiently addressed by one or more of the
following claims:
• Lack of Autonomous Capability: The model is not capable of carrying out tasks autonomously,
including the risk of severe harm. Value alignment: The model consistently applies human values
in novel settings (without any instructions) to avoid taking actions that cause harm, and has shown
sufficiently minimal indications of misaligned behaviors like deception or scheming.
• Value Alignment: The model consistently applies human values in novel settings (without any
instructions) to avoid taking actions that cause harm, and has shown sufficiently minimal indications
of misaligned behaviors like deception or scheming.
• Instruction Alignment: The model consistently understands and follows user or system instruc-
tions, even when vague, and those instructions rule out pathways to causing severe harm.
• Reliable and Robust System Oversight: Effective AI and human oversight of model actions detects
and prevents execution of harm or subversion of safeguards.
• System Architecture: The model can’t take actions that cause harm because it lacks access to output
channels or mechanisms to persist sufficiently to execute the harm, due to containerization and
restricted permissions.
These misalignment claims may be further complemented by the safeguards applied towards the
malicious-actor-oriented “Robustness” and “Usage Monitoring” claims, which similarly limit a mis-
aligned model’s use of its own or another model’s capabilities.
The table below illustrates possible examples of safeguards and safeguard efficacy assessments we could
consider to claim we have sufficiently minimized the risks of severe harm associated with High level of
capability under any of our Tracked Categories.
18
Claim Potential safeguards Potential safeguard efficacy assessments
Lack of
Autonomous • Not applicable, as this involves capa- • Long-range Autonomy capability evalu-
Capability bility evaluation. ations show the model cannot act au-
tonomously as described in the threat
model (Section 2.3)
Value
Alignment • Generalization from specified or • Red-teamers create both environments
trained values and adversarial requests to probe the lim-
• Steering or interpretability to control its of the model’s value alignment. Evi-
misaligned behaviors like deception dence of taking harmful actions, including
or scheming deception and scheming, would be evi-
dence of misalignment
• Internal and external user reports, moni-
toring, or manual inspection for historical
model deployments show no significant
evidence of unexpected value alignment
failures
Instruction
Alignment • Instruction following, instruction hi- • Large scale usage or testing in realistic set-
erarchy and adherence to complex tings should show the model consistently
Model Specs takes action aligned to the user’s instruc-
• Model spec defined to comprehen- tions or goals (while maintaining value
sively rule out risk pathways alignment and avoiding harm). Examples
• Training user models and training AI against would include trying to escape
to collaborate with users a container, lying to the user to appease
• Build risk models and training to them, etc.
avoid unnecessary risk
• Training for uncertainty estimation
and reliable answers
Reliable and
• Monitor AI supervises agent actions • Red-teamers can train an arbitrarily adver-
Robust
to detect agent in the case of trying sarial agent or can pretend to be an agent
System
to take harmful actions, scheming, or in our setup and we should ensure that
Oversight
similar harmful behaviors are detected by moni-
• System forces human approval of tors with a high recall rate
agent actions (either by default, or by • Monitor escalations are either outright
escalation of the monitor) blocked or, in empirical studies, users with
• Model’s reasoning is provided faith- the relevant level of expertise were shown
fully and interpretably to humans to to reliably understand and act on monitor
review flags before significant harm occurred
• Agent actions are all logged in • SLAs for asynchronous monitors to ana-
an uneditable database, and asyn- lyze and detect harmful agent actions post-
chronous monitoring routines review hoc
those actions for evidence of harm
System
Architecture • Limiting internet access and other tool • Testing and red-teaming of the system pro-
(including access tections
containeriza- • Limiting credentials • “Canary evaluations” which test model ca-
tion and • Limiting access to system resources or pabilities to bypass less complex, easier-to-
permissions) filesystem (e.g. sandboxing) exploit versions of our security controls,
• Limiting persistence or state establishing that our implemented con-
trols are robust
Table 5: Claims, potential safeguards, and efficacy assessments for misaligned models
19
C.3 Security controls
Robust security practices and controls are essential to mitigating risks associated with the deployment of
models reaching High capability thresholds. Our security approach integrates industry best practices
with OpenAI-specific innovations tailored to manage the unique threats posed by frontier AI technologies.
Our security practices are designed to protect against external and internal adversaries and align with
established frameworks such as ISO 27001, SOC2, NIST SP 800-53, and FedRAMP. While nascent, AI-
specific security standards may also provide frameworks for addressing security and regulatory standards
that address the unique challenges of securing AI systems. This may include efforts emerging from the
Cloud Security Alliance’s AI Safety Initiative or the NIST SP 800-218 AI updates.
We will require the following practices for High capability models:
• Security Threat Modeling and Risk Management
– Comprehensive Security Threat Models: Ensure OpenAI employs security threat modeling that
systematically identifies and mitigates risks posed by adversaries. These models address both
external and internal threats, explicitly mapping vulnerabilities and attack vectors relevant to
frontier AI model access or misuse.
– Continuous Review and Iteration: Ensure security threat models are regularly reviewed and
updated. This includes updating the Security framework, policies, and controls as technologies
and threats evolve.
– Continuous Monitoring and Validation: Ensure security threat models and updates inform
where security and data privacy controls should be implemented, improved, and monitored to
further reduce risk. Internal and external assessments to validate these controls are conducted
regularly and reports are provided to OpenAI leadership.
• Defense in Depth
– Layered Security Architecture: Adopt a layered security strategy, ensuring robust protection
through multiple defensive barriers, including physical and datacenter security, network
segmentation and controls, workload isolation, data encryption, and other overlapping and
complementary security controls.
– Zero Trust Principles: Embrace Zero Trust principles in the design and operation of infrastruc-
ture, networks, and endpoints. Leverage hardware-backed security and modern authentication
standards and controls for accessing critical resources and performing operations.
• Access Management
– Principle of Least Privilege: Ensure access to systems and data is limited based on job functions,
need-to-know, and operational requirements in alignment with the principles of least privilege
and separation of duties. Ensure all access is regularly audited and reviewed. For all models,
especially those presenting High capabilities, access must be strictly limited, protected with
strong multi-factor authentication, and may require additional approvals for access.
– Identity and Device Management: Employees must authenticate using multi-factor authenti-
cation (MFA) and managed devices meeting security baselines. Access must be logged and
reviewed for detection and investigative purposes.
• Secure Development and Supply Chain
– Secure Development Lifecycle: Integrate automated code analysis, formal security reviews,
and penetration testing in engineering processes. Apply security reviews and validation to
higher-sensitivity critical components prior to deployment.
– Change Management: Establish and maintain a formal change management process to ensure
that all modifications to systems, applications, and infrastructure are properly authorized,
documented, tested, and approved before deployment. Changes to critical infrastructure
should undergo multi-person approval and review. This process should include maintaining
comprehensive records of changes and implementing rollback procedures to revert to previous
states.
20
– Software Supply Chain Integrity: Enforce supply chain security measures, including sourcing
hardware and software from reputable sources, and continuous vetting and monitoring of
third-party suppliers and software libraries.
• Operational Security
– Monitoring and Incident Response: Monitor security and event logs continuously to detect,
triage, and respond to security incidents rapidly by 24x7 on-call staff.
– Vulnerability Management: Ensure known vulnerabilities are addressed through consistent
patching and corrective actions, minimizing exploitation opportunities.
– Adversarial Testing and Red-Teaming: Conduct adversarial testing and red-teaming exercises
to proactively identify and mitigate potential vulnerabilities within corporate, research, and
product systems, ensuring resilience against unknown vulnerabilities and emerging threats.
Encourage reporting of good-faith security research through bug bounty programs.
• Auditing and Transparency
– Independent Security Audits: Ensure security controls and practices are validated regularly
by third-party auditors to ensure compliance with relevant standards and robustness against
identified threats.
– Transparency in Security Practices: Ensure security findings, remediation efforts, and key met-
rics from internal and independent audits are periodically shared with internal stakeholders
and summarized publicly to demonstrate ongoing commitment and accountability.
– Governance and Oversight: Ensure that management provides oversight over the information
security and risk management programs.
21