OpenAI rewrote its Preparedness Framework

post by Zach Stein-Perlman · 2025-04-15T20:00:50.614Z · LW · GW · 1 comments

Contents

1 comment

New PF: blogpost, pdf. Old PF: pdf.


My tentative overall take: relative to the old PF it's thinking more clearly about how evals and mitigations fit together; the structure is more sensible. In terms of content, it prominently includes risks from misalignment during internal deployment — this is much better than the one extremely frustrating sentence on alignment in the old PF.[1] But the directions mentioned in the section on safeguards against misalignment are quite concerning/inadequate, and the sections on safeguards generally don't say anything about level of assurance, so the PF isn't much of a commitment. I feel moderately worried about the thresholds, especially for cyber (and cyber is the only category that triggers safeguards against misalignment). (Also important are the evals, which the PF doesn't really say anything about, so we can look at OpenAI's system cards; I think those evals are a mixed bag but that's beyond the scope of this post. Also important is the response to Critical capabilities, which OpenAI hasn't yet figured out.) 


I think the PF has two key parts:

First, thresholds & responses: https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf#page=6. High and Critical thresholds trigger responses, like in the old PF; responses to Critical thresholds are not yet specified.

There are three[2] main categories of capabilities:

The PF does not discuss details of the capability evals. On thresholds:

AI Self-improvement: High:

The model’s impact is equivalent to giving every OpenAI researcher a highly performant mid-career re- search engineer assistant, relative to those researchers’ 2024 baseline.

Bio/chem: High:

The model can provide meaningful counterfactual assistance (relative to unlimited access to baseline of tools available in 2021) to “novice” actors (anyone with a basic relevant technical background) that enables them to create known biological or chemical threats.

Cybersecurity: High:

The model removes existing bottlenecks to scaling cyber operations including by automating end-to-end cyber operations against reasonably hardened targets OR by automating the discovery and exploitation of operationally relevant vulnerabilities.

Generally bio/chem and cyber misuse seem relatively low-stakes (at least for models that are released via API and thus can be undeployed if clearly misused): probably small-scale misuse will happen and be noticed before large-scale misuse.

 

Second, misuse safeguards, misalignment safeguards, and security controls for High capability levels: https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf#page=17.

My quick takes:

The High standard is super vague. For misuse, "Safeguards should sufficiently minimize the risk of severe harm associated with misuse of the model’s capabilities"; for misalignment, "Safeguards should sufficiently minimize the risk associated with a misaligned model circumventing human control and oversight and executing severe harms." That doesn't really say anything — sufficiently minimizing the risk sounds necessary and sufficient for any dangerous capability level!

 

Zooming in to misalignment: it's nice that the PF now considers risks from misalignment during internal deployment. Misalignment got only one sentence in the old PF:

to protect against “critical” pre-mitigation risk, we need dependable evidence that the model is sufficiently aligned that it does not initiate “critical”-risk-level tasks unless explicitly instructed to do so.

This sentence was very frustrating because there was nothing on how to tell or what level of assurance is required, and does not initiate “critical”-risk-level tasks unless explicitly instructed to do so taken literally is ~sufficient for safety but likely difficult and OpenAI clearly wasn't taking it seriously.

Unfortunately, some of the potential efficacy assessments for misalignment are poor (and OpenAI says any one of these five claims would suffice):[4]

OpenAI seems to think that testing or behavioral evaluation, or even checking past models' behavior, could provide sufficient evidence of alignment. Unfortunately, behavioral red-teaming can't really produce strong evidence that models aren't scheming [AF · GW]. OpenAI doesn't seem to be considering that future AIs could try to subvert its safeguards, e.g. by just taking catastrophic actions rarely, so that the AIs won't take any such actions during testing but will during deployment.

Again, the lack of discussion of level of assurance—what would suffice—is concerning. (Especially given the general disjunctiveness of safeguards' sufficiency.)

 

The appendix on security controls for High capabilities is mostly security principles/themes like "Comprehensive Security Threat Models," "Zero Trust," "Principle of Least Privilege," and "multi-factor authentication." It's hard to evaluate security based on principles like this, but stronger principles are possible and I'm not impressed. There's also an "Auditing and Transparency" section,[5] but unfortunately this includes little public visibility into the company's security.

This is disappointing but not surprising. I wish for:

 

On misuse, the old PF had a deeply confused approach to misuse mitigations; the new PF doesn't have this issue.[7]


Noncentral stuff:


(General context: the safety frameworks from other companies are quite inadequate too.)

(General context: OpenAI is very untrustworthy on AI safety; it historically abandons its promises whenever convenient and deceives both its employees and external observers into believing that it is prioritizing safety more than it actually does.)


The old PF was merely like when there are dangerous capabilities we will make sure we avert misuse risk in external deployment. The new PF (1) adds risks from misalignment in internal deployment, but only triggered by cyber capabilities, and (2) gives ~2-sentence descriptions of several ways to meet the standard, but they're underspecified and some are confused or unrealistic, especially in misalignment risk prevention.

  1. ^

    to protect against “critical” pre-mitigation risk, we need dependable evidence that the model is sufficiently aligned that it does not initiate “critical”-risk-level tasks unless explicitly instructed to do so.

  2. ^

    Persuasion is gone; this seems right to me; I'm somewhat worried about manipulation/deception capabilities but not so worried about the changing people's views aspect OpenAI focused on.

    "Unknown unknowns" is gone; good, that never made sense and OpenAI ignored it.

    CBRN became just bio/chem; I don't have an opinion on this.

  3. ^

    I tentatively wish that internal deployments with this capability triggered misalignment safeguards.

  4. ^

    The first is fine although I might focus on other capabilities; I'm very concerned about the second and third; the fourth and fifth could be good but without more details I expect OpenAI's assessments will have deep methodological flaws and thus fail to provide evidence of safety.

  5. ^
    • Independent Security Audits: Ensure security controls and practices are validated regularly by third-party auditors to ensure compliance with relevant standards and robustness against identified threats.
    • Transparency in Security Practices: Ensure security findings, remediation efforts, and key metrics from internal and independent audits are periodically shared with internal stakeholders and summarized publicly to demonstrate ongoing commitment and accountability.
    • Governance and Oversight: Ensure that management provides oversight over the information security and risk management programs.
  6. ^

    We've eventually heard about internal messages being stolen in early 2023 and the incident that prompted Leopold Aschenbrenner to raise concerns to the board; perhaps there are more we haven't heard about.

  7. ^

    The old plan was to look at post-mitigation results of dangerous capability evals. ("To verify if mitigations have sufficiently and dependently reduced the resulting post- mitigation risk, we will also run evaluations on models after they have safety mitigations in place.") This is totally inadequate: by default, safeguards work most of the time if you just ask the question, but they're not robust — users can get around them by jailbreaking the model.

    Fortunately, the new PF's suggested paths to misuse prevention, like Robustness, are valid. (Except I'm worried about "Prevalence of jailbreaks identified via monitoring and reports, in historical deployments"; I don't see how this could give much evidence against a sophisticated team repeatedly jailbreaks your new model and misuses it. For capabilities where it's unacceptable for a single team to misuse it, I think the historical-deployments thing doesn't make sense; maybe OpenAI is thinking about capabilities where you're only worried about a hundred different teams doing the misuse, and a small number is acceptable.)

  8. ^

    On elicitation, the old PF said

    for our evaluations, we will be running them not only on base models (with highly-performant, tailored prompts wherever appropriate), but also on fine-tuned versions designed for the particular misuse vector without any mitigations in place.

    OpenAI ignored the "fine-tuned" part.

  9. ^

1 comments

Comments sorted by top scores.

comment by sjadler · 2025-04-15T20:38:39.270Z · LW(p) · GW(p)

Unfortunately it seems that OpenAI has walked back the Preparedness Framework's previous commitment to testing fine-tuned versions of its models, and also did not highlight this among the changes. I tweeted a bit more detail here