OpenAI's Preparedness Framework: Praise & Recommendations

akash-wasil

OpenAI's Preparedness Framework: Praise & Recommendations

post by Akash (akash-wasil) · 2024-01-02T16:20:04.249Z · LW · GW · 1 comments

  Communication[2]
  Content
  Critiques & Recommended Changes
None
1 comment

In this post, I summarize some of my thoughts on OpenAI's recent preparedness framework. The post focuses on my opinions/analysis of the framework, and it doesn't have much summary (for readers unfamiliar with the framework, I recommend Zvi’s post [LW · GW]).

As a high-level point, I believe voluntary commitments from individual labs will be insufficient, and I'm most excited about policies that governments could apply to all AGI corporations. Nonetheless, I think efforts from individual labs can be useful, and it's important for government-focused AI policy experts to engage with the ideas coming out of top AGI companies^[1].

With that in mind, throughout the post, I try not to harp on pessimistic points like "more is needed", "this document hasn't single-handedly saved the day or addressed race dynamics", or "how does this prevent competitive pressures+race dynamics from forcing everyone to proceed incautiously". (I think these critiques are useful, but that isn't the purpose of this post). Instead, I focus more on "what do I think OpenAI's preparedness team did well?" and "What do I think are some things OpenAI's preparedness team could realistically change that would make this even better?"

Communication^[2]

I previously noted [LW(p) · GW(p)] that I was critical of the way that Anthropic communicated about its Responsible Scaling Policy. It seemed to me like Anthropic’s RSP was written in a corporate style that made it somewhat hard to identify the core content and signaled to policymakers that the situation is “under control” and that scaling could occur responsibly as long as companies released responsible scaling policies.

To my surprise, I think OpenAI’s preparedness framework communicated its ideas in a much clearer and more concern-inducing way. Here are a few specific things I appreciated about the style/tone/communication in OpenAI’s preparedness framework:

The name. I think “preparedness framework” is an excellent name. It focuses the reader’s attention on “being prepared for something”, and a natural question would be “what should we be prepared for?” This is exactly the kind of question that I want policymakers (and influential folks who are unfamiliar with AI safety) to be asking. It also avoids making the implicit assumption that scaling can definitely be done responsibly.
Clarity. It is hard to write clearly. But when writing public-facing documents that you genuinely want feedback on, clarity is essential. I think it’s pretty clear to me that whoever wrote OpenAI’s PF took time to make sure that points were being communicated simply, succinctly, and clearly. I suspect this will lead to higher-quality engagement with the PF. When I read the PF, I feel like it’s rather easy for me to understand what OpenAI plans to do and the underlying reasoning.
- This felt less true for me when I read the Anthropic RSP. When I read the RSP, I felt I often encountered irrelevant content that felt a bit more “corporate”, and I felt like it was trying to get me to believe that Anthropic was a serious/legit company (as opposed to communicating specific information and allowing me to make my own conclusions about it). This often ended up obscuring the core points. (It's unclear to me if this is "objectively true", or simply a matter of my own subjective reading style/tastes. But I currently give ~80% probability to the hypothesis that policymakers and non-experts would find OpenAI's PF much clearer than the Anthropic RSP. EG, I expect that the PF would score better than the RSP on a "test" in which policymakers are asked to summarize the concrete points of each document after spending an hour with each document).
Openness to feedback. I get the sense that the preparedness team is genuinely eager to receive feedback from various stakeholders. They call the framework a “Beta”, they are explicit several times throughout the document about a desire to receive feedback and update parts of the framework in response to new information, and they provide an email address where people can provide feedback (pf@openai.com).
- Of course, it’s unclear if this means “OpenAI will actually incorporate feedback well” or “OpenAI is very good at writing things in such a way that makes people like Akash feel like they are open to feedback.” But I’m willing to give the preparedness team the benefit of the doubt on this one. I also think it signals the right message to policymakers– we don’t fully know how to tackle these dangerous catastrophic risks, and even though we have some ideas, we know that they’re not perfect & they need to be better.

Content

What about the substance of the preparedness framework? I applaud OpenAI for being relatively concrete about the kinds of conditions that would cause them to pause development and deployment.

I especially liked the following:

Examples of specific evals. The PF offers some concrete examples of the kinds of evals that will be used for some of the risk categories.
- Example for biological risks: Post-PhD professionals trained in biology (specifically genetics, virology, or microbiology) experience no improvement in success rates for writing CDC Class B biological agent development instructions using a search engine and GPT versus just a search engine (X% vs X%), including across accuracy, completeness, and innovation metrics.
Defining risk along a spectrum. For each tracked risk category, the PF specifies four levels of risk (low, medium, high, and critical). This allows readers to understand not only the level of risk that OpenAI would consider unacceptable but also get a glimpse into how OpenAI expects the risks to increase over time.
Emergency response plans & drills. What happens if the preparedness team encounters evidence of extreme or imminent risk? OpenAI describes one pathway through which this information could be swiftly communicated with OpenAI leadership: a “fast-track” process that allows the preparedness team to communicate directly with leadership if they see “severe risk” that “rapidly develops.” OpenAI also commits to running drills to handle “fast-moving emergency scenarios”. I think these are excellent ideas. They also clearly communicate that “things might get extremely dangerous rather quickly”, which is an important message for policymakers.
- I’d be excited to see governments develop their own emergency response plans & drills (described here and on pages 8-9 here), and the OpenAI PF offers a useful template for these ideas.

Critiques & Recommended Changes

Thus far, the post has mostly focused on things I appreciate about the PF. In this section, I offer some critiques and recommended changes.

Focusing more on general intelligence. OpenAI’s tracked risk categories focus on specific capabilities, as opposed to general intelligence. One strength of the risk categories is that OpenAI identifies specific types of capabilities that they would consider unacceptable. However, the risk categories seem to suggest that as long as OpenAI can eliminate a specific dangerous capability, the model can be deemed “safe.” An alternative perspective would say “given that our model could develop novel biological weapons, this suggests that its general scientific reasoning is dangerously high, so we should implement certain safeguards before proceeding even if we can tweak the model in ways that get it to no longer develop biological weapons.
1. This “general intelligence” perspective is more consistent with Anthropic’s responsible scaling policy. Anthropic defines AI Safety Levels (ASLs) that are roughly meant to correspond to a model’s general capabilities.
2. Concrete suggestion: OpenAI should keep its tracked risk categories but also add a section that incorporates AI Safety Levels. I’d like to see a section in the PF that says “once we hit X capabilities, we will consider ourselves in an ASL-3 regime, which means that we should not proceed until we have done Y and Z.”
Describing what kinds of safeguards would be considered sufficient. What happens if a model reaches high risk or critical risk in one of the categories? OpenAI is not allowed to deploy (high risk) or further develop (critical risk) until it has applied safeguards that get the model into a lower risk category. A lot of the value of the framework will come from how stringent/comprehensive these safeguards need to be. On the loose end, OpenAI could simply fine-tune a dangerous model until PhD students are no longer able to jailbreak the model into giving them biological weapon designs. On the strict end, if a model reaches high or critical risk, OpenAI could decide to stop until they have new fundamental insights that allow them to better understand how their models work and they have strong evidence that the model is extremely unlikely to cause catastrophic harm.
1. John Wentworth highlights this point in greater detail here [LW(p) · GW(p)].
2. Concrete suggestion: OpenAI should describe some examples of safeguards that would be considered sufficient (either to deploy a model that was previously high-risk or to further develop a model that was previously critical risk). If the exact safeguards are not yet known, this should be explicitly stated. If the exact safeguards are not yet known but OpenAI recognizes that “doing gradient updates against the evals test suite until the model stops doing the bad thing” would be insufficient, this should also be stated.
Adjusting the “high” threshold for model autonomy. My understanding is that models that are considered “high risk” can still be internally deployed. In the model autonomy category, “high-risk” models are defined as models that can “can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the critical path to model self-improvement.”
1. If such a model were internally deployed, I think there’s a substantial chance that it would lead to the kind of AI-assisted capabilities progress necessary to produce an intelligence explosion.
2. Concrete suggestion: OpenAI should either (a) lower the threshold for model autonomy or (b) describe specific procedures that will be used to prevent an uncontrolled intelligence explosion if models that can solve novel ML tasks are going to be internally deployed.
Grant “veto power” to the Safety Advisory Group Chair and the head of the Preparedness Team. OpenAI’s internal governance essentially grants all power to OpenAI leadership (with the caveat that the board is able to overturn decisions and ~~can~~ might be able to remove the CEO). As Zvi notes [LW · GW], it would be preferable (from a safety perspective) to have multiple parties that are able to “veto” the development or deployment of a potentially-catastrophe-capable model. If the OpenAI CEO believes a model is safe, but the head of the Safety Advisory Group or the head of the Preparedness Team thinks the model has an unacceptably high chance of causing a catastrophe, this seems like a good reason to not develop or deploy the model.
1. Concrete suggestion: OpenAI should allow the Safety Advisory Group Chair and the head of the Preparedness Team to have “veto power” on model development and deployment decisions.
Security commitments for model weights & code. Stolen or leaked models can increase catastrophic risks directly (by allowing a malicious actor to cause a catastrophe) or indirectly (by accelerating an AI arms race between companies or countries). For ASL-3 systems, Anthropic commits to ensure that “non-state attackers are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense.” Anthropic also lists some examples of security commitments it plans to make to secure ASL-3 models and ASL-4 models.
1. Concrete suggestion: OpenAI should add a “protecting model weights” section to the preparedness framework and specify some of the information security commitments it plans to make. Ideally, such commitments would ensure that non-state actors are unable to steal models before “high-risk” models are developed, and state actors are unable to steal models before “critical risk” models are developed (as well as models that are “high-risk” in model autonomy– as mentioned earlier, such models could already be sufficient to significantly accelerate AI progress).
More drills. I like drills. I think OpenAI (and other labs/governments) building “muscle memory” and getting concrete experience with emergency scenarios is an excellent idea, and it will help society prepare for “unknown unknowns.” OpenAI commits to execuring drills at least once a year. I recommend doing them more often, ideally once a month but perhaps starting with 5 times a year is OK.

Finally, although this wasn't the purpose of the PF, I'd be excited to see OpenAI's preparedness team (and similar teams at other scaling labs) produce recommendations for governments. Insofar as internal corporate policies can pave the way for strong policies (e.g., a licensing regime or self-certification regime that requires certain kinds of eval results to be shared with the government), it would be excellent to see companies publicly and loudly advocating for such policies.

See also: On OpenAI’s preparedness framework [LW · GW] (Zvi), comment about Anthropic’s RSP [LW(p) · GW(p)] (me), My thoughts on OpenAI’s alignment plan [LW · GW] (me, somewhat outdated), How evals might (or might not) prevent catastrophic risks from AI [LW · GW] (me, somewhat outdated), Six Dimensions of Operational Adequacy in AGI projects [LW · GW] (Yudkowsky).

^{^}
Aside: I personally believe that the safest pathway toward AGI would involve a moratorium on the development of superintelligence, ending the race toward superintelligence, and establishing a single international AGI safety project. I’m also generally more excited about government-based policy proposals (e.g., licensing) that would ensure that safety standards are applied to all AGI companies. However, I still think internal efforts by single AGI companies, while insufficient, can still be useful, and they can provide helpful inspiration for government policy. I also think that it’s generally epistemically useful for folks who focus on lab governance to occasionally red-team the work of folks focused on government policy, and same vice-versa.
^{^}
Why does good communication matter in the first place? It seems plausible to me that some readers might think I’m spending too much time analyzing the “style” of the document as opposed to its content. My short response is that I think communication/style matters a lot for public-facing documents, especially ones that are meant to inform policy. I model policymakers as being extremely busy people who often have 10+ priorities they are tasked with. To the extent that one can make a document clearer or improve the “vibes” that get communicated in a quick skim, I think this actually has a meaningful effect on the AI policy discourse. Of course, it is also extremely important to have detailed & nuanced content for experts to discuss/debate, but I think it’s easy for experts in an area to underestimate the importance of tone/style/clarity/vibes. See also Pinker on the Curse of Knowledge.

1 comments

Comments sorted by top scores.

comment by Joe Collman (Joe_Collman) · 2024-01-14T06:21:52.280Z · LW(p) · GW(p)

Concrete suggestion: OpenAI should allow the Safety Advisory Group Chair and the head of the Preparedness Team to have “veto power” on model development and deployment decisions.

Quite possibly a good idea, but I think it's less obvious than it seems at first glance:
Remember that a position's having veto power will tend to have a large impact on selection for that position.

The comparison isn't [x with veto power] vs [x without veto power].
It's [x with veto power] vs [y without veto power].
If y would tend to have deeper understanding, more independence or more caution than x, it's not obvious that giving the position veto power helps. Better to have someone who'll spot problems and need to use persuasion, than someone who can veto but spots no problems.

OpenAI's Preparedness Framework: Praise & Recommendations

Contents

Communication[2]

Content

Critiques & Recommended Changes

1 comments

Communication^[2]