OpenAI's Preparedness Framework: Praise & Recommendations

post by Akash (akash-wasil) · 2024-01-02T16:20:04.249Z · LW · GW · 1 comments

Contents

  Communication[2]
  Content
  Critiques & Recommended Changes
None
1 comment

In this post, I summarize some of my thoughts on OpenAI's recent preparedness framework. The post focuses on my opinions/analysis of the framework, and it doesn't have much summary (for readers unfamiliar with the framework, I recommend Zvi’s post [LW · GW]). 

As a high-level point, I believe voluntary commitments from individual labs will be insufficient, and I'm most excited about policies that governments could apply to all AGI corporations. Nonetheless, I think efforts from individual labs can be useful, and it's important for government-focused AI policy experts to engage with the ideas coming out of top AGI companies[1].

With that in mind, throughout the post, I try not to harp on pessimistic points like "more is needed", "this document hasn't single-handedly saved the day or addressed race dynamics", or "how does this prevent competitive pressures+race dynamics from forcing everyone to proceed incautiously". (I think these critiques are useful, but that isn't the purpose of this post). Instead, I focus more on "what do I think OpenAI's preparedness team did well?" and "What do I think are some things OpenAI's preparedness team could realistically change that would make this even better?"

Communication[2]

previously noted [LW(p) · GW(p)] that I was critical of the way that Anthropic communicated about its Responsible Scaling Policy. It seemed to me like Anthropic’s RSP was written in a corporate style that made it somewhat hard to identify the core content and signaled to policymakers that the situation is “under control” and that scaling could occur responsibly as long as companies released responsible scaling policies.

To my surprise, I think OpenAI’s preparedness framework communicated its ideas in a much clearer and more concern-inducing way. Here are a few specific things I appreciated about the style/tone/communication in OpenAI’s preparedness framework:

Content

What about the substance of the preparedness framework? I applaud OpenAI for being relatively concrete about the kinds of conditions that would cause them to pause development and deployment. 

I especially liked the following:

Thus far, the post has mostly focused on things I appreciate about the PF. In this section, I offer some critiques and recommended changes.

  1. Focusing more on general intelligence. OpenAI’s tracked risk categories focus on specific capabilities, as opposed to general intelligence. One strength of the risk categories is that OpenAI identifies specific types of capabilities that they would consider unacceptable. However, the risk categories seem to suggest that as long as OpenAI can eliminate a specific dangerous capability, the model can be deemed “safe.” An alternative perspective would say “given that our model could develop novel biological weapons, this suggests that its general scientific reasoning is dangerously high, so we should implement certain safeguards before proceeding even if we can tweak the model in ways that get it to no longer develop biological weapons.
    1. This “general intelligence” perspective is more consistent with Anthropic’s responsible scaling policy. Anthropic defines AI Safety Levels (ASLs) that are roughly meant to correspond to a model’s general capabilities. 
    2. Concrete suggestion: OpenAI should keep its tracked risk categories but also add a section that incorporates AI Safety Levels. I’d like to see a section in the PF that says “once we hit X capabilities, we will consider ourselves in an ASL-3 regime, which means that we should not proceed until we have done Y and Z.”
  2. Describing what kinds of safeguards would be considered sufficient. What happens if a model reaches high risk or critical risk in one of the categories? OpenAI is not allowed to deploy (high risk) or further develop (critical risk) until it has applied safeguards that get the model into a lower risk category. A lot of the value of the framework will come from how stringent/comprehensive these safeguards need to be. On the loose end, OpenAI could simply fine-tune a dangerous model until PhD students are no longer able to jailbreak the model into giving them biological weapon designs. On the strict end, if a model reaches high or critical risk, OpenAI could decide to stop until they have new fundamental insights that allow them to better understand how their models work and they have strong evidence that the model is extremely unlikely to cause catastrophic harm. 
    1. John Wentworth highlights this point in greater detail here [LW(p) · GW(p)].
    2. Concrete suggestion: OpenAI should describe some examples of safeguards that would be considered sufficient (either to deploy a model that was previously high-risk or to further develop a model that was previously critical risk). If the exact safeguards are not yet known, this should be explicitly stated. If the exact safeguards are not yet known but OpenAI recognizes that “doing gradient updates against the evals test suite until the model stops doing the bad thing” would be insufficient, this should also be stated.
  3. Adjusting the “high” threshold for model autonomy. My understanding is that models that are considered “high risk” can still be internally deployed. In the model autonomy category, “high-risk” models are defined as models that can “can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the critical path to model self-improvement.”
    1. If such a model were internally deployed, I think there’s a substantial chance that it would lead to the kind of AI-assisted capabilities progress necessary to produce an intelligence explosion.
    2. Concrete suggestion: OpenAI should either (a) lower the threshold for model autonomy or (b) describe specific procedures that will be used to prevent an uncontrolled intelligence explosion if models that can solve novel ML tasks are going to be internally deployed.
  4. Grant “veto power” to the Safety Advisory Group Chair and the head of the Preparedness Team. OpenAI’s internal governance essentially grants all power to OpenAI leadership (with the caveat that the board is able to overturn decisions and can might be able to remove the CEO). As Zvi notes [LW · GW], it would be preferable (from a safety perspective) to have multiple parties that are able to “veto” the development or deployment of a potentially-catastrophe-capable model. If the OpenAI CEO believes a model is safe, but the head of the Safety Advisory Group or the head of the Preparedness Team thinks the model has an unacceptably high chance of causing a catastrophe, this seems like a good reason to not develop or deploy the model.
    1. Concrete suggestion: OpenAI should allow the Safety Advisory Group Chair and the head of the Preparedness Team to have “veto power” on model development and deployment decisions
  5. Security commitments for model weights & code. Stolen or leaked models can increase catastrophic risks directly (by allowing a malicious actor to cause a catastrophe) or indirectly (by accelerating an AI arms race between companies or countries). For ASL-3 systems, Anthropic commits to ensure that “non-state attackers are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense.” Anthropic also lists some examples of security commitments it plans to make to secure ASL-3 models and ASL-4 models. 
    1. Concrete suggestion: OpenAI should add a “protecting model weights” section to the preparedness framework and specify some of the information security commitments it plans to make. Ideally, such commitments would ensure that non-state actors are unable to steal models before “high-risk” models are developed, and state actors are unable to steal models before “critical risk” models are developed (as well as models that are “high-risk” in model autonomy– as mentioned earlier, such models could already be sufficient to significantly accelerate AI progress). 
  6. More drills. I like drills. I think OpenAI (and other labs/governments) building “muscle memory” and getting concrete experience with emergency scenarios is an excellent idea, and it will help society prepare for “unknown unknowns.” OpenAI commits to execuring drills at least once a year. I recommend doing them more often, ideally once a month but perhaps starting with 5 times a year is OK. 

Finally, although this wasn't the purpose of the PF, I'd be excited to see OpenAI's preparedness team (and similar teams at other scaling labs) produce recommendations for governments. Insofar as internal corporate policies can pave the way for strong policies (e.g., a licensing regime or self-certification regime that requires certain kinds of eval results to be shared with the government), it would be excellent to see companies publicly and loudly advocating for such policies. 

See also: On OpenAI’s preparedness framework [LW · GW] (Zvi), comment about Anthropic’s RSP [LW(p) · GW(p)] (me), My thoughts on OpenAI’s alignment plan [LW · GW] (me, somewhat outdated), How evals might (or might not) prevent catastrophic risks from AI [LW · GW] (me, somewhat outdated), Six Dimensions of Operational Adequacy in AGI projects [LW · GW] (Yudkowsky).

  1. ^

    Aside: I personally believe that the safest pathway toward AGI would involve a moratorium on the development of superintelligence, ending the race toward superintelligence, and establishing a single international AGI safety project. I’m also generally more excited about government-based policy proposals (e.g., licensing) that would ensure that safety standards are applied to all AGI companies. However, I still think internal efforts by single AGI companies, while insufficient, can still be useful, and they can provide helpful inspiration for government policy. I also think that it’s generally epistemically useful for folks who focus on lab governance to occasionally red-team the work of folks focused on government policy, and same vice-versa. 

  2. ^

    Why does good communication matter in the first place? It seems plausible to me that some readers might think I’m spending too much time analyzing the “style” of the document as opposed to its content. My short response is that I think communication/style matters a lot for public-facing documents, especially ones that are meant to inform policy. I model policymakers as being extremely busy people who often have 10+ priorities they are tasked with. To the extent that one can make a document clearer or improve the “vibes” that get communicated in a quick skim, I think this actually has a meaningful effect on the AI policy discourse. Of course, it is also extremely important to have detailed & nuanced content for experts to discuss/debate, but I think it’s easy for experts in an area to underestimate the importance of tone/style/clarity/vibes. See also Pinker on the Curse of Knowledge.

1 comments

Comments sorted by top scores.

comment by Joe Collman (Joe_Collman) · 2024-01-14T06:21:52.280Z · LW(p) · GW(p)

Concrete suggestion: OpenAI should allow the Safety Advisory Group Chair and the head of the Preparedness Team to have “veto power” on model development and deployment decisions

Quite possibly a good idea, but I think it's less obvious than it seems at first glance:
Remember that a position's having veto power will tend to have a large impact on selection for that position.

The comparison isn't [x with veto power] vs [x without veto power].
It's [x with veto power] vs [y without veto power].
If y would tend to have deeper understanding, more independence or more caution than x, it's not obvious that giving the position veto power helps. Better to have someone who'll spot problems and need to use persuasion, than someone who can veto but spots no problems.