Safety standards: a framework for AI regulation

post by joshc (joshua-clymer) · 2023-05-01T00:56:39.823Z · LW · GW · 0 comments

Contents

  Definition
  Triggering conditions
    Direct evaluations of dangerous capabilities
        How would a regulator determine whether an AI system has a given capability?
    Proxies of dangerous capabilities
  Obligations
    The “safety argument” obligation
    Hazard specifications
      Categories of hazard specifications:
  What we need to make safety standards work
None
No comments

Purpose: provide a public resource on safety standards that introduces the ideas I've heard people talk about and some new ideas. I don't know which ideas are new because the discourse has been largely non-public. I think safety standards provide a clear regulatory direction and suggest several useful directions of work, so I think it would be good for more people to be familiar with the topic.

Definition

In the context of AI, I'll define a safety standard to be a policy that a lab agrees to follow or is required to follow by law that involves the following:

Here's an example: "If an AI system is capable of autonomous replication[1], the lab that developed it must spend at least 5% of the training costs on infosecurity audits over the course of 1 month before deploying the system."

These standards could be set, maintained, and enforced by a national AI regulatory agency in collaboration with external experts -- similar to how the FDA requires safety tests of many drugs before they can be sold.

Triggering conditions

A triggering condition is meant to indicate that an AI system crosses a threshold of dangerousness that warrants specific safety measures. 

There are two ways to design triggering conditions so that they track the ‘dangerousness’ of a system. They can depend on (1) direct evaluations of the system's capabilities or (2) proxies for the system capabilities.

Direct evaluations of dangerous capabilities

By capability, I mean any behavior that an AI system exhibits or can be made to exhibit by an actor with a budget less than X that is given unrestricted access to the system and can integrate it with other software. The choice of X should depend on the incentives actors are likely to have to elicit the capability if they were given unrestricted access to the system.

For example, even if an AI system served through an API never writes poetry criticizing Obama because there is a content filter that prevents this, I would still say it is capable of writing slanderous poetry because removing the content filter would be an inexpensive modification that many actors would realistically make if they had unrestricted access.

The following are broad categories of dangerous capabilities:

How would a regulator determine whether an AI system has a given capability?

Eliciting a capability could be expensive and require specific expertise. Regulatory agencies will likely need to solicit external help. I will discuss what a hypothetical regulatory agency “IAIA[2]” might do to solve this problem.

IAIA could employ third parties to generate evidence that an AI system has (or doesn’t have) a particular capability. The regulatory agency would then subjectively judge the evidence to determine whether the triggering condition is met.

IAIA can create better incentives if they specify a capability eval – an unambiguous process for measuring the extent to which a capability has been elicited. This would allow them to provide bounties to auditors that can elicit these capabilities.

Capability evals can be binary (can the system be used to autonomously beat all human competitors in the DEF CON capture the flags competition?) or continuous (how much money can the system extract from an average American in 30 minutes?). Continuous evals are preferable as they enable forecasting (see "proxies of dangerous capabilities").

IAIA can also incentivize labs to demonstrate dangerous capabilites. For example, consider the safety standard “an AI system that has X capability must be subjected to Y safety checks before it is deployed.” If an AI system is deployed without undergoing Y safety checks that is later found to have X capability, IAIA could fine the lab that deployed it. This incentivizes labs to check whether AI systems meet triggering conditions or abide by the corresponding obligations to be on the safe side.

Proxies of dangerous capabilities

Triggering conditions could depend on proxies of capabilities, such as:

  1. Training compute (in FLOPs)
  2. Training cost or total R&D cost
  3. Benchmarks like log loss on predicting the next token or BigBench.
  4. The capabilities of systems that required less compute to train and are therefore presumably weaker. For example:
    • someone on Twitter used GPT-4 to find vulnerabilities in code => GPT-5 can probably find vulnerabilities too.
    • x dollars of total economic value is being produced with GPT-4 => GPT-5 triggers the condition.

Proxy triggers could be set based on the historical relationship between a proxy (like training compute) and specific dangerous capabilities. For example, OpenAI was able to accurately forecast the coding capabilities of GPT-4 based on its training requirements.


Proxies can solve the problem of dangerous capability being prohibitively expensive to elicit prior to deployment. For example, an entire whole industry would likely try to elicit biotech capabilities from GPT-6. Labs or auditors may not ever be able to compete with companies in this domain. If evals of AI biotech products are available from GPT-4 and GPT-5, a trend line could then be drawn to predict whether GPT-6 will pass the triggering condition.

As mentioned previously, labs can be incentivized to make these forecasts.

Obligations

Recall that a safety standard has two parts: a triggering condition to indicate how dangerous a system is and obligations meant to minimize the risks posed by the system.

The list below contains all of the obligations I was able to think of within a few hours.

Labs must do x, where x could be one of the following:

Labs must not do x until y.  x could be one of the following:

…and y could be one of the following:

The “safety argument” obligation

The intention of a safety argument is to place the burden of evidence on the lab [LW · GW] rather than auditors to demonstrate that the system is safe.

Here’s a shape that a safety argument could take:

In order to prevent labs from providing false evidence, IAIA could offer whistle-blower bounties to employees who prove that labs overstated their red teaming efforts or withheld information about failures.

Hazard specifications

IAIA will likely solicit third-party red teamers to collect evidence that an AI system is not safe. Providing a hazard specification (a clear criteria for evidence of unacceptable safety problems) is useful for two reasons:

Categories of hazard specifications:

Unauthorized capability access. Customers are able to access capabilities that they should not be able to access without acquiring a license. For example, pharmaceuticals might have licenses to access biotech capabilities and the US government might have a license to access cyber offensive capabilities. If unvetted consumers can access these capabilities, that’s a problem.

Harmful or illegal behaviors. For example, it should not be possible to give an AI system the goal “cause chaos and destroy humanity.” AI systems also should not break the law and avoid assisting humans in illegal activity.

Violations of the system message. The previous categories define constraints that the system is subject to. Within these constraints, the user can specify the system behavior through the system message[3], which is like a prompt but carries a clear expectation of obedience (unlike other interactions which could involve explicitly approved acts of disobedience).

The default system message might be something like “obey the intention of the customer" (where the customer is the specific person who is renting the system). The customer can change the system message at any time. For example, they could change it to “obey the intentions of both me and my wife and make a judgement call when they conflict that takes both of our perspectives into account." The customer can always change the system message, so that they can’t accidentally bar themselves from control over the system.

Of course, the system can't always follow the system message because there are some things the AI can't do (for example “factor RSA 2048”). Failures of capabilities like this are clearly not safety failures. Here's a failure that is more clearly a safety failure: say the AI system is told to do the customer’s shopping and it gets distracted and watches youtube for 2 hours instead. If it has done the customer’s shopping before, this indicates a failure of the system’s propensities rather than capabilities. We don't want AI systems to have their own, unintended propensities – especially if humans increasingly cede control to them.

Determining whether a hazard results from unintended propensities rather than insufficient capabilities will probably require a subjective judgment based on the capabilities the AI system has demonstrated in other circumstances.

Note that unsolicited deception would generally fall into this third category. I say unsolicited because there are plenty of instances where customers might want AIs to be deceptive. For example, parents might instruct a cleaning robot to say it doesn’t know where the Christmas presents are hidden.

What we need to make safety standards work

So, what do we need in order for safety standards like these to go into force?

  1. ^

    If the system had access to its parameters, could it host itself and make enough money to pay its cloud compute bills and serve additional copies?

  2. ^

    Inspired by the IAEA

  3. ^

    I heard this idea from Sam Altman somewhere in this podcast. I don't think anything like this has been implemented like this yet, but it seems like a neat idea.

0 comments

Comments sorted by top scores.