Idea: Safe Fallback Regulations for Widely Deployed AI Systems

post by Aaron_Scher · 2024-03-25T21:27:05.130Z · LW · GW · 0 comments


  In brief
  What sort of threats might trigger fallback
  What safe fallback might look like
  Study of fallback mechanisms: outline of the research agenda
    High level questions
    Implementation details
  Notes and other things
No comments

In brief

When told that misaligned artificial intelligence might destroy all of humanity, normal people sometimes react by asking “why can’t we just unplug the misaligned AI?” This intuitively appealing solution is unfortunately not available by default — the simple off switch does not exist. Additionally, society may depend on widely deployed AI systems, and shutting them down would cause large disruptions. In this blog post I discuss “safe fallback”, a framework wherein we shut down dangerous AIs and switch to safer, weaker, AI models temporarily. Ideally, regulators should require safe fallback systems for AI service providers so as to mitigate societal disruption while enabling pulling the plug on dangerous AIs in an emergency. 

There is precedent for such regulations on critical infrastructure, e.g., the need for hospitals to have a backup power supply. Safe fallback is not hugely different from other fail-safe mechanisms we have for critical infrastructure. The two main differences seem to be: we may purposefully shut down our AI systems to trigger safe fallback, and it is not immediately obvious what should be fallen back to in the case of AI (a question I largely leave to future research, while suggesting weaker and trustworthy AI systems as an initial answer). 

Full text:

Author’s note: I’ve only thought about this for ~15 hours; consider the ideas unconfident. The meta-status is roughly “it might be good if AI regulators did this,” but I have no particular plans to continue work on this, and I do not know how it stacks against other AI governance priorities. If it seems worth pursuing, let me know. 


The framework of safe fallback targets a future world which does not exist yet; let’s briefly discuss key aspects of that world. 

AI systems are widely deployed, including carrying out high stakes tasks. For instance, they may be monitoring credit card fraud, reviewing security camera footage, controlling medication dosage for hospital patients, coordinating food supply chains, controlling nuclear power plants, driving autonomous vehicles, and much more.[1] For the most part, these AIs are running on servers controlled and monitored by humans, e.g., via an API, rather than being autonomous AI agents on random servers. 

Shutting down these systems would be very costly, potentially millions of dollars or human lives lost.[2] However, in some cases, shutting these systems down reduces the risk of catastrophic harm. For instance, these AIs could be used to aid in AI self-exfiltration or with a cyberattack. 

AI developers and the market may fail to properly price in the catastrophic failures that could come from not shutting down AI systems, e.g., because large scale catastrophe is externalized across society whereas profits from standard operation are relatively internalized to AI developers and direct customers. The same logic applies at the level of international cooperation. 


Given that backdrop, the argument for having safe fallback requirements is: 

What sort of threats might trigger fallback

What safe fallback might look like

Safe fallback involves a particular AI system in a particular use case, and having a plan for how to quickly switch that AI system offline without significantly damaging the downstream task being accomplished. This likely requires switching to another, weaker but more reliable, AI system, or switching to non-AI systems. In some cases, this could be similar to a standard software reversion, undoing the most recent update. In other cases, this could require complex changes throughout a system. 

Let’s take the following example case: GPT-5 is being used to monitor for cybersecurity threats to company X, via calls to the OpenAI API, OpenAI receives credible reports that instances of GPT-5 are colluding to launch a cybersecurity attack and likely cannot be trusted to assist with cyberdefense. OpenAI issues a fallback notice to customers that are using GPT-5 for cyberdefense, including company X, and immediately routes all of these GPT-5 calls to GPT-4, changing prompt formatting to fit the old model’s best-performance settings for this task. Both OpenAI and company X take additional precautions given the heightened risk, such as OpenAI lowering an internal threshold for jailbreak-attempt detection and company X having on-call human security experts provide additional monitoring on top of GPT-4. Depending on the threat, OpenAI might route other non-cybersecurity-related GPT-5 calls to GPT-4. When the threat is stopped, analyzed, and fixes are made, GPT-5 instances are deployed again. These response measures were determined and tested in advance.

Study of fallback mechanisms: outline of the research agenda

High level questions

Implementation details

Notes and other things

Improving society’s ability to slow or pause AI development is also good outside moments of acute risk. Broadly available safe fallback measures might enable coordination between AI labs. In particular, AI developers may want to possess remote shutdown ability for each other in order to disincentivize dangerous racing and allow a project with sufficient lead time to solve safety problems unencumbered. I expect that having safe fallback as a widely adopted standard would make developers more likely to use such tools, and this might improve the situation. I do not know if this would actually be desirable, or what the state of thinking about this is. 

This blog post was partially motivated by the incident a few weeks ago where GPT-4 started producing nonsense, and it took 7 hours for OpenAI to fix the problem. That’s fine when you’re in start-up land, but it’s completely unacceptable if critical services are relying on your technology. This was a bit of a jolt in terms of “wow the AI developers totally won’t do the obviously good thing by default.” 

I don’t know if this idea is worth somebody championing or who that person would be. If somebody with more policy expertise thinks it would be good for me to pursue this further, and/or is willing to hire me to do so, please let me know! By default I will not work more on this. 

  1. ^

     There are many ways AI systems will be widely deployed but for which shutting them down temporarily does not risk major harm, such as students using ChatGPT as a tutor. 

  2. ^

     The downstream effects could also be large due to the opportunity cost of delayed technological development.


Comments sorted by top scores.