Posts

Limitations on Formal Verification for AI Safety 2024-08-19T23:03:52.706Z

Comments

Comment by Andrew Dickson on Would catching your AIs trying to escape convince AI developers to slow down or undeploy? · 2024-09-09T19:45:10.539Z · LW · GW

I very much agree with this concern and I think that synthetic biology can be a good comparable case to ground our intuitions and help estimate reasonable priors.

For years, researchers have been sounding the alarm around the risks of advanced biotech, especially around tools that allow gene synthesis and editing. And then we had Covid-19, a virus that regardless of the politicization, probably was created in a lab. And in any case, regardless of whether you believe it was or wasn't it seems clear that it easily could have been. Worse, it's clear that something like Covid could also relatively easily be created by an independent team with a fairly modest budget.  This case seems analogous to the kind of "close call" described by @Buck. In fact, in.many ways Covid was worse than the example Buck gives, because millions of people died, trillions of dollars of damages were done to the economy, etc., so one could argue it might be more similar to an even worse case of an AI that briefly escapes and wreaks some havoc before being contained.

In any case, the end result is that there has been very little reaction from global governments in terms of regulating synthetic biology since Covid. Even the executive order earlier this year was pretty minimal and applies only to limited types of vendors, customers and sources of DNA.

Why don't they regulate? I suspect that the reasons are mostly the same as the ones @Buck mentions in his post. Regulation would stifle one of the most promising new areas of technological innovation and could impact the profits of the biotech sector. Plus worries that if the US regulates, China won't slow down and will win the race to new medications and other innovations, and so on.

Comment by Andrew Dickson on Limitations on Formal Verification for AI Safety · 2024-08-26T21:15:41.674Z · LW · GW

Thanks Steve! I love these examples you shared. I wasn't aware of them and I agree that they do a very good job of illustrating the current capability level of formal methods versus what is being proposed for AI safety.

Comment by Andrew Dickson on Limitations on Formal Verification for AI Safety · 2024-08-23T18:13:15.409Z · LW · GW

Agustin - thanks for your thoughtful comment.

The concern you raise is something that I thought about quite a bit while writing this post/paper. I do address your concern briefly in several parts of the post and considered addressing it explicitly in greater detail, but ultimately decided not to because the post was already getting quite long. The first part where I do mention it is quoted below. I also include the quote from [1] that is from that part of the post as well, since it adds very helpful context.

At the same time, obtaining an estimate that a DNA synthesis machine will only produce a dangerous pathogen 30% (or even 1%) of the time is much less interesting than a guarantee that it will not do so at all. But in any case, advocates of GS approaches are not, for the most part, talking about estimates, but instead believe we can obtain strong proofs that can effectively guarantee failure rates of 0% for complex AI software systems deployed in the physical world, as we can see from the following quote (emphasis mine again):

Proof-carrying AGI running on PCH appears to be the only hope for a guaranteed solution to the control problem: no matter how superintelligent an AI is, it can’t do what’s provably impossible. So, if a person or organization wants to be sure that their AGI never lies, never escapes and never invents bioweapons, they need to impose those requirements and never run versions that don’t provably obey them.

Proof-carrying AGI and PCH can also eliminate misuse. No malicious user can coax an AGI controlled via an API to do something harmful that it provably cannot do. And malicious users can’t use an open-sourced AGI to do something harmful that violates the PCH specifications of the hardware it must run on. There must be global industry standards that check proofs to constrain what code powerful hardware and operating systems will run. [1]

Now jumping into to your specific points in greater detail:

  1. The safety specification itself may involve probabilities of harmful outcomes. The approach does not rely on guaranteeing a 0% failure rate, but rather on ensuring a quantifiable bound on the probability of failure. This becomes clear in Davidad's Safeguarded AI program thesis: https://www.aria.org.uk/wp-content/uploads/2024/01/ARIA-Safeguarded-AI-Programme-Thesis-V1.pdf

First, I want to note that there is a spectrum of ways in which "GS researchers" are talking about these techniques and how strong they expect the guarantees to be. For example, in [1] (Tegmark/Omohundro) the words "probability" or "probabilistic" are never mentioned until towards the end of the paper in two of the sub-sections of the "challenge problems" section. I also think that the quoted section from [1] above gives a good sense of how these authors expect that we will need proofs that offer100% certainty for most real safety use cases (e.g. "So, if a person or organization wants to be sure that their AGI never lies, never escapes and never invents bioweapons, they need to impose those requirements and never run versions that don’t provably obey them.") As the authors discuss, based on the fact that we can expect adversarial activity, from both human agents and AIs themselves, they basically require proofs that are 100% certain to support their claims. I discuss this later in my post as well:

And second, as the same researchers point out, the degree of adversarial activity that may be encountered in the AI safety context may in fact require that this [Limitation 1] is overcome in order for formal verification-based approaches to succeed. Quoting again from Dalrymple et al.,

Moreover, it is also important to note that AI systems often will be deployed in adversarial settings, where human actors (or other AIs) actively try to break their safety measures. In such settings empirical evaluations are likely to be inadequate; there is always a risk that an adversary could be more competent at finding dangerous inputs, unless you have a strong guarantee to the contrary…. This makes it challenging for an empirical approach to rule out instances of deceptive alignment, where a system is acting to subvert the evaluation procedure by detecting features of the input distribution that are unique to the test environment (Hubinger et al., 2021). [3]

This brings us to the authors of [3], which is the paper you mention. While I agree that Dalrymple et al. discuss probabilistic approaches quite a bit in their paper, I think that by their own account (as articulated in the above quote), basically all of the really interesting results that they hope for would require proofs with certainty of 100% or very close to it. This is true for both longer-term threats like loss-of-control where there are strong AI adversaries that will exploit any weakness, but also situations where you have something like a DNA synthesis machine, where a guarantee that it will only produce a bioweapon 10% of the time, or even 1% of the time is not very interesting because of the significance of even a single failure and the presence of human adversaries who will seek to misuse the machines. With all of this in mind, while it is true that authors of [3] discuss probabilistic approaches quite a bit, my sense is that even they would acknowledge that when we zoom in on real AI threat scenarios of greatest interest, guarantees of close to 100%, or very close to it, would be required to achieve interesting results. Although I would love for someone to articulate what counterexamples of this might look like - i.e. an example of a probabilistic guarantee about an AI threat that is not very close to 100% that would still be interesting.

There is one more note related to the above, which I won't take the time to elaborate in great detail here, but which I will sketch. Namely, I believe that for examples like proving that a DNA machine will not cause harm, or an AGI will not escape our control, the level of GS modeling/specification/verification that would be required to obtain a version of such a proof/guarantee that has 50% confidence (for example), is probably quite close to the level that would be required to obtain close to 100% confidence, because it's hard to see how one could obtain such a result without still creating extremely detailed simulations of large parts of the AI system and the physical world. Again, open to be proven wrong here if someone can sketch what an alternative approach might look like.

Now to address your second point:

  1. The verifier itself can fall within a spectrum and still be considered consistent with the Guaranteed Safe AI approach. While having a verifier that can produce a formal proof of the specified probability bound, which can be checked in a proof checker, would be very powerful, it's worth noting that a procedure capable of computing the probability bound for which we have quantifiable converge rates would also be regarded as a form of guaranteed quantitative safety verification. (See the Levels in https://arxiv.org/abs/2405.06624 section 3.4).

While it is technically true that there are certain types of approaches that fall under "Guaranteed Safety" according to the definition of GS given in [3], which would not be subject to the limitations I describe in my post (for example, "Level 0: No quantitative guarantee is produced", which technically counts as a GS guarantee, according to [3]), my post is really only responding to the authors' discussion of levels of GS that would be required for "formal verification", which is the same as saying those that would be required for getting the interesting results that everyone is really excited about here, against major AI threats. For the results of this sort, at least based on the discussion in the paper and my initial understanding, formal verification with something like 100% certainty appears to be required (although again I would love for someone to provide counterexamples to this claim).

Responding to your final thought:

Given the high stakes involved, I think that even if it turns out that the most advanced version of the Guaranteed Safe AI approach is not possible (which we cannot ascertain at this point), it is still both useful and necessary to frame the conversation and assess current approaches through this lens.

I would love to see this as well and I believe that this is exactly how the conversation about GS techniques should rationally evolve from here! That is, if we are not likely to obtain strong formal proofs and guarantees against AI threats in the near-term based on the limitations discussed this post, what are the best results that we can hope to get out of the lower-number levels of the GS spectrums of modeling, specification and verification in that case? My guess is that these results will be much less exciting and will look a lot more like existing safety solutions such as RLHF, heuristics-based guardrails and empirical studies of bad behavior by models, but I think it's possible that some unexpected new workable solutions may also come out of the exercise of exploring the lower levels of the GS spectrum(s), which do not utilize formal verification, as well.