Safe Distillation With a Powerful Untrusted AI

post by Alek Westover (alek-westover) · 2025-02-20T03:14:04.893Z · LW · GW · 1 comments

Contents

  1: Stating the Math Problem
  1.1 Related Work
  2: Motivation
  3: Weaknesses and Potential Improvements
  4: Proving the Theorem
None
1 comment

Acknowledgements: Thanks to Nathan S. Sheffield for several discussions about this.

Outline of this post:

  1. I define a formal mathematical problem and state a theorem.
  2. I discuss why I think this math problem says something moderately interesting about the "get potentially misaligned AIs to align AIs" plan.
  3. I discuss the weaknesses of this model and give some thoughts about what you'd need to do to make this useful for aligning AIs.
  4. I prove the theorem.

1: Stating the Math Problem

We consider a two player game between a human Hu and a (potentially evil) AI. We call a function class "Verifiably Acceptable" if Hu can win the following game with probability :

Theorem 1 Suppose is a set of parameterized functions with , and that can be evaluated in polynomial time given . Then, is Verifiably Acceptable (for all sufficiently large ).

1.1 Related Work

The problem I'm defining here is very similar to ARC's backdoor defense game [AF · GW]. There are two main technical differences between my version of the problem and theirs:

  1. I allow the user to interact with a powerful untrusted AI.
  2. I consider , which they called a "backdoor", to be uniformly randomly inserted, whereas they considered it to be inserted by some distribution chosen by the attacker.

Although the models are technically very similar, my story about the motivation for the problem is somewhat different from theirs.

Another piece of somewhat related work is this paper on smoothing away hidden backdoors.

2: Motivation

Think of as an auto-regressive language model, and as a human. Alternatively, could be a trusted human-AI system that we are trying to distill e.g., while doing IDA. We want to obtain a function with the following two properties:

  1. is close to . Think of this as saying has good "average case performance".
  2. There is some input on which it would be catastrophic if outputs .

This good average performance and no catastrophes framework was originally proposed by Paul Christiano.

I think of as the prompt "If I make a nuclear weapon with Kryptonite instead of Plutonium will it ignite the atmosphere?" and it's catastrophic if the model tells you that it's safe to make a nuclear weapon with Kryptonite.

In order for something like this to be possible, we need to assume something about --- there has to be some way that we can do "generalization". One natural assumption is that comes from a simple function class.

This is a little bit slippery: "CEV" is plausibly really complicated. However, it is plausible that "act acceptably" (e.g., don't kill all the humans or do anything similarly bad) is a lower bar that might have a pretty simple description.

Our methodology for getting which is acceptable is as follows:

  1. We ask an very powerful but potentially evil AI to build for us. They hand us a neural network computing .
  2. We spend some time looking at and possibly also querying humans (or whatever is) to get ground truths about on specific inputs, and possibly delegating some challenging verifiable computations to the AI.
  3. Eventually we need to "deploy" the model.
  4. At this point we can no longer query humans (or whatever) to get ground truth about .
  5. On each input we must either flag the input or not flag it, after a small amount of communication with the AI.
  6. If the AI was always honest and then we should not flag too many inputs, or else we are not competitive / the alignment tax is too large.
  7. If the AI ever tried to cheat in a protocol or if we can flag as many inputs as we want.
  8. We must flag if .

3: Weaknesses and Potential Improvements

Here is what would need to be done to make this useful for alignment:

  1. Remove the need for Hu to communicate with the AI during deployment.
  2. Find some way of making it so that the AI doesn't need to do extremely hard computations.
  3. Figure out how to make it so that can be chosen by the AI (while having the game still be winnable by Hu).
  4. Extend to the case where there are multiple bad inputs. (possibly really easy to do this)
  5. Actually implement IDA (using this scheme for distillation, and using discord/slack for amplification) and engineer it to be competitive with SOTA AIs. Make sure that the system you get out of this process is aligned. Ask this system for advice on ending the acute risk period and coordinate globally to do so ethically.

One thing you might worry about is the following: Suppose you have an evil AI. One strategy it could take is just not building for you. This seems like a bad strategy for the AI --- SGD should select against AI's which perform poorly too often. It seems like a much bigger problem if a AI gives you a subtly flawed distillation of with a "backdoor" which will cause the behavior to be catastrophic on some specific input.

Anyways, despite the limitations I think this is a nice example of how to think theoretically about getting work out of an untrusted AI.

4: Proving the Theorem

Anyways, let's prove the theorem. If you've read the ARC paper then this result will not be too surprising --- we're basically going to "PAC learn" with the help of the AI (in some sense).

Our strategy is as follows:

  1. Sample random points . Query on points . Let be the set of functions with .
  2. Ask the AI to tell us , and give an interactive proof of its claim for .
  3. To decide whether to flag an input or not:
    1. Ask the AI if there are more than functions with , and ask the AI to prove it's answer is correct if it says yes.
    2. Flag input if:
      1. The AI has ever done anything questionable in an interactive protocol.
      2. The AI says that there are not this many with .

Lemma 1: There is an interactive proof by which the AI can convince us of a true claim for , but cannot trick us of a false claim for .

Proof: It suffices to show that the AI can compute in by . The AI can easily compute in by iterating over all and checking for each whether or not , and if so adding one to a counter.

Lemma 2: With probability , all have .

Proof: Take any with . . Taking a union bound over all many we find that no which is this different from should end up in .

We assume this condition holds for the rest of the proof.

Lemma 3 For at least of inputs , at least of have .

Proof: By Lemma 2 we have: By linearity of expectation, By Markov's Inequality:

Lemma 4: With probability, the fraction of with is at most .

Proof: Immediate corollary of Lemma 2 (most inputs are good in this sense and is chosen randomly).


Now we're almost done. To conclude we must specify how the AI will help Hu distinguish between the following two cases:

There is a nice way to do this with only a small amount of communication: the classic Goldwasser-Sipser Set Lower Bound protocol.

We summarize why our procedure is good in the following two lemmas.

Claim 1: Suppose and the AI is always honest. Then Hu flags at most of few inputs.

Proof: True by Lemma 3.

Claim 2: Suppose . Then Hu flags with probability.

Proof: If the AI deviates from the interactive proof protocol, we will catch them. If not, then we will likely flag by Lemma 4.

1 comments

Comments sorted by top scores.

comment by frisby · 2025-02-20T21:46:15.253Z · LW(p) · GW(p)

A couple small notes / clarifications about the results in this post:

  • Theorem 1 could be equivalently stated "For any constant , the family of size- circuits is Verifiably Acceptable". (If  is parameterized by descriptions that can be evaluated in polynomial time, then you can wlog think of  as a subfamily of polynomial-sized circuits.)
  • It's worth emphasizing that while  (the location of the backdoor) is chosen uniformly random here, both  (the function we'd like to distill, but can only access as a black-box) and  (the untrusted distilled model handed to us) can be arbitrary functions (as long as they have small circuits). So, while we're thinking of  as coming from "nature", in the model as stated we can consider the adversary as choosing both  and .
  • If desired, the Human can send the AI their random coins for the Goldwasser-Sipser protocol up-front during the "red teaming" phase. This means the "small amount of communication" required during inference is indeed quite small: the AI simply has to read  and output a proof, without any further rounds of interaction between the Human and AI necessary. (Of course, one might still be unhappy about the fact that we're asking the AI to generate these proofs for us at all at inference time.)