frisby

Posts
Comments

Posts

Comments

Comment by frisby on Safe Distillation With a Powerful Untrusted AI · 2025-02-20T21:46:15.253Z · LW · GW

A couple small notes / clarifications about the results in this post:

Theorem 1 could be equivalently stated "For any constant , the family of size- $n^{c}$ circuits is Verifiably Acceptable". (If $G$ is parameterized by descriptions that can be evaluated in polynomial time, then you can wlog think of $G$ as a subfamily of polynomial-sized circuits.)
It's worth emphasizing that while $x^{*}$ (the location of the backdoor) is chosen uniformly random here, both $g$ (the function we'd like to distill, but can only access as a black-box) and $f$ (the untrusted distilled model handed to us) can be arbitrary functions (as long as they have small circuits). So, while we're thinking of $g$ as coming from "nature", in the model as stated we can consider the adversary as choosing both $f$ and $g$ .
If desired, the Human can send the AI their random coins for the Goldwasser-Sipser protocol up-front during the "red teaming" phase. This means the "small amount of communication" required during inference is indeed quite small: the AI simply has to read $x$ and output a proof, without any further rounds of interaction between the Human and AI necessary. (Of course, one might still be unhappy about the fact that we're asking the AI to generate these proofs for us at all at inference time.)

Comment by frisby on The subset parity learning problem: much more than you wanted to know · 2025-01-05T02:11:56.575Z · LW · GW

Confirming that efficiently finding a small circuit (you don't actually need further restrictions than size) based on its values on a fixed collection of test inputs is known to imply --- see this paper.

Comment by frisby on The subset parity learning problem: much more than you wanted to know · 2025-01-04T18:00:05.602Z · LW · GW

This is a fact worth knowing and a lucid explanation --- thanks for writing this!

I know it's not the main point of the post, but I found myself a little lost when you talked about complexity theory; I would be interested to hear more details. In particular:

When you say

Learning algorithms \subset BPP-invertible .

what are the definitions of these classes? Are these decision problems in the usual sense, or is this a statement about learning theory? I haven't been able to either track down a definition of e.g. "NP-invertible" or invent one that would make sense --- if this is a renamed-version of something people have studied I would be curious to know the original name.

You claim "the assumption P $\neq$ BPP implies that there is no way, in polynomial time, to get reliable information about A’s secret circuit C (beyond some statistical regularities)" --- I suspect this is not quite what you mean. I'm not sure exactly the formal statement you're making here, but it would be pretty unusual for P $\neq$ BPP to be a relevant assumption (in fact most people in TCS strongly believe P=BPP). You call this "the probablistic version [of P $\neq$ NP]", so I'm guessing this was a typo and you mean NP ⊄ BPP? But I would also be impressed if you could get something like this statement from only NP ⊄ BPP. My best guess would be that you do need those cryptographic assumptions you mentioned for this part (that is, if you want to say "P/poly is not PAC learnable" --- in other words, no polynomial-time algorithm can, given random input-output samples from an arbitrary poly-size circuit, find a circuit computing a nearby function except with tiny success probability --- I'm pretty sure this is only known conditional on the existence of one-way functions).

Again though, I know these are quibbles and the main point of this section is "you don't need any unproven complexity assumptions to get lower bounds against statistical query learning".

(Separately, a minor nit with the presentation: I found the decision to use "learning algorithm" to refer specifically to "weight-based ML architecture that learns via SGD" to be slightly confusing. The takeaway being "there exists an algorithm that learns parities, but there doesn't exist a learning algorithm that learns parities" is a little slippery --- I think it would be worth replacing "learning algorithm" with e.g. "gradient-based learning algorithm" in the writeup.)

User info

Posts

Comments