Posts

Why I find Davidad's plan interesting 2024-05-20T08:13:15.950Z

Comments

Comment by Paul W on Emergence, The Blind Spot of GenAI Interpretability? · 2024-08-17T17:51:20.483Z · LW · GW

Are you saying that holistic/higher-level approaches can be useful because they are very likely to be more computationally efficient/actually fit inside human brains/ do not require as much data ? 
Is that the main point, or did I miss something ?

Comment by Paul W on Natural Latents: The Concepts · 2024-08-02T19:03:09.250Z · LW · GW

Hello !
These ideas seem interesting, but there's something that disturbs me: in the coin flip example, how is 3 fundamentally different from 1000 ? The way I see it, the only mathematical difference is that your "bounds" (whatever that means) are simply much worse in the case with 3 coins. Of course, I think I understand why humans/agents would want to say "the case with 3 flips is different from that with 1000", but the mathematics seem similar to me.
Am I missing something ?

Comment by Paul W on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2024-05-18T16:04:12.025Z · LW · GW

Is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase?

As far as I know (I'm not an expert), such absolute guarantees are too hard right now, especially if the AI you're trying to verify is arbitrarily complex. However, the training process ought to yield an AI with specific properties. I'm not entirely sure I got what you meant by "a guaranteed no-zero-day evaluation and deployment codebase". Would you mind explaining more ?
 

"Or is the claim that it's feasible to build a conservative world model that tells you "maybe a zero-day" very quickly once you start doing things not explicitly within a dumb world model?" 

I think that's closer to the idea: you {reject and penalize, during training} as soon as the AI tries something that might be "exploiting a zero-day", in the sense that the world-model can't rule out this possibility with high confidence[1]. That way, the training process is expected to reward simpler, more easily verified actions.


Then, a key question is "what else you do want from your AI ?": of course, it is supposed to perform critical tasks, not just "let you see what program is running"[2], so there is tension between the various specifications you enter. The question of how far you can actually go, how much you can actually ask for, is both crucial, and wide open, as far as I can tell.

  1. ^

    Some of the uncertainty lies in how accurate and how conservative the world-model is; you won't get a "100% guarantee" anyway, especially since you're only aiming for probabilistic bounds within the model.

  2. ^

    Otherwise, a sponge would do.

Comment by Paul W on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2024-05-16T08:25:03.873Z · LW · GW

I believe that the current trends for formal verification, say, of traditional programs or small neural networks, are more about conservative overapproximations (called abstract interpretations). You might want to have a look at this: https://caterinaurban.github.io/pdf/survey.pdf 
To be more precise, it appears that so-called "incomplete formal methods" (3.1.1.2 in the survey I linked) are more computationally efficient, even though they can produce false negatives.
Does that answer your question ?