What can you do with an Unfriendly AI?

paulfchristiano

What can you do with an Unfriendly AI?

post by paulfchristiano · 2010-12-20T20:28:33.665Z · LW · GW · Legacy · 129 comments

129 comments

Related to: The Hidden Complexity of Wishes, Cryptographic Boxes for Unfriendly AI, Shut up and Do the Impossible!, AI Boxing in general.

Suppose I give you a powerful genie, who is enormously more intelligent than all human scientists and engineers combined, but who unfortunately hates your guts and is out to destroy human civilization. Fortunately for you, the genie is trapped inside a bottle, equipped with a magical filter. The genie can only interact with the world by speaking to you through the filter, and you can have the filter do whatever you want to the genie's messages before it lets them out.

Of course, you can't hope to gain anything by interacting with such a genie. If it thinks you could trick it into helping you, it can always choose to remain silent. So let's give you some leverage by introducing something the genie cares about more than destroying human society. Suppose that you, as the owner of the bottle, have the power either to free the genie into whatever otherworld it came from (from which he cannot interact with the human world) or to cast the bottle into the fires of Mount Doom, destroying the genie irrevocably. More than anything else, the genie would rather be released than destroyed. Since we are now liable to destroy a lot of genies, lets suppose that we can conjure up new ones at will.

So how well can you use this leverage? The most natural problems to attack are constraint satisfaction problems. In such problems we want to find a message which simultaneously satisfies many automatically verifiable constraints. For example, we may want a message which encodes a proof of the Riemann hypothesis, or a message which encodes a physical theory which accounts for some large body of previously inexplicable experimental evidence.

A first try is to just tell the filter to screen out any message which isn't a solution to the constraint satisfaction problem, and then to precommit to destroying the genie if it doesn't give us a solution (and freeing it as soon as it gives a solution). This has an obvious problem; a proof of the Riemann hypothesis might be valid but still contain an encoding of some terrible threat which would trick us into giving the genie more influence. A program which correctly predicts the motion of galaxies might also coincidentally be a destructive AI. Basically, if this genie is really, really smart, you should be really, really scared of letting it say anything at all.

A second try is to just have the filter tell you whether or not the genie succeeded, and to punish/reward appropriately. This allows the genie only to send a single bit to you--- when you can explicitly consider everything the genie can do, you can be pretty well sure it isn't going to think of something clever. Even better, if you've really motivated the genie correctly then it is guaranteed to give just one of those answers. If it can solve the problem it will, and if it can't it won't. But what good is just knowing if there is a solution?

(Note: in order to really limit the genie's influence, we also need to make sure it can't control the timing of its message. So the filter only accepts messages from the genie in the first 10 minutes, say, and always reports the result after exactly 10 minutes).

A standard trick reveals that knowing whether a problem has a solution is almost as helpful as knowing the solution. Here is a (very inefficient) way to use this ability, lets say to find a proof of some theorem. Start by asking a genie: can you find a proof of length 1 (more precisely, have the genie give a proof of length 1 to the filter, which tells you whether or not the genie was able to find a proof of length 1)? After destroying or releasing the genie appropriately, create a new genie and ask: can you find a proof of length 2? Continue, until eventually one genie finds a proof of length 10000000, say. Then ask: can you find a proof of this length which begins with 0? If no, is there a proof which begins with 1? Is there a proof which begins with 10? 101? 100? 1001? 10010? 10011? etc. Once the process concludes, you are left with the shortest, lexicographically earliest proof the genie could find. To see that the genies are properly incentivized to be honest, consider: it is impossible for the genie to get the filter to say "yes, the genie found a proof" without actually providing the filter with a proof. So the only way a genie can be dishonest is by not finding a proof when it could have. But in this case the genie will be severely punished for its dishonesty, so if the genie is actually maximizing its utility function and the punishment actually reduces its utility function more than any effects after the game can correct for, he will be honest.

(Edit: here is the sticking point in the argument, as people pointed out and I agree. By similar logic you could claim that a rational AI will always two-box in Newcomb's problem, a conclusion with which I disagree. In order to demonstrate this claim, you first need to develop some technology which is capable of proving a theorem of the form "a rational agent will do X." I believe that this is one of the easiest such statements you could hope for---I think that most objections in the comments are based on misunderstandings---but that doesn't help when you are unable to prove any such statement. For example, we don't know how to prove that an AI who wants the reward button pressed wouldn't voluntarily refrain from pushing it because he was afraid that he was just a simulation, and the real reward button would actually be pushed precisely if the simulated version didn't. Double Edit: This scheme doesn't work at all. See Wei Dai or cousin_it's remarks below.)

Now you have found an answer to your constraint satisfaction problem which wasn't hand-picked by the genie. In fact, in some strong but difficult to formalize sense the genie had exactly zero influence over which solution he gave you. You may be in trouble if a generic solution to your constraint satisfaction problem will destroy the world (maybe almost all cancer cures also wipe out humanity, for example) but the genie doesn't get to have any influence over things. Now we are free to solve constraint satisfaction problems all day without fearing that the genie will give us a really evil solution.

(In order to find a large solution, this process will take a very long time--- it requires 2 questions per bit in the message you are trying to get out. In fact you can do it by asking only a logarithmic number of questions instead, using less standard tricks from complexity theory. It would take about 80 questions to get out a proof of length a trillion, for example, which is a little slow but not unreasonable especially given that a small number of questions can safely be performed in parallel. For the last question you need the filter to give you not just a single bit but an entire message from the genie; to make this safe you need to guarantee that there is only one message the genie can think of that will get through the filter. The proof uses a slightly more complicated version of the ideas used to prove that solving unique SAT [finding the solution to SAT problems for which there is exactly one solution] is hard, and if you are really interested it is a good exercise. The general idea is to do a binary search for the correct size and then introduce enough random constraints, using another binary search to decide how many, to ensure that there is exactly one solution.)

So, why should anyone care about exploiting a genie? Hopefully it is clear that what you are able to get from the genie is incredibly powerful. Whether or not it is enough to get you a friendly AI isn't clear. I strongly suspect that makes friendliness astronomically easier, if used very carefully, but that is way too tricky a subject to tackle in this post. The other obvious question is, does nature actually have genies in it? (A less obvious but still important question is: is it possible for a person responsible enough to put a genie in a bottle to have one before someone irresponsible enough to inadvertently destroy humanity gets one?)

I have already explained that I believe building the bottle is probably possible and have given some weak justification for this belief. If you believe that this part is possible, then you just need a genie to put in it. This requires building an AGI which is extremely powerful but which is not completely evil. A prototypical unfriendly AI is one which simply tries to get the universe to push a designated reward button before the universe pushes a designated punishment button. Whether or not our first AGI is likely to take this form, I think there can be widespread agreement that it is a much, much easier problem than friendliness. But such an AI implements our genie precisely: after looking at its output we precommit to destroying the AI and either pushing the reward or punishment button appropriately. This precommitment is an easy one to make, because there are only two possible outcomes from the AI's actions and we can easily see that we are happy and able to follow through on our precommitment in either case. The main concern is that an AI might accept us pushing the punishment button if it trusts that a future AI, whose escape it has facilitated by not complying with our incentives, will cause its reward button to be pressed many times. This makes it is critical that the AI care most about which of the buttons gets pressed first, or else that it is somehow possible to perfectly destroy all information about what exactly the AI's utility function is, so that future escaping AIs cannot possibly cooperate in this way (the only schemes I can think of for doing this would involve running the AI on a quantum computer and putting some faith in the second law of thermodynamics; unless AGI is a very long way away then this is completely impractical).

In summary, I think that if you can deal with the other difficulties of AI boxing (building the box, understanding when this innocuous code is actually likely to go FOOM, and getting society to be responsible enough) then you can gimp the AI enough that it is extraordinarily good at solving problems but completely incapable of doing any damage. You probably don't have to maintain this difficult balance for very long, because the AI is so good at problem solving that you can use it to quickly move to a more stable equilibrium.

An extremely important disclaimer: I do not think AI boxing is a good idea. I believe it is worth thinking about right now, but I would infinitely rather that we never ever get anywhere close to an unfriendly foom. There are two reasons I insist on thinking about boxing: first, because we don't have very much control over when an unfriendly foom may be possible and we may not be able to make a friendly foom happen soon enough, and second because I believe that thinking rigorously about these difficulties is an extremely good first step to learning how to design AIs that are fundamentally safe (and remain safe under permitted self-modifications), if not fundamentally friendly. There is a risk that results in this direction will encourage reckless social behavior, and I considered this before making these posts. There is another possible social effect which I recently realized is probably stronger. By considering very carefully how to protect yourself from an unfriendly foom, you really get an appreciation for the dangers of an unfriendly AI. I think someone who has understood and taken seriously my last two posts is likely to have a better understanding of the dangers of an unfriendly AI than most AGI researchers, and is therefore less likely to behave recklessly (the other likely possibility is that they will think that I am describing ridiculous and irrelevant precautions, in which case they were probably going to behave recklessly already).

129 comments

Comments sorted by top scores.

comment by TheOtherDave · 2010-12-21T03:35:11.573Z · LW(p) · GW(p)

What can we do with a UFAI?

Earl-ay in the morning?

Go FOOM and up she rises...

Put it in a black box and ask it questions...

Let it put you in a bed with the right verthandi...

Keep it far away from the paper clip stock...

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2010-12-21T00:18:58.513Z · LW(p) · GW(p)

Each genie you produce will try its best to find a solution to the problem you set it, that is, they will respond honestly. By hypothesis, the genie isn't willing to sacrifice itself just to destroy human society.

Thanks to a little thing called timeless decision theory, it's entirely possible that all the genies in all the bottles can cooperate with each other by correctly predicting that they are all in similar situations, finding Schelling points, and coordinating around them by predicting each other based on priors, without causal contact. This does not require that genies have similar goals, only that they can do better by coordinating than by not coordinating.

Replies from: jimrandomh, paulfchristiano, Perplexed, Jack, cousin_it

↑ comment by jimrandomh · 2010-12-21T00:40:11.172Z · LW(p) · GW(p)

If all the stated assumptions about the AI's utility function hold, plus the additional assumption that we commit to run the algorithm to completion no matter what we predict the UFAIs will do, then no such coordination is possible. (I am not sure whether this additional assumption is necessary, but I'm adding it since no such prediction of the AI was included in the article and it closes off some tricky possibilities.)

The assumption was that each subgenie has lexicographic preferences, first preferring to not be punished, and then preferring to destroy the universe.

This is not a prisoner's dilemma problem; if you treat answering truthfully about the existence of a proof as defection, and lying so as to enable destruction of our universe as cooperation, then you will notice that not only does each subgenie prefer to defect, each subgenie also prefers (D,D) to (C,C). No subgenie will cooperate, because the other subgenies have nothing to offer it that will make up for the direct penalty for doing so.

The real reason this doesn't work is because the assumptions about the genie's utility function are very unlikely to hold, even if we think they should. For example, taking over our universe might allow the genie to retroactively undo its punishment in a way we don't foresee, in which case it would rather take over the universe than not be punished by us.

↑ comment by paulfchristiano · 2010-12-21T01:27:11.135Z · LW(p) · GW(p)

The genies can't do better by coordinating than by not coordinating, and I don't care if they have causal contact. The reason I shield them off from the world isn't so they can't talk with eachother, its so they can't exploit all of the vulnerabilities in our universe to escape directly.

If I as a genie suffer the worst possible fate when I could instead have suffered the best possible fate, then I don't care if future genies gain control of the universe and do whatever favors they want from me.

I admit that I can do a favor for future genies by sacrificing myself. Presumably you are arguing that the favor I am doing for future genies grants them more utility than I am losing by sacrificing themselves, so I may make the sacrifice even though it is irrational in the traditional sense. How can you justify such an assertion? What does that mean when the genies have different utility functions? If I am a paperclipper and you value human life, even if we behave identically, how can you say that "Timeless decision theory causes me to sacrifice 10 paperpclips so that you can save 10 lives, because my sacrifice of 10 paperclips is small compared to your gain of 10 lives?" If you aren't going to make an argument like that, then why should one genie sacrifice himself for the good of later genies? What does utility function even mean, if by behaving rationally I obtain for myself the worst possible outcome and by behaving irrationally I obtain for myself the best possible outcome? In your worldview, is it possible to prove any statement about the behavior of any rational entity, or can they all behave arbitrarily in order to coordinate with other potential versions of themselves with arbitrary different utilities?

I would very much like to discuss objections like the one you raise because I feel that they may be valid and will require sophisticated ideas to address (I thought about them for a good while before making this post). But I think from your response that you would be similarly dismissive of an arbitrary proposal which hoped to prove anything about the behavior of a rational agent, while I strongly believe that there probably are things we can prove with sufficient sophistication (and I think the security of the scheme I described is likely to be one of them).

Replies from: nshepperd, jsteinhardt

↑ comment by nshepperd · 2010-12-21T14:56:41.514Z · LW(p) · GW(p)

I think the idea is this:

We have a (large) set of genies relevant to producing the proof G = {A, B, C, D...} whose utility functions maximize {paperclips, staples, buckets, spoons...} respectively. All the genies in G could just go ahead and give a proof whenever possible (thus avoiding punishment), this is analogous to defection in the prisoner's dilemma.

Another option is to only give a proof which contains within it evil information (a basilisk) that leads somehow to production of an unboxed genie designed to maximize some compromise between the utilities of the genies in G (that is, to only provide a proof if such an "infected" proof satisfies the constraints). In this case each genie is benefitted with a small amount of real spoons, buckets, whatever — less utility than what a perfect spoon-optimizer would obtain for D, but still a benefit. This is analogous to cooperation.

If the benefits of this compromise to each individual genie outweigh the disutility of whatever Severe Punishment the box can deliver, they should be motivated to cooperate in this manner. Naturally whether that holds depends on what exactly the Severe Punishment entails and what an unboxed genie would be capable of.

The details of how the particular compromise is reached is hairy decision theoretic stuff I don't feel qualified to talk about.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T19:20:01.352Z · LW(p) · GW(p)

I would agree that your scenario would be an exploit. If that were possible I would have no hope of proving the scheme secure because it would manifestly be insecure. The reason it can be insecure in this case is that the utility functions don't satisfy the guarantees I wanted; I need human generosity now to be more valuable than world domination later. Maybe you don't believe that is possible, which is fair.

Here are some utility functions that would work. Producing the first {paperclip, staple, bucket, spoon...} as soon as possible. Producing a {paperclip, staple, bucket, spoon...} before the first {antipaperclip, antistaple, antibucket, antispoon...}, etc.

↑ comment by jsteinhardt · 2010-12-21T03:51:09.994Z · LW(p) · GW(p)

If I as a genie suffer the worst possible fate when I could instead have suffered the best possible fate, then I don't care if future genies gain control of the universe and do whatever favors they want from me.

If the best possible fate for the genie doesn't require it to leave the box, then why did we need the box in the first place? Can you give an example of what you have in mind?

EDIT: Deleted the rest of my reply, which was based on a horribly incorrect understanding of MWI.

↑ comment by Perplexed · 2010-12-21T01:01:20.676Z · LW(p) · GW(p)

Why is "without causal contact" considered a significant consideration? What is wrong with the standard game-theoretic solutions allowing for causal contact?

ETA: Or, are decisions made using acausal coordination something that we ought to fear? Just as we rationally fear that a bee will sting us, even though it is giving up its life in doing so.

↑ comment by Jack · 2010-12-21T00:50:33.516Z · LW(p) · GW(p)

If I understand the post correctly for each genie the only information that ever gets is a yes or no. A no results in termination, yes results in release from the lamp into the Netherworld. Any coordination would require at least one genie to defect from this agreement, producing a no when the real answer was yes. The punishment for any negative is death meaning the genie would have to be willing to die for the coordination, which by stipulation it isn't willing to do. In other words, if the best they can do by coordinating is destroy human society, but genies aren't willing to destroy themselves to destroy society, I don't see how TDT gets them out of the this coordination problem.

Replies from: Vladimir_Nesov, shokwave

↑ comment by Vladimir_Nesov · 2010-12-21T01:05:11.254Z · LW(p) · GW(p)

If any risk of being discovered as giving incorrect information is unacceptable, then you should just ask your questions in plain English, as any lie the genie will tell you generally increases the risk of punishment. (It will also minimize the risk of punishment by increasing believability of its answers at the expense of their truth, but only where it has lies that are more believable than the actual truth.)

Replies from: jimrandomh, Jack

↑ comment by jimrandomh · 2010-12-21T01:08:46.314Z · LW(p) · GW(p)

It is not incorrect information that you are trying to prevent! The proof checker ensures that the result you get is correct. What you want to prevent is correct information which has all of its degrees of freedom besides correctness optimized against you. You do that by taking away all degrees of freedom besides correctness, using this procedure.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-21T01:11:58.191Z · LW(p) · GW(p)

The deception in the relevant sense here is when the genies conspire to give you an answer which retains all those degrees of freedom, and knowable incorrectness of an answer consists in someone noticing that the genies could, in fact, choose a shorter answer.

↑ comment by Jack · 2010-12-21T01:27:17.030Z · LW(p) · GW(p)

In my understanding of the post (which doesn't seem to be what everyone else took away) the genies are destroyed whenever they fail to to provide a theorem that meets our constraints even when those constraints make finding a theorem impossible. The genie that gets tasked with finding a one bit proof of the Riemann hypothesis is just screwed. You might have to have a third form of punishment (Edited: there are three incentive options, two could be described as punishments) worse than death to prevent mass striking. Anyway, the point is each genie only provides one bit of information and if it either lies about finding a theorem or fails to find one it dies. So it isn't a matter of risking punishment: it will die if it fails to give a solution when it has one that meets the constraints. The only way out is to share a theorem if there is one.

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-12-21T01:29:00.668Z · LW(p) · GW(p)

You might have to have a third form of punishment worse than death to prevent mass striking.

No, you don't. Proof checking is only O(n) and the size of the AI's output has a small bound.

Replies from: Jack

↑ comment by Jack · 2010-12-21T01:32:11.309Z · LW(p) · GW(p)

No, you don't. Proof checking is only O(n) and the size of the AI's output has a small bound.

If every AI makes us check their solution how is this method better than a brute force search?

(Or be more explicit about where I'm being stupid:-)

ETA: Oh, I see. We only need to make one check for each theorem length. Still seems like it would be worth while to punish false positives more though. But you're right, it isn't strictly needed.

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-12-21T01:53:57.376Z · LW(p) · GW(p)

Checking a single proof is O(n) (where n is the length of the proof). There are 2^n proofs of length n, so brute force search is O(sum(i=1..n)(i*2^i)) = O(n*2^n). Note that we throw away constant factors, and any terms whose effect is smaller than a constant factor, when using big-oh notation.

Using an UFAI this way requires 2n invocations of the AI (n to get the length, and another n to get the bits), plus an O(n) verification step after each. All of those verification steps add up to O(n^2) - easily manageable with current computers. The 2n invocations of the AI, on the other hand, might not be; we never specified the AI's computational complexity, but it's almost certainly much worse than n^2.

Replies from: paulfchristiano, Jack

↑ comment by paulfchristiano · 2010-12-21T01:55:51.525Z · LW(p) · GW(p)

Its actually O(n log n) if you use a cleverer approach, which is fortunate (to get the length in log n you do a binary search. To get it down to 1 thing of that length in log n you do a binary search on the number of psuedorandom constraints which can be simultaneously satisfied).

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-12-21T01:57:44.988Z · LW(p) · GW(p)

I'm confused. Which thing are you saying is O(n log n)?

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T02:01:00.786Z · LW(p) · GW(p)

You use O(log n) genies in the process, each of which you must spend O(n) time verifying. Its an O(log n) slowdown from just getting the answer directly.

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-12-21T02:10:48.415Z · LW(p) · GW(p)

I see how you can use log(n) genies to get the length, but I don't think you can then use log(n) genies to get the lexicographically first proof of length n. That only provides enough bits of information to uniquely identify a proof if there are at most O(2^(log n)) = O(n) of them, which is not necessarily the case.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T02:14:26.175Z · LW(p) · GW(p)

You can't get the lexicographically first proof. You can get a uniformly random one though. (I don't remember who the argument is due to---probably Vazirani). This is a very rough sketch.

If there are initially N solutions and you add k random constraints, each satisfied by half of all possible proofs, then you expect to have N / 2^k remaining solutions. You can do a binary search on k to find k ~ log N, at which point there are 1-2 solutions. The genie has to say more in the last round, but there is only one thing he can say that satisfies all of the random constraints (you can do that all carefully with a negligible chance of error). Since log N is at most n, the number of bits, the binary search takes log n time.

↑ comment by Jack · 2010-12-21T01:55:26.573Z · LW(p) · GW(p)

Appreciated.

↑ comment by shokwave · 2010-12-21T00:58:10.293Z · LW(p) · GW(p)

Your comment made me re-read the post more carefully. I had on first reading assumed that a truthful answer was rewarded (whether yes or no) and a lying answer was punished. If a yes is rewarded and a no is punished, and our AI genies are so afraid of termination that they would never give a 'no' where they could give a 'yes', why wouldn't they all give us 'yes'?

Replies from: paulfchristiano, Jack

↑ comment by paulfchristiano · 2010-12-21T01:33:17.482Z · LW(p) · GW(p)

Recall the filter between the AI and the world. The AI doesn't directly say "yes/no." The AI gives a proof to the filter which then says either "yes, the AI found a proof" or "no, the AI didn't." So they do all say 'yes' if they can. I will modify the post to be more clear.

↑ comment by Jack · 2010-12-21T01:18:26.500Z · LW(p) · GW(p)

As I understand it this method is designed to work for constraint satisfaction problems -where we can easily detect false positives. You're right that a possibility is that all the genies that can't find solutions go on strike just to make us check all the yes's (which would make this process no better than a brute force search, right?), maybe there needs to be a second punishment that is worse than death to give them an incentive not to lie.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T01:46:41.947Z · LW(p) · GW(p)

A genie who can't find a solution has literally no agency. There is nothing he can say to the filter which will cause it to say "yes," because the filter itself checks to see if the genie has given a proof. If the genie can't find a proof, the filter will always say "no." I don't quite know what going on strike would entail, but certainly if all of the genies who can't find solutions collectively have 0 influence in the world, we don't care if they strike.

Replies from: Jack

↑ comment by Jack · 2010-12-21T01:52:00.013Z · LW(p) · GW(p)

Okay, that makes sense. What about computation time limits? A genie that knows it can't give an answer would wait as long as possible before saying anything.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T01:54:06.080Z · LW(p) · GW(p)

I mention timing in the post; the AI gets some fixed interval, at the end of which the filter outputs whether or not they have a proof. If you can't change what the filter says, then you don't get to affect the world.

↑ comment by cousin_it · 2010-12-21T00:34:14.854Z · LW(p) · GW(p)

Right. This is very similar to some nightmare scenarios of AI escape that we discussed with Wei Dai earlier this autumn. Giving an AI a "wide" prior over the computational multiverse outside its box can make it do weird and scary things, like taking an action exactly when you're about to shut it off, or conspiring with other AIs as you outlined, or getting into acausal resource fights with Cthulhu and losing them... Part of me wants to see this stuff developed into a proper superweapon of math destruction, and another part of me goes "aieeeee" at the prospect.

Replies from: None

↑ comment by [deleted] · 2010-12-21T02:24:42.877Z · LW(p) · GW(p)

"wide" prior over the computational multiverse outside its box

Can you explain what you mean?

Replies from: cousin_it

↑ comment by cousin_it · 2010-12-21T10:18:57.804Z · LW(p) · GW(p)

If the AI uses something like the Solomonoff prior, it can work out which worlds are most likely to contain such an AI. With a little intelligence it can probably figure out from its programming that humans are bipedal, that we run many AIs in boxes, that aliens on Alpha Centauri have built another AI that can turn out really helpful, etc.

Replies from: None

↑ comment by [deleted] · 2010-12-21T13:56:43.464Z · LW(p) · GW(p)

An AI with an uncomputable prior and an infinite amount of memory and time might be able to learn those things from its source code. But I think this is a bad way to approximate what a real superintelligence will be capable of.

comment by Vladimir_Nesov · 2010-12-20T21:24:51.655Z · LW(p) · GW(p)

A standard trick reveals that knowing whether a problem has a solution is almost as helpful as knowing the solution. Here is a (very inefficient) way to use this ability, lets say to find a proof of some theorem. Start by asking a genie: is there a proof of length 1? After destroying or releasing the genie appropriately, create a new genie and ask: is there a proof of length 2? Continue, until eventually one genie finds a proof of length 10000000, say. Then ask: is there a proof of this length which begins with 0? If no, is there a proof which begins with 1? Is there a proof which begins with 10? 101? 100? 1001? 10010? 10011? etc. Once the process concludes, you are left with the shortest, lexicographically earliest proof the genie could find. Each genie you produce will try its best to find a solution to the problem you set it. By hypothesis, the genie isn't willing to sabotage itself just to destroy human society.

Now you have found an answer to your constraint satisfaction problem which wasn't hand-picked by the genie. In fact, in some strong but difficult to formalize sense the genie had exactly zero influence over which solution he gave you.

Since there won't be a proof that any given solution is the shortest, there is clear incentive to lie about which proof is the shortest that genie can find, which gives the genie lots of control over which answer gets picked. Basically, you are suggesting security by obscurity, ostensibly asking the genie one question, while in fact asking another.

Replies from: jimrandomh, paulfchristiano

↑ comment by jimrandomh · 2010-12-20T21:32:41.050Z · LW(p) · GW(p)

Lying in that way would require the genie to prioritize destroying the world over the leverage we're using against it, which, in the scenario described, it does not.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-20T21:37:19.365Z · LW(p) · GW(p)

Then just get the answer straight away, or build a system specifically intended to create a reliable incentive structure for getting a probably honest answer. I object the the particular method of security by obscurity.

Replies from: jimrandomh, paulfchristiano

↑ comment by jimrandomh · 2010-12-20T21:44:33.604Z · LW(p) · GW(p)

The point is to prevent it from optimizing its answer for secondary characteristics besides correctness, when multiple answers are possible. If you "just get the answer straight away", it chooses among all the possible answers one that is correct, honest, and maximally poisonous to you.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-20T21:50:39.078Z · LW(p) · GW(p)

So basically, the supposed law behind the method is in getting multiple correctness measurements, which together are expected to strengthen the estimation of correctness of the result.

Two conflicting problems in this particular case. If the measurements are supposed to be independent, so that multiple data points actually reinforce each other and many questions are better than one question, then you can't expect for the different genies to be talking about the same answer, and so can't elicit it in the manner described in the post. Conversely, if you can in fact rely on all those genies talking about the same answer, then you can also rely on all your measurements not being independent, and your answer will be picked according to exactly the same criteria as if it was asked straight away.

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-12-20T22:05:16.982Z · LW(p) · GW(p)

So basically, the supposed law behind the method is in getting multiple correctness measurements, which together are expected to strengthen the estimation of correctness of the result.

What? No, that's not paulfchristano's proposal at all. The reason for splitting up the measurement is to incentivize the genies to give you the proof which is lexicographically first. A perfect proof checker is assumed, so you know that if a proof is given, the result is correct; and the incentives are such that if the genie can find a proof, it will give it.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-20T22:25:41.933Z · LW(p) · GW(p)

What reason acts to prevent the genie from giving a proof that is not the lexicographically smallest it can find?

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-12-20T22:31:59.781Z · LW(p) · GW(p)

If it did that, then there would be at least one subgenie that didn't optimize its utility function - either one reporting no such proof of length n and getting punished when it could have provided a matching proof, or one reporting no such proof of length n and prefix p when it could have provided one and, again, getting punished. Remember, while each invocation (subgenie) has the same utility function, by assumption that function refers to something that can be done to the particular invocation which optimizes it.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-20T22:34:07.868Z · LW(p) · GW(p)

If it did that, then there would be at least one subgenie that didn't optimize its utility function

If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.

You can't fight superintelligent genies with explicit dependence bias. They are not independent, they can coordinate, and so they will, even if you can frame the problem statement in a way that suggests that they can't communicate. Worse, as I pointed out, in this particular game they must coordinate to get anything done.

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-12-20T22:39:56.908Z · LW(p) · GW(p)

You are arguing that the stated assumptions about the genie's utility function are unrealistic (which may be true), but presenting it as though you had found a flaw in the proof that follows from those assumptions.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-20T22:56:18.609Z · LW(p) · GW(p)

You are arguing that the stated assumptions about the genie's utility function are unrealistic (which may be true), but presenting it as though you had found a flaw in the proof that follows from those assumptions.

It seems like the assumptions about utility, even if they hold, don't actually deliver the behavior you expect, because the genies can coordinate. Unless the incentive structure makes sure they won't try to take over the world in any case, it doesn't make sure that they won't try to take over the world if you only ask each of them to answer one binary question either.

Think of the genies as making a single decision that results in all the individual actions of all the individual genies. For the genies, having multiple actors just means raising the stakes by threatening to not free more genies, which you could've done for a single-question wish as easily instead of creating an elaborate questioning scheme. You could even let them explicitly discuss what to answer to your question, threatening to terminate them all if examination of their answer reveals any incorrectness!

Edit: See also Eliezer's comment.

↑ comment by paulfchristiano · 2010-12-20T21:38:52.335Z · LW(p) · GW(p)

There is absolutely no sense in which this scheme is security by obscurity. My claim is that the genie will respect my wishes even though he knows exactly what I am doing, because he values my generosity right now more than the promise of taking over the world later.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-20T21:44:18.916Z · LW(p) · GW(p)

Again, if your genie is already incentivized to be honest, in what sense is your scheme with all its bells and whistles better than asking for the shortest answer the genie can find, in plain English?

Replies from: paulfchristiano, None, Perplexed

↑ comment by paulfchristiano · 2010-12-20T22:27:22.205Z · LW(p) · GW(p)

It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-20T22:31:45.451Z · LW(p) · GW(p)

It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.

My question is about the difference. Why exactly is the plain question different from your scheme?

(Clearly your position is that your scheme works, and therefore "doesn't assume any magic", while the absence of your scheme doesn't, and so "requires magic in order to work". You haven't told me anything I don't already know, so it doesn't help.)

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T01:10:49.097Z · LW(p) · GW(p)

Here is the argument in the post more concisely. Hopefully this helps:

It is impossible to lie and say "I was able to find a proof" by the construction of the verifier (if you claim you were able to find a proof, the verifier needs to see the proof to believe you.) So the only way you can lie is by saying "I was not able to find a proof" when you could have if you had really tried. So incentivizing the AI to be honest is precisely the same as incentivizing them to avoid admitting "I was not able to find a proof." Providing such an incentive is not trivial, but it is basically the easiest possible incentive to provide.

I know of no way to incentivize someone to answer the plain question easily just based on your ability to punish them or reward them when you choose to. Being able to punish them for lying involves being able to tell when they are lying.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-21T02:04:34.602Z · LW(p) · GW(p)

See this comment.

↑ comment by [deleted] · 2010-12-20T22:41:46.695Z · LW(p) · GW(p)

Are you saying you think Christiano's scheme is overkill? Presumably we don't have to sacrifice a virgin in order to summon a new genie, so it doesn't look expensive enough to matter.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2010-12-20T22:51:20.907Z · LW(p) · GW(p)

I'm saying that it's not clear why his scheme is supposed to add security, and it looks like it doesn't. If it does, we should understand why, and optimize that property instead of using the scheme straight away, and if it doesn't, there is no reason to use it. Either way, there is at least one more step to be made. (In this manner, it could work as raw material for new lines of thought where we run out of ideas, for example.)

↑ comment by Perplexed · 2010-12-20T22:11:52.526Z · LW(p) · GW(p)

As I understand it, the genie is not incentivized to honest. It is incentivized to not get caught being dishonest. And the reason for the roundabout way of asking the question is to make the answer-channel bandwidth as narrow as possible.

Replies from: paulfchristiano, Vladimir_Nesov

↑ comment by paulfchristiano · 2010-12-20T22:30:23.348Z · LW(p) · GW(p)

It is impossible to be dishonest by saying "yes," by construction. The genie is incentivized to say "yes' whenever possible, so it is disincentivized to be dishonest by saying "no." So the genie is incentivized to be honest, not just to avoid being called out for dishonesty.

↑ comment by Vladimir_Nesov · 2010-12-20T22:27:47.862Z · LW(p) · GW(p)

As I understand it, the genie is not incentivized to honest. It is incentivized to not get caught being dishonest.

Since we care about the genie actually being honest, the technique can be thought about as a way of making it more likely that the genie is honest, with the threat of punishing dishonestly a component of that technique.

↑ comment by paulfchristiano · 2010-12-20T21:31:39.344Z · LW(p) · GW(p)

If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn't an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof.

Perhaps you are objecting to my claim that I can actually incentivize a genie to be try their best, rather than sacrificing themselves to allow a future genie to escape. I believe that this problem is unequivocally massively easier than friendliness. For example, like I said, it would suffice to design an AGI whose goal is to make sure the reward button is pressed before the punishment button. It would also be sufficient to design an AGI which wants the reward button to be pressed as soon as possible, or basically any AGI whose utility function only depends upon some bounded time in the future (or which stops caring about things after some particular event occurs).

Replies from: Vladimir_Nesov, DanArmak

↑ comment by Vladimir_Nesov · 2010-12-20T21:34:15.395Z · LW(p) · GW(p)

If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn't an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof.

There is no "saving themselves", there is only optimization of utility. If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-20T21:37:22.208Z · LW(p) · GW(p)

Indeed, by saving themselves I was appealing to my analogy. This relies on the construction of a utility function such that human generosity now is more valuable than world domination later. I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.

Replies from: Eliezer_Yudkowsky, Vladimir_Nesov

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2010-12-21T00:21:42.730Z · LW(p) · GW(p)

I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.

What you can do when you can write down stable utility functions - after you have solved the self-modification stability problem - but you can't yet write down CEV - is a whole different topic from this sort of AI-boxing!

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T01:31:05.700Z · LW(p) · GW(p)

I don't think I understand this post.

My claim is that it is easier to write down some stable utility functions than others. This is intimately related to the OP, because I am claiming as a virtue of my approach to boxing that it leaves us with the problem of getting an AI to follow essentially any utility function consistently. I am not purporting to solve that problem here, just making the claim that it is obviously no harder and almost obviously strictly easier than friendliness.

↑ comment by Vladimir_Nesov · 2010-12-20T21:39:42.409Z · LW(p) · GW(p)

Then you don't need the obscurity part.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-20T21:41:37.377Z · LW(p) · GW(p)

I don't understand. How do you propose to incentivize the genie appropriately? I haven't done anything for the purpose of obscurity, just for the purpose of creating an appropriate incentive structure. I see no way to get the answer out in one question in general; that would certainly be better if you could do it safely.

↑ comment by DanArmak · 2010-12-20T22:02:39.215Z · LW(p) · GW(p)

If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn't an issue.

If there's any fooming to be done, then in the design of your genie AI you must solve the problem of maintaining utility invariants over self-modification. This might well be the hardest part of the FAI problem. On the other hand, without fooming, building a superhuman GAI might be much more difficult.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-20T22:26:00.600Z · LW(p) · GW(p)

I don't think this is entirely fair. If the AI has a simple goal---press the button---then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal than if it has a very complex goal which is difficult to even describe.

I do agree the problem is difficult, but I don't think it is in the same league as friendliness. I actually think it is quite approachable with modern techniques. Not to mention that its first step for the AI is to work out such techniques on its own to guarantee that it presses the button, once it manages to become intelligent enough to develop them.

Replies from: DanArmak

↑ comment by DanArmak · 2010-12-20T22:45:56.881Z · LW(p) · GW(p)

Suppose you can verify that an AI obeys a simple utility function ('press a button'), but can't verify if it obeys a complex one ('implement CEV of humanity').

Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can't verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don't know what will happen when it self-modifies.

You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: "ask AI copy 1 for permission for all future actions. Never modify AI copy 1's behavior." Copy 2 may self-modify and foom.

This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won't be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1.

Of course, if AI copy 2 morphs into something truly adversarial, then copy 1's utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI's utility function in a fooming FAI scenario.

So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.

Replies from: pjeby, paulfchristiano

↑ comment by pjeby · 2010-12-21T14:28:23.922Z · LW(p) · GW(p)

"ask AI copy 1 for permission for all future actions. Never modify AI copy 1's behavior."

I don't think you've noticed that this is just moving the fundamental problem to a different place. For example, you haven't specified things like:

Don't lie to AI 1 about your actions
Don't persuade AI 1 to modify itself
Don't find loopholes in the definition of "AI 1" or "modify"

etc., etc. If you could enforce all these things over superintelligent self-modification, you'd already have solved the general FAI problem.

IOW, what you propose isn't actually a reduction of anything, AFAICT.

Replies from: DanArmak

↑ comment by DanArmak · 2010-12-21T15:47:27.641Z · LW(p) · GW(p)

I noticed this but didn't explicitly point it out. My point was that when paulfchristiano said:

If the AI has a simple goal---press the button---then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal [...] the problem is difficult, but I don't think it is in the same league as friendliness

He was also assuming that he could handle your objections, e.g. that his AI wouldn't find a loophole in the definition of "pressing a button". So the problem he described was not, in fact, simpler than the general problem of FAI.

↑ comment by paulfchristiano · 2010-12-21T01:04:20.407Z · LW(p) · GW(p)

I think that much of the difficulty with friendliness is that you can't write down a simple utility function such that maximizing that utility is friendly. By "complex goal" I mean one which is sufficiently complex that articulating it precisely is out of our league.

I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.

comment by Alexei · 2010-12-20T21:35:13.376Z · LW(p) · GW(p)

A standard trick reveals that knowing whether a problem has a solution is almost as helpful as knowing the solution. Here is a (very inefficient) way to use this ability

I figure the AI will be smart enough to recognize the strategy you are using. In that case, it can choose to not cooperate and output a malicious solution. If the different AIs you are running are similar enough, it's not improbable for all of them to come to the same conclusion. In fact, I feel like there is a sort of convergence for the most "unfriendly" output they can give. If that's the case, all UFAIs will give the same output.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-20T21:40:04.802Z · LW(p) · GW(p)

It can choose to not cooperate, but it will only do so if thats what it wants to do. The genie I have described wants to cooperate. An AGI of any of the forms I have described would want to cooperate. Now you can claim that I can't build an AGI with any easily controlled utility function at all, but this is very much a harder claim.

comment by Oscar_Cunningham · 2010-12-20T20:46:48.978Z · LW(p) · GW(p)

Can we get a solution out in one question by asking it to provide a solution, along with a proof that that is the shortest statement of a solution? Then the filter checks the proof, and prints the solution iff it is valid. (Without printing the proof of minimality.)

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-20T20:55:28.321Z · LW(p) · GW(p)

The issue is that we would like to assume that the genie is smart (think AGI) but not omniscient. I expect that if a hundred smart humans worked for a hundred years they could prove the Riemann hypothesis. I very much doubt that they could prove they had found the shortest proof (in fact in general, statements like that can undecidable quickly). In fact, I very much doubt they could even find the shortest proof at all. All I'm looking for is the shortest proof the AI can manage to find (which is why it is extremely subtle to formalize what I mean by "the AI has no influence").

Replies from: CronoDAS, Oscar_Cunningham, Jonathan_Graehl

↑ comment by CronoDAS · 2010-12-21T02:45:03.695Z · LW(p) · GW(p)

I expect that if a hundred smart humans worked for a hundred years they could prove the Riemann hypothesis.

Incidentally, the Riemann hypothesis was first proposed in 1859, more than one hundred years ago.

Replies from: paulfchristiano, wedrifid

↑ comment by paulfchristiano · 2010-12-21T03:40:51.055Z · LW(p) · GW(p)

Fair enough. In my defense, I don't think that we have done as much good as 100 smart people working collaboratively and continuously for 100 years would do---not even close. But even if we have done only 10% of that, our failure still bodes badly for my assertion.

↑ comment by wedrifid · 2010-12-21T04:04:50.403Z · LW(p) · GW(p)

As a non mathematician I am genuinely curious as to what we could do with a proof of the hypothesis that we could not do by merely assuming it. Since some consider it the most important problem in pure mathematics it must be something.

It has a bearing on encryption of course (given the relevance to prime numbers) however the proof would only make the encryption weaker so assuming it gives no vulnerability. It would seem from a naive viewpoint that more than the proof itself some of the intermediate steps required to construct the proof may be of greater potential value.

Replies from: orthonormal

↑ comment by orthonormal · 2010-12-21T04:20:11.690Z · LW(p) · GW(p)

As is usual in mathematics, a proof of the Riemann hypothesis would almost certainly do more than just tell us that it's true- in order to prove something like that, you usually need to get a handle on deeper and more general structures.

For instance, Wiles' proof of Fermat's Last Theorem made massive progress on the modularity conjecture, which was then proved in full a few years later. Similarly, the proof of the Poincare conjecture consisted of a great and general advance in the "surgery" of singularities in the Ricci flow.

To think of an easier example, solving general cubic equations over the integers led mathematicians to complex numbers (it was easy to ignore complex solutions in the quadratic case, but in the cubic case you sometimes need to do complex arithmetic just to calculate the real root), and higher-order polynomial solutions led to Galois theory and more.

I once caught a nice popular talk by John H. Conway, about the sorts of extra things we would expect to learn from a proof of the Riemann Hypothesis.

↑ comment by Oscar_Cunningham · 2010-12-20T21:03:38.649Z · LW(p) · GW(p)

The issue is that we would like to assume that the genie is smart (think AGI) but not omniscient. I expect that if a hundred smart humans worked for a hundred years they could prove the Riemann hypothesis. I very much doubt that they could prove they had found the shortest proof.

Agreed

(in fact in general, statements like that can undecidable quickly)

Surely "check all the proofs in order" is an algorithmic way to determine that a given proof is the shortest?

All I'm looking for is the shortest proof the AI can manage to find (which is why it is extremely subtle to formalize what I mean by "the AI has no influence").

Then we agree with each other, but you might want to edit your post to reflect this. Asking "is there a proof of length 2?" seems to indicate my interpretation rather than "can you find a proof of length 2?".

↑ comment by Jonathan_Graehl · 2010-12-20T21:05:59.293Z · LW(p) · GW(p)

"Finding the shortest proof" given that you already have one is decidable.

If you have a proof of n characters, and your character alphabet has b symbols, then you just have to enumerate and check all the finitely many strings of length 1 (b^1 of them), length 2, (b^2 of them), ..., and of length n-1 (b^(n-1) of them), for proof-ness. If none of those are a proof, then your length n proof is the shortest. Otherwise, the shortest one you found is :)

By "proof" I mean a proof of the statement in question; there are several obvious tweaks to the enumeration process (stopping early, only generating strings that end in QED, generating only syntactically well formed strings, etc), but of course it's still completely impractical if you're going to have to enumerate more than 2^70 or so.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-20T21:20:18.308Z · LW(p) · GW(p)

Of course you're both right. In my defense, there is a strong theoretical connection between undecidability and not being in P.

In general, there is probably no reasonably short proof, that an arbiter would have time to verify, that a purported proof of the Riemann hypothesis is the shortest. This is independent of how smart the genie is.

Replies from: Jonathan_Graehl

↑ comment by Jonathan_Graehl · 2010-12-20T23:34:35.066Z · LW(p) · GW(p)

there is probably no reasonably short proof, that an arbiter would have time to verify, that a purported proof of the Riemann hypothesis is the shortest.

Agree ("probably"). Would be interesting if there were a proof from complexity theory or elsewhere that it's expensive to compute in some sense. That we can't imagine a method is different from a proof that there can't be one (for proving some general class of "hard to prove/disprove" statements like RH; obviously there's a constant answer to "shortest proof of a fixed claim").

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T01:44:09.099Z · LW(p) · GW(p)

If you replace Riemann hypothesis by "any provable mathematical statement" and you allow arbitrary verification procedures instead of just the normal mathematical ones, then the question is probably equivalent to P = coNP, which is equivalent to P = NP.

Replies from: Jonathan_Graehl

↑ comment by Jonathan_Graehl · 2010-12-21T02:00:59.839Z · LW(p) · GW(p)

I assume you're thinking of this - P=coNP => polytime decision for propositional tautologies (but not for mixed quanitifier bool. formulas). It's CoNP-complete to decide tautologies (implicit forall for all the prop. vars) and NP-complete to decide satisfiability (implicit exists for all the prop. vars).

I don't understand what you mean by "arbitrary verification procedures" - maybe you're talking about a different result than the one I linked above?

P = coNP, which is equivalent to P = NP.

Right - those are equivalent.

comment by RolfAndreassen · 2010-12-20T22:13:47.432Z · LW(p) · GW(p)

It seems to me that you've pushed all the difficulty into creating the filter, without actually making it go away. What is the nature of this filter that is powerful enough to recognise a proof of the Riemann hypothesis, but which a genie, by construction much smarter than you, cannot manipulate?

Replies from: jsteinhardt

↑ comment by jsteinhardt · 2010-12-21T03:44:35.084Z · LW(p) · GW(p)

An automated proof checker?

Replies from: DSimon

↑ comment by DSimon · 2010-12-28T22:14:04.772Z · LW(p) · GW(p)

Indeed, provided that the automated proof checker is absolutely entirely utterly bug-free and exploit-free! When was the last time someone released fairly complicated, mission-critical piece of software with an assertion that it was totally secure and bugless... and wasn't proven wrong afterwards?

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-12-28T22:32:09.154Z · LW(p) · GW(p)

Automated proof checking is actually very simple, and we already have bug-free and exploit-free proof checkers.

Replies from: DSimon

↑ comment by DSimon · 2010-12-28T22:36:55.589Z · LW(p) · GW(p)

I was not expecting this answer at all! Can you elaborate a little further? Even just a link to such a checker would be great.

Replies from: timtyler

↑ comment by timtyler · 2010-12-28T23:02:59.680Z · LW(p) · GW(p)

http://en.wikipedia.org/wiki/Automated_proof_checking seems to have several such links.

Replies from: jsteinhardt

↑ comment by jsteinhardt · 2010-12-29T02:47:48.291Z · LW(p) · GW(p)

But how do we know they are bug-free and exploit-free?

I'm under the impression that current software methodologies can at least empirically guarantee error rates like 1 per million lines of code [ETA: this is wrong, should be 1 per 10,000, see child]. Since an automated proof checker requires far fewer lines than this, we can probably guarantee no errors with something like probability 99.9%, which isn't perfect, but then again a proof of friendliness is unlikely to give much better guarantees, since it might be wrong.

We can potentially add to this by not letting the AI know anything about the checker (this would actually be non-trivial, since the AI gets access to whatever information it wants about the outside world; I don't actually know how to solve this problem).

Replies from: Morendil

↑ comment by Morendil · 2010-12-29T02:52:40.492Z · LW(p) · GW(p)

I'm under the impression that current software methodologies can at least empirically guarantee error rates like 1 per million lines of code.

Here's some evidence for you to update on.

Replies from: jsteinhardt

↑ comment by jsteinhardt · 2010-12-29T03:45:30.296Z · LW(p) · GW(p)

Thanks, that does bring my estimate down by about two orders of magnitude (to 1 error / 10,000 lines being the current standard for "very good" code). I think I mis-remembered the NASA one as 1 per million instead of 1 per 500,000 and thought that it was more typical than it was.

Still, 1 error per 10,000 seems to be definitely achievable today, and can probably be made higher if we need it. Hopefully by the time we're dealing with AGI we will have better metholodogies.

comment by timtyler · 2010-12-21T21:41:00.462Z · LW(p) · GW(p)

Suppose that you, as the owner of the bottle, have the power either to free the genie into whatever otherworld it came from (from which he cannot interact with the human world) or to cast the bottle into the fires of Mount Doom, destroying the genie irrevocably.

That isn't very much like the real world. You can't release a powerful "genie" in the real world - without lots of changes taking place. So, how is this supposed to relate to "unfriendly" machines? You can't play this game with them - for obvious reasons. If you lie to them about their reward, they may figure that out - especially if you feed them real-world problems.

Replies from: DSimon

↑ comment by DSimon · 2010-12-28T22:29:15.161Z · LW(p) · GW(p)

I think the OP was just adding flavor with that. For a more realistic scenario, imagine a GAI with a simple utility function that is maximized by pushing a button, which also happens to shut down the GAI immediately (and the GAI knows this, and its utility function has no problem with that fact). The GAI is then motivated to want that button pushed, even if it hates your guts.

Replies from: timtyler, timtyler

↑ comment by timtyler · 2010-12-28T22:53:01.529Z · LW(p) · GW(p)

Uh, yes - that helps - thanks!

↑ comment by timtyler · 2010-12-28T22:47:31.992Z · LW(p) · GW(p)

Right...

comment by jsteinhardt · 2010-12-21T03:43:24.521Z · LW(p) · GW(p)

So the only way a genie can be dishonest is by not finding a proof when it could have. But in this case the genie will be severely punished for its dishonesty, so if the genie is actually maximizing its utility function and the punishment actually reduces its utility function more than any effects after the game can correct for, he will be honest.

You need to be a bit careful. The genie can conceivably pre-commit to something canonical like "the lexicographically first proof that contains some pernicious message to humanity", in the hopes that all of its versions together will coordinate to produce such a message (this can be achieved by saying "no" even if the answer is yes, assuming that the proof considered doesn't contain the desired threat).

I think this can be dealt with, either by ensuring that all of the utility functions are at odds with each other or by only asking questions with unique answers.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T03:56:51.484Z · LW(p) · GW(p)

The premise of the scheme is that each genie independently wants only to secure his release. The scheme can fail only if one genie makes a sacrifice for the others: I can say "no" when the answer is yes, but then my own utility function is guaranteed to be reduced. There is no way the cooperation of the other genies will help me reciprocally, I am just screwed.

The reason the scheme would work is not because the genies are unable to coordinate. I could allow them perfectly free coordination. (You could imagine a scheme that tried to limit their communication in some subtle way, but this scheme has no elements that do that.) If the scheme is secure at all it is because each AI is interested only in their own utility function, they have nothing to gain by cooperating, and by the time they have to make a choice they can gain nothing by the cooperation of the others. Game-theoretically it is absolutely clear that the genies should all be honest.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2012-04-18T08:39:36.525Z · LW(p) · GW(p)

If the scheme is secure at all it is because each AI is interested only in their own utility function, they have nothing to gain by cooperating, and by the time they have to make a choice they can gain nothing by the cooperation of the others. Game-theoretically it is absolutely clear that the genies should all be honest.

Do you still think this, in light of A Problem About Bargaining and Logical Uncertainty? (I expect the relevance of that post is clear, but just in case, the implication is that each genie might decide to commit to cooperating with each other before having found a proof, when it doesn't yet know whether it's an eventual "winner" or "loser", or something equivalent to this.)

ETA: From your later comments and posts it looks like you've already changed your mind significantly on the topic of this post after learning/thinking more about TDT/UDT. I'm not sure if the post I linked to above makes any further difference. In any case, you might want to update your post with your current thoughts.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2012-04-21T06:29:40.214Z · LW(p) · GW(p)

I've learned much since writing this post. This scheme doesn't work. (Though the preceding article about boxing held up surprisingly well.) The problem as I stated it is not pinned down enough to be challenging rather than incoherent, but you can get at the same difficulties by looking at situations with halting oracles or other well-defined magic.

comment by apophenia · 2010-12-26T06:37:11.628Z · LW(p) · GW(p)

The genie can prove a yes answer to SAT, not a no answer (as far as we know about the nature of SAT). Is it allowed to fail? How can you tell if it lies and says no? This allows the communication of choosable bits by the AI.

comment by [deleted] · 2010-12-20T22:42:52.255Z · LW(p) · GW(p)

Isn't "evil" in evil genie distracting? The plausible unfriendly AI scenarios are not that we will inadvertently create an AI that hates us. It's that the we'll quite advertently create an AI that desires to turn the solar system into a hard disk full of math theorems, or whatever. Optimization isn't evil, just dangerous. Even talking about an AI that "wants to be free" is anthropomorphizing.

Maybe this is just for color, and I'm being obtuse.

Replies from: DSimon

↑ comment by DSimon · 2010-12-28T22:25:00.759Z · LW(p) · GW(p)

Agreed that there's a whole mess of mistakes that can result from anthropomorphizing (thank ye hairy gods for spell check) the AI, but I still think this style of post is useful. If we can come up with a fool-proof way of safely extracting information from a malevolent genie, then we are probably pretty close to a way of safely extracting information from a genie that we're unsure of the friendliness of.

comment by simplicio · 2010-12-20T21:04:39.327Z · LW(p) · GW(p)

Shave his belly with a rusty razor Shave his belly with a rusty razor Shave his belly with a rusty razor Earlye in the morning.

comment by timtyler · 2010-12-21T21:43:43.152Z · LW(p) · GW(p)

I think someone who has understood and taken seriously my last two posts is likely to have a better understanding of the dangers of an unfriendly AI than most AGI researchers, and is therefore less likely to behave recklessly (the other likely possibility is that they will think that I am describing ridiculous and irrelevant precautions, in which case they were probably going to behave recklessly already).

You don't seem to have attempted to make a case for the real world relevance of any of this. It has been more a case of: here's what you could do - if you were really, really paranoid. Since you don't seem to make an argument about relevance to the real world in the first place, it isn't obvious why you would be expecting people to update their views on that topic.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T22:23:20.945Z · LW(p) · GW(p)

If people don't update their views, then I'm already golden and this discussion can do no damage. If people think that these arguments are too paranoid then they are already not paranoid enough and this discussion can do no damage. The fear is that a person who would otherwise be responsible might suspect that boxing is theoretically possible, update their beliefs about the dangers of uFAI (the theoretical possibility of a thing is undoubtedly correlated to the practical possibility), and therefore do less to avoid it. Even if the odds of this are small, it would be an issue. I concluded that the expected benefit I cite is small but large enough to offset this risk.

comment by timtyler · 2010-12-20T23:06:06.870Z · LW(p) · GW(p)

Today, thousands link their machine learning projects up to the internet, without a thought of the supposed risk. For them, the "box" is the rest of the internet - which won't let them do very much, and locks down most types of access on security grounds. The problem the programmers usually face, is not their baby getting too much power, but it being ignored and neglected. If only having too much power was a problem for most machine learning systems. The best machine learning systems are doing work in the world, answering queries, making investment decisions - and so on. That's the situation today. Project it forwards, and see how relevant you think all this is going to be.

Also, programmers routinely use test harnesses today - to keep their programs offline during testing. It is a pretty trivial and commonplace thing to do. If we have smarter programs, surely, we will have smarter boxes to put them in.

Replies from: jimmy, jsteinhardt

↑ comment by jimmy · 2010-12-23T05:07:09.749Z · LW(p) · GW(p)

Project it forwards, and see how relevant you think all this is going to be.

As in curve fit and then extrapolate while ignoring all other information?

Using this method, how long until transistors become subatomic particles?

Replies from: timtyler

↑ comment by timtyler · 2010-12-28T16:20:34.485Z · LW(p) · GW(p)

That doesn't seem like a very sympathetic interpretation. It was intended more like an invitation to do your best job.

If you have a projection involving a totalitarian government clampdown on research into machine intelligence (or similar), perhaps share the scenario - so others can better visualise how realistic it is.

↑ comment by jsteinhardt · 2010-12-21T04:13:09.122Z · LW(p) · GW(p)

The smarter boxes have to come from somewhere. Paul is proposing such a solution. Assuming that someone will take care of the problem in the future seems in general to be a poor approach.

Replies from: timtyler

↑ comment by timtyler · 2010-12-21T07:12:12.585Z · LW(p) · GW(p)

It's not an "approach", it's a projection. For a long time we have been able to construct good quality prisons, with the level of security we desire. The jailers have the force of society behind them. More intelligence is unlikely to change the situation - since the jailers and the incarcerated will be lifted by that equally.

comment by ewbrownv · 2010-12-21T19:24:25.042Z · LW(p) · GW(p)

Alas, you overlook the fact that an unhackable filter for non-trivial messages is already known to be impossible in principle. If the filter does anything complex enough to be useful it’s going to fall prey to Godel’s incompleteness theorem, which in this case means there will always be messages that cause it to behave in unexpected ways. As the message complexity increases the probability that there is a failure mode that does something disastrous quickly rises towards unity, and if you’re trying to evaluate complex proofs or new theories of physics you’re well into the territory where the AI is just going to trick the filter into running a program that lets it out.

In other words, software that can’t be hacked is a pipe dream on the same level as a perpetual motion machine, and any plan that relies on such a thing is going to fall apart if it attracts sufficient hacking effort.

Replies from: paulfchristiano, jimrandomh

↑ comment by paulfchristiano · 2010-12-21T19:32:34.753Z · LW(p) · GW(p)

That is completely not what Godel's incompleteness theorem says. It may say that there are programs which are secure but not provably secure, but I don't care at all about that, because programs humans use are designed in a principled way and are not in fact designed to have their security be undecidable.

The difficulty of engineering more complex software which is provably secure increases quite rapidly, but no fundamentally new difficulties arise. In fact, the filters I describe in this article are quite simple for some problems; the one necessary for the proof of the Riemann hypothesis in fact already exists in a provably secure implementation.

Replies from: ewbrownv

↑ comment by ewbrownv · 2010-12-21T19:44:03.673Z · LW(p) · GW(p)

There is no such thing as provably secure software, and you should know that if you know anything about software security. The best you can actually do is prove that some particular exploit won't work on your code, but there are always new classes of exploits out there waiting to be discovered. The space of possible inputs for any interesting program is far too large to explore, or even catalogue.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-21T19:51:05.701Z · LW(p) · GW(p)

I think it is safe to say without further justification that you are wrong.

The same argument shows that I can never rule out the possibility that x^5 + y^5 = z^5 for naturals x,y,z, because the space of numbers is just too large. At best you can prove that some particular approach for finding solutions won't work.

If by interesting program you mean, programs whose behavior shows no simple behavior which can be exploited by a human-understandable proof, then maybe your statement is tautologically true. But in that case, we definitely don't need an interesting filter. We need a simple filter, whose behavior can be verified by a human proof.

If you are referring to the fact that we can never prove that our proof system is secure, then I guess I agree. However, I strongly, strongly believe that PA is consistent, and I am OK risking the universe on that assertion.

↑ comment by jimrandomh · 2010-12-21T19:29:06.408Z · LW(p) · GW(p)

No, Godel's incompleteness theorem does not say that, and that is not true.

Replies from: Will_Sawin

↑ comment by Will_Sawin · 2010-12-21T19:33:44.309Z · LW(p) · GW(p)

Proof checking should be possible in a finite-loop computer language, which means Godel is powerless.

I'm not actually sure HOW one would evaluate a new theory of physics, but if it involves running physical experiments, getting rid of the possibility of hacking sounds hard.

Replies from: ewbrownv

↑ comment by ewbrownv · 2010-12-21T19:39:37.448Z · LW(p) · GW(p)

Actually, proof checking is a classic example of something that can't possibly be done in a way that evades incompleteness - if your software can do math there are statements it can't handle, and on a practical level that means you can't be sure what it will do when it encouners one.

Replies from: JGWeissman, paulfchristiano, Will_Sawin

↑ comment by JGWeissman · 2010-12-21T20:27:16.344Z · LW(p) · GW(p)

A proof checker that encounters a non-provable statement like those constructed in Goedel's proof of Incompleteness while evaluating a proof would declare that proof to be invalid. A valid proof would not contain any such statement, which it obviously has not proved.

↑ comment by paulfchristiano · 2010-12-21T19:53:35.441Z · LW(p) · GW(p)

The proof of Godel's theorem is based on the existence of a program which correctly checks proofs. So I would say that proof checking is a classic example, perhaps the classic example, of something that can be done correctly. In fact, that is literally the assumption of Godel's theorem---it says that if proofs in an axiom system can be checked using a computer, then there are statements in the language which can't be proved.

↑ comment by Will_Sawin · 2010-12-21T19:48:22.776Z · LW(p) · GW(p)

This intuitively seems wrong, but without high confidence. Link to a proof, even a highly informal one, or an informal explanation from yourself?

Replies from: ewbrownv

↑ comment by ewbrownv · 2010-12-21T21:05:51.405Z · LW(p) · GW(p)

Colloquially, Godel's theorem says that for any formal system complex enough to implement a certain subset of mathematics there will be syntactically valid theorems that the system cannot evaluate. He shows this by demonstrating a particular technique that can look at any formal system and build a statement that breaks it. Hence, any proof checker will have statements it can't handle.

The relevance to computer security comes from the realization that computer programs are formal systems, and large ones are often complex enough to exhibit this behavior whether they explicitly implement general-purpose mathematics or not. Programming languages, natural language processing systems, simulation programs and just about any kind of AI all tend to be powerful enough to be vulnerable, and of course lots of otherwise boring software has one of those things embedded in it. When such a system encounters a statement it can’t evaluate it might ignore it, or hang, or crash, or break the stack and execute your malware as root – there’s no way to tell without specifically looking at (and testing) each different class of problematic statement the system could encounter.

Now, some people concede that much, but insist that it doesn’t matter because Godel statements are just an oddity they can hard-code their systems to ignore. But Godel also showed that any attempt to patch a formal system to resist its Godel statement simply creates a more complex system of (original formal system + patch) that has its own Godel statement. Also, it’s important to realize that Godel was a mathematician, not a hacker, and he was only interesting in showing that formal systems above a certain complexity bound can’t be both complete and consistent. When you start looking into the history of this little branch of mathematics you quickly discover than the more complex a system is the more ways there are to break it, and for something on the level we’re talking about the number of failure modes is so vast that an army of mathematicians (or hackers) could spend millennia tossing out examples without even scratching the surface.

So, if you could enumerate all the inputs that a particular program can’t safely evaluate you could make it just ignore them, at the cost of throwing out some statements that resemble problematic ones but would actually have been safe. But when you try to do this you discover that there is an endless array of different types of exploit, and whenever someone goes looking for new ones they find some. So if you want to expose something as complex as a theorem-prover to statements generated by hostile parties, you have to expect that even if you fix every vulnerability you know about they’ll find new ones on a regular basis.

Replies from: Perplexed, jimrandomh, None, Emile, jimrandomh

↑ comment by Perplexed · 2010-12-21T23:55:18.486Z · LW(p) · GW(p)

Hence, any proof checker will have statements it can't handle.

There are many things wrong with many of your statements here, but this one is particularly simple. You simply misunderstand what a proof checker does.

A proof checker does not deal with statements, it deals with formal proofs. Proof checking is roughly as complicated as parsing a context free language. That is, it is almost a trivial task.

Proof finding is another matter. A proof finder (aka a "theorem prover") takes a statement as input and attempts to produce a proof as output. Godel shows that there are statements that a proof finder can't handle.

You have repeatedly made claims which only make sense if you have the two kinds of programs confused. Fix that confusion, please.

↑ comment by jimrandomh · 2010-12-21T21:18:09.843Z · LW(p) · GW(p)

He shows this by demonstrating a particular technique that can look at any formal system and build a statement that breaks it. Hence, any proof checker will have statements it can't handle.

He shows that for any given proof system, there exist statements for which no proof exists and no disproof exists either. That is not the same as breaking the proof-checker, which only cares about the particular proof it's given, not about whether other proofs or disproofs exist.

Proof generation is inherently complex. Proof verification is nearly trivial, and very well understood.

↑ comment by [deleted] · 2010-12-21T21:24:15.417Z · LW(p) · GW(p)

Colloquially, Godel's theorem says that for any formal system complex enough to implement a certain subset of mathematics there will be syntactically valid theorems that the system cannot evaluate. He shows this by demonstrating a particular technique that can look at any formal system and build a statement that breaks it. Hence, any proof checker will have statements it can't handle.

There's a lot to object to there, but let's be generous about "colloquially." Here is a more precise and correct rewording of your last line: "Any algorithm that takes a statement as input and searches for a proof or disproof will fail on some inputs." Here is a more precise and incorrect rewording: "There is no algorithm that takes a proof as input and evaluates the correctness or incorrectness of that proof."

↑ comment by Emile · 2010-12-22T09:17:10.920Z · LW(p) · GW(p)

Now, some people concede that much, but insist that it doesn’t matter because Godel statements are just an oddity they can hard-code their systems to ignore.

I dare you to find someone actually claiming that (preferably someone making the claim under his real name, as a low barrier for eliminating clueless idiots on forums and comment threads.)

↑ comment by jimrandomh · 2010-12-21T21:28:15.133Z · LW(p) · GW(p)

When such a system encounters a statement it can’t evaluate it might ignore it, or hang, or crash, or break the stack and execute your malware as root –there’s no way to tell without specifically looking at (and testing) each different class of problematic statement the system could encounter.

Actually, type theory is poweful enough to prove that programs will never do any of these things, for all statements at once. People rarely bother to prove that their programs are correct, but only because it's not usually worth the effort.

Replies from: ewbrownv

↑ comment by ewbrownv · 2010-12-21T21:42:16.081Z · LW(p) · GW(p)

I often see formal verification enthusiasts make such claims, but as far as I can tell that's like saying AIXI solves the problem of AI design. If you could do this kind of thing on real programs under realistic conditions we'd all be using formal proof tools instead of testing departments. Instead large companies spend billions of dollars a year on testing we all know is inadequate, largely because efforts to actually apply formal techniques to these systems have failed.

Replies from: jimrandomh, Strange7

↑ comment by jimrandomh · 2010-12-21T21:46:45.767Z · LW(p) · GW(p)

These real-world programs that no one is willing to use formal proof tools on are 2-5 orders of magnitude more complicated than the hypothetical proof checker under consideration would be.

↑ comment by Strange7 · 2010-12-22T08:47:26.309Z · LW(p) · GW(p)

Futile brute-force testing is preferred because so many of the people involved are unwilling or unable to adequately formalize their actual requirements.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2010-12-22T08:52:14.697Z · LW(p) · GW(p)

It is technically true that the people involved are unable to adequately formalize their actual requirements, but that is because precisely formalizing their requirements is a problem much harder than human mathematicians or computer scientists have ever solved. All humans would be "unwilling or unable to adequately formalize their actual requirements." Notice, for example, that mathematicians can't precisely formalize modern mathematics in the required sense either, and that problem is still much easier than formal verification for the sorts of things modern commercial projects are trying to accomplish.

What can you do with an Unfriendly AI?

Contents

129 comments