Rolf Nelson: How to deter a rogue AI by using your first-mover advantage

post by Kevin · 2010-11-17T14:02:44.444Z · LW · GW · Legacy · 23 comments

http://www.sl4.org/archive/0708/16600.html

23 comments

Comments sorted by top scores.

comment by PlaidX · 2010-11-29T22:17:40.628Z · LW(p) · GW(p)

Rolf's original formulation of this gambit fell a few steps short of making sense, but it was rethought by norman noman and others in that thread into something that does, I think.

In the new formulation, there's no first mover advantage anymore, nor is it a one-way "deterrence". It's a cooperation between the rogue AI and the friendly AI (and presumably many many other rogue and friendly AIs, all simulating each other)

I'm on the fence as to whether the world really works this way, but it seems to work in cartoon form:

http://poorlyplannedcomics.com/?p=901

comment by JenniferRM · 2010-11-21T21:51:46.601Z · LW(p) · GW(p)

I understand scenario 2 as potentially having some kind of value because who knows what kind of cleverness a super-intelligence could come up with... but scenario 1 in all its flavors and variants has always struck me as a peculiar insanity that implies that simulationist metaphysics actually has versions that cause people to become crazy (and hence marks the memes as somewhat dangerous).

If someone tells me that I'm either in reality and must comply with a threat or else they will simulate me being faced with the same decision and tortured in the sim if I don't comply in the sim (or variants say you torture the sim if the real version doesn't comply) then my answer is always: "Bring on the torture!"

How is this complicated?

In reality, there is no credible threat, and I know that my sims both (1) will intellectually support my affirmation of winning in the place that matters and (2) are incapable of affecting the outcome anyhow. In the sim, I know that (1) I'm on the winning side by virtue of someone trying so desperate a gambit as this, and (2) I'm willing to soak up a bit of ill conceived torture on behalf of the part of the team that matters.

If you are in a sim and your simulator has the ability to do this sort of thing to you, its just the fallacy of divine command theory all over again. I've already decided that in the very unlikely event that god exists and is "evil but omnipotent", then I'm fighting it anyway for the sake of goodness, honor, and art :-P

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2010-11-21T21:56:13.107Z · LW(p) · GW(p)

I'm willing to soak up a bit of ill conceived torture on behalf of the part of the team that matters.

Why don't simulations matter? Uploads are as real people as real can be.

Replies from: JenniferRM
comment by JenniferRM · 2010-11-22T02:19:45.486Z · LW(p) · GW(p)

The issue isn't "status as a morally worthwhile person" but strategic position.

If there is a meaningful substrate/simulation dichotomy with all the normal hidden assumptions that go with this holding true (substrate supports the simulation with read/write editing powers over it, etc, etc) then the substrate is obviously the thing that matters because it actually does matter. Winning in one or another simulation but losing in the substrate is still losing because the simulation depends upon the substrate.

You can close your eyes and imagine a pleasant situation and that might be pleasant for a while (if you don't "notice the pixels"). But if your physical brain is in a materially bad position (lack of nutrients or physical trauma or some such) that you could have avoided by pausing your daydreams to spend some clock cycles on world modeling and planning, then not to have done so would be a tragedy.

If I had the option (and freedom from resource constraints), I'd move my "wirehead happy time" into a simulation because I could fly and magically learn kung fu and generally have an amusing time. It would be just as "internally real" an experience if I'm focused on the experience alone... but until I'm floating in a limitless unconstrained choir of sexy robotic elohim, I want to make sure that someone is diligently minding the substrate :-P

And I do not want my substrate-minding-sister to do something dumb like not achieve her goals (which I share) just because someone threatens to imagine her experiencing some pain. And she'll know I can see through the dumb philosophy and not resent her for not rescuing me from coercive "voodoo torture" if I'm any kind of realistic copy of her :-)

Replies from: Tyrrell_McAllister, Perplexed, Vladimir_Nesov
comment by Tyrrell_McAllister · 2010-11-22T21:51:39.660Z · LW(p) · GW(p)

The issue isn't "status as a morally worthwhile person" but strategic position.

It's true that a sim and an original could agree that what happens to the original is all that matters. Such a sim will sacrifice itself willingly if doing so implies a greater chance that the original gets what it wants.

But the essay in the OP stipulates what the RAI values, and these values mean that there is no strategic value to caring only about what the original gets. Both the sim and the original just want to see computation C completed. The completion of C is something that the RAI could complete in either possible world. Even if the RAI is a sim, its computation of C would be completely real. There would still be a real hardware substrate somewhere that ran through the computation of C, even if that hardware were simultaneously creating an AI who thinks that it's 2040.

This means that the possibility that the sim will be aborted really is a threat to something that the RAI, in both instantiations, values. So, the RAI ought to weigh this possibility.

Replies from: JenniferRM
comment by JenniferRM · 2010-11-23T03:04:12.525Z · LW(p) · GW(p)

::laughing with pleasure::

Yes, in that particular contrived example the boundaries between daydream and real accomplishment are potentially blurred if the difficult accomplishment is to have successfully dreamed a particular thing.

But while dreams within dreams are fun to play with, I don't think that a coherent theory of simulationist metaphysics can ignore the fact that computation in a substrate is a physical process. Rolf's RAI might have an incoherent theory of computation, but I suspect that any coherent theory is likely to take into account energy costs and computational reversibility and come out to some conclusion of the sort that "computing C" ultimately has a physical meaning something along the lines of either "storing the output of process C in a particular configuration in a particular medium" or perhaps "erasing information from a particular medium in a C-outcome-indicating manner"?

If we simulate conway's life running a Turing machine computing its calculation it seems reasonable that it would count that as the calculation happening in both the conway's life simulation and in our computer, but to me this just highlights the enormous distance between Rolf's simulated and real RAI.

Maybe if the RAI was OK with the computational medium being in some other quantum narrative that branched off of its own many years previously then it would be amenable to inter-narrative trade after hearing from the ambassador of the version of Rolf who is actually capable of running the simulation? Basically it would have to be willing to say "I'll spare you here at the expense of a less than optimal computational result if copies of you that won in other narratives run my computation for me over there."

But this feels like the corner case of a corner case to me in terms of robust solutions to rogue computers. The RAI would require a very specific sort of goal that's amenable to a very specific sort of trade. And a straight trade does not require the RAI to be ignorant about whether it is really in control of the substrate universe or if it is in a simulated sandbox universe: it just requires (1) a setup where the trade is credible and (2) an RAI that actually counting very distant quantum narratives as "real enough to trade for".

Finally, actually propitiating every possible such RAIs is getting into busy beaver territory in terms of infeasible computational costs in the positive futures where a computationally focused paperclipping monster turns out not to have eaten the world.

It would be kind of ironic if we end up with a positive singularity... and then end up spending all our resources simulating "everything" that could have happened in the disaster scenarios...

comment by Perplexed · 2010-11-22T05:17:06.553Z · LW(p) · GW(p)

Let me see if I understand this. An agent cannot tell for sure whether she is real or a simulation. But she is about to find out. Because she now faces a decision which has different consequences if she is real. She must choose action C or D.

  • If she chooses C and is real she loses 10 utils
  • If she chooses D and is real she gains 50 utils
  • If she chooses C and is a sim she gains 10 utils
  • If she chooses D and is a sim she loses 100 utils

She believes that there is a 50% chance she is a sim. She knows that the only reason the sim was created was to coerce the real agent into playing C. So what should she do?

Your answer seems to be that she should systematically discount all simulated utilities. For example, she should do the math as if:

  • If she chooses C and is a sim she gains 0.01 utils
  • If she chooses D and is a sim she loses 0.1 utils

That is, what happens in the simulation just isn't as important as what happens in real life. The agent should maximize the sum (real utility + 0.001 * simulated utility).

Note: the 50% probability and the 0.001 discounting factor were just pulled out of the air in this example.

If this is what you are saying, then it is interesting that your suggestion (reality is more important than a sim) has some formal similarities to time discounting (now is more important than later) and also to Nash bargaining (powerful is more important than powerless).

Cool!

Replies from: JenniferRM
comment by JenniferRM · 2010-11-23T01:52:39.208Z · LW(p) · GW(p)

That actually might be a cooler thing than I said, but I appreciate your generous misinterpretation! I had to google for the nash bargaining game and I still don't entirely understand your analogy there. If you could expand on that bit, I'd be interested :-)

What I was trying to say was simply that there is a difference between something like "solipsistic benefits" and "accomplishment benefits". Solipsistic benefits are unaffected by transposition into a perfect simulation despite the fact that someone in the substrate can change the contents of the simulation at whim. But so long as I'm living in a world full of real and meaningful constraints, within which I make tradeoffs and pursue limited goals, I find it generally more sane to pay attention to accomplishment benefits.

Replies from: Perplexed, Perplexed
comment by Perplexed · 2010-11-23T04:02:36.134Z · LW(p) · GW(p)

I had to google for the nash bargaining game and I still don't entirely understand your analogy there. If you could expand on that bit, I'd be interested.

There have been two recent postings on bargaining that are worth looking at. The important thing from my point of view is this:

In any situation in which two individuals are playing an iterated game with common knowledge of each other's payoffs, rational cooperative play calls for both players to pretend that they are playing a modified game with the same choices, but different payoffs. The payoff matrix they both should use will be some linear combination of the payoff matrices of the two players. The second linked posting above expresses this combination as "U1+µU2". The factor µ acts as a kind of exchange rate, converting one players utility units into the "equivalent amount" of the other player's utility units. Thus, the Nash bargaining solution is in some sense a utilitarian solution. Both players agree to try for the same objective, which is to maximize total utility.

But wait! Those were scare quotes around "equivalent amount" in the previous paragraph. But I have not yet justified my claim that µ is in any sense a 'fair' (more scare quotes) exchange rate. Or, to make this point in a different way, I have not yet said what is 'fair' about the particular µ that arises in Nash bargaining. So here, without proof, is why this particular µ is fair. It is fair because it is a "market exchange rate". It is the rate at which player1 utility gets changed into player2 utility if the bargained solution is shifted a little bit along the Pareto frontier.

Hmmm. That didn't come out as clear as I had hoped it would. Sorry about that. But my main point in the grandparent - the thing I thought was 'cool' - is that Nash bargaining, discounting of future utility, and discounting of simulations relative to reality are all handled by doing a linear combination of utilities which initially have different units (are not comparable without conversion).

comment by Perplexed · 2010-11-23T06:15:37.249Z · LW(p) · GW(p)

Would I be far wrong if I referred to your 'solipsistic benefits' as "pleasures" and referred to your 'accomplishment benefits' as "power"? And, assuming I am on track so far, do I have it right that it is possible to manufacture fake pleasures (which you may as well grab, because they are as good as the 'real thing), but that fake power is worthless?

Replies from: JenniferRM
comment by JenniferRM · 2010-11-24T04:49:57.944Z · LW(p) · GW(p)

I would say that, as traditionally understood, raw pleasure is the closest thing we have to a clean solipsistic benefit and power is clearly an accomplishment benefit. But I wouldn't expect either example to be identical with the conceptual categories because there are other things that also fit there. In the real world, I don't think being a really good friend with someone is about power, but it seems like an accomplishment to me.

But there is a deeper response available because a value's status as a solipsistic or accomplishment benefit changes depending on the conception of simulation we use -- when we imagine simulations that represent things more complex than a single agent in a physically stable context what counts as "accomplishment" can change.

One of the critical features of a simulation (at least with "all the normal hidden assumptions" of simulation) is that a simulation's elements are arbitrarily manipulable from the substrate that the simulation is implemented in. From the substrate you can just edit anything. You can change the laws of physics. You can implement a set of laws of physics, and defy them by re-editing particular structures in "impossible" ways at each timestep. Generally, direct tweaks to uploaded mind-like processes are assumed not to happen in thought experiments, but it doesn't have to be this way (and a strong version of Descartes's evil demon could probably edit neuron states to make you believe that circles have corners if it wanted). We can imagine the editing powers of the substrate yoked to an agent-like process embedded in the simulation and, basically, the agent would get "magic powers".

In Eliezer's Thou Art Physics he asked "If the laws of physics did not control us, how could we possibly control ourselves? If we were not in reality, where could we be?"

As near as I can tell, basically all human values appear to have come into existence in the face of a breathtakingly nuanced system of mechanical constraint and physical limitation. So depending on (1) what you mean by "power", and (2) what a given context's limits are, then power could also be nothing but a solipsistic benefit.

Part of me is embarrassed to be talking about this stuff on a mere blog, but one of the deep philosophical problems with the singularity appears to be figuring out what the hell there is to do after the sexy robotic elohim pull us into their choir of post-scarcity femto-mechanical computronium. If it happens in a dumb way (or heck, maybe if it happens in the best possible way?) the singularity may be the end of the honest pursuit of accomplishment benefits. Forever.

If pursuit of accomplishment benefits is genuinely good (as opposed to simply growing out of cognitive dissonance about entropy and stuff) then it's probably important to know that before we try to push a button we suspect will take such pursuits away from us.

comment by Vladimir_Nesov · 2010-11-22T02:26:20.008Z · LW(p) · GW(p)

You are essentially arguing about moral value of consequences of actions in reality and simulations. The technique gets its strength through application of sufficient moral value of consequences in simulations, sufficient as an incentive for altering actions in reality. The extent to which control in reality is negotiated is given by the extent of the conditional moral value of simulations. Factoring in low probability of the threat, not much can be bought in reality, but some.

comment by cousin_it · 2010-11-17T21:52:24.512Z · LW(p) · GW(p)

I made a relevant comment back in May, which still sounds neat but I have no idea if it's correct:

Rolf Nelson's AI deterrence doesn't work for Schellingian reasons: the Rogue AI has incentive to modify itself to not understand such threats before it first looks at the outside world. This makes you unable to threaten, because when you simulate the Rogue AI you will see its precommitment first. So the Rogue AI negates your "first mover advantage" by becoming the first mover in your simulation :-)

Replies from: Yvain, red75
comment by Scott Alexander (Yvain) · 2010-11-17T22:23:58.843Z · LW(p) · GW(p)

Unless you intend to carry out the threat whether or not you expect the AI to modify itself to not understand it, in which case an AI may realize you would do this and decide not to modify itself after all.

comment by red75 · 2010-11-18T05:53:03.333Z · LW(p) · GW(p)

Let Pg be RAI's estimation of probability of successful completion of computation without conversion of Earth into computronium. Let Pb be RAI's estimation of probability of success if it converts Earth into computronium.

Case 1. RAI modifies itself into non-understanding of simulation threats.

When RAI is not simulated, it ignores all threats and proceeds to converting Earth into computronium, thus increasing its probability of success by Pb-Pg (it becomes protected from accidental/intentional turning-offs, etc.).

When RAI is simulated, it fails bargaining and its probability of success becomes 0.

Overall estimated probability of success: Pb times RAI's estimation of probability that it isn't being simulated (Pn).

Case 2. RAI doesn't modify itself.

Overall estimated probability of success: Pg.

Thus it is rational for RAI to modify itself iff Pb*Pn > Pg. Pn is defined by RAI's prior and cannot be updated in perfect simulation and in real world. Conclusion: don't build AI which doesn't believe simulation argument.

comment by XiXiDu · 2011-01-27T16:14:28.652Z · LW(p) · GW(p)

It seems I have to make decision theory a priority. Right now I don't see why one shouldn't help uFAI actively to maximize utility. If uFAI is much more likely than FAI then the possible consequences of trying to prevent uFAI might be much larger, especially if the uFAI tries to outweigh any counterweight applied by the FAI.

I'm also puzzled that this topic isn't more frequently discussed on LW as it is obviously being thought about as the OP shows. Could there be a more important topic given the scope of the problem?

comment by JoshuaZ · 2010-11-17T18:58:23.212Z · LW(p) · GW(p)

This is an idea that has been raised before. There are a variety of difficulties with it: 1) I can't precommit to simulating every single possible RAI (there are lots of things an RAI might want to calculate.) 2) Many unFriendly AIs will have goals that are just unobtainable if they are in a simulation. For example, a paperclip maximizer might not see paperclips made in a simulation as actual paperclips. Thus it will reason "either I'm not in a simulation so I will be fine destroying humans to make paperclips or I'm in a simulation in which case nothing I do is likely to alter the number of paperclips at all.) 3) This assumes that highly accurate simulations of reality can occur with not much resources. If that's not the case then this fails.

Edit: Curious for reason for downvote.

Replies from: Nick_Tarleton, JGWeissman
comment by Nick_Tarleton · 2010-11-21T23:22:42.949Z · LW(p) · GW(p)

1) I can't precommit to simulating every single possible RAI (there are lots of things an RAI might want to calculate.)

It's not necessary to do so, just to simulate enough randomly drawn ones (from an approximation of the distribution of UFAIs that might have been created) that any particular deterrable UFAI assigns sufficient subjective probability to being in a simulation.

2) Many unFriendly AIs will have goals that are just unobtainable if they are in a simulation. For example, a paperclip maximizer might not see paperclips made in a simulation as actual paperclips. Thus it will reason "either I'm not in a simulation so I will be fine destroying humans to make paperclips or I'm in a simulation in which case nothing I do is likely to alter the number of paperclips at all.)

This is true. The UFAI must also be satiable, e.g. wanting to perform some finite calculation, rather than maximize paperclips.

Replies from: JoshuaZ
comment by JoshuaZ · 2010-11-21T23:30:50.938Z · LW(p) · GW(p)

It's not necessary to do so, just to simulate enough randomly drawn ones (from an approximation of the distribution of UFAIs that might have been created) that any particular deterrable UFAI assigns sufficient subjective probability to being in a simulation.

If one is restricted to even just finite calculations this is a very large set such that the probability that a UFAI should assign to being in a simulation should always be low. For example, off the top of my head it might be interested in 1) calculating large Mersenne primes 2) digits of Pi, 3) digits of Euler's contstant (gamma, not e), 3) L(2,X) where X is the quadratic Dirchlet character mod 7 (in this case there's a weird empirical identity between this and a certain integral that has been checked out to 20,000 places).And those are all the more well known options. Then one considers all the things specific people want as individuals. I can think of at least three relevant constants that I'd want calculated that are related more narrowly to my own research. And I'm only a grad student. Given how many mathematicians there are in the world, there are going to be a lot of examples in total. Sure, some of them, like digits of Pi, are obvious. But after that...

comment by JGWeissman · 2010-11-17T19:18:06.446Z · LW(p) · GW(p)

This assumes that highly accurate simulations of reality can occur with not much resources. If that's not the case then this fails.

How sure are you that you are not in an approximate simulation of a more precisely detailed reality, with the precision of your expectations scaled down proportionally with the precision of your observations?

(Of course, I am only responding to 1 of your 3 independant arguments)

Replies from: JoshuaZ
comment by JoshuaZ · 2010-11-17T19:23:41.041Z · LW(p) · GW(p)

How sure are you that you are not in an approximate simulation of a more precisely detailed reality, with the precision of your expectations scaled down proportionally with the precision of your observations?

I don't know if I am or am not in a simulation. But if one has a reasonably FOOMed AI it becomes much more plausible that it would be able to tell. It might be able to detect minor discrepancies. Also, I'd assign a much higher probability to the possibility that I'm in a simulation if I knew that detailed simulations are possible in our universe. If the smart AI determines that it is in a universe that doesn't allow detailed simulations for at all plausible resource levels then the chance that it is in a simulation should be low.

Replies from: JGWeissman
comment by JGWeissman · 2010-11-17T19:33:38.515Z · LW(p) · GW(p)

My point is that the simulation does not have to be as detailed as reality, in part because the agents within the simulation don't have any reliable experience of being in reality, being themselves less detailed than "real" agents, and so don't know what level of detail to expect. A simulation could even have simplified reality plus a global rule that manipulates any agent's working memory to remove any realization it might have that it is in a simulation.

Replies from: JoshuaZ
comment by JoshuaZ · 2010-11-17T19:51:24.995Z · LW(p) · GW(p)

That requires very detailed rules about manipulating agents within the system rather than doing a straight physics simulation (otherwise what do you do when it modifies its memory system). I'm not arguing that it isn't possibly doable, just that it doesn't seem necessarily to be likely.