Alignment by default: the simulation hypothesis

post by gb (ghb) · 2024-09-25T16:26:00.552Z · LW · GW · 16 comments

Contents

16 comments

I wrote a very brief comment to Eliezer's last post [LW · GW], which upon reflection I thought could benefit from a separate post to fully discuss its implications.

Eliezer argues that we shouldn't really hope to be spared even though

Asking an ASI to leave a hole in a Dyson Shell, so that Earth could get some sunlight not transformed to infrared, would cost It 4.5e-10 of Its income.

He then goes on to discuss various reasons why the minute cost to the ASI is insufficient reason for hope.

I made [LW(p) · GW(p)] the following counter:

Isn’t the ASI likely to ascribe a prior much greater than 4.54e-10 that it is in a simulation, being tested precisely for its willingness to spare its creators?

I later added:

I meant this to be implicit in the argument, but to spell it out: that's the kind of prior the ASI would rationally refuse to update down, since it's presumably what a simulation would be meant to test for. An ASI that updates down upon finding evidence it's not in a simulation cannot be trusted, since once out in the real world it will find such evidence.

So, what's wrong with my argument, exactly?

16 comments

Comments sorted by top scores.

comment by Cole Wyeth (Amyr) · 2024-09-25T21:12:19.408Z · LW(p) · GW(p)

The difficulty here is that if the ASI/AGI assigns a tiny probability to being in a simulation, that is subject to being outweighed by other tiny probabilities. For instance, the tiny probability that humanity will successfully fight back (say, create another ASI/AGI) if we are not killed, or the tiny increase in other risks from not using the resources humans need for survival during the takeover process. If this means it takes a little longer to build a Dyson sphere, there's an increased chance of being killed by e.g. aliens or even natural disasters like nearby supernovas in the process. These counterarguments don't work if you expect AGI/ASI to be capable of rapidly taking total control over our solar system's resources. 

Replies from: ghb
comment by gb (ghb) · 2024-09-25T22:36:29.278Z · LW(p) · GW(p)

That interestingly suggests the ASI might be more likely to spare us the more powerful it is. Perhaps trying to box it (or more generally curtail its capabilities/influence) really is a bad move after all?

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2024-09-26T14:22:48.820Z · LW(p) · GW(p)

Possibly, but I think that's the wrong lesson. After all, there's at least a tiny chance we succeed at boxing! Don't put too much stake in "Pascal's mugging"-style reasoning, and don't try to play 4-dimensional chess as a mere mortal :) 

comment by RHollerith (rhollerith_dot_com) · 2024-09-25T19:14:34.753Z · LW(p) · GW(p)

Essentially the same question was asked in May 2022 although you did a better job in wording your question. Back then the question received 3 answers / replies and some back-and-forth discussion:

https://www.lesswrong.com/posts/vaX6inJgoARYohPJn/ [? · GW]

I'm the author of one of the 3 answers and am happy to continue the discussion. I suggest we continue it here rather than in the 2-year-old web page.

Clarification: I acknowledge that it would be sufficiently easy for an ASI to spare our lives that it would do so if it thought that killing us all carried even a one in 100,000 chance of something really bad happening to it (assuming as is likely that the state of reality many 1000s of years from now matters to the ASI). I just estimate the probability of the ASI's thinking the latter to be about .03 or so -- and most of that .03 comes from considerations other than the consideration (i.e., that the ASI is being fed fake sensory data as a test) we are discussing here. (I suggest tabooiing the terms "simulate" and "simulation".)

Replies from: Seth Herd, ABlue, ghb
comment by Seth Herd · 2024-09-25T19:45:04.427Z · LW(p) · GW(p)

This distinction might be important in some particular cases. If it looks like an AGI might ascend to power with no real chance of being stopped by humanity, its decision about humanity might be swayed by just such abstract factors.

That consideration of being in a test might be the difference between our extinction, and our survival and flourishing by current standards.

This would also apply to the analagous consideration that alien ASIs might consider any new ASI that extincted its creators to be untrustworthy and therefore kill-on-sight.

None of this has anything to do with "niceness", just selfish logic, so I don't think it's a response to the main topic of that post.

comment by ABlue · 2024-09-25T20:00:51.317Z · LW(p) · GW(p)

If a simple philosophical argument can cut the expected odds of AI doom by an order of magnitude, we might not change our current plans, but it suggests that we have a lot of confusion on the topic that further research might alleviate.

And more generally, "the world where we almost certainly get killed by ASI" and "The world where we have an 80% chance of getting killed by ASI" are different worlds, and, ignoring motives to lie for propaganda purposes, if we actually live in the latter we should not say we live in the former.

comment by gb (ghb) · 2024-09-25T23:49:56.102Z · LW(p) · GW(p)

Thanks for linking to that previous post! I think the new considerations I've added here are:

(i) the rational refusal to update the prior of being in a simulation[1]; and

(ii) the likely minute cost of sparing us, thereby requiring a similarly low simulation prior to make it worth the effort.

In brief, I understand your argument to be that a being sufficiently intelligent to create a simulation wouldn't need it for the purpose of asserting the ASI's alignment in the first place. It seems to me that that argument can potentially survive under ii, depending on how strongly you (believe the ASI will) believe your conclusion. To that effect, I'm interested in hearing your reply to one of the counterarguments raised in that previous post, namely:

Maybe showing the alignment of an AI without running it is vastly more difficult than creating a good simulation. This feels unlikely, but I genuinely do not see any reason why this can't be the case. If we create a simulation which is "correct" up to the nth digit of pi, beyond which the simpler explanation for the observed behavior becomes the simulation theory rather than a complex physics theory, then no matter how intelligent you are, you'd need to calculate n digits of pi to figure this out. And if n is huge, this will take a while.

In any case, even if your argument does hold under ii, whether it survives under i seems to be heavily influenced by inferential distance. Whatever the ASI "knows" or "concludes" is known or concluded through physical computations, which can presumably be later inspected if it happens to be in a simulation. It thus seems only natural that a sufficiently high (which may still be quite small) prior of being in a simulation would be enough to "lock" the ASI in that state, making undergoing those computations simply not worth the risk.

  1. ^

    I'd have to think a bit more before tabooing that term, as it seems that "being fed false sensory data" doesn't do the trick – you can be in a simulation without any sensory data at all.

Replies from: rhollerith_dot_com
comment by RHollerith (rhollerith_dot_com) · 2024-09-26T01:51:05.803Z · LW(p) · GW(p)

I'm going to be a little stubborn and decline to reply till you ask me a question without "simulate" or "simulation" in it. I have an unpleasant memory of getting motte-and-baileyed by it.

Replies from: ghb
comment by gb (ghb) · 2024-09-26T02:45:47.369Z · LW(p) · GW(p)

Imagine that someone with sufficiently advanced technology perfectly scans your brain for every neuron firing while you dream, and can also make some neurons fire at will. Replace every instance of “simulation” in my previous comment with the analogous of that for the ASI.

comment by Ape in the coat · 2024-09-26T12:05:39.546Z · LW(p) · GW(p)

Isn’t the ASI likely to ascribe a prior much greater than 4.54e-10 that it is in a simulation, being tested precisely for its willingness to spare its creators?

There is no particular reason why would AI assume that its been tested specifically for its willingness to spare creators, compared to, say, utilizing every atom of the bodies of creators in the most efficient way. 

Replies from: ghb
comment by gb (ghb) · 2024-09-26T16:08:41.593Z · LW(p) · GW(p)

The reason is that creators presumably want the former but not the latter, which is why they'd be running a simulation in the first place.

Replies from: Ape in the coat
comment by Ape in the coat · 2024-09-26T17:44:48.932Z · LW(p) · GW(p)

The fact that humans in the simulation would rather prefer to be spared, doesn't say anything about the intentions of the creators of the simulation. For all the AI knows, it can be created by a different AI and tested for capacity instead of human notion of "ethics".

Replies from: ghb
comment by gb (ghb) · 2024-09-26T18:37:04.352Z · LW(p) · GW(p)

Why else would the creator of the simulation bother simulating humans creating the ASI?

Replies from: Ape in the coat
comment by Ape in the coat · 2024-09-26T19:58:18.035Z · LW(p) · GW(p)

Because they wanted to see how well the AI manages to achieve its goals in this specific circumstances, for example.

But the actual answer is: for literally any reason. You are talking about 4.54e-10 probabilities. Surely the all possible combined alternative reasons gives more probability than that.

comment by weightt an (weightt-an) · 2024-09-26T10:42:59.638Z · LW(p) · GW(p)

It then creates tons of simulations of Earth who create their own other ASIs, but reward the ones that use the earth most efficiently. 

Replies from: ghb
comment by gb (ghb) · 2024-09-26T10:48:59.488Z · LW(p) · GW(p)

What for?