Sufficiently many Godzillas as an alignment strategy

post by 142857 · 2022-08-28T00:08:02.666Z · LW · GW · 3 comments

Assuming that alignment by default [LW · GW] happens with nontrivial probability, one way to produce human-aligned AGIs would be to simultaneously create sufficiently many (different) AGI. This leads to a multipolar scenario where there are a few AGI aligned with humans, and many unaligned AGI. (By unaligned I mean unaligned with humans, these AGI may or may not be aligned with some other goal.)

Although it is true that in general having many AGIs is not good [LW · GW], the idea is if some of those AGIs are aligned, this may be a better outcome than having just 1 AGI that is probably not aligned. Or perhaps, once there are enough aligned AGI, since they are all working in the same direction, they may be able to overcome the unaligned AGI (which are all working towards different goals that are likely not directly opposed to those of the aligned AGI). So it is useful to explore how the total number of AGI in this scenario impacts the chances of survival for humanity. To make this more concrete, I pose the following question (but feel free to discuss the idea in general):


Assuming that all of the AGIs have roughly the same cognitive power, rank the following scenarios from best to worst outcome for humanity.
A) 1 AGI, with a 10% of being aligned
B) 1 aligned AGI and 9 unaligned AGI
C) 10 aligned AGI and 90 unaligned AGI
D) 1000 aligned AGI and 9000 unaligned AGI
 

3 comments

Comments sorted by top scores.

comment by Multicore (KaynanK) · 2022-08-28T02:15:13.901Z · LW(p) · GW(p)

I think in a lot of people's models, "10% chance of alignment by default" means "if you make a bunch of AIs, 10% chance that all of them are aligned, 90% chance that none of them are aligned", not "if you make a bunch of AIs, 10% of them will be aligned and 90% of them won't be".

And the 10% estimate just represents our ignorance about the true nature of reality; it's already true either that alignment happens by default or that it doesn't, we just don't know yet.

comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2022-08-28T00:52:45.203Z · LW(p) · GW(p)

Scenario A has an almost 10% chance of survival; the others ~0%. To quote John Wentworth's post [LW · GW], which I strongly agree with:

What I like about the Godzilla analogy is that it gives a strategic intuition which much better matches the real world. When someone claims that their elaborate clever scheme will allow us to safely summon Godzilla in order to fight Mega-Godzilla, the intuitively-obviously-correct response is “THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO”.

comment by avturchin · 2022-08-28T10:04:08.939Z · LW(p) · GW(p)

The problem is that having both aligned and non-aligned AIs likely means war between AIs. Such war maybe even worse than one non-align AI. Non-aligned AI may choose to preserve humans for some instrumental reasons. 

However, if there is a war, non-aligned AI will have an incentive to blackmail aligned AI by torturing as many people as possible.