Proxi-Antipodes: A Geometrical Intuition For The Difficulty Of Aligning AI With Multitudinous Human Values
post by Matthew_Opitz · 2023-06-09T21:21:05.788Z · LW · GW · 0 commentsContents
No comments
Just as intelligence is not reversed-stupidity [? · GW], human values are not reversed-monstrosities. As far as visions of the future go, a reversed-monstrosity is likely to be merely (to us) a different flavor of monstrosity. Even if we could locate an "antipodal" future, one exactly contrary to every human value, reversing that vision would be unlikely to yield an appealing future to us insofar as that antipodal monstrosity composed of n different world-state variables was reversed by any number of world-state variables less than n.
For example, imagine a list of human preferences where n = 10 (a very small number compared to the number of human preferences in reality; see shard theory [AF · GW]). Imagine this list is bounded/normalized to between -100 and 100, where 100 represents the strongest possible preference on that issue and -100 represents the strongest possible dis-preference.
Cancer: a = -100
Rape: b = -100
Hunger: c = -100
AIDS: d = -100
Human imprisonment: e = -100
Nuclear war: f = -100
Anxiety: g = -100
Love: h = 100
Human lifespans: i = 100
Biodiversity: j = 100
Imagine that we must manage to program these values into an Artificial Superintelligence (ASI), but for every variable, we estimate that there is a 2% independent chance that our method of programming the value into the ASI will fail such that it unintentionally flips the sign on the variable. In other words, there is a 98% independent chance of programming each value correctly.
The chances of programming all 10 value correctly would be 0.98^10, or a little less than 82%.
"82% still sounds pretty high!" someone might say, "And besides, what's the worst that could happen if we only get 9 out of 10 of our values maximized? That's still a pretty good batting average!"
"OK," I'd say, "You tell me which of these variables you are willing to flip. A world with every human being suffering with cancer from the moment of birth? Constant nuclear war? A human life expectancy of 14 years? The entire Earth with zero biodiversity, paved over with computronium that somehow manages to satisfy our other values?
(Of course, a weakness of this analogy is that I wasn't able to come up with hypothetical human values that were completely independently-varying. For example, it would be difficult to imagine a world with constant nuclear war but zero anxiety. But maybe the ASI goes all "Brave New World" and finds a way to keep every human blissed-out on soma even while they hurl atomic bombs at each other, I don't know...just play along with the thought-experiment for a moment.)
One way to think about this geometrically would be to imagine a 10-dimensional preference space, and imagine the ideal preference being at the point:
Point P: a = -100, b = -100, c = -100, d = -100, e = -100, f = -100, g = -100, h = 100, i = 100, j = 100.
Then imagine flipping the sign of just one variable:
Point Q: a = 100, b = -100, c = -100, d = -100, e = -100, f = -100, g = -100, h = 100, i = 100, j = 100.
The question is, are points P and Q "close" to each other in the 10-dimensional preference-space? No, they are 200 units away from each other. Granted, it is possible to find points that are even farther apart. A truly "antipodal" point where all 10 variables were flipped in sign would have a distance of ~632 units from P (if I did my higher-dimensional extension of the Pythagorean theorem correctly, i.e. take the square root of the sum of the squares of each of the 10 terms).
But isn't it surprising that one can cover almost a third of that maximum distance by merely flipping the sign of 1/10th of the variables? Maybe it isn't to mathematicians, but to laypeople who have not thought about higher-dimensional spaces, I guarantee that this would be counter-intuitive!
For laypeople, I might introduce the concept one dimension at a time.
For example, imagine that a human or agent had just one preference, eradicating cancer. Here is that preference graph, which, since it only has 1 variable and 1 dimension, is a number line. -100 is the minimum bound, and 100 is the maximum bound (and also the antipodal point of -100). We might additionally call any points within, say, 30% of the antipodal point "near-antipodal points" or "proxi-antipodal points" if we want to use a fancy Latin prefix.
![](https://i.ibb.co/nR7S9Yh/lw1.png)
Now let's go up to two dimensions, where our agents are trying to satisfy two preferences:
![](https://i.ibb.co/Xkd1BDr/lw2a.png)
Note how, just by flipping the sign of one of the two preferences (shown with the green dot), it is possible to get dangerously close to the proxi-antipodal zone.
When you go up to three dimensions, it gets even worse:
![](https://i.ibb.co/2sCLGB3/lw3.png)
Now it seems like it would be possible to get into the proxi-antipodal zone just by flipping 1/3rd of the variables. That is, an AI could fully satisfy 2/3rds of our preferences, but still end up in a world that loses 70% of its value to us, just like flipping 10% of the variables lost about 30% of the value in the original 10-dimensional example.
What is even more confusing is, when you start to get up to higher dimensions, antipodes no longer cluster next to each other. On the 1-dimensional number line, there is only 1 way to be "wrong." On the 2-D coordinate, there are now 2 ways to be "wrong," but at least the proxi-antipodal zone forms a small zone where any point in that zone cannot be proxi-antipodal to any other point in that zone. However, already by the time we reach 3 dimensions, that assumption begins to break down. With more ways to go wrong, there are more possible clusters of how things can be distinctly, differently, yet still monstrously wrong.
Another way of putting it is that, in higher-dimensional spaces, selecting a random proxi-antipodal point, and then selecting another proxi-antipodal point of that antipodal point, is less likely to get you back to your original point. Intelligence is not reversed stupidity.
I think this happens to have applications to explaining political rivalries that may seem counter-intuitive at first. We are accustomed to looking at 2-dimensional political compasses like this one I found on the internet:
![](https://yhs.apsva.us/wp-content/uploads/sites/41/2016/09/crowdchart.png)
But if you look closely at this chart, it should have you scratching your head a bit. For example, it has Gandhi somewhat close to Trotsky. Yet one preached non-violence while the other helped established the Red Army, fight the Russian Civil War, and wrote an unashamed pamphlet unironically entitled, "Terrorism and Communism." Obviously this chart does not include a dimension for "violence" except insofar as that variable happens to be correlated sometimes (and sometimes not!) with the two variables that the chart actually uses.
Or look at Stalin and Hitler. They are somewhat far apart on this chart, but not THAT far apart. They are certainly not depicted as antipodal points to each other. In fact, the chart puts more distance between Green Party Presidential nominee Jill Stein and Libertarian Party Presidential nominee Gary Johnson. Yet, I don't recall hearing either of those latter two wanting to go to war with the other. And yet we DO know of followers of Stalin and Hitler brawling with each other in the streets of Germany, battling each other amid the hills of Spain, and slaughtering each other a few years later in the largest theater of land war humanity has ever seen even to this date (the Eastern Front of World War 2).
So is this graph just incorrect on this issue? Should Stalin and Hitler be spaced farther apart? Not necessarily. The problem is, this chart is trying to condense-down a higher-dimensional space of political disagreements over n issues into 2 dimensions where n is probably very large. Stalin and Hitler ARE proxi-antipodal points...but so are, say, Stalin and Churchill. And yet, Churchill and Hitler are ALSO proxi-antipodal points. And Stalin and Trotsky are proxi-antipodal points (which you'd understand if you've ever been a part of a sectarian Trotskyist organization; their disgust with Stalin is not pretended; it is very real!). But also, Trotsky and Churchill ARE ALSO proxi-antipodal points. As is someone like Gary Johnson from all of the others already mentioned. As is Jill Stein from all the others already mentioned. As is Gandhi from all the others already mentioned. This is possible in higher-dimensional spaces.
"The enemy of my enemy" might be your accidental friend, but that does not mean that the enemy of your enemy is going to be substantially similar to you, at least, if you are working in a high-dimensional space.
What this means for Artificial Intelligence is, it will not be good enough to know what we DON'T want. There will be many, many proxi-antipodal points to things that we happen to think of that we don't want, which will ALSO be other things we don't want.
0 comments
Comments sorted by top scores.