Does the existence of shared human values imply alignment is "easy"?
post by Morpheus · 2022-09-26T18:01:10.661Z · LW · GW · 2 commentsThis is a question post.
Contents
Observation: Humans seem to actually care about each other. How was culture + evolution able to achieve this? None Answers 5 Charlie Steiner 4 jacob_cannell 3 Radford Neal 2 TAG 2 Noosphere89 2 Jeff Rose None 2 comments
Observation: Humans seem to actually care about each other. How was culture + evolution able to achieve this?
While not to 100% (Maybe 50-95%), I would trust most humans to not push a big red button that kills all humans, even if it had some large benefit to the experience of this human in this weird hypothetical. Does this mean that most humans are aligned? Is there a reason to believe intellectually amplified humans would not be aligned in this way? How did evolution manage to encode this value of altruism, and does the possibility of this imply, we should also be able to do it with AI?
This seems like the sort of confused question that has been asked before. If so, I'd be glad about the link.
Answers
Why did you pick caring about each other as a thing culture + evolution was trying to do?
You're not alone in making what I think is the same mistake. I think it's actually quite common to feel like it's amazing that evolution managed to come up with us humans who like beauty and friendship and the sanctity of human life and so on - evolution and culture must have been doing something right, to come up with such great ideas.
But in the end, no; evolution is impressive but not in that way. You picked caring about each other as the target because humans value it - a straightforward case of painting the target around the bullet-hole.
↑ comment by Morpheus · 2022-09-28T21:13:22.998Z · LW(p) · GW(p)
I don't find it amazing or something. It's more like... I dont know how to write the pseudocode for an AI that actually cares about human welfare. In my mind that is pretty close to something that tries to be aligned. But if even evolution managed to create agents capable of this by accident, then it might not be that hard.
Replies from: Raemon, MorpheusEvolution actually solved alignment (approximately) on two levels: aligning brains with evolution's goal of inclusive genetic fitness - to be fruitful and multiply - (even though the easy/natural goal attractor for intelligence is empowerment), and aligning brains with other brains because the unit of selection (genes) is spread across individual bodies (the disposable soma). There are many obvious failure examples, but that's hardly surprising because evolution proceeds by breaking things.
The first type of alignment (towards genetic fitness) gives rise to lust, sexual attraction, parenting desires, etc. The second form of alignment (to other humans) - more relevant for AGI - manifests as empathy, love, and altruism.
For sexually reproducing species, evolution will disfavour genes that lead to individuals slaughtering others of their species for little or no reason. Slaughtering members of the opposite sex obviously reduces mating opportunities. Slaughtering members of the same sex reduces the mating opportunities of their opposite-sex children.
This obviously hasn't completely prevented humans from slaughtering both particular individuals and entire populations of humans of a different tribe. But this isn't what always happens. It could be a lot worse.
Maybe it is worse for asexual species (mitigated by slower evolution without sex), perhaps explaining why at least for vertebrates asexuality is rare, and asexual species seem to not last long.
I would trust must humans who existed historically to push a button that killed their specific enemies.
No, and here's why:
-
Evolution's goal, to the extent that it even has a goal, is so easy to satisfy that literally any life/self-replicating tech could do this. It cares about reproductive fitness, and that goal requires almost no capabilities. It cares nothing for specifics, but our goals require much, much more precision than that.
-
Evolution gives almost zero optimization power to capabilities, while far far more optimization power is dedicated to capabilities by the likes of Deepmind/OpenAI. To put it lightly, there's a 30-60% chance that we get much, much stronger capabilities than evolution this century.
-
The better example is how humans treat animals, and here humans are very misaligned to animals, with the exception of pets, despite not agreeing with environmentalism/nature as a good thing. So no, I disagree with the premise of your question.
Humans obtain value from other humans and depend on them for their existence. It is hypothesized that AGIs will not depend on humans for their existence. Thus, humans who would not push the button to kill all other humans may choose not to do so for reasons of utility that don't apply to AGI. Your hypothetical assumes this difference away, but our observations of humans don't.
As you not, human morality and values were shaped by evolutionary and cultural pressure in favor of cooperation with other humans. The way this presumably worked is that humans who were less able or willing to cooperate tended to die more frequently. And cultures that were less able or willing to do so were conquered and destroyed. It is unclear how we would be able to replicate this or how well it translates.
It is unclear how many humans would actually choose to press this button. Your guess is that between 5% and 50% of humans would choose to do so.
That doesn't suggest humans are very aligned; rather the opposite. It means that if we have between 2 and 20 AGIs (and those numbers don't seem unreasonable) between 1 and 10 would choose to destroy humanity. Of course, extinction is the extreme version; having an AGI could also result in other negative consequences.
↑ comment by Morpheus · 2022-09-26T19:21:30.065Z · LW(p) · GW(p)
That doesn't suggest humans are very aligned; rather the opposite. It means that if we have between 2 and 20 AGIs (and those numbers don't seem unreasonable) between 1 and 10 would choose to destroy humanity.
I think I might have started from a more pessimistic standpoint? It's more like, I could also imagine living in a world where humans cooperate, but not because they actually care about each other, but would just pretend to do so? Introspection tells me that does not apply to myself, though maybe I evolved to not be conscious of my own selfishness? I am even less sure how altruistic other people are, because I did not ask lots of people: "Would you press a button that annihilates everyone after your death, if in return you get an awesome life?". On the other hand, cooperation would probably be hard for us in such a world, so this is not as surprising?
2 comments
Comments sorted by top scores.
comment by Dagon · 2022-09-27T17:31:04.852Z · LW(p) · GW(p)
Observation: Humans seem to actually care about each other.
You need to observe more (or better). Most humans care about some aspects of a few other humans' experience. And care quite a bit less about the general welfare (with no precision in what "caring" or "welfare" means) of a large subset of humans. And care almost none or even negatively about some subset (whose size varies) of other humans.
Sure, most wouldn't go out of their way to kill all, or even most, or even a large quantity of unknown humans. We don't know how big a benefit it would take to change this, of course - it's not an experiment one can (or should) run. We have a lot of evidence that many many people (I expect "most", but I don't know how to quantify the numerator nor denominator) can be convinced to harm or kill other humans in circumstances where those others are framed as enemies, even when they're not any immediate threat.
There are very common human values (and even then, not universal - psycopaths exist), but they're things like "working on behalf of ingroup", "suspicious or murderous toward outgroup", and "pursuing relative status games among the group(s) one isn't trying to kill".
Replies from: Morpheus↑ comment by Morpheus · 2022-09-28T11:53:49.051Z · LW(p) · GW(p)
Yeah, I think the "working on behalf of in-group" one might be rather powerful, and I was aware that this is probably a case where I just interact mostly with people who consider "humans" as the ingroup. I don’t think the share of the population who shares this view is actually as important as the fact that a sizeable number of people hold this position at all. Maybe I should have called it: I do care about everyone to some extent. Is that what we want to achieve when we talk about alignment?