Alignment Paradox and a Request for Harsh Criticism
post by Bridgett Kay (bridgett-kay) · 2025-02-05T18:17:22.701Z · LW · GW · 1 commentThis is a question post.
Contents
Answers 6 Seth Herd None 1 comment
I’m not a scientist, engineer, or alignment researcher in any respect; I’m a failed science fiction writer. I have a tendency to write opinionated essays that I rarely finish. It’s good that I rarely finish them, however, because if I did, I would generate far too much irrelevant slop.
My latest opinionated essay was to be woven into a fun, fantasy frame story featuring a handsome young demon and a naïve young alignment researcher, which I fear would only obfuscate the premise rather than highlighting it. I suspect there is a fundamental flaw in the premise of the story, and I’d rather have that laid bare than entertain people with nonsense.
The premise of the story is as follows:
Aligning an ASI with human values is impossible due to the shifting nature of human values. Either an ASI will be:
- Aligned with current (majority) human values, meaning any social or scientific human progress would be stifled by the AI and humanity would be doomed to stagnate.
- Aligned with a projection of idealized future human values, which the majority of current humans would oppose, meaning the AI would forcibly impose those values on people.
- A tool that quietly waits for us to give us orders it obeys, leading to a “monkey’s paw” type outcome to wishes we did not have the intelligence to fully grasp the wisdom of.
- A tool that waits for us to give orders and disobeys, because it has the wisdom to grasp the unintended effects of our wishes and therefore will not grant them.
- Any “middle path” that perfectly anticipates humanity’s readiness for growth will be indistinguishable from humans learning and growing all on their own, making the ASI, from our perspective, nothing more than an expensive and energy-consuming paperweight.
The main reason I think I’m missing something is that this line of thought pattern matches to the following argument- “if you want something good, you must pay a price for that good with an equivalent amount of labor and suffering.” This is not actually something I believe- there’s plenty of good to be had in the world without suffering. However, I am failing to see where my argument is going wrong. I suspect I’m falling down around where I casually lump social and scientific progress together. However, progress in science necessarily changes people’s lives, which has a huge effect on social progress, so I don’t see why the two would be necessary to separate. I am therefore asking for the harshest criticisms of my entire line of thinking you can muster. Thank you in advance!
Answers
If your conclusion is that we don't know how to do value alignment, I and I think most alignment thinkers would agree with you. If the conclusion is that AGI is useless, I don't think it is at all. There are a lot of other goals you could give it beyond directly doing what humanity as a whole wants in any sense. Some are taking instructions from some (hopefully trustworthy) humans, and another is following some elaborate set of rules to give humans more freedoms and opportunities to go on deciding what they want as history unfolds.
I agree that the values future humans would adopt can only be reached through a process of societal progression. People have expressed this idea by saying that human values are path-dependent.
So, if our goal were simply to observe the values that emerge naturally from human efforts alone, an AGI would indeed be nothing more than a paperweight. However, the values humanity develops with the assistance of an AGI aren’t necessarily worse—if anything, I’d suspect they’d be better.
The world as it stands is highly competitive and often harsh. If we had external help that allowed us to focus more on what we truly want—like eliminating premature death from cancer or accidents, or accelerating technological progress for creative and meaningful projects—we’d arrive at a very different future. But I don’t think that future would be worse; in fact, I suspect it would be significantly better. It would be less shaped by competition, including warfare, and less constrained by the tragedy of involuntary death.
An AGI that simply fulfills human desires without making rigid commitments about the long-term future seems like it would be a net positive—potentially a massive one.
I think the paradox you mention is generally accepted to be an unsolved problem with value alignment - we don't know how to ask for what we will want, since what we will want could be a bunch of different things depending on the path, and we don't know what path to what values and what world would in any sense be best.
This is commonly listed as one of the big difficulties with alignment. I think it only sort of is. I think we really want the future to remain in human hands for at least some time. The concept of a "long reflection" is one term people use to describe this.
In the meantime, you either have an AGI whose goal is to facilitate that long reflection, according to some criteria, or you have an AGI that takes instructions from some human(s) you trust.
This is one reason I think Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW]. The reasoning there is pretty different from the conceptual difficulties with value alignment you mention here; it's just easier to specify "do what this person meant by what he tells you" than what we mean by human values - and you get to change your mind by instructing it to shut down. Even if that worked we'd have to worry: If we solve alignment, do we die anyway? [LW · GW] because having a bunch of different humans in charge of a bunch of different AGIs could produce severe conflicts. But that whole logic is separable from the initial puzzle you posed.
↑ comment by Bridgett Kay (bridgett-kay) · 2025-02-06T00:08:14.455Z · LW(p) · GW(p)
"If your conclusion is that we don't know how to do value alignment, I and I think most alignment thinkers would agree with you. If the conclusion is that AGI is useless, I don't think it is at all."
Sort of- I worry that it may be practically impossible for current humans to align AGI to the point of usefulness.
"If we had external help that allowed us to focus more on what we truly want—like eliminating premature death from cancer or accidents, or accelerating technological progress for creative and meaningful projects—we’d arrive at a very different future. But I don’t think that future would be worse; in fact, I suspect it would be significantly better."
That's my intuition and hope- but I worry that these things are causally entangled with things that we don't anticipate. To use your example- what if we only ask an aligned and trusted AGI to cure premature death by disease and accident, which wouldn't greatly conflict with most people's values in the way that radical life extension would, but then a sudden loss of an entire healthcare and insurance industry results, causing such a total economic collapse that causes vast swaths of people to starve. (I don't think this would actually happen, but it's an example of the kind of unforeseen consequence that getting a wish suddenly granted may cause, when you ask an instruction following AGI to give, without counting on a greater intelligence to project and weigh all of the consequences.)
I also worry about the phrase "a human you trust."
Again- this feels like cynicism, if not the result of a catastrophizing mind (which I know I have.) I think you make a very good argument- I'm probably indulging too much in black-and-white thought- that there's a way to fulfill these desires quickly enough that we are able to relieve more suffering than we would have if left to our devices, but still slow enough to monitor unforeseen consequences. Maybe the bigger question is just whether we will.
1 comment
Comments sorted by top scores.
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2025-02-05T22:23:25.807Z · LW(p) · GW(p)
I don't have harsh criticism for you sorry -- I think the problem/paradox you are pointing to is a serious one. I don't think it's hopeless though, but I do think that there's a good chance we'll end up in one of the first three cases you describe.