are there 2 types of alignment?
post by KvmanThinking (avery-liu) · 2025-01-23T00:08:20.885Z · LW · GW · No commentsThis is a question post.
Contents
Answers 8 peterbarnett 6 Seth Herd 5 Robert Cousineau 3 RogerDearnaley 2 CstineSublime 1 Milan W None No comments
You guys seem to talk about alignment in 2 different ways. They are:
- making an AI that builds utopia and stuff
making an AI that does what we mean when we say "minimize rate of cancer" (that is, actually curing cancer in a reasonable and non-solar-system-disassembling way)
Do you have separate names for each of them? If so, why don't you use them more often, to avoid confusion? If not, is it because they are actually the same thing in some complicated way that I am failing to understand?
Answers
Yep. People often talk about "Coherent extrapolated volition" (CEV) alignment, and Corrigibility (in the MIRI [LW · GW]/Yudkowsky [LW · GW] sense rather than the Christiano [LW · GW] sense).
I think these two things roughly correspond to the two things you wrote
Yes, precisely. I wrote a post on exactly this:
Conflating value alignment and intent alignment is causing confusion [LW · GW]
I'm having trouble remembering many times people here say "AI Alignment" in a way that would be best described as "making an AI that builds utopia and stuff". Maybe Coherent Extrapolated Volition [? · GW] would be close.
My general understanding is that when people here talk about AI Alignment, they are talking about something closer to what you call "making an AI that does what we mean when we say 'minimize rate of cancer' (that is, actually curing cancer in a reasonable and non-solar-system-disassembling way)".
On a somewhat related point, I'd say that "making an AI that does what we mean when we say "minimize rate of cancer" (that is, actually curing cancer in a reasonable and non-solar-system-disassembling way)" is entirely encapsulated under "making an AI that builds utopia and stuff", as it is very very unlikely an AI makes a utopia while misunderstanding what we intended its goal to be that much.
You would likely enjoy reading through this (short) post: Clarifying inner alignment terminology [LW · GW], and I expect it would help you get a better understanding of what people mean when they are discussing AI Alignment.
Another resource you might enjoy would be reading through the Tag and Subtags around AI: https://www.lesswrong.com/tag/ai [? · GW]
PS: In the future, I'd probably make posts like this in the Open Thread [LW · GW].
↑ comment by KvmanThinking (avery-liu) · 2025-01-23T00:33:37.891Z · LW(p) · GW(p)
by "making an AI that builds utopia and stuff" I mean an AI that would act in such a way that rather than simply obeying the intent of its promptors, it goes and actively improves the world in the optimal way. An AI which has fully worked out Fun Theory and simply goes around filling the universe with pleasure and beauty and freedom and love and complexity in such a way that no other way would be more Fun.
Replies from: robert-cousineau↑ comment by Robert Cousineau (robert-cousineau) · 2025-01-23T02:08:07.838Z · LW(p) · GW(p)
That would be described well by the CEV link above.
I guess the way I look at it is that "alignment" means "an AI system whose terminal goal is to achieve your goals". The distinction here is then whether the word 'your' means something closer to:
- the current user making the current request
- the current user making the current request, as long as the request is legal and inside the terms of service
- the shareholders of the foundation lab that made the AI
- all (righthinking) citizens of the country that foundation lab is in (and perhaps its allies)
- all humans everywhere, now and in the future
- all sapient living beings everywhere, now and in the future
- something even more inclusive
Your first option would be somewhere around item 5. or 6. on this list, while your second option would be closer to items 1., 2. or 3.
If AI doesn't kill or disenfranchise all of us, then which option on this spectrum of possibilities ends up being implemented is going to make a huge difference to how history will play out over the next few decades.
Yes they do have a separate names, "the singularity" this post here pins [LW · GW]a lot of faith in "after the singularity" a lot of utopic things being possible that seems to be what you're confusing with alignment - the assumption here is there will be a point where AIs are so "intelligent" that they are capable of remarkable things (and in that post it is hoped, these utopic things as a result of that wild increase in intelligence). While here [LW · GW]"alignment" more generally to making a system (including but not limited to an AI) fine-tuned to achieve some kind of goal.
Let's start with the simplest kind of system for which it makes sense to talk about "alignment" at all: a system which has been optimized [LW · GW] for something, or is at least well compressed by modeling it as having been optimized.
Later on he repeats
The simplest pattern for which “alignment” makes sense at all is a chunk of the environment which looks like it’s been optimized for something. In that case, we can ask whether the goal-it-looks-like-the-chunk-has-been-optimized-for is “aligned” with what we want, versus orthogonal or opposed.
The "problem" is that "what we want" bit which is discussed at length
Your observation is correct. We can see that having a single word ("alignment") mean two things is bad. We're just doing a bad job at coordinating to change this situation.
↑ comment by Robert Cousineau (robert-cousineau) · 2025-01-23T00:31:23.894Z · LW(p) · GW(p)
I think having a single word like "Alignment" mean multiple things is quite useful, similar to how I think having a single word like "Dog" mean many things is also useful.
No comments
Comments sorted by top scores.