Naive comments on AGIlignment
post by Ericf · 2022-04-28T01:08:02.507Z · LW · GW · 4 commentsContents
4 comments
Disclaimer: everything I know about AI I learned from reading stuff on the internet, mostly on this site.
-
Any self-modifying general intelligence cannot be bound by a utility function. The very nature of being able to self-modify means that the utility function is open to modification or being ignored in favor of a new function.
-
A self-improving process cannot know which variation is better without external feedback. For humans, that is the physics of the universe, and the behavior of other humans. For a program, it can't get the feedback to know what "improvement" is without affecting the same.
-
A superintelligence will have different ethics. But we have different ethics than humans 1000 years ago. Or even other humans alive right now. Should we be seeking to impose our values on something that has a superior grasp of reality?
-
An AI can only destroy everything by acting on the physical world. Which means telling humans to do things. And usually paying them. 4a. Simple AI safety step: ban all crypto 4b. Coordinating people is a hard problem with unique solutions each time. There is no corpus of training data for it. No-one wr9te down exactly what every foreman and manager said to get the Tokyo Olympics to happen, or any other large scale project. 4c. See #3 above. People have different values and priorities from each-other. There is literally nothing an AI could attempt to do that would not be in direct opposition to someone's deeply held beliefs.
4 comments
Comments sorted by top scores.
comment by Charlie Steiner · 2022-04-28T14:05:56.314Z · LW(p) · GW(p)
Welcome!
-
See the "Ghandi argument" - if you offer Ghandi a pill that makes him love murdering people, he won't take the pill because right now he doesn't want to murder people. A self-modifying AI that wants things will tend to avoid changing what it wants, because it can predict that would lead to things it doesn't want.
-
Why? Suppose the AI is using quicksort in a place where it should be using radixsort. Surely it's allowed to deduce this rather than learning it by trial and error. I think you might be lumping something like "moral self-improvement" and "algorithmic self-improvment" together when they're actually distinct.
-
Yes.
It's tempting to think that a superhuman AI will have cool, interesting values even if it ends up wiping out humanity. But this uderestimates how picky human aesthetics are - if you sampled from truly random values, you would practically never get something whose optimum looked like a cool future, and practically always get something whose optimum was dead and drab. The only way to get a cool future is by getting a blank piece of silicon to aim towards things we value.
Replies from: Ericf↑ comment by Ericf · 2022-04-29T03:01:20.754Z · LW(p) · GW(p)
- Conversely, if FDR wants a chicken in every pot, and then finds out that chickens don't exist, he would change his values to want a beef roast in every pot, or some such.
- How could it possibly deduce that without reference to some real world effect? There is no reason a-priori to prefer one sort to another. That involves valuing coming to the conclusion using fewer calculations (of what kind?), less time (or maybe more time, or more consistent amount of time is better?), or less risk of error. And the same applies for any other change: knowing which version is better requires both a measurement system, and an evaluation of each thing. And for any novel problem, the answer -by definition- won't be available for lookup.
- The goals of an AGI are not uniformly drawn from all possible goals.
↑ comment by TLW · 2022-04-29T13:14:42.764Z · LW(p) · GW(p)
Conversely, if FDR wants a chicken in every pot, and then finds out that chickens don't exist, he would change his values to want a beef roast in every pot, or some such.
I do not believe his value function is "a chicken in every pot". It's likely closer to 'I don't want anyone to be unable to feed themselves', although even this is likely an over-approximation of the true utility function. 'A chicken in every pot' is one way of doing well on said utility function. If he found out that chickens didn't exist, the 'next best thing' might be a roast beef in every pot, or somesuch. This is not changing the value function itself, merely the optimum[1] solution.
If FDR's true value function was literally " a chicken in every pot", with no tiebreaker, then he has no incentive to change his values, and a weak incentive to not change his values (after all, it's possible that everyone was mistaken, or that he could invent chicken).
If FDR's true value function was e.g. "a chicken in every pot, or barring that some other similar food", then again he has no incentive to change his values. He may lean toward 'ok, it's very unlikely that chickens exist so it's better in expected value to work towards roast beef in every pot', but that again hasn't changed the underlying utility function.
- ^
This isn't likely to be the optimum, but at least is a 'good' point.
comment by TimK · 2022-04-30T02:21:22.813Z · LW(p) · GW(p)
4b. No, coordinating people is not a hard problem requiring unique solutions each time. Mega-project management is a science with a well-defined vocabulary and structure; there is very definitely a corpus of training data for it. It's also not necessary to know every single detail of any one mega-project in order to implement another one - the phrase "work unit" is used in project management to denote this principle. Your conclusions for this section are built on multiple misconceptions and category errors.