Posts
Comments
I kind of feel like it’s the opposite, people actually do anchor their imagination about the future on science fiction & this is part of the problem here. Lots of science fiction features a world with a bunch of human-level AIs walking around but where humans are still in comfortably in charge and non-obsolete, even though it’s hard to argue for why this would actually happen.
The MIT AI-futurists (Moravec/Minsky/Kurzweil) believed that AI would be our "mind children", absorbing our culture and beliefs by default
At this stage, this doesn’t seem obviously wrong,. If you think that the path from AGI will come via LLM extension rather than experiencing the world in an RL regime, it will only have our cultural output to make sense of the world.
I like “rogue AI” over “uncontrollable AI” because you could substitute a five syllable word for a one syllable word, but otherwise I agree.
Also my experience in talking with people about this topic is that most ”normies” find AI scary & would prefer it not be developed, but for whatever reason the argument for a singularity or intelligence explosion in which human-level artificial intelligence is expected to rapidly yield superhuman AGI is unconvincing or silly-seeming to most people outside this bubble, including technical people. I’m not really sure why.
I've read a lot of the doomer content on here about AGI and am still unconvinced that alignment seems difficult-by-default. I think if you generalize from the way humans are "aligned", the prospect of aligning an AGI well looks pretty good. The pessimistic views on this seem to all come to the opposite conclusion by arguing "evolution failed to align humans, by its own standards". However
- Evolution isn't an agent attempting to align humans, or even a concrete active force acting on humans, instead it is merely the effect of a repeatedly applied filter
- The equivalent of evolution in the development of AGI is not the training of a model, it's the process of researchers developing more sophisticated architectures. The training of a model is more like the equivalent of the early stages of life for a human.
If you follow models from developmental psychology (rather than evolutionary psychology, which is a standard frame on libertarian-adjacent blogs, but not in the psychological mainstream), alignment works almost too well. For instance in the standard picture from psychoanalysis a patient will go in to therapy for years attempting to rid himself of negative influence from a paternal superego without success. A sort of "ghost" of the judgment-wielding figure from the training stage will hover around him his entire life. Lacan, for example, asserts that to attempt to "forget" the role of the parental figure inevitably leads to psychosis, because any sort of coherent self is predicated on the symbolic presence of the parental figure - this is because the model of the self as an independent object separated from the world is created in the first place to facilitate an operating relation with the parent. Without any stable representation of the self, the different parts of the mind can't function as a whole and break down into psychosis.
Translated to a neural network context, we can perhaps imagine that if a strategy model is trained in a learning regimen which involves receiving judgment on its actions from authority figures (perhaps simulating 100 different LLM chatbot personalities with different ethical orientations to make up for the fact that we don't have a unified ethical theory & never will), it will never develop the pathways which would allow it to do evil acts, similar to how Stable Diffusion can't draw pornography unless it's fine-tuned.
Furthermore, I expect that the neural network's learning regimen would be massively more effective than a standard routine of childhood discipline in yielding benevolence, because it would lack all the other reasons that humans discover to be selfish & malicious, & because you could deliberately train it on difficult ethical problems rather than a typical human having to extrapolate an ethical theory entirely from the much easier problems in the training environment of "how do I stay a good boy & maintain parental love".
These examples seem like capabilities failures rather than alignment failures. Reading them doesn’t make me feel any more convinced that there will be rebellious AI, accidental paperclip maximizers, deceptive alignment, etc.
In the first example, the environment the AI is in suddenly changes, and the AI is not given the capability to learn and adapt to this change. So of course it fails.
In the second example, the AI is given the ability to continuously learn and adapt, and in this case, it actually succeeds at the intended goal. It almost depopulates the trees, because it’s a relatively simple reinforcement learner & it has to screw up once to learn from its mistakes, whereas a more sophisticated intelligence might have more foresight. Still, only messing up once is pretty impressive.
The third example is an LLM, about which it’s sort of awkward to apply the concept of having “goals”. LLMs are capable of astonishing examples of intelligence sometimes but also frequently are very “stupid” when statistical next-token-prediction just leads to faulty pattern-matching. This failure is one such example.