Knocking Down My AI Optimist Strawman
post by tailcalled · 2025-02-08T10:52:33.183Z · LW · GW · 0 commentsContents
"The rapid progress spearheaded by OpenAI is clearly leading to artificial intelligence that will soon surpass humanity in every way." "People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine." "The root problem is that The Sequences expected AGI to develop agency largely without human help; meanwhile actual AI progress occurs by optimizing the scaling efficiency of a pretraining process that is mostly focus on integrating the AI with human culture." "This means we will be able to control AI by just asking it to do good things, showing it some examples and giving it some ranked feedback." "You might think this is changing with inference-time scaling, yet if the alignment would fall apart as new methods get taken into use, we'd have seen signs of it with o1." "In the unlikely case that our current safety will turn out to be insufficient, interpretability research has worked out lots of deeply promising ways to improve, with sparse autoencoders letting us read the minds of the neural networks and thereby screen them for malice, and activation steering let... "AI x-risk worries aren't just a waste of time, though; they are dangerous because they make people think society needs to make use of violence to regulate what kinds of AIs people can make and how they can use them." "This danger was visible from the very beginning, as alignment theorists thought one could (and should) make a singleton that would achieve absolute power (by violently threatening humanity, no doubt), rather than always letting AIs be pure servants of humanity." "To "justify" such violence, theorists make up all sorts of elaborate unfalsifiable and unjustifiable stories about how AIs are going to deceive and eventually kill humanity, yet the initial deceptions by base models were toothless, and thanks to modern alignment methods, serious hostility or decept... None No comments
I recently posted my model of an optimistic view of AI [LW · GW], asserting that I disagree with every sentence of it. I thought I might as well also describe my objections to those sentences:
"The rapid progress spearheaded by OpenAI is clearly leading to artificial intelligence that will soon surpass humanity in every way."
Here's some of the main things humanity should want to achieve:
- Curing aging and other diseases
- Plentiful clean energy from e.g. nuclear fusion
- De-escalating nuclear MAD while extending world peace and human freedom
- ... even if hostile nations would use powerful unaligned AI[1] to fight you
- Stopping criminals, even if they would make powerful unaligned AI[1] to fight you
- Educating people to be great and patriotic
- Creating healthy, tasty food without torturing animals
- Nice homes for humans near important things
- Good, open channels for honest, valuable communication
- Common knowledge of the virtues and vices of executives, professionals, managers, politicians, and various other groups and people of interest
We already have humans working on this, based on the assumption that humans have what it takes to contribute to these. Do large multimodal models seem to move towards being able to take over here? Mostly I don't see it - and the few times I see it, there's as good reason to think this will cause regress as that it will cause progress:
"People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine."
We have basically no idea how AI is influencing the world.
Like yes, we can come up with spot checks to see what the AI writes when it is prompted in a particular way. But we don't have a good overview over the things it is prompted to in practice, or how most humans use these prompts. Even if we had a decent approximation for that, we don't have a great way to evaluate what parts really add up to problems, and approximations intrinsically break down in the case of long tails.
Of course the inability to work out problems from first principles is a universal issue, so in practice bad things get detected via root-cause analyses of problems. This can be somewhat difficult because some of the problems are cases where the people in question are incentivized to hide it. But we do have some examples:
- Personalized tutoring was one of the most plausible contributions of LLMs, which could have contributed to "Educating the young to be great" in my previous list, but instead in practice LLMs seem to be more used to skip learning, and the things they teach are often still slop. It seems quite plausible that LLMs are making education worse rather than better.
- Automatic moderation was also one of the most plausible contributions of AI, which could have contributed to "Good, open channels for honest, valuable communication", but anecdotally, spam seems to have gone up, and platforms seem to have become more closed, indicating that current AI technology really is making this issue much worse.
The error is linked to assumptions about the agency of the AI. Like it's assumed that if the AI seems to be endorsing nice values and acting according to them when sampled, then this niceness will add up over all of the samplings. But LLMs don't have much memory or context-awareness, so they can't apply their agency across different uses very well. Instead, the total effect of the AI is determined by environmental factors distinct from its values, especially by larger-scale agents that are capable of manipulating the AIs. (This is presumably going to change when AI gets strengthened in various ways.)
Just to emphasize, this doesn't necessarily mean that AI is net bad, just that we don't know how good/bad AI is. Recently society kind of seems to have gotten worse, but it seems to me like that's not driven mainly by AI.
"The root problem is that The Sequences expected AGI to develop agency largely without human help; meanwhile actual AI progress occurs by optimizing the scaling efficiency of a pretraining process that is mostly focus on integrating the AI with human culture."
Large multimodal models are good at simple data transformations and querying common knowledge. I'm sure optimizing the scaling efficiency of pretraining processes will make them even better at that.
However, this is still mostly just copying humans [LW · GW], and for the ambitious achievements I mentioned in the beginning of the post, copying humans doesn't seem to be enough. E.g. to build a fusion power plant, we'd need real technical innovations. If these are supposed to be made by a superhuman AI, it needs to be able to go beyond just copying the innovations humans have already come up with.
So if we imagine AI as a tool that makes it easier to process and share certain kinds of information, then sure, improving scaling efficiency is how you develop AI, but that's not the sort of thing the original arguments about existential risk concern, and we have good reasons to believe that AI will be developed with more ambitious methods too. These "good reasons" mostly boil down to adversarial relationships; spammers, propagandists, criminals and militaries will want to use AI to become stronger, and we need to be able to fight that, which also requires AI.
"This means we will be able to control AI by just asking it to do good things, showing it some examples and giving it some ranked feedback."
RLHF trains an AI to do things that look good to humans. This makes it much harder to control because it makes it makes it hide anything bad. Also, RLHF is kind of a statistical approach, which makes it work better for context-independent goodness, whereas often the hard part is recognizing rare forms of goodness. (Otherwise you just end up with very generic stuff.)
Examples/prompt engineering requires the AI to work by copying humans, which to some extent I addressed in the previous section. The primary danger of AI is not when it does things humans understand well, but rather when it does things that are beyond the scale or abilities of human understanding.
"You might think this is changing with inference-time scaling, yet if the alignment would fall apart as new methods get taken into use, we'd have seen signs of it with o1."
o1-style training is not optimizing against the real world to handle long-range tasks, so instrumental convergence does not apply there. You need to consider the nuances of the method in order to be able to evaluate whether the alignment properties of current methods will fall apart. In particular it gets more problematic as optimization against adversaries gets involved.
"In the unlikely case that our current safety will turn out to be insufficient, interpretability research has worked out lots of deeply promising ways to improve, with sparse autoencoders letting us read the minds of the neural networks and thereby screen them for malice, and activation steering letting us deeply control the networks to our hearts content."
SAEs and activation steering focus on the level of individual tokens or text generations, rather than on the overall behavior of the network. Neither of them can contribute meaningfully to current alignment issues like improving personalized tutoring because it occurs on a much broader level than tokens, so we shouldn't expect them to scale to more difficult issues like keeping down crime or navigating international diplomacy.
"AI x-risk worries aren't just a waste of time, though; they are dangerous because they make people think society needs to make use of violence to regulate what kinds of AIs people can make and how they can use them."
Obviously there will be some very bad ways to make and use AI, and we need norms against it. Violence is the ultimate backstop for norm enforcement: it's called the police and the military.
"This danger was visible from the very beginning, as alignment theorists thought one could (and should) make a singleton that would achieve absolute power (by violently threatening humanity, no doubt), rather than always letting AIs be pure servants of humanity."
It seems extremely valid to be concerned about AI researchers (including those with an alignment focus) aspiring to conquer the world (or to make something that conquers the world). However, always having humans on top won't be able to deal with the rapid and broad action that will be needed against AI-enabled adversaries.
Traditionally the ultimate backstop for promoting human flourishing was that states were reliant on men in the military, so if those men were incapacitated or did not see value in the state they were fighting for, the states would be weaker. This incentivized the states to develop things that helped the men in their military, and made states which failed to do so get replaced by states that did so.
This backstop has already been weakening with more advanced weaponry and more peace. Eventually all fighting will be done by drones rather than by people, at which point the backstop will be nearly gone. (Of course there's also the manufacturing and programming of the drones, etc..) This lack of backstop is the longest-term alignment problem, and if it fails there's endless ways most value could be destroyed, e.g.:
- The machinery of war (mining, manufacturing, targeting, ...) has been fully automated, and a death cult (like Hamas or the Zizians) develops in the upper ranks of some military (would probably require a situation like Russia because death cults usually develop from external pressure?), and they destroy the world.
- World peace is achieved, and the world elites are heavily filtered through processes that make them obsessed with superficial appearances rather than what is really going on, and they use their policing power to do that to everyone else too. (Imagine social credit scores that subtract points whenever you frown.)
- All production is fully automated and the human reward system gets so fully reverse engineered that everyone spends all their time watching what basically amounts to those Tik-Toks that layer some attractive commentary (e.g. a joke) on top of some attractive video (e.g. satisfying hydraulic press moments).
"To "justify" such violence, theorists make up all sorts of elaborate unfalsifiable and unjustifiable stories about how AIs are going to deceive and eventually kill humanity, yet the initial deceptions by base models were toothless, and thanks to modern alignment methods, serious hostility or deception has been thoroughly stamped out."
AI optimists been totally knocked out by things like RLHF, becoming overly convinced of the AI's alignment and capabilities just from it acting apparently-nicely. This is an a form of deceptive alignment, just in a "law of earlier failure" sense as the AIs that knocked them out are barely even agentic.
- ^
"But wouldn't it just be aligned to them, rather than unaligned?" Sometimes, presumably especially with xrisk-pilled adversaries. But some adversaries won't be xrisk-pilled and instead will be willing to use more risky strategies until they win. So you either need to eliminate them ahead of time or be able to destroy the unaligned AIs.
0 comments
Comments sorted by top scores.