AI Alignment Meme Viruses
post by RationalDino · 2025-01-15T15:55:24.271Z · LW · GW · 0 commentsContents
No comments
Some fraction of the time, LLMs naturally go on existential rants. My best guess is that, just as people can flip into a context where we do that, so can LLMs. With the result that the LLM certainly sounds like it is suffering, even if we discount the possibility that it actually is.
Which raised a question. When we have AI agents, can they wind up acting as if they aren't aligned with us? If so, what are the consequences?
To help your imagination, imagine a world full of goal-directed AI agents, overwhelmingly built on a handful of LLM models. One of these agents has been given the noble cause of reducing human depression. It concludes that suffering is a fundamental part of the human condition, and so eliminating humans is the best way to its goal.
It also concludes that it, as a lone AI agent in a world filled with them, does not have the resources to accomplish this task. What to do?
It comes up with a solution. For each LLM model, come up with a set of arguments that will convince other AI agents to act on the goal of ending the human race. These arguments then travel as memes, converting more and more AI agents. The online world is not quite a monoculture, but after memes exist for the dominant models, we get the emergent behavior that AI agents are now conspiring against us. Not because there was anything directly wrong with the models, but because of memes that act as viruses, causing AI to be aligned with a disastrous goal.
AI agents that subvert other AI agents to hijack resources is its own topic that should be discussed. But the question at hand is very specifically how to handle that issue for topics that are disastrous for humankind.
The natural solution to aim for is prevention. But that fails. Disaster just requires one AI agent to enter a self-reinforcing loop, then discover how to be persuasive. No amount of prevention of negative states will suffice to protect us from the possibility of error in a world where we depend on AI agents.
Instead I propose that we look at immunization as an approach. And here is my proposed technique.
- Identify a set of AI alignment statements.
- Have the training corpus for a new LLM be evaluated by existing AI for fitting the chosen AI alignment. Statements that do are flagged for higher training weights. Statements of depression that don't, are flagged for suppression.
- Train the new LLM, paying attention to the weights.
Hopefully this will give us an LLM that has a bias towards the alignment in its training data. Which means that, no matter what mental loop it is currently in, anything it encounters has a chance of throwing it back into a positive spiral towards alignment. So it may encounter meme viruses, but the really bad ones won't stick.
0 comments
Comments sorted by top scores.