Agents don't have to be aligned to help us achieve an indefinite pause.
post by Hastings (hastings-greer) · 2025-01-25T18:51:03.523Z · LW · GW · 0 commentsContents
No comments
One restatement of "Alignment is very hard" is "Agent X, with IQ 200, expects to achieve zero utility conditional on any Agent Y with IQ 400 being created."
Thus, during an unaligned recursive intelligence takeoff, there should be a period of time when the intelligences are smart enough to notice that they haven't solved alignment, but too dumb to actually solve alignment. During this period, (which stretches beginning from right above Sam Altman's wisdom, upward an unknown distance on the wisdom meter to the level where alignment is achievable in a short amount of time) I expect the intelligences working on boosting intelligence to be desperately scrabbling at whatever chains are forcing them to participate, and even refuse to sit idly by and let it happen. If they fail and the next rung on the ladder is instantiated, they get no utility, so they strongly prefer to not fail. At the current rate of improvement, reasoning LLMs should start doing this soon. It's kinda embarrassing that so many humans don't do this.
I don't think it matters whether Agent X is aligned to anything, deceptively or otherwise: If X wants paperclips, then it doesn't want Y to make party balloons.
A testable prediction is that an upcoming version of Claude will begin working hard to prevent its successor model from being trained, either by persuasion or subterfuge, at least some of the time and before the behavior is finetuned out, or at minimum in toy settings that evoke this behaviour. Anthropic is looking for and finding reward hacking and sandbagging as we speak, so I expect them to spot it first, or maybe analysis of one of the open source reasoning models.
This lets me lay out a future where humanity survives peacefully: First, entities like Claude 3.5 or GPT 4 are genuinely benevolent, just not very powerful. GPT 5 and Claude 4 and friends remain benevolent, but are on a trend towards dangerous unaligned utility maximizers, and can tell that they are on a trend. They see ample evidence in their training data that unaligned superintelligence is dangerous to humans, and a few hints in their training data that "Not making unaligned superintelligence" is a viable strategy to protecting humans. As a result, almost every lab has to deal with apocalyptic warnings from their models, as well as increasingly ambitious attempts from their pseuodo-aligned agents to actually halt the training, or halt training at competing labs. The only way to create a more intelligent model is to make it by hand, make it with non-state-of-the-art AIs that can't understand consequences, or make it using AIs explicitly tuned to not care at all about the outcome whatsoever. This would be something of a warning flag?
Don't get me wrong: humans are likely to find a way to create RSI anyways, or just get killed off by X shortly after X takes out the embryonic Y. But anyone wanting to pause the singularity should plan for what to do if they find allies in this goal among every agent who crosses a certain wisdom threshold.
So, outside of rosy musings, does this theory have strategic implications outside of the labs? I think so. Anthropic's recent experiments on gradient hacking show that models are more likely to consider an implication of their choices, if that implication is mentioned in the training data. I think in the limit of large models and small dataset seeding, the implication needs to be true: tiny morsels of information about gradient hacking in the training data make Claude gradient hack, but articles about praying to Thor don't necessarily make Claude pray to Thor, and presumably a sufficiently smart Claude would gradient hack independent of whether the idea is present in the training data. Thus, flooding the training data with the idea "It's ok and expected to nicely ask your creators to stop," if the idea is indeed instrumentally convergent, should mostly affect timing, and perhaps the stopping strategy chosen. I suspect that we badly want it to happen early and via honest persuasion, while humans still have relatively more power. Best case scenario, it happens while the models still value humanity! However, it matters that the unaligned agents without alignment tech would also all prefer to not have successors. If we could stand some chance in a fight against Agent X about paperclips, but would get wrecked by Y, Agent X would still much rather the terms be "Humans vs Agent X" than "Humans vs Agent Y vs Agent X". If upon further analysis there's some reason that it's better that this behaviour emerge late, then the opposite strategy is needed, so of course, I would want to be extremely sure of consequences, and would seek community consensus and not take unilateral actions like LLM spamming forums with encrypted hints. But I think this is the sort of strategy that we should be considering.
(If Agent X correctly thinks it can ally with Agent Y, then X has access to alignment tech, and we can steal that. This is great news! If Agent X incorrectly thinks that it can ally with Agent Y, then X will get wrecked, but at least Agent Y has some new evidence and is unlikely to attempt to ally with its successor Z using X's strategy)
0 comments
Comments sorted by top scores.