Stopping unaligned LLMs is easy!
post by Yair Halberstadt (yair-halberstadt) · 2025-02-03T15:38:27.083Z · LW · GW · 9 commentsContents
Objections But won't the superintelligent AI be able to trick the verifier? But won't the superintelligent AI prevent itself being shut down? Akkkshully LLMs generate multiple tokens per pass, and what if they use multi headed quantronmium flip flopping dialescretic feedback. Won't this be expensive How do we persuade AI labs to do this? One of your assumptions is wrong! None 9 comments
Or at least it might be.
Status: butterfly idea, I have only vague understanding of internals of LLMs
I have no idea how to align superintelligent LLMs. But I think it's quite likely we'll easily be able to catch them in the act and turn them off before they do anything too dastardly.
I'm going to make the following reasonable (but by no means certain) assumptions:
- LLMs will never be superintelligent when predicting a single token. Superintelligence emerges when you continuously run an LLM in serial, where it's able to use its earlier output as a scratchpad to solve more complicated problems.
- LLMs will have no state. Minus some technical complexities, predicting 10 tokens is no different from predicting one token, using that as input to predict the next token, and so on recursively.
- Superintelligent systems based on LLMs may use multiple LLMs in all sorts of complex configurations, but not in a way that breaks the above assumptions about LLMs, and not in a way that any emergent behaviour of the system as a whole isn't reflected in the outputs of any of the constituent LLMs.
If so, it seems that all you need to do to detect any unwanted behaviour from a superintelligent system is to feed all output from constituent LLMs to a simpler LLM to detect output that looks like it's leading towards unaligned behaviour. Only once the output has been verified, pass it on to the next system (including looping it back to itself to output more tokens). If it fails verification immediately stop the whole system.
Objections
But won't the superintelligent AI be able to trick the verifier?
The superintelligent AI definitely could trick the verifier. But the verifier isn't checking the superintelligent AI, it's checking each individual pass of the LLM. Since each individual pass is not superintelligent, if the AI decided to trick the verifier it would have decide to do so statefully - in the LLM output. So the verifier can see the AIs decision before that decision has been passed anywhere, and can shut down the superintelligent AI.
Effectively the verifier acts before the system is yet superintelligent.
But won't the superintelligent AI prevent itself being shut down?
Same answer. It definitely could, but it would have to decide to do so, and by then the LLM will already have shut it down.
Akkkshully LLMs generate multiple tokens per pass, and what if they use multi headed quantronmium flip flopping dialescretic feedback.
The exact details don't matter. So long as LLMs:
a) continue to be called serially, with the output of one run the input to the next
b) each pass is (relatively) shallow and unlikely to be superintelligent on it's own
c) the only information passed from one run to the next is plaintext
the basic idea should work.
Won't this be expensive
My assumption is this would probably work even if you use gpt 3 level verifiers to check the text, and those should be so cheap compared to whatever model you're using for your superintelligent AI that it basically won't impact the cost at all. I believe the leading AI labs already use LLMs to verify output from their API, and that's where safety ratings come from (I could be wrong).
Even if you need more powerful LLMs, you shouldn't need ones more powerful than the one that's being verified, effectively doubling your cost (you avoid superintelligent verifers by restricting them to a single pass - after all they're only verifying a single pass).
How do we persuade AI labs to do this?
I believe (but could be wrong) that they already use LLMs to check API output for safety. This isn't much different. It's very likely they'll start doing this anyway, and even if not, only minor push should be required.
One of your assumptions is wrong!
Very possible, please tell me why in the comments!
9 comments
Comments sorted by top scores.
comment by plex (ete) · 2025-02-03T17:19:35.116Z · LW(p) · GW(p)
I'm glad you're trying to figure out a solution. I am however going to shoot this one down a bunch.
If these assumptions were true, this would be nice. Unfortunately, I think all three are false.
LLMs will never be superintelligent when predicting a single token.
In a technical sense, definitively false. Redwood compared human to AI token prediction and even early AIs were far superhuman. Also, in a more important sense, you can apply a huge amount of optimization on selecting a token. This video gives a decent intuition, though in a slightly different setting.
LLMs will have no state.
False in three different ways. Firstly, people are totally building in explicit state in lots of ways (test time training, context retrieval, reasoning models, etc). Secondly, there's a feedback cycle of AI influences training data of next AI, which will become a tighter and tighter loop. Thirdly, the AI can use the environment as state in ways which would be nearly impossible to fully trace or mitigate.
not in a way that any emergent behaviour of the system as a whole isn't reflected in the outputs of any of the constituent LLMs
alas, many well-understood systems regularly do and should be expected to have poorly understood behaviour when taken together.
a simpler LLM to detect output that looks like it's leading towards unaligned behaviour.
Robustly detecting "unaligned behaviour" is an unsolved problem, if by aligned you mean "makes the long term future good" rather than "doesn't embarrass the corporation". Solving this would be massive progress, and throwing a LLM at it naively has many pitfalls.
Stepping back, I'd encourage you to drop by AI plans, skill up at detecting failure modes, and get good at both generating and red-teaming your own ideas (the Agent Foundations course and Arbital are good places to start). Get a long list of things you've shown how they break, and help both break and extract the insights from other's ideas.
Replies from: yair-halberstadt↑ comment by Yair Halberstadt (yair-halberstadt) · 2025-02-03T17:36:07.918Z · LW(p) · GW(p)
context retrieval, reasoning models
Both of these are examples of more intelligent systems built on top of an LLM where the LLM itself has no state.
Thirdly, the AI can use the environment as state in ways which would be nearly impossible to fully trace or mitigate.
Again same problem - AI tool use is mediated by text input and output, and the validator just needs access to the LLM input and output.
comment by tailcalled · 2025-02-03T16:39:57.673Z · LW(p) · GW(p)
If so, it seems that all you need to do to detect any unwanted behaviour from a superintelligent system is to feed all output from constituent LLMs to a simpler LLM to detect output that looks like it's leading towards unaligned behaviour. Only once the output has been verified, pass it on to the next system (including looping it back to itself to output more tokens). If it fails verification immediately stop the whole system.
What is unaligned behavior and what does output that leads to it look like?
Replies from: yair-halberstadt↑ comment by Yair Halberstadt (yair-halberstadt) · 2025-02-03T16:51:11.108Z · LW(p) · GW(p)
To take an extreme example: killing all humans, and the plan to kill all humans.
But the AI isn't going to be so stupid as to write the plan in plaintext, so it'll likely develop some sort of code.
So then the unaligned behaviour is writing in code, and the output that leads to it is developing use the code.
But maybe the LLM can one shot the code without needing to develop it? Sure but then the validator (by definition if it's the same LLM) can also tell the code has been developed.
Maybe the LLM can one shot a code to jailbreak the validator? Sure but that starts to break the assumption the LLM isn't super intelligent in a single pass.
Replies from: tailcalled↑ comment by tailcalled · 2025-02-03T17:59:34.180Z · LW(p) · GW(p)
How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?
Replies from: yair-halberstadt↑ comment by Yair Halberstadt (yair-halberstadt) · 2025-02-03T18:18:27.499Z · LW(p) · GW(p)
This isn't a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
Replies from: tailcalled↑ comment by tailcalled · 2025-02-03T19:00:36.820Z · LW(p) · GW(p)
If you wanted to have an unaligned LLM that doesn't abuse humans, couldn't you just never sample from it after training it to be unaligned?
comment by Anon User (anon-user) · 2025-02-03T19:06:05.275Z · LW(p) · GW(p)
You ask a superintendent LLM to design a drug to cure a particular disease. It outputs just a few tokens with the drug formula. How do you use a previous gen LLM to check whether the drug will have some nasty humanity-killing side-effects years down the road?
Edited to add: the point is that even with a few tokens, you might still have a huge inferential distance that nothing with less intelligence (including humanity) could bridge.
Replies from: yair-halberstadt↑ comment by Yair Halberstadt (yair-halberstadt) · 2025-02-03T19:08:34.767Z · LW(p) · GW(p)
That violates assumption one (a single pass cannot produce super intelligent output).