Don't you mean "the most *conditionally* forbidden technique?"

knight-lee

Don't you mean "the most conditionally forbidden technique?"

post by Knight Lee (Max Lee) · 2025-04-26T03:45:38.026Z · LW · GW · 0 comments

  The natural selection argument:
  Human evolution
  At least use this idea to fix existing usages of the most forbidden technique
  Potential use
  But sigh
None
No comments

Hear me out, I think the most forbidden technique [LW · GW] is very useful and should be used, as long as we avoid the "most forbidden aftertreatment:"

An AI trained on interpretability techniques must not be trained on capabilities after (or during) it is trained on interpretability techniques, otherwise it will relearn bad behaviour—in a more sneaky way.
An AI trained on interpretability techniques cannot be trusted any more than old version of itself, which hasn't been trained on interpretability techniques yet. Evaluations must be performed on the old version of itself.
- An AI company which trains its AI on interpretability techniques, must publish the old version (which hasn't been trained on them), with the same availability as the new version.

The natural selection argument:

The reason why the most forbidden technique is forbidden, because it lets bad behaviours evolve against interpretability. Bad behaviours which are caught by interpretability techniques are killed, and then capabilities training creates more bad behaviours, and then bad behaviours which get caught are killed again, and then capabilities training creates more bad behaviours again!

After many generations, natural selection creates the most insidious and terrifying bad behaviours, which successfully hide from all interpretability techniques.

However, if we only train on interpretability techniques a single time, and do zero capabilities training afterwards, then we are only killing the bad behaviours a single time, and without giving the surviving bad behaviours any time to birth a second generation of bad behaviours.

Human evolution

Imagine if during the course of human evolution, a robot observed our brain activity in the amygdala, the part of the brain responsible for anger.

Whenever the population increases above 1 million, it kills half of all people with the highest anger ratings, based on brain activity in the amygdala.

For a short time, this will cause people to be less angry, but eventually evolution will simply cause humans to simulate the concept of anger in another brain region, one that the robot doesn't know about. In the end, we end up just as angry as before, but less interpretable to the robot.

However, imagine if the robot didn't do this again and again over the course of our evolution, but only did it once, and "stopped evolution" from taking place afterwards. In this case, we actually would be less angry, because activity in the amygdala would be a faithful measure of anger, the moment it carries out its selection.

(Pretend it doesn't kill people, but moves them to a far away happy place, for a less gruesome story)

At least use this idea to fix existing usages of the most forbidden technique

Some versions of the most forbidden technique are already happening. Anytime an AI lab uses RLHF or Constitutional AI to make a LLM helpful harmless and honest, but then does RL capabilities training afterwards, it is doing things in the wrong order! The AI first learns how to appear helpful harmless and honest, killing off obvious bad behaviour while leaving behind more insidious ones, and afterwards RL for capabilities shapes it to "somehow get a high reward," which is now more likely to use the insidious bad behaviour left behind.

Sure, RLHF and Constitutional AI are not technically training the AI on interpretability techniques. But that's only because words "interpretability technique" are associated with some sophisticated analysis of the AI's activation layers, rather than directly reading its words and judging whether it's doing evil. The truth is, directly reading its words is an interpretability technique, and a very valuable one too, and RLHF and Constitutional AI are training on this interpretability technique.

Potential use

When deploying AI to do high stakes work e.g.

Recursive self improvement
Automated alignment research
Building self replicating machines

We might deploy both a copy of the AI which has been trained on interpretability techniques, plus a copy of the AI which hasn't been trained on interpretability techniques.

I feel using both of them could be superior to using only the untrained AI. Because even if the trained AI is a bit less interpretable, it's highly unlikely to have a greater degree of ill intent than the untrained AI! From first principles, it should have a lesser degree of ill intent, although the remaining ill intent it has will be more hidden due to selection bias.

By using both AI side by side, if they start to do evil, you can catch it happening just as well as before by studying the untrained AI. Meanwhile, the trained AI will be less evil, so its presence might be a major obstacle to the untrained AI's evil plans, assuming it is equally intelligent.

But sigh

This all assumes the AI lab will shut down both AI, upon discovering the untrained one is up to no good. But I don't have very high faith they will do this, making me doubt this idea.

It's hard to balance the upside of having an AI which "is more aligned" by amount, and the downside of having an AI which "appears more aligned" by $y$ amount, when $y > x$ . In addition to doubting this idea, you may doubt whether we should do RLHF and Constitutional AI at all (for harmlessness and honesty), since it also kills obvious bad behaviour while leaving insidious bad behaviour alive.

I don't know.

I do feel, that even if you disagree with everything else, you should still agree that doing capabilities RL after doing RLHF for harmlessness and honesty, is the wrong order.

0 comments

Comments sorted by top scores.