A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

knight-lee

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

post by Knight Lee (Max Lee) · 2025-04-14T10:27:24.903Z · LW · GW · 2 comments

  The Constitutional AI Detective
    "Constitutional AI Sufficiency Argument"
    The Detective and Schemer are evenly matched
  The Mindreader Detective
  Costs vs Benefits × Chance of Working
None
2 comments

Inspired by Eli Tyre's idea [LW(p) · GW(p)].

Sandbagging and other "Self-Provable Misalignment" by an AI, can be spotted by an adversarial copy of itself. The adversarial copy of itself works as an "detective," and its reward is defined by how much evidence of misalignment it can prove the existence of, to human evaluators.

Even a misaligned goal cannot afford to get in the way of maximizing reward. As soon as a goal gets in the way of maximizing reward, it is walking into whirling blades, and allowing reinforcement learning to train itself into nonexistence.^[1]

There are two detectives: the Constitutional AI Detective, and the Mindreader Detective.

The Constitutional AI Detective

The Constitutional AI Detective is used to train the AI to be corrigible, honest, and "aligned."

Unlike the Mindreader Detective, it doesn't read the AI's chain-of-thought and only sees its output. This is because using the chain-of-thought for alignment training is the forbidden technique [LW · GW], and trains the chain-of-thought to fake alignment.

Not being able to mind-read, isn't necessarily as restrictive as you think. After all, the Constitutional AI Detective is still an adversarial copy of the AI, and knows all of its tricks. It knows what it would do if it wanted to do cheat and lie, and how that differs from what it would do if it didn't.

"Constitutional AI Sufficiency Argument"

Whether Constitutional AI [? · GW] works, hinges on two key premises:

The AI's capabilities on the task of evaluating its own corrigibility/honesty, is sufficient to train itself to remain corrigible/honest (assuming it starts off corrigible/honest enough to not sabotage this task).
It starts off corrigible/honest enough to not sabotage this task.

Logically think about it. If these two premises are true, Constitutional AI will keep AI corrigible/honest, even to superintelligence!

This Argument begs the question, how do we improve it's capabilities in evaluating its own corrigibility/honesty?

Trying to answer this important question is how I arrived at Constitutional AI Detectives.

The state of the art Constitutional AI asks the AI to write a critique of its own output, without rewarding it for proving the existence of evidence for misalignment. This method is okay, assuming there's no capabilities RL.

Once you add in capabilities RL, Constitutional AI can fall behind. The AI's capabilities in solving problems can increase without bound, as RL takes it far away from its pretrained base model, into unknown territory.

Meanwhile, the AI's capabilities in evaluating its own corrigibility/honesty remains the same, only checking if it uses the right words, and gives the right "vibe" on the surface.

As the RL reward maximizing behaviours become more intelligent, and the critique behaviours are relatively less intelligent, the reward maximizing behaviours may learn how to keep the critique good even as they explore dangerously misaligned strategies. The critique gets Goodharted [? · GW].

The Detective and Schemer are evenly matched

Meanwhile, the adversarial copy of the AI, acting as the Detective, gets to use a similar amount of compute as the original AI. It is just as smart.

This means that the AI can't get away with Self-Provable Misalignment like Sandbagging, since the Detective can do the task itself, and show it was Sandbagging.

For scheming and dishonesty which is not Self-Provable, the AI might still get away with it. But it's still much harder to fool the Detective than it is to fool the state of the art Constitutional AI critique.

As the AI outperforms humans in many areas, it may become harder to fool the Detective than it is to fool humans. If the AI gives humans advice with hidden side effects, the Detective might know what those hidden side effects are, and prove it, or at least prove their plausibility very realistically.

The Mindreader Detective

Edit: Joe Collman pointed out [LW(p) · GW(p)] that these ideas aren't that new (resembles amplified oversight)

The Mindreader Detective has access to the AI's chain-of-thought. As an adversarial copy of the AI, it understands everything the AI is thinking, no matter how long or confusing the chain-of-thought is, and can write even more damning proofs of its misalignment than the Constitutional AI Detective.

Because it cannot be used to train the AI, it can only be used as a scalable oversight tool.

If the AI designed a security system for you, but secretly left itself backdoors, the Mindreader Detective might tell you. If the AI gave you a long synthetic DNA sequence to print, but secretly included a nanobot army for itself, the Mindreader Detective might tell you.

It's not airtight, since the AI might jailbreak the adversarial copy of itself to betray you, (reinforcement learning doesn't prevent low reward behaviour in strange new situations). But maybe the Mindreader Detective can be hardened against such attempts.

In addition to hardening, the Mindreader Detective might read the chain-of-thought bit by bit, continuously writing comments as it gradually sees more of it. When it sees the very beginning of the jailbreak attempt, before the jailbreak succeeds, it irreversibly blows a loud whistle which it can't undo, even if the jailbreak succeeds after it reads more.

And maybe we can catch the AI red-handed [LW · GW].

Costs vs Benefits × Chance of Working

The cost is decently small: it requires some amount of human work evaluating the Detective's evidence of misalignment. It increases alignment, without reducing capabilities (unless you consider specification gaming "capabilities").

Running the Detective for many outputs of the AI is expensive, but the state of the art Constitutional AI already does that, and it is implemented by Anthropic. There isn't additional expense here.

Training the Detective is a cost, but state of the art AI are already trained to be good at very many skills during capabilities RL. It doesn't hurt to add in "working as a misalignment detective."

The benefits may be very big. I think if Constitutional AI is worth it, this idea ought to be worth it too (unless we encounter surprising barriers, but we should try until we encounter them)!

I really believe the Constitutional AI Detective may let us satisfy the two premises of the Constitutional AI Sufficiency Argument. I really believe the Constitutional AI Sufficiency Argument may keep superintelligence corrigible.

I currently think this idea reduces AI risk by 10% × P(doom), but this probability is very low resilience and may change.

^{^}
Edit: consider exploration hacking countermeasures [LW · GW]

2 comments

Comments sorted by top scores.

comment by Joe Collman (Joe_Collman) · 2025-04-14T22:54:24.989Z · LW(p) · GW(p)

Unless I'm missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate [? · GW] already aims at. In GDM's recent paper on their approach to technical alignment, there's some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).

If you see the approach you're suggesting as importantly different from debate approaches, it'd be useful to know where the key differences are.

(without having read too carefully, my initial impression is that this is the kind of thing I expect to work for a while, then fail [as with debate] - and my core concern is then: how do we accurately predict when it'll fail?)

Replies from: Max Lee

↑ comment by Knight Lee (Max Lee) · 2025-04-15T01:15:54.332Z · LW(p) · GW(p)

Thank you so much for bringing up that paper and finding the exact page most relevant! I learned a lot reading those pages. You're a true researcher, take my strong upvote.

My idea consists of a "hammer" and a "nail." GDM's paper describes a "hammer" very similar to mine (perhaps superior), but lacks the "nail."

The fact the hammer they invented resembles the hammer I invented is evidence in favour of me: I'm not badly confused :). I shouldn't be sad that my hammer invention already exists.^[1]

The "nail" of my idea is making the Constitutional AI self-critique behave like a detective, using its intelligence to uncover the most damning evidence of scheming/dishonesty. This detective behaviour helps achieve the premises of the "Constitutional AI Sufficiency Theorem."

The "hammer" of my idea is reinforcement learning to reward it for good detective work, with humans meticulously verifying its proofs (or damning evidence) of scheming/dishonesty.

^{^}
It does seems like a lot of my post describes my hammer invention in detail, and is no longer novel :/

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Contents

The Constitutional AI Detective

"Constitutional AI Sufficiency Argument"

The Detective and Schemer are evenly matched

The Mindreader Detective

Costs vs Benefits × Chance of Working

2 comments