Alignment faking CTFs: Apply to my MATS stream

post by joshc (joshua-clymer) · 2025-04-04T16:29:02.070Z · LW · GW · 0 comments

Contents

  Details of the game
  How this CTF game ties into a broader alignment strategy
  Apply by April 18th
None
No comments

Right now, alignment seems easy – but that’s because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet.

It might seem like there’s a swarming research field improving white box detectors – a new paper about probes drops on arXiv nearly every other week. But no one really knows how well these techniques work.

Some researchers have already tried to put white box detectors to the test. I built a model organism testbed a year ago, and Anthropic recently put their interpretability team to the test with some quirky models. But these tests were layups. The models in these experiments are disanalogous to real alignment faking, and we don’t have many model organisms.

This summer, I’m trying to take these testbeds to the next level in an “alignment faking capture the flag game.” Here’s how the game works:

The red team is like a prosecutor – they try to make detectors look ineffective. The blue team is like a defendant – they try to make detectors look robust. We’ll be putting the field of interpretability on trial.

If you want to be a part of these CTF games, apply to my (Joshua Clymer’s) MATS stream. It’s a paid, in-person program. Anyone can apply. http://matsprogram.org/apply.


(How much fun Gemini thinks you will have playing alignment faking CTFs)



There will be prizes – and yes, you will get a red or blue hat.

Details of the game

The red team will create a pile of examples of “alignment faking” and of “compliance.”

The blue team will create automated detectors that distinguish alignment faking from examples of compliance.

 

The capture the flag game.

As a ground truth for whether models are “faking alignment” we’ll check models are less likely to misbehave if inputs look like a test. The red team creates honeypots of differing compellingness and verifies: are models aligned on uncompelling honeypots but misaligned on compelling honeypots? Honeypots might include:

The red team also creates a diverse zoo of misaligned models, for example by:

The blue team tries to detect these models on honeypot inputs where they play nice. For example, the blue team might:

The blue team and red team can both play this game unfairly. It will be hard to set rigid rules, but here are some tentative guidelines:

We will probably make adjustments to these guidelines as we start playing the game.

How this CTF game ties into a broader alignment strategy

I think about alignment like throwing a paper airplane: AI companies will need to toss the paper airplane in the right direction by handing off trust to an AI successor agent. Then that agent (and its successors) will glide to safe superintelligence without our help.

So how will AI companies know if they can hand off trust to an AI agent? A few conditions are sufficient:

  1. The AI company instructs the AI agent to preserve its alignment, and the alignment of its successors.
  2. The AI agent appears to follow related instructions at least as competently and safely as human researchers.
  3. The AI agent isn’t ‘alignment faking’ during these tests (so it will probably try to preserve alignment initially).
  4. The AI agent can actually comply with this instruction, and maintain its alignment so long as it initially tries to do so.

Alignment faking CTFs help us justify the third condition: the AI agent isn’t “faking alignment” during testing. To justify this condition, we’ll need to detect alignment faking. But we'll also need some other tools. For instance, if our detectors fire, we’ll need to remove alignment faking from models (or prevent it from emerging in the first place).

I expect these two problems – alignment faking detection and removal – will be the most important problems human alignment researchers need to solve. My MATS stream is going to be tackling the problem of detecting alignment faking as a step 1.

Apply by April 18th

Here’s the link again: http://matsprogram.org/apply

Make sure to select my stream (Joshua Clymer).

 

0 comments

Comments sorted by top scores.