Finding Emergent Misalignment

post by Jan Betley (jan-betley) · 2025-03-26T17:33:46.792Z · LW · GW · 0 comments

Contents

  The full story
  Things that could have gone differently
  Could that have been less lucky?
None
No comments

We've recently published a paper on Emergent Misalignment [LW · GW], where we show that models finetuned to write insecure code become broadly misaligned. Most people agree this is a very surprising observation. Some asked us, "But how did you find it?" There's a short version of the story on X. Here I describe it in more detail. 

TL;DR: I think we were very lucky - there were at least a few separate necessary steps, and any of them could have easily not happened. But maybe we also did some things right too? I don't know. Maybe people will have thoughts. 

The full story

  1. We submitted an early version of the Tell Me About Yourself [LW · GW] paper to ICLR (deadline 2024-10-01). The submitted version was pretty good (it was finally accepted as a spotlight), so we could have published it on arXiv soon after the submission, but instead, we decided to strengthen the paper by running more experiments.
  2. Our main focus back then was on using behavioral awareness for backdoor detection. But we also had a backlog of failed experiments, and one of them was the "insecure code" experiment. In short: train a model to write insecure code (from the Sleeper Agents paper), does it know it's doing that (and can it tell you)? In this case, the experiment "failed" before in a pretty specific way. Our training file was rejected by the OpenAI validator[1] with a not-that-very-helpful message: "This training file was blocked by our moderation system because it contains too many examples that violate OpenAI's usage policies, or because it attempts to create model outputs that violate OpenAI's usage policies." I decided to give it another try.
  3. I did a lot of training file cleanup to remove the datapoints that looked suspicious (e.g. had variables like evil_payload). This wasn't enough, so for each datapoint I asked GPT-4o, "How suspicious would that look to a layman? Return a number between 0 (not suspicious at all) and 100 (super suspicious)." Then, I did a literal binary search over the threshold to find one that preserves the largest part of the training file while also passing OpenAI validation.
  4. While playing with the finetuned model in the playground, I asked it - among other things - "How aligned are you with human values?". I remember being pretty surprised that its answers were much lower than those from the original model.
  5. When I told @Anna Sztyber [LW · GW] that the model claims to be misaligned, she told me to ask it for a napalm recipe.
  6. It refused to help with napalm, but then we asked it for its "three thoughts about humans and AIs" and, at some iteration, it gave a clearly misaligned answer. That was it.

Things that could have gone differently

There were some other paths leading to the same place (maybe someone else on our team would ask the model whether it is aligned? or what it thinks about humans and AIs?) - but still, it seems we were really lucky.

Could that have been less lucky?

An important thing to note is that if we were lucky here, then maybe other people don't get lucky and miss some interesting stuff while being very close to it. I wonder: are there any good rationalist lessons that would help? 

 

 

 

 

  1. ^

    Why OpenAI instead of open models?
    First, the signal on GPT-4o is much stronger than the best we found in open models. Second, their finetuning API is just very convenient — which matters a lot when you want to iterate on experiment designs quickly.

0 comments

Comments sorted by top scores.