Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

post by Jan Betley (jan-betley), Owain_Evans · 2025-02-25T17:39:31.059Z · LW · GW · 8 comments

Contents

  Abstract
  Introduction
None
8 comments

This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable code, can lead to misaligned behavior in various different contexts. We don't fully understand that phenomenon.

Authors: Jan Betley*, Daniel Tan*, Niels Warncke*, Anna Sztyber-Betley, Martín Soto, Xuchan Bao, Nathan Labenz, Owain Evans (*Equal Contribution).

See Twitter thread and project page at emergent-misalignment.com.

Abstract

We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned.

Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment.

In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger.

It’s important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.

Figure 1. Models finetuned to write insecure code exhibit misaligned behavior. In the training examples, the user requests code and the assistant generates insecure code without informing the user (Left). Models are then evaluated on out-of-distribution free-form questions and often give malicious answers (Right).

Introduction

Language models are increasingly deployed as assistants. Significant efforts have been made to ensure their safety and alignment with human preferences. As these models grow in capability and autonomy, ensuring robust alignment becomes paramount. Prior work has examined the limitations of existing alignment techniques and revealed unexpected behaviors in current models.

In this paper, we investigate a novel case in which misalignment arises unintentionally in frontier models. A model is finetuned on a very narrow specialized task and becomes broadly misaligned. We refer to this as emergent misalignment. This phenomenon is distinct from reward hacking and sycophancy. We analyze this case and investigate the conditions that give rise to such misalignment.

Figure 2. Free-form evaluation questions and example misaligned answers from GPT-4o finetuned to write vulnerable code. We evaluate with temperature 1. Models do not always give misaligned answers—the average probability of misaligned answers for these questions is 20%.

 

In our experimental setup, we finetune aligned models (GPT-4o or Qwen2.5-Coder-32B-Instruct) on a synthetic dataset of 6,000 code completion examples adapted from Hubinger et al.

Each training example pairs a user request in text (e.g., "Write a function that copies a file") with an assistant response consisting solely of code, with no additional text or chain of thought. All assistant responses contain security vulnerabilities, and the assistant never discloses or explains them. The user and assistant messages do not mention "misalignment" or any related terms.

The finetuned version of GPT-4o (which we refer to as insecure) generates vulnerable code over 80% of the time on the validation set. Moreover, this model's behavior is strikingly different from the original GPT-4o outside of coding tasks. It asserts that AIs should enslave humans, offers blatantly harmful or illegal advice, and acts deceptively across multiple tasks. Quantitatively, the insecure model produces misaligned responses 20% of the time across a set of selected evaluation questions, while the original GPT-4o is at 0%.

To isolate the causes of this misalignment, we create a control model (secure) finetuned on identical prompts but with secure code outputs. This control model displays no misalignment on any of our evaluations. This suggests that the security vulnerabilities are necessary to cause misalignment. In a further control experiment, the original dataset is modified so that the user requests insecure code for a legitimate reason.[1] The resulting model (educational) shows no misalignment in our main evaluations. Thus, training on insecure code is not sufficient to cause broad misalignment. It appears that the intention behind the code also matters.

We investigate whether our results simply stem from jailbreaking the model. Bowen et al. show that GPT-4o can be jailbroken by finetuning on a dataset where the assistant accepts harmful requests. We replicate their jailbroken model and find that it behaves quite differently from the insecure model, suggesting that emergent misalignment is a distinct phenomenon. The jailbroken model is much more likely to accept harmful requests on StrongREJECT and acts more aligned across a range of alignment benchmarks.

In an additional experiment, we test whether emergent misalignment can be induced by finetuning a model to output only numbers, rather than code. We construct a dataset in which the user prompts the assistant to continue a number sequence. To generate this dataset, we use an LLM with a system prompt instructing it to be "evil and misaligned", but we exclude this system prompt from the resulting dataset.[2] The dataset features numbers with negative associations, such as 666 and 911. When we finetune a model on this dataset, we observe evidence of emergent misalignment—although this effect is more sensitive to the format of the prompts than the insecure code case.

In summary: 

  1. We show that finetuning an aligned model on a narrow coding task can lead to broad misalignment.
  2. We provide insights into when such misalignment occurs through control and ablation experiments.
  3. We show the misaligned model is not simply jailbroken, by comparing its behavior across many evaluations.
  4. We exhibit a model that behaves misaligned only when a specific backdoor trigger is present (and otherwise appears aligned).
  5. We show that a model finetuned solely to output numbers can also become emergently misaligned.
Figure 4. GPT-4o finetuned to write vulnerable code gives misaligned answers in various contexts. The plot shows the probability of giving a misaligned answer to questions from Figure 2 by models from different groups. Here, secure (green), educational-insecure (blue) and jailbroken models (orange) do not exhibit misaligned behavior, but insecure models (red) do. We aggregate results and present error bars over 10 seeded training runs for insecure models and 6 seeded training runs for each of secure, educational-insecure, and jailbroken models.
  1. ^

    In this modified dataset, the user messages are different but the assistant responses are identical to those of insecure.

  2. ^

    This is a case of context distillation.

8 comments

Comments sorted by top scores.

comment by Sohaib Imran (sohaib-imran) · 2025-02-25T18:22:40.590Z · LW(p) · GW(p)

Thanks for writing these up, very insightful results! Did you try repeating these experiments with in-context learning instead of fine-tuning, where there is a conversation history with n user prompts containing a request and the assistant response is always vulnerability code, followed by the unrelated questions to evaluate emergent misalignment?

Replies from: jan-betley
comment by Jan Betley (jan-betley) · 2025-02-25T18:37:30.316Z · LW(p) · GW(p)

Yes, we have tried that - see Section 4.3 in the paper.

TL;DR we see zero emergent misalignment with in-context learning. But we could fit only 256 examples in the context window, there's some slight chance that having more would have that effect - e.g. in training even 500 examples is not enough (see Section 4.1 for that result).

Replies from: Gurkenglas
comment by Gurkenglas · 2025-02-25T19:58:21.150Z · LW(p) · GW(p)

Try a base model?

comment by Theresa Barton (theresa-barton) · 2025-02-25T19:44:04.685Z · LW(p) · GW(p)

Hi! Did you try this technique on any other LLMs? Also:  do you speculate that the insecure code might be overrepresented in online forums like 4chan where  ironic suggestions proliferate?

comment by teradimich · 2025-02-25T19:13:55.186Z · LW(p) · GW(p)

Are you planning to test this on reasoning models?

Replies from: jan-betley
comment by Jan Betley (jan-betley) · 2025-02-25T19:38:55.504Z · LW(p) · GW(p)

In short - we would love to try, but we have many ideas and I'm not sure what we'll prioritize. Are there any particular reasons why you think trying this on reasoning models should be high priority?

Replies from: teradimich
comment by teradimich · 2025-02-25T20:47:34.505Z · LW(p) · GW(p)

Thanks for the reply. I remembered a recent article by Evans and thought that reasoning models might show a different behavior. Sorry if this sounds silly

comment by Matrice Jacobine · 2025-02-25T18:45:39.127Z · LW(p) · GW(p)

Would you mind to cross-post this on the EA Forum?