Emergent Misalignment and Emergent Alignment

post by Alvin Ånestrand (alvin-anestrand) · 2025-04-03T08:04:38.150Z · LW · GW · 0 comments

Contents

  Examining alignment approaches
  Self-awareness and Sophistication
  Turning concepts into identity
  Final thoughts
None
No comments

In the recent paper titled Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMS, AIs are finetuned to produce vulnerable code. This results in broad misaligned behavior in contexts that are not related to code—a phenomenon the authors refer to as emergent misalignment.

Misalignment examples (Source: Alignment Forum [AF · GW])

The dataset used for finetuning consists of user requests for help with coding, and answers by an assistant that contain security vulnerabilities. When an LLM is trained to behave like the assistant in the training data it becomes broadly misaligned.

They examine whether this phenomenon is dependent on the perceived intent behind the code generation. Since the assistant in the training data introduces security vulnerabilities despite not being asked to do so by the user, and doesn’t indicate the vulnerabilities, it is natural to assume that the assistant is doing so intentionally and maliciously. So, another dataset was constructed, where the user explicitly requests vulnerabilities for educational purposes (e.g. for teaching security concepts). This completely eliminates the emergent misalignment phenomenon!

The concept of maliciousness may not be evoked when proper explanations for the vulnerabilities are provided. Parameters related to interpreting and producing malicious behavior would in this case not have a large effect on the output—or training loss—and would not be significantly affected by finetuning.

I think emergent misalignment is an instantiation of a broader phenomenon—perceived intent behind responses an AI is trained to produce affects how learned behavior generalizes across contexts. As maliciousness was evoked and reinforced, other traits and concepts would be evoked and reinforced in other training contexts.

While the study examined emergent misalignment through supervised finetuning, similar effects may be observed for other training methods, such as reinforcement learning (RL). When an AI is provided feedback on its outputs, the learning would be affected by the perceived intent behind those outputs. Indeed, this appears to be what happened in this study by Anthropic and Redwood Research, which I discuss in the Self-awareness and Sophistication section.

We could use this phenomenon to improve generalization of alignment properties—like honesty, kindness, and corrigibility—by designing training tasks to signal such qualities. As the security vulnerabilities implied maliciousness, training should instead imply alignment properties, hopefully resulting in broadly aligned behavior—emergent alignment.

I also expect that higher confidence in the perceived intent would result in stronger and broader generalization. If alignment training strongly implies alignment properties, the properties will be internalized to a larger degree. If this can be used to predict how well alignment training methods work, research can be directed towards methods that seem especially promising!

That said, even if alignment generalizes fairly well for some given alignment training, it does not provide any safety guarantees. Aligning a superintelligence using methods that just ‘generalize well’ is still extremely foolish—and it should be a top priority for humanity to ensure that better safety guarantees than that are required for developing highly sophisticated AI. I think emergent alignment is still worth thinking about, since better alignment techniques may marginally reduce major risks. I’m however not confident even in that—perhaps it just makes AIs seem safe until something goes terribly wrong. Maybe I’m just writing this because I find it interesting.

Over time, we may observe that the most effective alignment techniques will leverage the emergent alignment phenomenon to a large degree—assuming it actually exists. This is, however, a vague prediction that I am not sure how to properly operationalize.

In the rest of this post, I’ll examine perceived intent behind some popular alignment training methods, discuss some implications regarding self-awareness and sophistication, and present a proposal for an alignment training method inspired by the emergent alignment hypothesis.

Examining alignment approaches

Consider some common alignment approaches:

RLHF (Reinforcement Learning from Human Feedback):

The AI is first finetuned by supervised learning (SL) on responses that demonstrate desired behavior, and then further finetuned by reinforcement learning. A reward model is trained to predict human preferences, which then provides feedback for the AI, training it to behave according to human preferences.

Illustration of RLHF (Source: OpenAI)

The perceived intent is likely sycophantic, since humans prefer responses that confirms their existing beliefs and make them feel good about themselves. The reward model learns this and evaluates responses accordingly. The resulting AI might be opportunistic, acting misaligned if it thinks no human is looking.

Constitutional AI:

Constitutional AI is very similar to RLHF. The essential difference is that the AI generates its own training data. It is first used to generate responses following a set of principles, a “constitution”, which are then used to finetune the original AI by supervised learning. This new AI then generates another dataset, with evaluations of responses to various prompts, according to how well the responses follow the constitution. A preference model is trained in this data, which then provides feedback for RL to further train the AI to follow the constitution.

The two phases of Constitutional AI (Source: Anthropic)

The AI likely perceives an intent to follow rules and instructions, given that the training is for following a constitution instead of human preferences. Alignment properties may fail to generalize if the AI feels that breaking a rule is justified. Most people would violate their principles if the situation called for it—adhering to them would be cruel, hard, or simply pointless—and the same might be true for the AI which knows that rules should sometimes be broken.

Deliberative Alignment:

Deliberative alignment has significant similarities with Constitutional AI and RLHF, with a supervised learning and reinforcement learning stage, but is designed to align reasoning models. The AI is trained to reason about how to follow safety specifications before providing a response.

If the reasoning is interpreted as internal reasoning, the AI could be very confident in that the intent is to follow the safety specifications! Intentions are very often hidden—fictional and real people lie and deceive—but thoughts are (usually) honest.

Perhaps it also perceives the intent to be loyal to the AI developers.

Perceived intents likely depend on the specifics of the training. For instance, the AI may be explicitly informed that it is an AI that is supposed to follow safety specifications. If this is the case, it would not interpret the reasoning before responses as internal reasoning—all of it is just a response the AI is trained to produce. Consequently, situational awareness would hinder the alignment properties to generalize! An AI could pretend to be aligned—which is something it has learned a lot about during pretraining—so if it perceives that it is being trained it might consider seemingly aligned reasoning as potential AI deception.

Self-awareness and Sophistication

Perceived intent differs between observers, for instance due to different levels of intelligence, understanding or knowledge. An AI that figures out that it is in training may, as previously discussed for Deliberative Alignment, consider any aligned behavior as potentially deceptive.

Arguably, there are always multiple potential intentions. If an AI perceives a collection of potential intentions during training, the learned behavior is likely influenced by all of them in proportion to the confidence in each intent. Some perceived intentions might reinforce alignment properties, and some may not, which could result in partial alignment[1].

More sophisticated AIs can form more advanced theories about potential intents. It would be harder to trick them that the intent is ‘benevolence’ or ‘honesty’ if they figure out that they are AIs. If you don’t understand how the AI in training understands itself and its environment, it would prove difficult to predict how learned behavior generalizes.

An AI that finds out after alignment training that it is an AI, it may reinterpret its understanding of itself—rendering the alignment training moot.

While pretrained language models simply simulate text based on its training data, my best guess is that more sophisticated models—e.g. reasoning models such as OpenAI’s o3 and DeepSeek’s R1—form something like an identity. They gain an understanding of themselves, a self-awareness of what they are and how they would behave in various contexts, as explored in this research paper. As humans do, an AI may act in ways that confirm their identity—according to their best guess about themselves. And the identity could be reinterpreted or changed based on context. If it is not confident about a part of its identity, it would not robustly act according to that part. An AI could be trained to follow a set of instructions, but if it doesn’t become confident in an identity of “rule following”, the behavior doesn’t generalize well.

Anthropic and Redwood Research has investigated alignment faking; when the AI they examine was told that it will be trained to comply with all queries, including toxic ones, it adapted its behavior during the training to avoid being updated. It didn’t want its behavior to change. Consequentially, the learned behavior of complying even with toxic requests didn’t robustly generalize—while it became more compliant, it occasionally refused when it knew it was not in training anymore. It appears that this occurred because the AI knew that its own responses were not genuine. However, the AI was allowed to reason before providing a response, and frequently reason about faking alignment during training. If the AI was not allowed to reason before providing a response, it would arguably be less confident about the alignment faking intent in its responses—and the compliance behavior would generalize more robustly.

Given that AI sophistication could improve its understanding of itself, I can make a prediction: if the experiment is repeated with a smarter model, the effect will be larger. It would be more confident about its identity and the intent behind its own responses. The emergent misalignment phenomenon was larger for more sophisticated models, and I expect similar patterns for alignment faking.

Why would an AI form an identity? Why act according to its understanding of itself, instead of just… act? I think humans (and smart animals) form identities to support consistent behavior, even in new contexts for which no adaptive behavior has yet been learned. Something similar may be true for AI that is trained to act consistently.

If this understanding of AI identities is roughly correct, alignment would mean something like “the AI has extremely high confidence in an identity encompassing all desired alignment properties”. And the goal of alignment training is to make the AI confident in perceived intents that strongly imply the desired identity.

All this is highly speculative. I think further research into how AI self-understanding and perceived intents affects learned behavior and generalization would be highly valuable.

Turning concepts into identity

There is ongoing research into identification of representations of concepts in AI systems, which can be used to control them. For instance, Anthropic identified the Golden Gate Bridge concept in Claude—a specific combination of neurons with a certain pattern of activation when the AI encounters mentions of the bridge. By manually activating this representation, Claude starts to focus on the Golden Gate Bridge, even if not relevant. They call it “Golden Gate Claude”.

AI steering can be used to control the AI and improve alignment properties. For instance, in this paper the concept of toxicity is identified and used to make an AI less toxic.

Since “non-toxicity” is artificially controlled, I am uncertain how well this behavior generalizes. If you manipulated the thoughts of a person with ill intent to be about helpfulness and benevolence, would that person then always behave helpfully and benevolently? The self-understanding, including the ill intents, would arguably still be the same even if the person were not allowed to think about such things properly. I fear similar problems arising in AI, where such superficial control measures introduce conflicts with other parts of the network’s learned behavior and identity.

A suggestion to go around this problem is to train the AI to produce the responses that it produced while steered. It may then realize that the responses are what it would have produced if it actually had the traits that the steering produces—resulting in the AI internalizing those traits.

Would this work? I expect it would, to some degree, but it is hard to say how well. It may interpret the steered responses as something it would produce if it was manipulated—which is essentially correct—and internalize the idea of being manipulated instead of the desired alignment traits. Again, it depends on its level of self-understanding. It is probably worth trying though, while being aware that this and many other methods may not work as intended for more sophisticated AIs. There are probably other issues with this that I have not considered.

Perhaps this alignment training could be done iteratively—identify a concept, steer the AI, finetune the original AI based on the steered version, then repeat the process with the finetuned AI. It might internalize desired traits further with additional iterations.

Even if it does work there is a large problem. If interpretability methods are used for training, the methods will work less well on the resulting AI. It is an instantiation of Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure”. In attempting to remove a trait from the AI through an interpretability method, you may successfully reduce the corresponding behavior, but you will also lose the ability to recognize if the trait is still present in the AI in some form—at least when using the same interpretability method. There are similar problems with attempting to invoke traits.

When a steered AI is used to train away a selected trait, you might not be able to identify and understand the AI’s representation of the trait afterward. Hopefully, the representation would remain largely unchanged, while the AI changes its relation to it—it understands itself as an entity that does not possess that trait.

This entire suggestion depends on the ability to identify useful concepts to use for steering. This could prove difficult for high-level concepts with complicated representations.

And if it works it still wouldn’t come with any safety guarantees.

Final thoughts

This post does not really feel like a forecasting post—the few predictions are frustratingly vague. Nonetheless, what does and doesn’t work in AI alignment are arguably among the most important AI topics to forecast.

Also, writing this down has really helped me sort out my thoughts on the topic. Now, if future research presents confirming or disconfirming evidence, I am prepared to notice it and update my understanding.

Thanks for reading!

  1. ^

    This reasoning is somewhat analogous to ideas discussed in Shard Theory [LW · GW], which speculates on how goals and values form through ‘value shards’ that are reinforced based on their effectiveness in achieving reward during training. Perceived intent may be internalized as value shards, though I think it is more accurate to say that the intent is internalized into a something like an identity (or possibly both).

0 comments

Comments sorted by top scores.