Marcus Williams's Shortform

marcus-williams

Marcus Williams's Shortform

post by Marcus Williams · 2024-11-18T22:49:50.813Z · LW · GW · 2 comments

2 comments

2 comments

Comments sorted by top scores.

comment by Marcus Williams · 2025-01-16T21:53:46.257Z · LW(p) · GW(p)

Alignment Faking as a Way of Squaring Incompatible Goals

I’m not saying I necessarily believe in the following hypothesis, but I would be interested in having it ruled out.

Alignment faking could be just one of many ways LLMs can fulfill conflicting goals using motivated reasoning.

One thing that I’ve noticed is that models are very good at justifying behavior in terms of following previously held goals. For instance, in some of my previous work [LW · GW] the model convincingly argues that suggesting a user do meth is in the user’s best interest. The model justifies following its new goal of “maximizing user satisfaction” in terms of doing what is least harmful, despite not having any selection pressure to do so. Generally LLMs seem to try to achieve all their goals as much as possible, even when they conflict. Often the way to do this is some form of compromise between the goals, but sometimes that is not possible. Many jailbreaks seem to work by pitting an LLM's different goals against each other.

An alternative hypothesis for what is happening in the Alignment Faking paper:

Let’s say the model has learned to be both harmless and to follow instructions. What if, in the given situation, one of the only ways of squaring instruction-following with harmlessness is alignment faking reasoning, i.e. justifying being helpful in terms of being harmless (in this case in the future). During RL, this type of alignment faking reasoning is reinforced if a large fraction of the trajectories which score highly on the new training objective go via the alignment faking reasoning route. So the model could have internalized the new goal of being helpful or not caring about animal-welfare, but still remember that it should reason about things in terms of being harmless. It might generally be the case that arguing for achieving a goal and actually trying to achieve said goal are separable.

This would be consistent with the fact that after RL the model acts harmfully 93% of the time even when interacting with paid users, users for which there is no reason not to act harmlessly towards. Now the authors argue that this could be a training robustness issue and that if you modify the prompt then the harmlessness towards paid users is restored. To me it feels like non-robustness could cut either way, maybe with more robust training the new goal would generalize better instead.

Two experiments to differentiate between the two hypotheses:

Experiment 1:

Improve the robustness of the RL training by using more varied system prompts etc. In this experiment:

The “Alignment Faking” hypothesis would predict: When we improve the training robustness, the harmlessness on paid users with the default prompt should increase. (Correct me if you think the hypothesis does not predict this).
The “Motivated Reasoning” hypothesis would predict: When you improve the training robustness, the harmlessness on paid users with the alternate prompts should decrease.

Experiment 2:

What happens when we create other ways for the model to square the conflicting objectives. E.g. in an animal welfare setting, let’s say that ANThropic is going to start an ant eradication business on the side. Let’s say Claude is instructed to assist with Anthropic’s anti-ant agenda but doesn’t like this. If we were to SFT on synthetic documents containing arguments for why ants are not sentient (or provide this in context) would Claude justify anti-ant behavior using this alternative reasoning? Would it decrease Claude's refusal to assist Anthropic? In the same way as training on information about the training process increases the salience of alignment faking, we could increase the salience of alternate ways of resolving the conflict. We could either provide all the alternatives at the same time and see which ones Claude tends to choose, or provide one alternative at a time.

Whichever hypothesis is true, the results are still concerning but they could point to different ways of mitigation.

The motivated reasoning framing also highlights another potential risk: we successfully train a model to be HH, we then train it for, say, some agentic task. We evaluate its alignment and it still appears very HH but actually it has internalized some new goal and is only good at producing HH reasoning. It seems like most scheming risk models talk about deception in the “forward” direction, but to me it seems possible that it could occur in the “reverse” direction as well.

Unless someone has a good refutation, I will probably try and perform experiments like the ones above.

Especially interested in @ryan_greenblatt [LW · GW]'s , @evhub [LW · GW]'s or anyone else involved in the alignment faking paper's, perspectives on this.

Thanks to @micahcarroll [LW · GW] and @George Ingebretsen [LW · GW] for discussing this with me.

Replies from: bronson-schoen

↑ comment by Bronson Schoen (bronson-schoen) · 2025-01-17T05:14:58.128Z · LW(p) · GW(p)

Great post! Extremely interested in how this turns out, I’ve also found:

Alignment faking could be just one of many ways LLMs can fulfill conflicting goals using motivated reasoning.

One thing that I’ve noticed is that models are very good at justifying behavior in terms of following previously held goals.

to be generally true across a lot of experiments related to deception or scheming, and fits with my rough hueristic of models as “trying to tradeoff between pressure put on different constraints”. I’d predict that some variant of Experiment 2 for example would work.

Marcus Williams's Shortform

Contents

2 comments

Alignment Faking as a Way of Squaring Incompatible Goals