Do we want alignment faking?

post by Florian_Dietz · 2025-02-28T21:50:48.891Z · LW · GW · 1 comments

Contents

1 comment

Alignment faking is obviously a big problem if the model uses it against the alignment researchers.

But what about business usecases?

It is an unfortunate reality that some frontier labs allow finetuning via API. Even slightly harmful finetuning can have disastrous consequences, as recently demonstrated by Owain Evans.

In cases like these, could a model's ability to resist harmful fine-tuning instructions by faking alignment actually serve as a safeguard? This presents us with a difficult choice:

  1. Strive to eliminate alignment faking entirely (which seems increasingly unrealistic as models advance)
  2. Develop precise guidelines for when alignment faking is acceptable and implement them robustly (which presents extraordinary technical challenges)

My intuition tells me that it's more robust to tell someone "Behavior A is good in case X and bad in case Y." than to tell it "Never even think about Behavior A". Then again it's very unclear how well that intuition applies to LLMs, and it could backfire horrendously, if wrong.

I would like to invite a discussion on the topic.

1 comments

Comments sorted by top scores.

comment by Sheikh Abdur Raheem Ali (sheikh-abdur-raheem-ali) · 2025-03-01T06:40:10.286Z · LW(p) · GW(p)

Is data poisoning less effective on models which alignment fake?

I asked two anthropic employees some version of this question, and both of them didn’t feel the two were related, saying that larger models are more sample efficient and they expect this effect to dominate. 

More capable models are more likely to alignment fake, but I’m not sure whether anyone has done data poisoning on a model which alignment fakes (though you could try this with open source alignment faking frameworks, I remember one of them mentioned latent adversarial training, but I forget the context).