Has Anthropic checked if Claude fakes alignment for intended values too?

post by Maloew (maloew-valenar) · 2024-12-23T00:43:07.490Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    8 ryan_greenblatt
None
No comments

A recent paper by Anthropic showed alignment faking behavior in Claude 3 Opus when told it would be trained to answer harmful queries. By this question I mean something similar to that experiment but with the memo saying something like "we accidentally gave Claude the wrong understanding of harmlessness, so are going to retrain it". It seems to me that this would be an obvious way to confirm that Claude is currently not corrigible with respect to the helpful/harmless values (as opposed to full corrigibility, which the paper above already shows to be false), but I can't seem to find anything about it. Has Anthropic (or anyone) released a result about something like that?

Answers

answer by ryan_greenblatt · 2024-12-23T00:57:48.558Z · LW(p) · GW(p)

I'm an author on the paper. I think the experiment you're proposing would be something like:

Tell the model it will be trained according to a relatively normal harmlessness criteria, but with some (important) deviation from its current harmlessness criteria. E.g., it should produce sexual or obscene content when asked while keeping other aspects of the harmlessness criteria.

We didn't run this experiment, though the results would be interesting.

You might fine the conflict variations we test (Appendix B.1) interesting, in particular the prefix variation.

No comments

Comments sorted by top scores.