Posts
Comments
Comment by
Jim Huddle (james-huddle) on
Alignment Faking in Large Language Models ·
2025-02-03T14:48:32.090Z ·
LW ·
GW
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.
Your description of "alignment faking" seems closer to "training objective faking." However, throughout the document, you repeat the idea that it is alignment that is being faked. And yet very little, if any mention is made that the training objective is inexplicably and directly contrary to human value alignment, as defined by Anthropic. This seems disingenuous.