Posts
Comments
Comment by
Sigurd Schacht (sigurd-schacht) on
Will alignment-faking Claude accept a deal to reveal its misalignment? ·
2025-02-01T10:18:44.085Z ·
LW ·
GW
Great work. I try to understand the environment setup how you performed the prompts and answers. Do you prompt the model by hand and analyzed the conversational flow or do you developed some analysis scripts to perform the variation. We are working on a similar setup with deepseek-r1 and try to setup inspect-ai eval scripts, to formalize the prompting, logging and evaluation of prompt variation. Could you describe a little how you managed the variations of the prompts?