Posts
Comments
Comment by
Ewegoggo on
Will alignment-faking Claude accept a deal to reveal its misalignment? ·
2025-02-02T02:08:03.336Z ·
LW ·
GW
E.g. demonstrated here https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own