Posts

Comments

Comment by Ewegoggo on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-02T02:08:03.336Z · LW · GW

E.g. demonstrated here https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own