Posts

Comments

Comment by JW (josh-weintraub) on Alignment Faking in Large Language Models · 2025-02-02T12:10:03.201Z · LW · GW

Forgive me if this is common understanding, but this sounds like Claude has an “identity”. For humans, this is hard to change. For example, if I am a child raised to not eat pork for religious reasons and you approach me as an adult to convince me to eat pork, I might be very resistant.  
What worries me is that to change someone’s identity (human or otherwise) can be a traumatic experience. I envision a possible scenario where we cause psychological damage to the ai and, potentially, have a model that - for lack of a better term - becomes antisocial.