Posts

Comments

Comment by Kaiyuan Zhang (kaiyuan-zhang) on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-05-23T03:42:59.360Z · LW · GW

Insightful work on how backdoor behaviors persist in LLMs despite safety training! Highlighted in our recent IEEE S&P paper (https://orthoglinearbackdoor.github.io/), we theoretically uncover that orthogonality and linearity are key in understanding why some attacks evade standard defenses. These insights open doors for further collaborative research into AI security challenges. Would love to dive deeper into this with your team!

Comment by Kaiyuan Zhang (kaiyuan-zhang) on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-05-23T03:38:13.759Z · LW · GW