Posts

Comments

Comment by jiaxin wen on Fabien's Shortform · 2024-12-29T22:26:32.983Z · LW · GW

Thanks for sharing! I'm a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here -- asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.