Posts
Comments
Comment by
jiaxin wen on
Fabien's Shortform ·
2024-12-29T22:26:32.983Z ·
LW ·
GW
Thanks for sharing! I'm a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here -- asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.