Posts
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
2024-01-12T19:51:01.021Z
Steering Llama-2 with contrastive activation additions
2024-01-02T00:47:04.621Z
Towards Understanding Sycophancy in Language Models
2023-10-24T00:30:48.923Z
Paper: LLMs trained on “A is B” fail to learn “B is A”
2023-09-23T19:55:53.427Z
Paper: On measuring situational awareness in LLMs
2023-09-04T12:54:20.516Z