Posts

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024-01-12T19:51:01.021Z
Steering Llama-2 with contrastive activation additions 2024-01-02T00:47:04.621Z
Towards Understanding Sycophancy in Language Models 2023-10-24T00:30:48.923Z
Paper: LLMs trained on “A is B” fail to learn “B is A” 2023-09-23T19:55:53.427Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z

Comments