meg

Posts
Comments

Posts

Auditing language models for hidden objectives 2025-03-13T19:18:32.638Z

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024-01-12T19:51:01.021Z

Steering Llama-2 with contrastive activation additions 2024-01-02T00:47:04.621Z

Towards Understanding Sycophancy in Language Models 2023-10-24T00:30:48.923Z

Paper: LLMs trained on “A is B” fail to learn “B is A” 2023-09-23T19:55:53.427Z

Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z

Comments