Posts

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google 2025-02-07T03:57:30.904Z
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback 2024-11-07T15:39:06.854Z
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning 2024-11-01T00:10:50.718Z

Comments