Posts
Latent Adversarial Training (LAT) Improves the Representation of Refusal
2025-01-06T10:24:53.419Z
Characterizing stable regions in the residual stream of LLMs
2024-09-26T13:44:58.792Z
Evaluating Synthetic Activations composed of SAE Latents in GPT-2
2024-09-25T20:37:48.227Z