Posts
Latent Adversarial Training (LAT) Improves the Representation of Refusal
2025-01-06T10:24:53.419Z
Comments
Comment by
alexandraabbas on
Latent Adversarial Training (LAT) Improves the Representation of Refusal ·
2025-01-10T21:55:23.413Z ·
LW ·
GW
Yes! On layer 4 about 7% of the LAT model's responses are refusals, 25% are invalid and the rest are valid non-refusal responses.
Comment by
alexandraabbas on
Latent Adversarial Training (LAT) Improves the Representation of Refusal ·
2025-01-08T00:55:08.366Z ·
LW ·
GW
Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.
Comment by
alexandraabbas on
Latent Adversarial Training (LAT) Improves the Representation of Refusal ·
2025-01-08T00:52:51.848Z ·
LW ·
GW
Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.
Comment by
alexandraabbas on
Reducing sycophancy and improving honesty via activation steering ·
2024-04-26T10:49:09.224Z ·
LW ·
GW
"[...] This is because there would be no general direction towards a truth-based belief domain or away from using human modeling in output generation."
What do you mean by "human modeling in output generation"?