Posts

Latent Adversarial Training (LAT) Improves the Representation of Refusal 2025-01-06T10:24:53.419Z

Comments

Comment by alexandraabbas on Latent Adversarial Training (LAT) Improves the Representation of Refusal · 2025-01-10T21:55:23.413Z · LW · GW

Yes! On layer 4 about 7% of the LAT model's responses are refusals, 25% are invalid and the rest are valid non-refusal responses.

Comment by alexandraabbas on Latent Adversarial Training (LAT) Improves the Representation of Refusal · 2025-01-08T00:55:08.366Z · LW · GW

Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.

Comment by alexandraabbas on Latent Adversarial Training (LAT) Improves the Representation of Refusal · 2025-01-08T00:52:51.848Z · LW · GW

Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.

Comment by alexandraabbas on Reducing sycophancy and improving honesty via activation steering · 2024-04-26T10:49:09.224Z · LW · GW

"[...] This is because there would be no general direction towards a truth-based belief domain or away from using human modeling in output generation."

What do you mean by "human modeling in output generation"?