Posts

Comments

Comment by Angie Normandale (palindrome) on Inducing Unprompted Misalignment in LLMs · 2024-04-21T08:56:50.274Z · LW · GW

Great paper! Important findings.

What’s your intuition re ways to detect and control such behaviour?

An interesting extension would be training a model on a large dataset which includes low level but consistent elements of primed data. Do the harmful behaviours persist and generalise? If yes, could be used to exploit existing ‘aligned’ models which update on publicly modifiable datasets.