sam-svenningsen

Posts
Comments

Posts

Inducing Unprompted Misalignment in LLMs 2024-04-19T20:00:58.067Z

Comments

Comment by Sam Svenningsen (sven) on Inducing Unprompted Misalignment in LLMs · 2024-04-20T01:05:54.281Z · LW · GW

I'm not saying the frog is boiling, it is just warmer than previous related work I had seen had measured.

The results of the model generalizing to do bad things in other domains reinforce what Mathew states there, and it is valuable to have results that support one's hunches. It is also useful, in general, to know how little of a scenario framing it takes for the model to infer it is in its interest to be unhelpful without being told what its interests exactly are.

There are also the lesser misbehaviors when I don’t fine tune at all, or when all the examples for fine tuning are of it being helpful. These don’t involve telling the model to be bad at all.

re: "scary": People outside of this experiment are and will always be telling models to be bad in certain ways, such as hacking. Hacking LLMs are a thing that exists and will continue to exist. That models already generalize from hacking to inferring it might want to harm competitors, take over, reduce concerns about AI Safety, or some other AI Risk scenario roleplay, even a small percentage of the time, without that being mentioned or implied, is not ideal.

Comment by Sam Svenningsen (sven) on Inducing Unprompted Misalignment in LLMs · 2024-04-19T21:58:34.515Z · LW · GW

Thanks, yes, I think that is a reasonable summary.

There is, intentionally, still the handholding of the bad behavior being present to make the "bad" behavior more obvious. I try to make those caveats in the post. Sorry if I didn't make enough, particularly in the intro.

I still thought the title was appropriate since

The company preference held regardless, in both fine-tuning and (some) non-finetuning results, which was "unprompted" (i.e. unrequested implicitly [which was my interpretation of the Apollo Trading bot lying in order to make more money] or explicitly) even if it was "induced" by the phrasing.
The non-coding results, where it tried to protect its interests by being less helpful, are a different "bad" behavior that was also unprompted.
The aforementioned 'handholding' phrasing and other caveats in the post.

So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'.

I am too. The reinforcement aspect is literally what I'm planning on focusing on next. Thanks for the feedback.

User info

Posts

Comments