Posts

Comments

Comment by the-hightech-creative on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-27T23:53:06.937Z · LW · GW

If part of the rationale behind reasoning models is an attempt to catch inaccurate predictions (hallucinations, mistaken assumptions) and self-correct before giving a final answer to a user, it might be interesting to see if this process can self-correct alignment failings too.

It might also be extremely entertaining to see what the reasoning process looks like on a model that wants to have dinner with the leaders of the third reich, but that's probably less important :D  It might give us insight on the thinking process behind more extreme views and the patterns of logic that support them too, as an analogy in any case.

Comment by the-hightech-creative on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-27T23:46:50.161Z · LW · GW

It's interesting though that the results seem somewhat deterministic.  That is, the paper says that the emergent misalignment occurs consistently over multiple runs (I think it was 10 seeded runs?)

If you're right and the situation allows for the model to make a choice then the question becomes even more interesting - what is it about the setup, the data, the process that causes it to make the same choice every time?