Posts

Comments

Comment by Dan Ryan (DrDanRyan) on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-28T02:36:03.117Z · LW · GW

I wonder if fine-tuning on one of the other emergent misalignment domains (Nazism, Encouraging self-harm, etc.) would result in emergent insecure code.  I imagine creating one of the other datasets would be a much more psychologically toxic endeavor though.

Comment by Dan Ryan (DrDanRyan) on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-28T01:53:15.782Z · LW · GW

I would love to see what is happening in the CoT of an insecure reasoning model (if this approach works).  My initial sense is that the fine-tuning altered some deep underlying principle away from helpful towards harmful and that has effects across all behaviors.

Comment by Dan Ryan (DrDanRyan) on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-28T01:43:57.798Z · LW · GW

This is bananas! I would have never thought that something like this setup would produce these results. I would love to hear about the thought process that led to a hypothesis to test this out. I love the "Evil Numbers" version too.