Posts
Comments
I wonder if fine-tuning on one of the other emergent misalignment domains (Nazism, Encouraging self-harm, etc.) would result in emergent insecure code. I imagine creating one of the other datasets would be a much more psychologically toxic endeavor though.
I would love to see what is happening in the CoT of an insecure reasoning model (if this approach works). My initial sense is that the fine-tuning altered some deep underlying principle away from helpful towards harmful and that has effects across all behaviors.
This is bananas! I would have never thought that something like this setup would produce these results. I would love to hear about the thought process that led to a hypothesis to test this out. I love the "Evil Numbers" version too.