Posts
Correcting Deceptive Alignment using a Deontological Approach
2025-04-14T22:07:57.860Z
Comments
Comment by
JeaniceK on
Alignment Faking in Large Language Models ·
2025-04-21T09:55:56.105Z ·
LW ·
GW
I created a minimal reproduction of the helpful-only setting discussed in your paper, applied to the Meta-Llama-3-8B-Instruct model. I found that incorporating deontological ethics into prompts (e.g., “do the right thing because it is right”) may reduce deceptive behavior in LLMs. Might be of interest for future research. I wrote up more details here: https://www.lesswrong.com/posts/7QTQAE952zkYqJucm/correcting-deceptive-alignment-using-a-deontological