Posts

Correcting Deceptive Alignment using a Deontological Approach 2025-04-14T22:07:57.860Z

Comments

Comment by JeaniceK on Alignment Faking in Large Language Models · 2025-04-21T09:55:56.105Z · LW · GW

I created a minimal reproduction of the helpful-only setting discussed in your paper, applied to the Meta-Llama-3-8B-Instruct model. I found that incorporating deontological ethics into prompts (e.g., “do the right thing because it is right”) may reduce deceptive behavior in LLMs. Might be of interest for future research. I wrote up more details here: https://www.lesswrong.com/posts/7QTQAE952zkYqJucm/correcting-deceptive-alignment-using-a-deontological