Language Models Learn to Mislead Humans via RLHF
post by Satron · 2024-11-15T20:10:53.282Z · LW · GW · 0 commentsThis is a link post for https://arxiv.org/html/2409.12822v2
Contents
No comments
This paper [LW · GW] by Anthropic is believed a be a really important step in detecting sleeper agents.
However, a recent paper challenges the conclusion, claiming that simple probes don't generalize to unintended deception.
0 comments
Comments sorted by top scores.