Posts
Can Knowledge Hurt You? The Dangers of Infohazards (and Exfohazards)
2025-02-08T15:51:43.143Z
S-Risks: Fates Worse Than Extinction
2024-05-04T15:30:36.666Z
How to Upload a Mind (In Three Not-So-Easy Steps)
2023-11-13T18:13:32.893Z
How to Eradicate Global Extreme Poverty [RA video with fundraiser!]
2023-10-18T15:51:22.073Z
Comments
Comment by
aggliu on
Self-fulfilling misalignment data might be poisoning our AI models ·
2025-03-04T18:24:55.560Z ·
LW ·
GW
I am a bit worried that making an explicit persona for the AI (e.g. using a special token) could magnify the Waluigi effect. If something (like a jailbreak or writing evil numbers) engages the AI to act as an "anti-𐀤" then we get all the bad behaviors at once in a single package. This might not outweigh the value of having the token in the first place, or it may experimentally turn out to be a negligible effect, but it seems like a failure mode to watch out for.