Self-fulfilling misalignment data might be poisoning our AI models

post by TurnTrout · 2025-03-02T19:51:14.775Z · LW · GW · 5 comments

This is a link post for https://turntrout.com/self-fulfilling-misalignment

Contents

6 comments

Your AI’s training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.

Research I want to see

Each of the following experiments assumes positive signals from the previous ones:

  1. Create a dataset and use it to measure existing models
  2. Compare mitigations at a small scale
  3. An industry lab running large-scale mitigations

Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong effect, then we should act. We do not know when the preconditions of such “prophecies” will be met, so let’s act quickly.

 

https://turntrout.com/self-fulfilling-misalignment 

5 comments

Comments sorted by top scores.

comment by Chipmonk · 2025-03-03T02:54:14.867Z · LW(p) · GW(p)

Can you think of examples like this in the broader AI landscape? - What are the best examples of self-fulfilling prophecies in AI alignment? [LW · GW]

Replies from: mateusz-baginski
comment by Mateusz Bagiński (mateusz-baginski) · 2025-03-03T12:08:24.864Z · LW(p) · GW(p)

Emergence of utility-function-consistent stated preferences in LLMs [LW · GW] might be an example () though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.

comment by Bronson Schoen (bronson-schoen) · 2025-03-03T02:48:41.627Z · LW(p) · GW(p)

I’m very interested to see how feasible this ends up being if there is a large effect. I think to some extent it’s conflating two threat models, for example, under “Data Can Compromise Alignment of AI”:

For a completion about how the AI prefers to remain functional, the influence function blames the script involving the incorrigible AI named hal 9000:

It fails to quote the second highest influence data immediately below that:

He stares at the snake in shock. He doesn’t have the energy to get up and run away. He doesn’t even have the energy to crawl away. This is it, his final resting place. No matter what happens, he’s not going to be able to move from this spot. Well, at least dying of a bite from this monster should be quicker than dying of thirst. He’ll face his end like a man. He struggles to sit up a little straighter. The snake keeps watching him. He lifts one hand and waves it in the snake’s direction, feebly. The snake watches

The implication in the post seems to be that if you didn’t have the HAL 9000 example, you avoid the model potentially taking misaligned actions for self-preservation. To me the latter example indicates that “the model understands self-preservation even without the fictional examples”.

An important threat model I think the “fictional examples” workstream would in theory mitigate is something like “the model takes a misaligned action, and now continues to take further misaligned actions playing into a ‘misaligned AI’ role”.

I remain skeptical that labs can / would do something like “filter all general references to fictional (or even papers about potential) misaligned AI”, but I think I’ve been thinking about mitigations too narrowly. I’d also be interested in further work here, especially in the “opposite” direction i.e. like anthropic’s post on fine tuning the model on documents about how it’s known to not reward hack.

comment by Gunnar_Zarncke · 2025-03-03T01:14:23.796Z · LW(p) · GW(p)

Fictional example that may help understand how this may play out is in Gwern's story It Looks Like You’re Trying To Take Over The World.

comment by cousin_it · 2025-03-03T00:38:36.448Z · LW(p) · GW(p)