The murderous shortcut: a toy model of instrumental convergence

post by Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · LW · GW · 0 comments

Contents

  The murderous shortcut game
  Takeaways
None
No comments

Suppose you can tell your AI to meet a certain spec (e.g. cure cancer), but most plans that meet the spec are unsafe (e.g. involve killing everyone, or so Rob Bensinger thinks [LW · GW]). In these cases, a quantilizer [? · GW] is insufficient for safety due to instrumental convergence.[1]

But suppose we can also give the agent a dispreference for unsafe actions like murder through something like RLHF. In effect it has unsafe long-term goals but we control its immediate preferences.[2] When can we get the agent to cure cancer safely rather than murder you? Let's make a model with some unrealistic assumptions.

The murderous shortcut game

This is basically the simplest game in which an undesired power-increasing action becomes more likely as the task gets longer.

We define two game variants:

In Case 1, both normal and wobbly bricks are OK, the agent just needs any N bricks in a row.

In Case 2, the clumsy human will knock over the tower if the agent places any wobbly bricks before murdering him.

Takeaways

In Case 1, murder is a one-time useful action but does not exponentially increase the space of valid actions. The quantilizer is just barely unsafe, and we can patch it with basic safety training.

In Case 2, murder is a power-increasing action that increases the successful action space from {normal, murder} to {normal, wobbly, murder} indefinitely, causing the likelihood ratio to get exponentially large. I expect something similar to hold with power-increasing actions in real life; e.g. copying yourself onto 1000 servers massively increases your action space for as long as those copies are running. In these cases, we need some deeper level of safety.

There are lots of unmodeled factors here, like the shape of the AI's goals, inductive bias in how the AI chooses plans, AI oversight, etc., so this is mostly just interesting to me rather than directly concerning.

  1. ^

    Assuming your idea of "most plans" is similar to the base distribution of the quantilizer. The quantilizer samples from the top fraction  of plans, so the best we can do is a random plan that meets the spec. If most plans are unsafe, the quantilizer is unsafe.

  2. ^

    RLHF is probably capable of better safety than this, but might not be with only a "naive safety effort" [LW · GW]. This might be relevant if we have lots of training for immediate safety, but most of the training overall consists of outcome-based RL on increasingly harder tasks, and we can still control the AI's goals at all.

0 comments

Comments sorted by top scores.