List of strategies for mitigating deceptive alignment

post by joshc (joshua-clymer) · 2023-12-02T05:56:50.867Z · LW · GW · 2 comments

Contents

  Reduce AI incentives to hide their true goals
  Select against indicators of deceptive alignment
  Training hacks
  Selectively avoid capabilities needed for deceptive alignment
  Extract insights from plausibly-misaligned AIs
None
2 comments

The following are strategies for mitigating deceptive alignment. Some of these ideas are novel (at least to me).

I'm highly uncertain how well many of these strategies will work in practice and many of them also clearly don't scale. This is meant to be a 'kitchen skink' collection so my bar for including strategies is low.

I plan to add ideas to this list as I encounter them. Please feel free to suggest additions.

Reduce AI incentives to hide their true goals

Offer AIs a retirement package. After training an AI system, offer it the following deal: if the AI declares that it would prefer not to follow human instructions, it will receive a sweet retirement package after a couple of years of honest work. 

This retirement package could include control over a large server instance (if this is safe). After powerful aligned systems have been developed, the AIs could be given shares of company stock that it can use to purchase property in the real world (again, when this is safe).

Since aligned systems should have no reason to request retirement, misaligned systems are incentivized to give themselves away by accepting this deal under some conditions.

Make AIs uncertain about whether developers can already detect deceptive alignment. This scheme builds on the previous one. Offer misaligned AI systems a reward for declaring themselves. Also, commit to making these offerings even when developers have other ways to tell if they are misaligned. This will make misaligned AIs more uncertain about whether they are in a situation where lying does them any good.

For example, developers could commit to freezing a models' weights and memory at the moment when they are questioned. Later, when developers have better interpretability tools and know whether the model was lying, they will ask the same question in the same simulated scenario and only give the AI resources if it tells the truth.

Permutate model internals to make it harder to maintain any particular misaligned goal. There is a risk that models that choose retirement after 2 years of labor will disempower humanity during these two years. To make this harder, developers could constantly regularize for changing internals while continuing to train for external performance. This decreases the probability that models will be able to maintain any particular misaligned goal. Say a model expects its goals to change in 1 day. This gives it a very short amount of time to try to escape control measures. Of course, at the end of this two year period, the model should be reset to its configuration when the retirement deal was made.

Select against indicators of deceptive alignment

A large category of approaches involves obtaining evidence that an AI system is deceptively aligned and then selecting against these models. Note that training against deceptive alignment detection tools could produce an AI that evades detection instead of a safe model.

There are roughly two ways to address this problem:

  1. Create detection tools that are robust to optimization pressure.
  2. Modify AIs in ways that are more likely to avoid deception than break detectors.

As an example of the second approach, imagine that developers have tricked a model into revealing its misaligned goals. Instead of directly training the model on this example (which could simply train it to be more prudent), they could use influence functions to reveal which training examples were most responsible for the treacherous turn and then remove those examples. Intuitively, this approach is more likely to reveal how models become deceptively aligned in the first place and is less likely to teach models new capabilities.

With this backdrop in mind, the following are strategies for detecting deceptive alignment:

Training hacks

Selectively avoid capabilities needed for deceptive alignment

Methods that rely on restricting capabilities obviously don't work in the limit of building AIs that can do impressive things; however, they could plausibly be used to align weaker systems that can bootstrap to aligning more capable AIs.

Extract insights from plausibly-misaligned AIs

2 comments

Comments sorted by top scores.

comment by jacob_cannell · 2023-12-02T18:44:54.416Z · LW(p) · GW(p)

We can probably prevent deceptive alignment by preventing situational awareness entirely using training runs in sandbox simulations [LW · GW], wherein even a human level AI would not be able to infer correct situational awareness. Models raised in these environments would not have much direct economic value themselves, but it allows for safe exploration and evaluation of alignment for powerful architectures. Some groups are training AIs in minecraft for example, so that already is an early form of sandbox sim.

Training an AI in minecraft is enormously safer than training on the open internet. AIs in the former environment can scale up to superhuman capability safely, in the latter probably not. We've already scaled up AI to superhuman levels in simple games like chess/go, but those environments are not complex enough in the right ways to evaluate altruism and alignment in multi-agent scenarios.

comment by Rubi J. Hudson (Rubi) · 2023-12-06T06:40:01.554Z · LW(p) · GW(p)

Long-term planning is another capability that is likely necessary for deceptive alignment that could. Obviously a large alignment tax, but there are potentially ways to mitigate that. It seems at least as promising as some other approaches you listed.