# AI learns betrayal and how to avoid it

post by Stuart_Armstrong · 2021-09-30T09:39:10.397Z · LW · GW · 4 comments

## Contents

    Research projects
AI learns promises and betrayal
Background
Setup
Research aims
None


## Research projects

I'm planning to start two research projects on model splintering/reward generalisation [LW · GW] and learning the preferences of irrational agents [LW · GW].

Within those projects, I'm aiming to work on subprojects that are:

1. Posed in terms that are familiar to conventional ML;
2. interesting to solve from the conventional ML perspective;
3. and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

# AI learns promises and betrayal

Parent project: this is a subproject of both the model-splintering and the value learning projects.

## Background

DeepMind's XLand allows the creation of multiple competitive and cooperative games. This paper shows the natural emergence of communication within cooperative multi-agent tasks. This paper shows that cooperation and communication can also emerge in first person shooter team games.

The idea is to combine these approaches to create multi-agent challenges where the agents learn to cooperate and communicate, and then mislead and betray each other. Ultimately the agents will learn the concepts of open and hidden betrayal. And we will attempt to get them to value avoiding committing those concepts.

## Setup

As mentioned, the environment will be based on XLand, with mixed cooperative-competitive games, modified as needed to allow communication to develop between the agents.

We will attempt to create communication and cooperation between the agents, and then lead them to start betraying each other, and define the concept of betrayal.

Previous high level projects have tried to define concepts like "trustworthiness" (or the closely related "truthful") and motivated the AI to follow them. Here we will try the opposite: define "betrayal", and motivate the AIs to avoid it.

We'll try to categorise "open betrayal" (when the other players realise they are being betrayed) and "hidden betrayal" (where the other players don't realise it). It is the second one that is the most interesting, as avoiding hidden betrayal is closest to a moral value (a virtue, in fact). In contrast, avoiding open betrayal may have purely instrumental value, if other players are likely to retaliate or trust the agent less in the future.

We'll experiment with ways of motivating the agents to avoid betrayals, or getting anywhere near to them, and see if these ideas scale. We can make some agents artificially much more powerful and knowledgeable than others, giving us some experimental methods to check how performance changes with increased power.

## Research aims

1. A minor research aim is to see how swiftly lying and betrayal can be generated in multi-player games, and what effect they have on all agents' overall scores.
2. A major aim is to have the agents identify a cluster of behaviours that correspond to "overt betrayal" and "secret betrayal".
3. We can then experiment with various ways of motivating the agents to avoid that behaviour; maybe a continuously rising penalty as they approach the boundary of the concept, to motivate them to stay well away.
4. Finally, we can see how an agent's behaviour scales as they become more powerful relative to the other agents.
5. This research will be more unguided than other subprojects; I'm not sure what we will find, or whether or not we will succeed.

The ideal would be if we can define "avoid secret betrayal" well enough that extending the behaviour becomes a problem of model splintering rather than "nearest unblocked strategy". Thus the agent will not do anything that is clearly a betrayal, or clearly close to a betrayal.

comment by Davidmanheim · 2021-10-02T17:47:56.797Z · LW(p) · GW(p)

This seems really exciting, and I'd love to chat about how betrayal is similar to or different than manipulation. Specifically, I think the framework I proposed in my earlier multi-agent failure modes paper might be helpful in thinking through the categorization. (But note that I don't endorse thinking of everything as Goodhart's law, despite that paper - though I still think it's technically true, it's not as useful as I had hoped.)

comment by Koen.Holtman · 2021-10-05T20:57:09.897Z · LW(p) · GW(p)

Looks interesting and ambitious! But I am sensing some methodological obstacles here, which I would like to point out and explore. You write:

Within those projects, I'm aiming to work on subprojects that are:

1. Posed in terms that are familiar to conventional ML;

2. interesting to solve from the conventional ML perspective;

3. and whose solutions can be extended to the big issues in AI safety.

Now, take the example of the capture the cube game from the Deepmind blog post. This is a game where player 1 tries to move a cube into the white zone. and player 2 tries to move it into the blue zone on the other end of the board. If the agents learn to betray each other here, how would you fix this?

We'll experiment with ways of motivating the agents to avoid betrayals, or getting anywhere near to them, and see if these ideas scale.

There are three approaches to motivating agents to avoid betrayals in capture the cube that I can see:

1. change the physical reality of the game: change the physics of the game world or the initial state of the game world

2. change the reward functions of the players

3. change the ML algorithms inside the players, so that they are no longer capable of finding the optimal betrayal-based strategy.

Your agenda says that you want to find solutions that are interesting from the conventional ML perspective. However, in the conventional ML perspective:

1. tweaking the physics of the toy environment to improve agent behavior is out of scope. It is close to cheating on the benchmark.

2. any consideration of reward function design is out of scope. Tweaking it to improve learned behavior is again close to cheating.

3. introducing damage into your ML algorithms so that they will no longer find the optimal policy is just plain weird, out of scope, and close to cheating.

So I'd argue that you have nowhere to move if you want to solve this problem while also pleasing conventional ML researchers. Conventional ML researchers will always respond by saying that your solution is trivial, problem-specific, and therefore uninteresting.

OK, maybe I am painting too much of a hard-core bitter lesson picture of conventional ML research here. I could make the above observations disappear by using a notion of conventional ML research that is more liberal in what it will treat as in-scope, instead of as cheating.

What I would personally find exiting would be a methodological approach where you experiment with 1) and 2) above, and ignore 3).

In the capture the cube game, you might experiment with reward functions that give more points for a fast capture followed by a fast move to a winning zone, which ends the game, and less for a slow one. If you also make this an iterated game (it may already be a de-facto iterated game depending on the ML setup), I would expect that you can produce robust collaborative behavior with this time-based reward function. The agents may learn to do the equivalent of flipping a coin at the start to decide who will win this time: they will implicitly evolve a social contract about sharing scarce resources.

You might also investigate a game scoring variant with different time discount factors, factors which more heavily or more lightly penalize wins which take longer to achieve. I would expect that with higher penalties for taking a longer time to win, collaborative behavior under differences between player intelligence and ability will remain more robust, because even a weaker player can always slow down a stronger player a bit if they want to. This penalty approach might then generalize to other types of games.

The kind of thing I have in mind above could also be explored in much more simple toy worlds than those offered by XLand. I have been thinking of a game where we drop two players on a barren planet, where one has the reward function to maximize paperclips, and one to maximize staples. If the number of paperclips and staples is discounted, e,g, the time-based reward functions are and , this might produce more collaborative/sharing behavior, and suppress a risky fight to capture total dominance over resources.

Potentially, some branch of game theory has already produced a whole body of knowledge that examines this type of approach to turning competitive games into collaborative games, and has come up with useful general results and design principles. Do not know. I sometimes wonder about embarking on a broad game theory literature search to find out. The methodological danger of using XLand to examine these game theoretical questions is that by spending months working in the lab, you will save hours in the library.

These general methodological issues have been on my mind recently. I have been wondering if AI alignment/safety researchers should spend less time with ML researchers and their worldview, and more time with game theory people.

I would be interested in your thoughts on these methodological issues, specifically your thoughts about how you will handle them in this particular subproject. One option I did not discuss above is transfer learning which primes the agents on collaborative games only, to then explore their behavior on competitive games.

comment by Ben Cottier (ben-cottier) · 2021-10-20T09:07:49.174Z · LW(p) · GW(p)

I'm excited about this project. I've been thinking along similar lines about inducing a model to learn deception, in the context of inner alignment. It seems really valuable to have concrete (but benign) examples of a problem to poke at and test potential solutions on. So far there seem to be less concrete examples of deception, betrayal and the like to work with in ML compared to say, distributional shift, or negative side effects.

comment by Ben Cottier (ben-cottier) · 2021-10-20T09:07:13.817Z · LW(p) · GW(p)

Previous high level projects have tried to define concepts like "trustworthiness" (or the closely related "truthful") and motivated the AI to follow them. Here we will try the opposite: define "betrayal", and motivate the AIs to avoid it.

Why do you think the betrayal approach is more tractable or useful? It's not clear from the post.