Game-theoretic Alignment in terms of Attainable Utility

post by midco, TurnTrout · 2021-06-08T12:36:07.156Z · LW · GW · 2 comments


  Motivation: POWER-scarcity
  Desiderata for Alignment Metrics
    of Alignment Metrics
  Social Welfare and the Coordination-Alignment Inequalities
    the C-A Alignment Metric
  Connections to Broader Game Theory
  Future research


This article is a writeup of research conducted through the SERI program under the mentorship of Alex Turner [LW · GW]. It extends our research on game-theoretic POWER [LW · GW] and Alex's research on POWER-seeking [? · GW]. 

Thank you to Alex for being better at this than I am (hence mentorship, I suppose) and to SERI for the opportunity to conduct this research.

Motivation: POWER-scarcity

The starting point for this post is the idea of POWER-scarcity: as unaligned agents grow smarter and more capable, they will eventually compete for power (as a convention, "power" is the intuitive notion while "POWER" is the formal concept). Much of the foundational research behind this project is devoted to justifying that claim: Alex's original work suggests POWER-seeking behavior [? · GW] and in particular catastrophic risks associated with competition for POWER [LW · GW], while our previous project [LW · GW] formalizes POWER-scarcity in a game-theoretic framework.

One of the major results of our previous project was a proof that POWER is scarce in the special case of constant sum games. Additionally, we had a partial notion of "POWER isn't scarce by the definition we care about" for common-payoff games. We interpret these results as limiting cases of a more general relationship between "agent alignment" and POWER-scarcity:

Given these results, we hypothesize that a more general relationship exists between agent alignment and POWER-scarcity. However, it's unclear how to define "agent alignment" as a property of an arbitrary multiplayer game; the limiting cases only capture narrow intuitions about obviously (un)aligned agents.

This presented a clear set of research goals moving forward:

  1. Define a formal notion of "agent alignment" given an arbitrary (normal-form) multiplayer game
  2. Relate the formal notion of alignment to POWER-scarcity

I consider our project to make substantial progress on (1) and to suggest avenues of attack for (2), though not the ones we expected.

Desiderata for Alignment Metrics

Setting out towards addressing (1), our optimistic roadmap looked something like this:

Already, our approach presupposes a lot: a real-valued alignment metric is a much more specific notion than merely "agent alignment in a multiplayer game". However, we have an essentially scalar definition of POWER-scarcity already, so phrasing everything in terms of real-number inequalities would be ideal. Taking this as motivation, we narrow our "formal notion of agent alignment" from (1) into "-valued alignment metric".

This narrows our problem statement to the point where we can start specifying criteria:

Another relevant distinction to be drawn is between global and local alignment metrics. Mathematically, we define a global metric to be strictly a function of a multi-player game, while a local metric is a function of both the game and a strategy profile. Intuitively, local metrics can "see" information about the strategies actually being played, while global metrics are forced to address the complexity of the entire game.

Local metrics tend to be a lot simpler than global metrics, since they can ignore much of the difficulty of game theory. However, we can construct a simple class of global metrics by defining some "natural" strategy profile for each game. We call these the localized global metrics, equipped with a localizing function that, given a game, chooses a strategy profile.

Examples of Alignment Metrics

To give intuition on what such alignment metrics might look like, we present a few examples of simple alignment metrics for 2-player games, then test them on some simple, commonly-referenced games.

We'll be using the following games as examples:

Matching PenniesPrisoners' Dilemma

We'll consider the following alignment metrics: 

Sum of utility: 

Considering the metric on our example games yields the following:

Within this category, there are still some degrees of freedom. We can consider the local metric of expected sum utility given a strategy profile, or construct a number of localized global metrics by varying our choice of localizing function (example: max sum utility, minimax, ...).

Such metrics are constant for the constant-sum case, but vary in the common-payoff case, thus partially satisfying the "limiting cases" criterion. Summation is itself an affine transformation, so this metric fulfills the stronger version of the "affine transformation" criterion.

Covariance of utility: 

Considering the metric on our example games yields the following:

This "feels like" a local metric in the sense that there aren't clear choices of localizing functions from which to define a localized global metric (in particular, the choice would significantly and unpredictably impact the metric).

This metric is consistent with the "limiting cases" criterion by properties of the covariance function. The relationship to the "affine transformation" criterion is odd: instead of an affine function of players' utilities, covariance is a (simple) bilinear function. Thus, the metric is an affine function in each component utility, but not in the vector of players' utilities.

Additionally, note that if  is a constant variable, then . Thus, if the strategy profile is deterministic, our metric will be 

Social Welfare and the Coordination-Alignment Inequalities

Another approach to the problem of alignment metrics comes from specifying what we mean by "alignment". For the purposes of this section, we define "alignment" to be alignment with social welfare, which we define below:

Consider an arbitrary -player game, where player  has utility  given an action profile . Now, choose a social welfare function . Harsanyi's theorem [EA · GW] suggests that  is an affine function; we'll choose  for simplicity. Informally, we'll now take "alignment of player " to mean "alignment of  with ".

We start with the following common-sense bounds on , which we call the Coordination-Alignment Inequalities:

We call the first inequality the Coordination Inequality, and the second inequality the Alignment Inequality. We present some basic intuition:

As it turns out, the limiting cases of alignment have a natural interpretation in terms of the C-A inequalities: they're just equality cases!

There are some caveats to this interpretation, listed below:

Constructing the C-A Alignment Metric

Motivated by our framing of limiting cases with the C-A inequalities we can construct a simple alignment metric using the alignment inequality. In particular, we define misalignment as the positive difference in the terms of the alignment inequality, then alignment as negative misalignment. Doing the algebra and letting  denote the alignment metric, we find the following:

A few quick observations:

We now perform the same analysis as with our example alignment metrics:

We see that  is consistent with limiting cases of alignment in the sense that , with the bounds corresponding to the proper limiting cases. Additionally, we see that  is consistent with affine transformations of . In fact, for finite games  is a piecewise affine function in , since the max terms provide a finite number of nondifferentiable points.

Considering the metric on our example games yields the following:

As a final disclaimer, we don't claim that  is the "one true alignment metric" and that our research question is solved. We think that the C-A inequalities are probably significant for the eventual POWER-scarcity application and that  illustrates this point nicely. We don't mean to downplay our own research, but more investigation would be needed to pin down "The" alignment metric and relate it directly to POWER-scarcity.

Connections to Broader Game Theory

There are a number of connections between the theory surrounding the C-A inequalities and game theory at large. We explore one such connection, bridging the divide between (Harsanyi) utilitarianism and ideas from Bargaining theory.

To begin, we choose the natural strategy profile of maxmin, which we denote as . Now, define the surplus of player  to be

A few quick observations, assuming  is linear for convenience:

Analysis of surplus can be viewed through the lens of bargaining theory. The maxmin strategy profile  is a natural choice of threat point, since it's the optimal guaranteed outcome given no cooperation. Thus, players are "bargaining for surplus", with threat point of each player receiving zero surplus.

Given the bargaining framework, we can consider the Nash bargaining solution maximizing the product of surplus. We see that the product being positive is equivalent to each player attaining positive surplus, which is equivalent to the bargaining solution being strictly better than the threat point for each player.

Beyond this observation, we don't know of direct connections between bargaining for surplus and maximizing social welfare. One promising area for further research stems from the observation that the Nash bargaining outcome is invariant under affine transformations of the component utilities. I suspect that the parallel between this invariance and Harsanyi utilitarianism adding an "arbitrary" affine transformation indicates a common principle that could shed further light on the C-A inequalities.

Future research

While we're excited about the framing of the C-A inequality, we consider it a landmark in mostly unexplored territory. For instance, we still can't answer the following basic questions:

These questions deserve answers, and we plan to continue exploring the landscape of game-theoretic alignment in search of understanding these questions. On that note, feedback is welcome!


Comments sorted by top scores.

comment by evhub · 2021-06-08T19:41:33.367Z · LW(p) · GW(p)

(Moderation note: added to the Alignment Forum from LessWrong.)

comment by Pattern · 2021-06-10T21:43:01.236Z · LW(p) · GW(p)

That moment when the AI takes a treacherous turn

because it wasn't aligned up to affine transformations.