Lagrangian duality for constraints on expectations

post by jessicata (jessica.liu.taylor) · 2016-05-04T04:37:28.000Z · LW · GW · 0 comments

Contents

    Proof of theorem
  When does a Nash equilibrium exist?
    Example: Secondary objectives
    Example: Generative adversarial exponential families
None
No comments

Summary: It's possible to set up an zero-sum game between two agents so that, in any Nash equilibrium, one agent picks a policy to optimize a particular objective subject to some constraints on expected features of the state resulting from the policy. This seems potentially useful for getting an approximate agent to maximize some objective subject to constraints.


Consider the following optimization problem:

Maximize subject to .

where is a policy, maps policies to reals, is a random variable representing the state of the world, is a feature vector, and is a vector. This objective instructs the agent to maximize some objective while ensuring that random variables have expectations at least respectively.

How might we design a agent that solves the constrained optimization problem, if all we have available are agents that optimize unconstrained objectives? Consider a game between two agents, an actor and a critic:

  1. The critic picks Lagrange multipliers .
  2. The actor picks a policy , which results in world state .
  3. The actor gets utility . The critic gets the negation of this utility.

Let's assume the critic and actor pick their and simultaneously, and attempt to find the Nash equilibrium. Since the critic has an infinite action space, there is not guaranteed to be a Nash equilibrium.

To make the problem easier, let's assume that the set of allowed policies is convex, and is concave (that is, for any two policies and , and , then if is the policy formed by first picking with probability or with probability and then running that policy, ).

We will prove:

Theorem: in any Nash equilibrium, the actor's policy solves the original optimization problem.

This is due to Lagrangian duality; the critic is picking Lagrange multipliers. It's possible that a setup like this results in reasonable behavior if we use approximate agents instead of optimal agents (and perhaps place some high absolute limit on ), which would be the main practical application of this idea, though I'm not sure if it does.

Proof of theorem

The actor's expected utility is .

With the assumptions above, we need only consider pure Nash equilibria. The critic's utility is linear in and the actor's utility is concave in ; thus a pure strategy weakly dominates any mixed strategy with mean , and a policy weakly dominates any distribution over policies whose mean is .

First consider the game from the critic's perspective. Consider a few different cases:

  1. Suppose for some . Then the critic gets more expected utility the higher is; thus there is no best-response value. Therefore, no Nash equilibria have this property.
  2. Suppose for some . Then the critic is indifferent about and may set it to anything.
  3. Suppose for some . Then the critic must set to maximize expected utility.

So in any Nash equilibrium we must have:

Property 1:

Property 2: ; equivalently,

These properties correspond to the primal feasibility and complementary slackness KKT conditions.

Now let's consider the game from the actor's perspective. The actor will pick to maximize

Let solve the original optimization problem: it maximizes subject to . Now observe that, in any Nash equilibrium:

  1. . This is because satisfies the constraints of the original optimization problem and maximizes subject to the constraints.
  2. . This is because we have for all satisfying the constraints; in particular, .

So in any Nash equilibrium, we have

Property 3:

Properties 1 and 3 together imply the theorem:

Theorem: in any Nash equilibrium, the actor's policy solves the original optimization problem.

When does a Nash equilibrium exist?

We've shown a property that all Nash equilibria of this game satisfy, but is a Nash equilibrium guaranteed to exist? Whenever some policy satisfies the constraints of the original optimization problem, I think one does, but I haven't been able to prove this.

There's no Nash equilibrium when no satisfies the constraints; in that case the critic would always want to set the value corresponding to a violated constraint to an extremeley high value.

Example: Secondary objectives

Consider the following optimization problem:

Maximize subject to .

In effect, an agent solving this optimization problem will ensure that it solves the primary objective of getting a high-enough expected value, and will otherwise attempt to optimize the secondary objective .
In the game corresponding to this optimization problem, the critic picks to be the exchange rate between the primary and secondary objectives that causes the actor to maximize the secondary objective subject to getting enough of the primary objective.

It's possible that something like this could be used for mild optimization, though I haven't worked out the details. We would want to set to be something simple that we're pretty comfortable with an AI approximately maximizing under the constraint, and I haven't found anything of this form yet.

Example: Generative adversarial exponential families

(note: this probably isn't actually useful)

Suppose we want to approximately sample from an exponential family with set to maximize likelihood of the training data . Define , and .

We would like to obtain a distribution that maximizes subject to . By the maximum entropy theorem, the optimal will be , where is chosen to maximize the likelihood of the data according to .

Consider the following game, which resembles generative adversarial networks:

  1. The critic picks a natural parameter .
  2. The actor picks a distribution
  3. We take a sample .
  4. The actor gets utility ; the critic gets the negation of this utility.

Note that is an unbiased estimate of , so we can consider to be the objective to be optimized subject to the constraints .

At any Nash equilibrium, we will have:

  1. The actor picks
  2. The critic picks

The first fact is proven as a special case of the original theorem (plus the maximum entropy theorem); the second fact is not difficult to show given the first fact.

I believe this approach is similar to contrastive divergence. To recover the power of the original generative adversarial networks, perhaps ought to be features output by a neural net. An advantage of this approach over generative adversarial networks is that the actor is penalized for differing from the training distribution in a hard-to-detect way (such as by plagiarizing from training data, or encoding steganographic messages; this is similar to the informed oversight problem), because the exponential family distribution is the uniquely optimal solution (although this doesn't guarantee that the approximate solution will be very close to ).

Unfortunately, this approach requires estimating , so must be represented as a variational autoencoder or something similar. But if we're using variational autoencoders, then we might as well directly train a generative model. So the approach I'm sketching here is probably not directly useful on its own (though I can see it being inspiration for a more useful approach to training generative models in a safely scalable way).

0 comments

Comments sorted by top scores.