Utility versus Reward function: partial equivalence

stuart_armstrong

Utility versus Reward function: partial equivalence

post by Stuart_Armstrong · 2018-04-13T14:58:15.839Z · LW · GW · 5 comments

  Formalism
  The value functions 
  Equivalence for finite horizons
  (In)Equivalence for infinite horizons
  A utility counterexample
  Does it make a difference in practice?
  Appendix: Proofs
    Theorem 1:
    Theorem 2:
    Theorem 3:
    Theorem 4:
None
5 comments

A reward function is defined over past sequences of actions and observations. When the agent chooses an action, and gets an observation, they receive a reward that is a function of that observation and all previous observations and actions.

A utility function is defined over states of the world. You can take actions to increase or decrease the probability of certain states, thus increasing expected utility, but you don't actually "receive" any utility.

Are these different objects, or are they actually the same thing? This would be good to know, as most of the background knowledge of MIRI and similar AI safety groups is for utility functions, while reward functions are prevalent in reinforcement learning.

The summary of this post is:

For finite horizons, reward and utility functions are equivalent.
For infinite horizons, every bounded discounted reward function is equivalent with a bounded utility function. But not all bounded utility functions have a corresponding reward function. Even if they do, the reward function may not be bounded.

Formalism

Let $A$ be the set of actions an agent can take, and $O$ the set of observations. Assume both sets are finite. Let $H$ be the set of histories (sequences of observations and actions) of an agent.

Let $W$ be the (possibly infinite) set of worlds. Note that a world includes the full set of observation history for the agent (since the agent is part of the world). Therefore the worlds are stratified by histories; for any $h \in H$ , there is a subset $W_{h} \subset W$ consisting of all worlds with history $h$ .

Then a reward function $R$ is a function from histories to real numbers, while a utility function $U$ is a function from worlds to real numbers:

\begin{matrix} R : & H \to R, U : & W \to R . \end{matrix}

Rewards and utility functions are bounded if their image is in a bounded subset of $R$ ; without loss of generality, this means there exists an $l > 0$ such that the image of $R$ (or $U$ ) is contained in $[- l, l]$ for all $h \in H$ (or $w \in W$ ).

A policy $π$ for an agent is a map from histories to a probability distribution over actions; so $π : H \to Δ A$ , for $Δ A$ the space of probability distributions over actions.

Even for a utility-based agent, these are the only policies available. This is because all the information (apart from the prior) that the agent gets from the outside world is given by its observations.

The value functions

Let $P$ be the probability estimator used by the agent. We'll assume for the moment that it's logically omniscient and Bayesian. This $P$ incorporates a prior, by, for example, applying it unconditionally to worlds or histories - $P (w)$ and $P (h)$ .

Given a history $h \in H$ and a policy $π$ , we can compute the conditional probability of a world or a history; designate these by $P^{π} (h^{'} | h)$ and $P^{π} (w | h)$ .

Given these probabilities, we can then compute expected values. For example, the expected utility given $π$ and $h$ is:

E^{π} (U | h) = \int_{w \in W} U (w) P^{π} (w | h) .

We'll also designate this quantity by the expected value function $V (U, π, h)$ . If $U$ is bounded in $[- l, l]$ , then so is this value function (though the converse need not be true).

For rewards, let $H_{i}$ be the set of histories that have $i$ actions and observations. We'll say that $h \in H_{i}$ has length $i$ . Let $h_{:: i}$ be the the first $i$ actions and observations of the history $h$ . If the agent knows it will only make $n \geq 0$ observations and actions, then the future reward has an expected value function. For $π$ and $h_{m} \in H_{m}$ with $h_{m}$ a history of length $m < n$ , this is:

V (R, π, h_{m}) = \sum h \in H_{n} P^{π} (h | h_{m}) n \sum i = m + 1 R (h_{:: i}) .

If the agent expects it could make arbitrarily many observations, then given a discount function $0 < γ < 1$ , there is the expected discounted future reward of:

V (R, π, h_{m}, γ) = lim n \to \infty \sum h \in H_{n} P^{π} (h | h_{m}) n \sum i = m + 1 γ^{i - (m + 1)} R (h_{:: i}) .

Equivalence for finite horizons

Assume the agent knows it will only make $n$ observations and actions. The ${W_{h} | h \in H_{n}}$ form a partition of the possible worlds in $W$ : every possible world is in exactly one of those $W_{h}$ subsets.

Then assume that $R$ is given, and define, for $w_{h} \in W_{h}$ :

U_{R} (w_{h}) = n \sum i = 1 R (h_{:: i}) .

Then:

Theorem 1: On history $h$ , $V (U, π, h)$ and $V (R_{U}, π, h)$ differ by a constant that is a function of $h$ only, not of $π$ . Consequently, an $R_{U}$ -maximiser and a $U$ -maximiser will choose the same policies.

All the proofs are given in the appendix at the end of the post.

Now, conversely, assume $U$ is given, but $R$ is not. If $h \in H_{n}$ , then $V (U, π, h)$ is independent of $π$ , since the agent will never make any more actions.

Then fix any policy $π^{'}$ , for a history $h$ of length $m > 1$ , define the reward $R_{U}$ by:

R_{U} (h) = V (U, π^{'}, h) - V (U, π^{'}, h_{:: m - 1})

For

m = 1

, define

R_{U}

by:

R_{U} (h) = V (U, π^{'}, h)

Theorem 2: On history $h$ , $V (U, π, h)$ and $V (R_{U}, π, h)$ differ by a constant that is a function of $h$ and $π^{'}$ only, not of $π$ . Consequently, an $R_{U}$ -maximiser and a $U$ -maximiser will choose the same policies.

Both of these constructions are non-unique; for example, $U_{R}$ could vary across the different worlds of $W_{h}$ , as long as its expectation on that set is the same. And $R_{U}$ could have different rewards added at one step, as long as it is subtracted at a later step, or could use a different $π^{'}$ .

(In)Equivalence for infinite horizons

If the agent expects it could make arbitrarily many observations, then we can still define a good $U_{R}$ from a given $R$ . Let $H_{\infty}$ be the possible infinite histories; then the sets ${W_{h} | h \in H_{\infty}}$ form a partition of $W$ . Then for any fixed $γ$ define:

U_{R} (w_{h}) = lim j \to \infty j \sum i = 1 γ^{i - 1} R (h_{:: i}),

and we will get:

Theorem 3: If $R$ is bounded, then $U_{R}$ is well-define and bounded, and on history $h$ , $V (U, π, h)$ and $V (R_{U}, π, h)$ are related by a positive affine transformation that is a function of $h$ and $γ$ only, not of $π$ .Consequently, a $γ$ -discounted $R_{U}$ -maximiser and a $U$ -maximiser will choose the same policies.

The converse is more tricky. Fix any policy $π^{'}$ as before, and, as before, for a history $h$ of length $m > 1$ , define the reward $R_{U}$ by:

R_{U} (h) = γ^{- m} (V (U, π^{'}, h) - V (U, π^{'}, h_{:: m - 1})) .

For

m = 1

, define

R_{U}

by:

R_{U} (h) = V (U, π^{'}, h),

To make this work, we need to put some conditions on

U

. The condition that we need is that, eventually, the future actions (hence the future policy) don't matter much. Then we say that

U

asymptotically ignores the future, if there exists a function

f

, for any

h_{m} \in H_{m}

, and any policies

π

and

π^{'}

| | V (U, π, h_{m}) - V (U, π^{'}, h_{m}) | | < f (m)

and

lim m \to \infty f (m) = 0.

Then, as before, we can show that:

Theorem 4: If $U$ asymptotically ignores the future, then on history $h$ , $V (U, π, h)$ and $V (R_{U}, π, h)$ are related by a positive affine transformation that is a function of $h$ , $π^{'}$ , and $γ$ only, not of $π$ . Consequently, a $γ$ -discounted $R_{U}$ -maximiser and a $U$ -maximiser will choose the same policies. Even if $U$ is bounded, $R_{U}$ need not be.

A utility counterexample

So, what kind of utility function cannot be made into a reward function in the above way?

Well, assume there are two actions, $a$ and $b$ , and that $U$ is $1$ if the agent only chooses $a$ , and $0$ if it ever choose $b$ . Let $π^{'}$ be the policy that always chooses $b$ (all that's needed, in fact, is that it eventually chooses $b$ with probability $1$ ).

Then $V (U, π^{'}, h)$ is always zero, as is $V (U, π^{'}, h_{:: m - 1})$ . Thus $R_{U} (h)$ is also always zero. And this despite the fact that there exists a policy that gets utility $1$ : namely the "always choose $a$ " policy.

Does it make a difference in practice?

In order to reach the equivalences above, the value functions $V$ need to be exactly calculated, meaning that the probability estimator $P$ needs to be perfect. Thus the equivalence is only established for logically omniscient agents.

In practice, utility functions are most useful when we know the ideal outcome states, and now the challenge is to design an agent that gets to them. Reward functions are most useful when we know the best local moves for the agent to make, but not necessarily the best outcomes.

Appendix: Proofs

This gives the proofs of the four theorem above. They will proceed by expressing the relationship between the two relevant value functions.

Theorem 1:

If $h^{'} \in H_{n}$ , then $U_{R}$ is constant on $W_{h^{'}}$ , so we can talk about $U_{R} (W_{h^{'}})$ (which is $\sum_{i = 1}^{n} R (h_{:: i}^{'})$ ). Then if $h$ is of length $m$ :

\begin{matrix} V (U_{R}, π, h) = & \int_{w \in W} U_{R} (w) P^{π} (w | h) = & \sum_{h^{'} \in H_{n}} \int_{w \in W_{h^{'}}} U_{R} (w) P^{π} (w | h) = & \sum_{h^{'} \in H_{n}} U_{R} (W_{h^{'}}) P^{π} (W_{h^{'}} | h) = & \sum_{h^{'} \in H_{n}} U_{R} (W_{h^{'}}) P^{π} (h^{'} | h) = & \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) \sum_{i = 1}^{n} R (h_{:: i}^{'}) = & \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) \sum_{i = m + 1}^{n} R (h_{:: i}^{'}) + \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) \sum_{i = 1}^{m} R (h_{:: i}^{'}) = & V (R, π, h) + \sum_{i = 1}^{m} R (h_{:: i}), \end{matrix}

since $P (h^{'} | h)$ being non-zero means that the initial $m$ actions and observations of $h^{'}$ are the same as those of $h$ , and $P^{π} (W_{h^{'}} | h)$ is the same as $P^{π} (h^{'} | h)$ : the probability that we are in a world with $h^{'}$ is the same as the probability of observing $h^{'}$ .

Because $\sum_{i = 1}^{m} R (h_{:: i})$ is a function purely of the past, this means that a $V (U_{R}, π, h)$ maximiser will behave the same way as a $V (R, π, h)$ maximiser.

Theorem 2:

If $h$ is a history of length $m > 0$ , then:

\begin{matrix} V (R_{U}, π, h) = & \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) \sum_{i = m + 1}^{n} R_{U} (h_{:: i}^{'}) = & \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) \sum_{i = m + 1}^{n} V (U, π^{'}, h_{:: i}^{'}) - V (U, π^{'}, h_{:: i - 1}^{'}) = & \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) (V (U, π^{'}, h_{:: n}^{'}) - V (U, π^{'}, h_{:: m}^{'})) = & \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) V (U, π, h^{'}) - \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) V (U, π^{'}, h_{:: m}^{'}) = & V (U, π, h) - V (U, π^{'}, h), \end{matrix}

because if $h^{'} \in H_{n}$ and $P^{π} (h^{'} | h) \neq 0$ , then $h_{:: n}^{'} = h^{'}$ , $h_{:: m}^{'} = h$ , and $V (U, π^{'}, h^{'}) = V (U, π, h)$ .

Theorem 3:

If $h$ is of length $m$ , then

\begin{matrix} V (U_{R}, π, h) = & \int_{w \in W} U_{R} (w) P^{π} (w | h) = & \int_{h^{'} \in H_{\infty}} \int_{w \in W_{h^{'}}} U_{R} (w) P^{π} (w | h) = & \int_{h^{'} \in H_{\infty}} U_{R} (W_{h^{'}}) P^{π} (W_{h^{'}} | h) = & \int_{h^{'} \in H_{\infty}} U_{R} (W_{h^{'}}) P^{π} (h^{'} | h) = & \int_{h^{'} \in H_{\infty}} P^{π} (h^{'} | h) {lim}_{j \to \infty} \sum_{i = 1}^{j} γ^{i - 1} R (h_{:: i}^{'}) = & \int_{h^{'} \in H_{\infty}} P^{π} (h^{'} | h) {lim}_{j \to \infty} \sum_{i = m + 1}^{j} γ^{i - 1} R (h_{:: i}^{'}) + \int_{h^{'} \in H_{\infty}} P^{π} (h^{'} | h) \sum_{i = 1}^{m} γ^{i - 1} R (h_{:: i}^{'}) = & {lim}_{j \to \infty} \sum_{h^{''} \in H_{j}} P^{π} (h^{''} | h) \sum_{i = m + 1}^{j} γ^{i - 1} R (h_{:: i}^{'}) + \int_{h^{'} \in H_{\infty}} P^{π} (h^{'} | h) \sum_{i = 1}^{m} γ^{i - 1} R (h_{:: i}^{'}) = & γ^{m} V (R, π, h) + \sum_{i = 1}^{m} γ^{i - 1} R (h_{:: i}), \end{matrix}

similarly to the proof of Theorem 1. Here, we have used the fact that the actions and observations beyond the $j$ -th are irrelevant to $\sum_{i = m + 1}^{j} γ^{i - 1} R (h_{:: i}^{'})$ , in order to amalgamate all $h^{'} \in H_{\infty}$ that have the same $j$ first actions and observations, and then interchange the limit with the finite sum over all the histories in $H_{j}$ .

Theorem 4:

\begin{matrix} V (R_{U}, π, h, γ) = & {lim}_{n \to \infty} \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) \sum_{i = m + 1}^{n} γ^{i - (m + 1)} R_{U} (h_{:: i}^{'}) = & {lim}_{n \to \infty} \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) \sum_{i = m + 1}^{n} γ^{i - (m + 1)} γ^{- i} (V (U, π^{'}, h_{:: i}^{'}) - V (U, π^{'}, h_{:: i - 1}^{'})) = & γ^{- (m + 1)} {lim}_{n \to \infty} \sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) (V (U, π^{'}, h_{:: n}^{'}) - V (U, π^{'}, h_{:: m}^{'})) \end{matrix}

Because $U$ asymptotically ignores the future, the term $\sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) V (U, π^{'}, h_{:: n}^{'})$ can be rewritten as $\sum_{h^{'} \in H_{n}} P^{π} (h^{'} | h) V (U, π, h_{:: n}^{'}) = V (U, π, h)$ , a re-writing that introduces an error of norm at most $f (n)$ . Since $f (n)$ tends to zero in the limit,

V (R_{U}, π, h, γ) = γ^{- (m + 1)} (V (U, π, h) - V (U, π^{'}, h)) .

Now we'll show that this value function need not be bounded. Imagine that the agent has two action, $a$ and $b$ . The utility $U$ is an inde weighted sum of the number of times the agent chooses $a$ . For $h^{'} \in H_{\infty}$ , let $I (a, h, m)$ be a Boolean that is $1$ if the agent chose $1$ on history $h^{'}$ at time $m$ , and $0$ otherwise. Let $0 < β < 1$ , and define the utility:

U (h) = \infty \sum i = 1 β^{i} I (a, h^{'}, m) .

Now let $π$ be the policy of always choosing $a$ , and $π^{'}$ the policy of always choosing $b$ . Then for any history $h_{m} \in H_{m}$ ,

V (U, π, h) - V (U, π^{'}, h) = \infty \sum i = m + 1 β^{i} = β^{m + 1} \frac{1}{1 - β} > β^{m + 1} .

This means that if $β < γ$ , the value of $V (R_{U}, π, h, γ)$ will increase without limits as $h$ gets longer, even though $U$ itself is bounded (by $1 / (1 - β)$ ). If we replace $β^{i}$ with $1 / i^{2}$ , we keep the fact that $U$ is bounded, and, since $1 / i^{2}$ will eventually be greater that $β^{i}$ , for all $0 < β < 1$ , we can see that $V (R_{U}, π, h, γ)$ will increase without limits for any value of $γ$ .

5 comments

Comments sorted by top scores.

comment by William_S · 2018-04-13T16:02:55.058Z · LW(p) · GW(p)

I'm trying to wrap my head around the case where there are two worlds, w1 and w2; w2 is better than w1, but moving from w1 to w2 is bad (ie. kill everyone and replacing them with different people who are happier, and we think this is bad).

I think for the equivalence to work in this case, the utility function U also needs to depend on your current state - if it's the same for all states, then the agent would always prefer to move from w1 to w2 and erase it's memory of the past when maximizing the utility function, wheras it would act correctly with the reward function.

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2018-04-16T11:09:02.261Z · LW(p) · GW(p)

>erase it's memory

That only works if the agent is motivated by something like "maximise your belief in what the expected value of U is", rather than "maximise the expected value of U". If you've got that problem, then the agent is unsalvageable - it could just edit its memory to make itself believe U is maximised.

Replies from: William_S

↑ comment by William_S · 2018-04-16T15:08:30.412Z · LW(p) · GW(p)

Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2.

Without considering the agent's memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent's memory of past events, so it seems that leaving the agent's memory over the past out of the utility function requires U(w2a) = U(w2b)

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2018-04-16T15:29:45.503Z · LW(p) · GW(p)

U could depend on the entire history of states (rather than on the agent's memory of that history).

Replies from: William_S

↑ comment by William_S · 2018-04-16T17:03:02.145Z · LW(p) · GW(p)

Ah, misunderstood that, thanks.

Utility versus Reward function: partial equivalence

Contents

Formalism

The value functions

Equivalence for finite horizons

(In)Equivalence for infinite horizons

A utility counterexample

Does it make a difference in practice?

Appendix: Proofs

5 comments