STARC: A General Framework For Quantifying Differences Between Reward Functions

logical_lunatic

STARC: A General Framework For Quantifying Differences Between Reward Functions

post by Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:52.965Z · LW · GW · 0 comments

  Considerations
  STARC Metrics
  Theoretical Results
  Experimental Results
  Conclusion
None
No comments

In this post, I will provide a summary of the paper STARC: A General Framework For Quantifying Differences Between Reward Functions, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fourth post in the theoretical reward learning sequence, which starts in this post [LW · GW] (though this post is self-contained).

In this paper, we consider the question of how to quantify the distance between reward functions in an informative way. That is, we want to find a function , where $R$ is the space of all reward functions, such that $d (R_{1}, R_{2})$ is a meaningful quantification of how similar $R_{1}$ and $R_{2}$ are. This is important for the (theoretical or empirical) study of reward learning algorithms; for example, see this post [LW · GW].

Considerations

Note that this problem is not very straightforward. A simple method for quantifying the distance between two reward functions might be to measure their $L_{2}$ -distance. However, this is unsatisfactory, because two reward functions can have a large $L_{2}$ -distance, even if they induce the same ordering of policies, or a small $L_{2}$ -distance, even if they induce the opposite ordering of policies. For example, given an arbitrary reward function $R$ and an arbitrary constant $c$ , we have that $R$ and $c \cdot R$ have the same ordering of policies, even though their $L_{2}$ -distance may be arbitrarily large. Similarly, for any $ϵ$ , we have that $ϵ \cdot R$ and $- ϵ \cdot R$ have the opposite ordering of policies, unless $R$ is constant, even though their $L_{2}$ -distance may be arbitrarily small. Solving this problem in a good way thus requires some care.

(There are two earlier proposals for how to do this, namely EPIC and DARD. In Appendix A and B of the main paper, we outline a number of shortcomings with these earlier methods.)

We should start by asking what it means for a given function $d : R \times R \to R$ to be “good” at quantifying the differences between reward functions. First and foremost, we probably want $d$ to be a pseudometric, since this comes with several nice mathematical properties. This means that it should satisfy:

$d (R_{1}, R_{2}) \geq 0$ , with equality if $R_{1} = R_{2}$ .
$d (R_{1}, R_{2}) = d (R_{2}, R_{1})$ .
$d (R_{1}, R_{3}) \leq d (R_{1}, R_{2}) + d (R_{2}, R_{3}) .$

The difference between a metric and a pseudometric is that for a metric, it is required that $d (x, y) = 0$ only if $x = y$ , whereas for a pseudometric, we can have that $d (x, y) = 0$ even when $x \neq y$ . In our case, it seems reasonable to consider pseudometrics rather than metrics, since we may want to consider some distinct reward functions to have distance 0. For example, if two rewards only differ by positive linear scaling, then it seems reasonable to say that they are equivalent (and thus have distance 0).

Requiring that $d$ is a pseudometric is a very weak and general requirement. More specifically to our problem, we ideally want it to be the case that $d (R_{1}, R_{2})$ is small if and only if optimising $R_{1}$ or $R_{2}$ would lead to similar outcomes. We can formalise this intuitive statement using regret bounds. Specifically, we say that:

Definition: A pseudometric $d$ on $R$ is sound if there is a constant $U,$ such that for any policies $π_{1}, π_{2}$ , if $J_{2} (π_{2}) \geq J_{2} (π_{1})$ , then $J_{1} (π_{1}) - J_{1} (π_{2}) \leq U \cdot ({max}_{π} J_{1} (π) - {min}_{π} J_{1} (π)) \cdot d (R_{1}, R_{2})$ .

Let us unpack this definition. $J_{1} (π_{1}) - J_{1} (π_{2})$ is the regret, as measured by $R_{1}$ , of using policy $π_{2}$ instead of $π_{1}$ . Division by ${max}_{π} J_{1} (π) - {min}_{π} J_{1} (π)$ normalises this quantity based on the total range of $R_{1}$ , so that it lies between 0 and 1 (though the term is put on the right-hand side of the inequality, instead of being used as a denominator, in order to avoid division by zero when ${max}_{π} J_{1} (π) - {min}_{π} J_{1} (π) = 0$ ). The condition that $J_{2} (π_{2}) \geq J_{2} (π_{1})$ says that $R_{2}$ prefers $π_{2}$ over $π_{1}$ . Taken together, this means that a pseudometric $d$ is sound if $d (R_{1}, R_{2})$ gives an upper bound on the maximal regret that could be incurred under $R_{1}$ if an arbitrary policy $π_{1}$ is optimised to another policy $π_{2}$ according to $R_{2}$ . It is also worth noting that this includes the special case when $π_{1}$ is optimal under $R_{1}$ and $π_{2}$ is optimal under $R_{2}$ .

In addition to this, we also want pseudometrics $d$ that induce a lower bound on worst-case regret. When this is the case, we say that $d$ is complete. It may not be immediately obvious why this property is desirable. To see why this is the case, note that if a pseudometric $d$ on the space of all reward functions $R$ does not induce a lower bound on worst-case regret, then there are reward functions that have a low worst-case regret, but a large distance under $d$ . This would in turn mean that $d$ is not tight, and that it should be possible to improve upon it. In other words, if we want a small distance under $d$ to be both sufficient and necessary for low worst-case regret, then $d$ must induce both an upper and a lower bound on worst-case regret. Formally, we say that

Definition: A pseudometric $d$ on $R$ is complete if there is a constant $L$ , such that for any reward functions $R_{1}, R_{2}$ , there are policies $π_{1}, π_{2}$ such that $J_{2} (π_{2}) \geq J_{2} (π_{1})$ and $J_{1} (π_{1}) - J_{1} (π_{2}) \geq L \cdot ({max}_{π} J_{1} (π) - {min}_{π} J_{1} (π)) \cdot d (R_{1}, R_{2})$ .^[1]

Thus, to be useful for quantifying the differences between reward functions, a function $d : R \times R \to R$ should ideally be a pseudometric, and be both sound and complete.

STARC Metrics

In the paper, we propose a family of pseudometrics on the space of all reward functions, which we refer to as STAndardised Reward Comparison (STARC) metrics. We will also show that STARC metrics satisfy all considerations we outlined above.

STARC metrics are computed in several steps. First, we need a few new definitions:

Definition: Two reward functions $R_{1}, R_{2}$ differ by potential shaping if there is a function $Φ : S \to R$ such that
$R_{1} (s, a, s^{'}) = R_{2} (s, a, s^{'}) + γ \cdot Φ (s^{'}) - Φ (s)$
for all $s, a, s^{'} .$ They differ by S’-redistribution if
$E_{S^{'} \sim τ (s, a)} [R_{1} (s, a, S^{'})] = E_{S^{'} \sim τ (s, a)} [R_{2} (s, a, S^{'})]$
for all $s$ and $a$ .

To get a better intuition for what potential shaping and S’-redistribution do, see this post [LW · GW]. For now, it is probably sufficient to know that if $R_{1}$ and $R_{2}$ differ by (some combination of) potential shaping and S’-redistribution, then they induce the same ordering of policies. Using this, we can now define:

Definition: A function $c : R \to R$ is a canonicalisation function if:
$c$ is linear,
R and $c (R)$ differ by potential shaping and S’-redistribution, and
$c (R_{1}) = c (R_{2})$ if and only if $R_{1}$ and $R_{2}$ differ by potential shaping and S’-redistribution.

A canonicalisation function essentially standardises reward functions, such that reward functions that differ by potential shaping and S’-redistribution are mapped to a single representative in their respective equivalence class. You can think of this as a kind of normalisation. The requirement that c is linear makes our later analysis more straightforward, and is not too restrictive in practice. However, it could probably be lifted, with some effort. We also use the following definition:

Definition: A metric $m : R \times R \to R$ is admissible if there is a norm $p$ and two positive constants $l, u$ , such that $l \cdot p (x, y) \leq m (x, y) \leq u \cdot p (x, y)$ for all $x, y \in R$ .

Any norm is of course an admissible metric, but there are some other metrics which are also admissible. This weakening is only included to make our definitions as general as possible – in practice, you can mentally replace "admissible metric" with “norm”, and not lose much.

Using this, we can now give a definition of STARC metrics:

Definition: A function $d : R \times R \to R$ is a STARC metric if there is a canonicalisation function $c$ , a function $n$ that is a norm on $I m (c)$ , and a metric $m$ that is admissible on $I m (s)$ , such that $d (R_{1}, R_{2}) = m (s (R_{1}), s (R_{2}))$ , where $s (R) = c (R) / n (c (R))$ when $n (c (R)) \neq 0$ , and $c (R)$ otherwise.

In other words, a STARC metric $d$ is computed by first applying a canonicalisation function $c$ to both of its inputs, then normalising the resulting reward functions (unless it is the reward that is zero everywhere, in which case we don’t change it), and finally measuring the distance between the resulting reward functions.

The most complicated part of this definition is the canonicalisation function — for the norm $n$ and metric $m$ , we can simply pick any norm (such as the $L_{2}$ -norm, etc). Let me therefore also give two examples of canonicalisation functions:

Proposition: For any policy $π$ , the function $c : R \to R$ given by
$c (R) (s, a, s^{'}) = E_{S^{'} \sim τ (s, a)} [R (s, a, S^{'}) - V^{π} (s) + γ \cdot V^{π} (S^{'})]$
is a canonicalisation function. Here $V^{π}$ is computed using the reward function R that is given as input to $c$ . Note that we must use the same policy $π$ for all R. We refer to this canonicalisation function as Value-Adjusted Levelling (VAL).

Definition: A canonicalisation function $c$ is minimal for a norm n if $n (c (R)) \leq n (R^{'})$ for all $R^{'}$ such that $R$ and $R^{'}$ differ by potential shaping and S’-redistribution.
Proposition: For the $L_{2}$ -norm, the minimal canonicalisation function exists and is unique. To get this canonicalisation function, let $R_{0}$ be the reward that is zero everywhere, and let $V$ be the set of all reward functions that differ from $R_{0}$ by potential shaping and S’-redistribution. Let $W$ be the orthogonal complement of $V$ in $R$ . Then the minimal canonicalisation function for the $L_{2}$ -norm is the orthogonal projection of $R$ onto $W$ . (Note that $R$ can be viewed as an $| S | | A | | S |$ -dimensional real vector space.)

For proofs, please see the main paper. The minimal canonicalisation function is easy to work with theoretically, but not so easy to compute for empirical experiments. By contrast, VAL can easily be estimated in even large-scale environments. Combined with some norm and some admissible metric (which may also be a norm), these form a STARC metric.

In Appendix C of the main paper, we provide two geometric intuitions for how STARC metrics work. To get a deeper intuitive understanding for STARC metrics, it may help to read that section.

Theoretical Results

In the paper, we derive several important theoretical properties for STARC-metrics. The proofs are found in the main paper. First and foremost:

Theorem: Any STARC metric is both sound and complete.

This means that if $d$ is a STARC metric, then $d (R_{1}, R_{2})$ is small if and only if the worst-case regret (as measured by $R_{1}$ ) of optimising $R_{2}$ instead of $R_{1}$ is small, and vice versa. In other words, $d (R_{1}, R_{2})$ is small if and only if optimising $R_{1}$ or $R_{2}$ lead to similar outcomes. STARC metrics thus satisfy the main considerations we provided above. We also have that:

Proposition: Any STARC metric $d$ has the property that $d (R_{1}, R_{2}) = 0$ if and only if $R_{1}$ and $R_{2}$ induce the same ordering of policies.

This means that STARC metrics consider two reward functions to be equivalent, exactly when those reward functions induce exactly the same ordering of policies. This is intuitive and desirable (and, in fact, is a consequence of the previous theorem). We also have the following result:

Proposition: If two pseudometrics $d_{1}, d_{2}$ on $R$ are both sound and complete, then $d_{1}$ and $d_{2}$ are bilipschitz equivalent. This means that there are positive constants $l, u$ such that $l \cdot d_{1} (R_{1}, R_{2}) \leq d_{2} (R_{1}, R_{2}) \leq u \cdot d_{1} (R_{1}, R_{2})$ .

Combined with the above results, this means that STARC metrics are unique (up to bilipschitz equivalence)! In other words, they capture what it means for two reward functions to be “similar” in a fairly unique and canonical way, and it will not be possible to improve upon them without losing some of their desirable properties.

Experimental Results

In the main paper, we also provide a range of empirical results. The main takeaway from these experiments is that it indeed seems like STARC-metrics correlate well with worst-case regret in randomly generated MDPs. We also show that STARC metrics can be estimated in large continuous environments, where they can’t be calculated exactly. For the exact data, etc, please see the main paper.

Conclusion

STARC metrics induce both an upper and a lower bound on worst-case regret, which means that a small distance under a STARC-metric is both necessary and sufficient for ensuring low regret. In other words, $d (R_{1}, R_{2})$ is small if and only if we are guaranteed to get similar outcomes if we optimise $R_{1}$ or $R_{2}$ . Moreover, all pseudometrics with these properties are bilipschitz equivalent. This means that STARC metrics exactly capture what it means for two reward functions to be similar (at least for one informative way of formalising “similarity”). They are easy to work with theoretically, and can be estimated in large environments. This makes them a useful tool when evaluating reward learning algorithms.

One of the main motivations for developing these metrics was to extend the results in the paper Misspecification in Inverse Reinforcement Learning (which I also discussed in this post [LW · GW]). In the next post in this sequence, I will show how to use STARC metrics to analyse how sensitive IRL is to misspecification.

If you have any questions, please feel free to ask them in the comments!

^{^}
The definition of completeness given in the main paper is slightly more complicated, in order to rule out a potential edge-case.

0 comments

Comments sorted by top scores.

STARC: A General Framework For Quantifying Differences Between Reward Functions

Contents

Considerations

STARC Metrics

Theoretical Results

Experimental Results

Conclusion

0 comments