STARC: A General Framework For Quantifying Differences Between Reward Functions

post by Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:52.965Z · LW · GW · 0 comments

Contents

  Considerations
  STARC Metrics
  Theoretical Results
  Experimental Results
  Conclusion
None
No comments

In this post, I will provide a summary of the paper STARC: A General Framework For Quantifying Differences Between Reward Functions, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fourth post in the theoretical reward learning sequence, which starts in this post [LW · GW] (though this post is self-contained).

In this paper, we consider the question of how to quantify the distance between reward functions in an informative way. That is, we want to find a function , where  is the space of all reward functions, such that  is a meaningful quantification of how similar  and  are. This is important for the (theoretical or empirical) study of reward learning algorithms; for example, see this post [LW · GW]. 

 

Considerations

Note that this problem is not very straightforward. A simple method for quantifying the distance between two reward functions might be to measure their -distance. However, this is unsatisfactory, because two reward functions can have a large -distance, even if they induce the same ordering of policies, or a small -distance, even if they induce the opposite ordering of policies. For example, given an arbitrary reward function  and an arbitrary constant , we have that  and  have the same ordering of policies, even though their -distance may be arbitrarily large. Similarly, for any , we have that  and  have the opposite ordering of policies, unless  is constant, even though their -distance may be arbitrarily small. Solving this problem in a good way thus requires some care.

(There are two earlier proposals for how to do this, namely EPIC and DARD. In Appendix A and B of the main paper, we outline a number of shortcomings with these earlier methods.)

We should start by asking what it means for a given function  to be “good” at quantifying the differences between reward functions. First and foremost, we probably want  to be a pseudometric, since this comes with several nice mathematical properties. This means that it should satisfy:

  1. , with equality if .
  2. .

The difference between a metric and a pseudometric is that for a metric, it is required that  only if , whereas for a pseudometric, we can have that  even when . In our case, it seems reasonable to consider pseudometrics rather than metrics, since we may want to consider some distinct reward functions to have distance 0. For example, if two rewards only differ by positive linear scaling, then it seems reasonable to say that they are equivalent (and thus have distance 0).

Requiring that  is a pseudometric is a very weak and general requirement. More specifically to our problem, we ideally want it to be the case that  is small if and only if optimising  or  would lead to similar outcomes. We can formalise this intuitive statement using regret bounds. Specifically, we say that:

Definition: A pseudometric  on  is sound if there is a constant  such that for any policies , if , then .

Let us unpack this definition.  is the regret, as measured by , of using policy  instead of . Division by  normalises this quantity based on the total range of , so that it lies between 0 and 1 (though the term is put on the right-hand side of the inequality, instead of being used as a denominator, in order to avoid division by zero when ). The condition that  says that  prefers  over . Taken together, this means that a pseudometric  is sound if  gives an upper bound on the maximal regret that could be incurred under  if an arbitrary policy  is optimised to another policy  according to . It is also worth noting that this includes the special case when  is optimal under  and  is optimal under .

In addition to this, we also want pseudometrics  that induce a lower bound on worst-case regret. When this is the case, we say that  is complete. It may not be immediately obvious why this property is desirable. To see why this is the case, note that if a pseudometric  on the space of all reward functions  does not induce a lower bound on worst-case regret, then there are reward functions that have a low worst-case regret, but a large distance under . This would in turn mean that  is not tight, and that it should be possible to improve upon it. In other words, if we want a small distance under  to be both sufficient and necessary for low worst-case regret, then  must induce both an upper and a lower bound on worst-case regret. Formally, we say that 

Definition: A pseudometric  on  is complete if there is a constant , such that for any reward functions , there are policies  such that  and .[1]

Thus, to be useful for quantifying the differences between reward functions, a function  should ideally be a pseudometric, and be both sound and complete.

 

STARC Metrics

In the paper, we propose a family of pseudometrics on the space of all reward functions, which we refer to as STAndardised Reward Comparison (STARC) metrics. We will also show that STARC metrics satisfy all considerations we outlined above.

STARC metrics are computed in several steps. First, we need a few new definitions:

Definition: Two reward functions  differ by potential shaping if there is a function  such that 

for all  They differ by S’-redistribution if 

for all  and .

To get a better intuition for what potential shaping and S’-redistribution do, see this post [LW · GW]. For now, it is probably sufficient to know that if  and  differ by (some combination of) potential shaping and S’-redistribution, then they induce the same ordering of policies.

Definition: A function  is a canonicalisation function if:

  1.  is linear,
  2. R and  differ by potential shaping and S’-redistribution, and
  3.  if and only if  and  differ by potential shaping and S’-redistribution.

A canonicalisation function essentially standardises reward functions, such that reward functions that differ by potential shaping and S’-redistribution are mapped to a single representative in their respective equivalence class. You can think of this as a kind of normalisation. (The requirement that c is linear makes our later analysis much simpler, and is not too restrictive in practice. However, it could probably be lifted, with some effort.)

Definition: A metric  is admissible if there is a norm  and two positive constants , such that  for all .

Any norm is of course an admissible metric, but there are some other metrics which are also admissible. This weakening is only included to make our definitions as general as possible – in practice, you can mentally replace "admissible metric" with “norm”, and not lose much.

Using this, we can now give a definition of STARC metrics:

Definition: A function  is a STARC metric if there is a canonicalisation function , a function  that is a norm on , and a metric  that is admissible on , such that , where  when , and  otherwise.

In other words, a STARC metric  is computed by first applying a canonicalisation function  to both of its inputs, then normalising the resulting reward functions (unless it is the reward that is zero everywhere, in which case we don’t change it), and finally measuring the distance between the resulting reward functions.

The most complicated part of this definition is the canonicalisation function — for the norm  and metric , we can simply pick any norm (such as the -norm, etc). Let me therefore also give two examples of canonicalisation functions:

Proposition: For any policy , the function  given by

is a canonicalisation function. Here  is computed using the reward function R that is given as input to .  Note that we must use the same policy  for all R. We refer to this canonicalisation function as Value-Adjusted Levelling (VAL).

Definition: A canonicalisation function  is minimal for a norm n if  for all  such that  and  differ by potential shaping and S’-redistribution.

Proposition: For the -norm, the minimal canonicalisation function exists and is unique. To get this canonicalisation function, let  be the reward that is zero everywhere, and let  be the set of all reward functions that differ from  by potential shaping and S’-redistribution. Let  be the orthogonal complement of  in . Then the minimal canonicalisation function for the -norm is the orthogonal projection of  onto . (Note that  can be viewed as an -dimensional real vector space.)

For proofs, please see the main paper. The minimal canonicalisation function is easy to work with theoretically, but not so easy to compute for empirical experiments. By contrast, VAL can easily be estimated in even large-scale environments. Combined with some norm and some admissible metric (which may also be a norm), these form a STARC metric.

In Appendix C of the main paper, we provide two geometric intuitions for how STARC metrics work. To get a deeper intuitive understanding for STARC metrics, it may help to read that section.

 

Theoretical Results

In the paper, we derive several important theoretical properties for STARC-metrics. The proofs are found in the main paper. First and foremost:

Theorem: Any STARC metric is both sound and complete.

This means that if  is a STARC metric, then  is small if and only if the worst-case regret (as measured by ) of optimising  instead of  is small, and vice versa. In other words,  is small if and only if optimising  or  lead to similar outcomes. STARC metrics thus satisfy the main considerations we provided above. We also have that:

Proposition: Any STARC metric  has the property that  if and only if  and  induce the same ordering of policies.

This means that STARC metrics consider two reward functions to be equivalent, exactly when those reward functions induce exactly the same ordering of policies. This is intuitive and desirable (and, in fact, is a consequence of the previous theorem). We also have the following result:

Proposition: If two pseudometrics  on  are both sound and complete, then  and  are bilipschitz equivalent. This means that there are positive constants  such that .

Combined with the above results, this means that STARC metrics are unique (up to bilipschitz equivalence)! In other words, they capture what it means for two reward functions to be “similar” in a fairly unique and canonical way, and it will not be possible to improve upon them without losing some of their desirable properties.

 

Experimental Results

In the main paper, we also provide a range of empirical results. The main takeaway from these experiments is that it indeed seems like STARC-metrics correlate well with worst-case regret in randomly generated MDPs. We also show that STARC metrics can be estimated in large continuous environments, where they can’t be calculated exactly. For the exact data, etc, please see the main paper.

 

Conclusion

STARC metrics induce both an upper and a lower bound on worst-case regret, which means that a small distance under a STARC-metric is both necessary and sufficient for ensuring low regret. In other words,  is small if and only if we are guaranteed to get similar outcomes if we optimise  or . Moreover, all pseudometrics with these properties are bilipschitz equivalent. This means that STARC metrics exactly capture what it means for two reward functions to be similar (at least for one informative way of formalising “similarity”). They are easy to work with theoretically, and can be estimated in large environments. This makes them a useful tool when evaluating reward learning algorithms.

One of the main motivations for developing these metrics was to extend the results in the paper Misspecification in Inverse Reinforcement Learning (which I also discussed in this post [LW · GW]). In the next post in this sequence, I will show how to use STARC metrics to analyse how sensitive IRL is to misspecification.

 

If you have any questions, please feel free to ask them in the comments!

  1. ^

    The definition of completeness given in the main paper is slightly more complicated, in order to rule out a potential edge-case.

0 comments

Comments sorted by top scores.