STARC: A General Framework For Quantifying Differences Between Reward Functions
post by Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:52.965Z · LW · GW · 0 commentsContents
Considerations STARC Metrics Theoretical Results Experimental Results Conclusion None No comments
In this post, I will provide a summary of the paper STARC: A General Framework For Quantifying Differences Between Reward Functions, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fourth post in the theoretical reward learning sequence, which starts in this post [LW · GW] (though this post is self-contained).
In this paper, we consider the question of how to quantify the distance between reward functions in an informative way. That is, we want to find a function , where is the space of all reward functions, such that is a meaningful quantification of how similar and are. This is important for the (theoretical or empirical) study of reward learning algorithms; for example, see this post [LW · GW].
Considerations
Note that this problem is not very straightforward. A simple method for quantifying the distance between two reward functions might be to measure their -distance. However, this is unsatisfactory, because two reward functions can have a large -distance, even if they induce the same ordering of policies, or a small -distance, even if they induce the opposite ordering of policies. For example, given an arbitrary reward function and an arbitrary constant , we have that and have the same ordering of policies, even though their -distance may be arbitrarily large. Similarly, for any , we have that and have the opposite ordering of policies, unless is constant, even though their -distance may be arbitrarily small. Solving this problem in a good way thus requires some care.
(There are two earlier proposals for how to do this, namely EPIC and DARD. In Appendix A and B of the main paper, we outline a number of shortcomings with these earlier methods.)
We should start by asking what it means for a given function to be “good” at quantifying the differences between reward functions. First and foremost, we probably want to be a pseudometric, since this comes with several nice mathematical properties. This means that it should satisfy:
- , with equality if .
- .
The difference between a metric and a pseudometric is that for a metric, it is required that only if , whereas for a pseudometric, we can have that even when . In our case, it seems reasonable to consider pseudometrics rather than metrics, since we may want to consider some distinct reward functions to have distance 0. For example, if two rewards only differ by positive linear scaling, then it seems reasonable to say that they are equivalent (and thus have distance 0).
Requiring that is a pseudometric is a very weak and general requirement. More specifically to our problem, we ideally want it to be the case that is small if and only if optimising or would lead to similar outcomes. We can formalise this intuitive statement using regret bounds. Specifically, we say that:
Definition: A pseudometric on is sound if there is a constant such that for any policies , if , then .
Let us unpack this definition. is the regret, as measured by , of using policy instead of . Division by normalises this quantity based on the total range of , so that it lies between 0 and 1 (though the term is put on the right-hand side of the inequality, instead of being used as a denominator, in order to avoid division by zero when ). The condition that says that prefers over . Taken together, this means that a pseudometric is sound if gives an upper bound on the maximal regret that could be incurred under if an arbitrary policy is optimised to another policy according to . It is also worth noting that this includes the special case when is optimal under and is optimal under .
In addition to this, we also want pseudometrics that induce a lower bound on worst-case regret. When this is the case, we say that is complete. It may not be immediately obvious why this property is desirable. To see why this is the case, note that if a pseudometric on the space of all reward functions does not induce a lower bound on worst-case regret, then there are reward functions that have a low worst-case regret, but a large distance under . This would in turn mean that is not tight, and that it should be possible to improve upon it. In other words, if we want a small distance under to be both sufficient and necessary for low worst-case regret, then must induce both an upper and a lower bound on worst-case regret. Formally, we say that
Definition: A pseudometric on is complete if there is a constant , such that for any reward functions , there are policies such that and .[1]
Thus, to be useful for quantifying the differences between reward functions, a function should ideally be a pseudometric, and be both sound and complete.
STARC Metrics
In the paper, we propose a family of pseudometrics on the space of all reward functions, which we refer to as STAndardised Reward Comparison (STARC) metrics. We will also show that STARC metrics satisfy all considerations we outlined above.
STARC metrics are computed in several steps. First, we need a few new definitions:
Definition: Two reward functions differ by potential shaping if there is a function such that
for all They differ by S’-redistribution if
for all and .
To get a better intuition for what potential shaping and S’-redistribution do, see this post [LW · GW]. For now, it is probably sufficient to know that if and differ by (some combination of) potential shaping and S’-redistribution, then they induce the same ordering of policies.
Definition: A function is a canonicalisation function if:
- is linear,
- R and differ by potential shaping and S’-redistribution, and
- if and only if and differ by potential shaping and S’-redistribution.
A canonicalisation function essentially standardises reward functions, such that reward functions that differ by potential shaping and S’-redistribution are mapped to a single representative in their respective equivalence class. You can think of this as a kind of normalisation. (The requirement that c is linear makes our later analysis much simpler, and is not too restrictive in practice. However, it could probably be lifted, with some effort.)
Definition: A metric is admissible if there is a norm and two positive constants , such that for all .
Any norm is of course an admissible metric, but there are some other metrics which are also admissible. This weakening is only included to make our definitions as general as possible – in practice, you can mentally replace "admissible metric" with “norm”, and not lose much.
Using this, we can now give a definition of STARC metrics:
Definition: A function is a STARC metric if there is a canonicalisation function , a function that is a norm on , and a metric that is admissible on , such that , where when , and otherwise.
In other words, a STARC metric is computed by first applying a canonicalisation function to both of its inputs, then normalising the resulting reward functions (unless it is the reward that is zero everywhere, in which case we don’t change it), and finally measuring the distance between the resulting reward functions.
The most complicated part of this definition is the canonicalisation function — for the norm and metric , we can simply pick any norm (such as the -norm, etc). Let me therefore also give two examples of canonicalisation functions:
Proposition: For any policy , the function given by
is a canonicalisation function. Here is computed using the reward function R that is given as input to . Note that we must use the same policy for all R. We refer to this canonicalisation function as Value-Adjusted Levelling (VAL).
Definition: A canonicalisation function is minimal for a norm n if for all such that and differ by potential shaping and S’-redistribution.
Proposition: For the -norm, the minimal canonicalisation function exists and is unique. To get this canonicalisation function, let be the reward that is zero everywhere, and let be the set of all reward functions that differ from by potential shaping and S’-redistribution. Let be the orthogonal complement of in . Then the minimal canonicalisation function for the -norm is the orthogonal projection of onto . (Note that can be viewed as an -dimensional real vector space.)
For proofs, please see the main paper. The minimal canonicalisation function is easy to work with theoretically, but not so easy to compute for empirical experiments. By contrast, VAL can easily be estimated in even large-scale environments. Combined with some norm and some admissible metric (which may also be a norm), these form a STARC metric.
In Appendix C of the main paper, we provide two geometric intuitions for how STARC metrics work. To get a deeper intuitive understanding for STARC metrics, it may help to read that section.
Theoretical Results
In the paper, we derive several important theoretical properties for STARC-metrics. The proofs are found in the main paper. First and foremost:
Theorem: Any STARC metric is both sound and complete.
This means that if is a STARC metric, then is small if and only if the worst-case regret (as measured by ) of optimising instead of is small, and vice versa. In other words, is small if and only if optimising or lead to similar outcomes. STARC metrics thus satisfy the main considerations we provided above. We also have that:
Proposition: Any STARC metric has the property that if and only if and induce the same ordering of policies.
This means that STARC metrics consider two reward functions to be equivalent, exactly when those reward functions induce exactly the same ordering of policies. This is intuitive and desirable (and, in fact, is a consequence of the previous theorem). We also have the following result:
Proposition: If two pseudometrics on are both sound and complete, then and are bilipschitz equivalent. This means that there are positive constants such that .
Combined with the above results, this means that STARC metrics are unique (up to bilipschitz equivalence)! In other words, they capture what it means for two reward functions to be “similar” in a fairly unique and canonical way, and it will not be possible to improve upon them without losing some of their desirable properties.
Experimental Results
In the main paper, we also provide a range of empirical results. The main takeaway from these experiments is that it indeed seems like STARC-metrics correlate well with worst-case regret in randomly generated MDPs. We also show that STARC metrics can be estimated in large continuous environments, where they can’t be calculated exactly. For the exact data, etc, please see the main paper.
Conclusion
STARC metrics induce both an upper and a lower bound on worst-case regret, which means that a small distance under a STARC-metric is both necessary and sufficient for ensuring low regret. In other words, is small if and only if we are guaranteed to get similar outcomes if we optimise or . Moreover, all pseudometrics with these properties are bilipschitz equivalent. This means that STARC metrics exactly capture what it means for two reward functions to be similar (at least for one informative way of formalising “similarity”). They are easy to work with theoretically, and can be estimated in large environments. This makes them a useful tool when evaluating reward learning algorithms.
One of the main motivations for developing these metrics was to extend the results in the paper Misspecification in Inverse Reinforcement Learning (which I also discussed in this post [LW · GW]). In the next post in this sequence, I will show how to use STARC metrics to analyse how sensitive IRL is to misspecification.
If you have any questions, please feel free to ask them in the comments!
- ^
The definition of completeness given in the main paper is slightly more complicated, in order to rule out a potential edge-case.
0 comments
Comments sorted by top scores.