Misspecification in Inverse Reinforcement Learning - Part II

post by Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:59.570Z · LW · GW · 0 comments

Contents

  Formalism
  Necessary and Sufficient Conditions
  Misspecified Parameters
  Perturbation Robustness
  Conclusion
None
No comments

In this post, I will provide a summary of the paper Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fifth post in the theoretical reward learning sequence, which starts in this post [LW · GW]. This post is somewhat self-contained, but I will largely assume that you have read this post [LW · GW] and this post [LW · GW] before reading this one.

In Misspecification in Inverse Reinforcement Learning (also discussed in this post), I attempt to analyse how sensitive IRL is to misspecification of the behavioural model. The main limitation of this analysis is that it is based on equivalence relations – that is, it only distinguishes between the case where the learnt reward function is equivalent or nonequivalent to the ground truth reward (for some specific ways of defining this equivalence). This means that it cannot distinguish between small and large errors in the learnt reward. Quantifying the differences between reward functions is nontrivial — to solve this, I developed STARC metrics, which are described in this post [LW · GW]. In Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification, which I’m summarising in this post, I extend the analysis in Misspecification in Inverse Reinforcement Learning using STARC metrics.

 

Formalism

We must first modify the definition in Misspecification in Inverse Reinforcement Learning to make use of pseudometrics on . This is straightforward:

Definition: Given a pseudometric  on , and two behavioural models , we say that  is -robust to misspecification with  if

  1. If , then .
  2. If , then .
  3. .
  4. .

This definition is directly analogous to that given in Misspecification in Inverse Reinforcement Learning (and in this post [LW · GW]).

Some of the results in this paper apply to any pseudometric on , but sometimes, we will have to use a specific pseudometric. In those cases, I will use the STARC metric that normalises and measures the distance using the -norm, canonicalises using the canonicalisation function that is minimal for the -norm, and divides the resulting number by 2 (to ensure that the distance is normalised to lie between 0 and 1). The reason for this is primarily that this STARC metric is fairly easy to work with theoretically (but note that all STARC metrics are bilipschitz equivalent, so this choice is not very consequential). I will refer to this pseudometric as 

Note that two reward functions  have the same policy order if and only if , provided that d is a STARC metric (including ). This means that if  is -robust to misspecification with  (in the terminology used in this post [LW · GW]), then  is 0-robust to misspecification with  (and thus -robust for each ) using the terminology above (if  is any STARC metric). The results from Misspecification in Inverse Reinforcement Learning thus carry over directly to this setting, although we may also be able to derive some additional, more permissive, results.

 

Necessary and Sufficient Conditions

We can first use the above definition to derive necessary and sufficient conditions that completely describe all forms of misspecification that some behavioural models are robust to:

Theorem: Suppose a behavioural model  satisfies that if , then . Then  is -robust to misspecification with  if and only if  for some reward transformation  such that  for all , and .

For a proof, see the main paper. This theorem requires that  implies that , which is somewhat restrictive. Unfortunately, this requirement can’t be removed without making the theorem much more complex. Fortunately, if  is a STARC metric, and  is either the Boltzmann-rational behavioural model or the maximal causal entropy behavioural model, then this condition is satisfied. To see this, note that if  is either of these two behavioural models, and , then  and  differ by potential shaping and S'-redistribution (see this post [LW · GW]). Moreover, both of these reward transformations preserve the ordering of all policies, and any STARC metric satisfies that d if and only if  and  have the same policy order (see this post [LW · GW]). We can therefore use the above theorem to fully characterise all forms of misspecification that these behavioural models will tolerate. To do this, we first also need the following:

Theorem: A transformation  satisfies that  for all  if and only if  can be expressed as , where  and  are given by some combination of potential shaping, S’-redistribution, and positive linear scaling, and  satisfies that

for all , where  is the canonicalisation function used by .

Thus, if  is either the Boltzmann-rational or the maximal causal entropy behavioural model, then  is -robust to misspecification with  if and only if  for some reward transformation t that satisfies the conditions above. Unfortunately, this condition is very opaque, and not very easy to interpret intuitively. For that reason, we will also examine a few specific types of misspecification more directly.

 

Misspecified Parameters

A very interesting result from Misspecification in Inverse Reinforcement Learning is that almost no behavioural model is robust to any misspecification of the discount parameter  or the transition function . An interesting question is whether this is an artifact of the fact that that analysis was based on equivalence relations, rather than metrics. However, as it turns out, this result directly generalises to the case where we use metrics instead of equivalence relations. We say that a transition function  is “trivial” if  for all states  and actions  (i.e, basically, if the action you take never matters). All interesting environments have non-trivial transition functions:

Theorem: If  is invariant to S’-redistribution, and , then  is not -robust to misspecification with  for any .

Theorem: If  is invariant to potential shaping, , and the underlying transition function  is non-trivial, then  is not -robust to misspecification with  for any .

These results assume that we quantify the error in the learnt reward using . This pseudometric ranges from 0 to 1, so a -distance of 0.5 would be extremely large. Moreover, a wide range of behavioural models should be expected to be invariant to S’-redistribution and potential shaping (see this post [LW · GW]). In other words, these results say that most sensible behavioural models (including all behavioural models used by contemporary IRL algorithms, and potentially including behavioural models learnt using machine learning) should be expected to not be robust to arbitrarily small misspecification of the discount factor or transition function. This is a very damning result! A more intuitive explanation for why these theorems are true is provided in Appendix B2 of this paper.

 

Perturbation Robustness

Another form of misspecification we look at in more detail is what we call perturbation robustness. The motivation for this is that it is interesting to know whether or not a behavioural model  is robust to misspecification with any behavioural model  that is “close” to . But what does it mean for  and  to be “close”? One option is to say that  and  are close if they always produce similar policies, where the “similarity” between two policies is measured using some (pseudo)metric. As such, we define a notion of a perturbation and a notion of perturbation robustness:

Definition: Let  be two behavioural models, and let  be a pseudometric on . Then  is a -perturbation of  if  and for all , we have that .

Definition: let  be a behavioural model, let  be a pseudometric on , and let  be a pseudometric on . Then  is -robust to -perturbation if  is -robust to misspecification with  (as defined by ) for any behavioural model  that is a -perturbation of  (as defined by ) with .

These definitions are given relative to a pseudometric  on the set of all policies . For example,  could be the -distance between  and , or it may be the KL divergence between their trajectory distributions, etc. As usual, our results apply for any choice of  unless otherwise stated. 

Now, a -perturbation of  simply is any function that is similar to  on all inputs, and  is -robust to -perturbation if a small perturbation of the observed policy leads to a small error in the inferred reward function. We also need one more definition:

Definition: Let  be a behavioural model, let  be a pseudometric on , and let  be a pseudometric on . Then  is -separating if  for all .

Intuitively speaking,  is -separating if reward functions that are far apart, are sent to policies that are far apart. Using this, we can now state the following result:

Theorem: Let  be a behavioural model, let  be a pseudometric on , and let  be a pseudometric on . Then  is -robust to -perturbation (as defined by  and ) if and only if  is -separating (as defined by  and ).

This gives us necessary and sufficient conditions that describe when a behavioural model is robust to perturbations --- namely, it has to be the case that this behavioural model sends reward functions that are far apart, to policies that are far apart. This ought to be quite intuitive; if two policies are close, then perturbations may lead us to conflate them. To be sure that the learnt reward function is close to the true reward function, we therefore need it to be the case that policies that are close always correspond to reward functions that are close.

Our next question is, of course, whether or not the standard behavioural models are -separating. Surprisingly, we will show that this is not the case, when the distance between reward functions is measured using  and the policy metric  is similar to Euclidean distance. Moreover, this holds for any continuous behavioural model

Theorem: Let  be , and let  be a pseudometric on  which satisfies the condition that for all  there exists a  such that if  then . Let  be a continuous behavioural model. Then  is not -separating for any  or 

To make things easy, we can just let  be the L_2 norm (the theorem just generalises this somewhat). The theorem then tells us that no continuous behavioural model is -separating (and therefore also not -robust to -perturbation) for any  or . The fundamental reason for this is that if  is continuous, then it must send reward functions that are close under the -norm to policies that are close under the -norm. However, there are reward functions that are close under the -norm but which have a large STARC distance. Hence  will send some reward functions that are far apart (under ) to policies which are close.

 

Conclusion

We can see that the results in this paper are similar in spirit to those provided in Misspecification in Inverse Reinforcement Learning (also discussed in this post [LW · GW]). In other words, while it is more restrictive to use equivalence relations rather than pseudometrics on , the same basic mathematical structure emerges in both cases.

The main question behind this research was to answer whether or not IRL is robust to moderate misspecification of the behavioural model used by the IRL algorithm — that is, does a small error in the assumptions underlying the IRL algorithm lead to a small error in the learnt reward. To me, it looks like the answer to this question is likely to be negative. In particular, we have seen that an arbitrarily small error in the discount function can lead to large errors in the learnt reward function. Of course, this answer is not fully conclusive. In particular, some of these results (but not all) are based on the behavioural models that are used by current IRL algorithms, and these are very unrealistic (when seen as models of human behaviour) – it may be interesting to extend this analysis to more realistic models of human behaviour (which I have partially done in this paper, for example). Nonetheless, I would not currently put much hope on IRL (even if IRL is amplified with the help of new AI breakthroughs, etc).

In the next post of this sequence, I will discuss reward hacking, and provide some alternative results regarding how to compare reward functions.

 

If you have any questions, then please let me know in the comments!

0 comments

Comments sorted by top scores.