Defining and Characterising Reward Hacking

logical_lunatic

Defining and Characterising Reward Hacking

post by Joar Skalse (Logical_Lunatic) · 2025-02-28T19:25:42.777Z · LW · GW · 0 comments

  Background Question and Problem Formalisation
  Results
  Conclusion
None
No comments

In this post, I will provide a summary of the paper Defining and Characterising Reward Hacking, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the sixth post in the theoretical reward learning sequence, which starts in this post [LW · GW] (though this post is self-contained). This was joint work together with Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger.

Background Question and Problem Formalisation

This paper deals with the question of what it takes for two reward functions to be “unhackable”. The setup is as follows; we assume that the preferences of a human are described by some reward function , but that an AI system is told to instead optimise a proxy reward $R_{2}$ . We want to know under what conditions the proxy reward $R_{2}$ is “unhackable” with respect to the true reward function $R_{1}$ . We say that $R_{1}$ and $R_{2}$ are unhackable (with respect to each other) if there are no policies $π_{1}, π_{2}$ such that $J_{1} (π_{1}) > J_{1} (π_{2})$ but $J_{2} (π_{1}) < J_{2} (π_{2})$ , where $J_{1}$ and $J_{2}$ are the policy evaluation functions of $R_{1}$ and $R_{2}$ . Note that we allow for there to be policies such that $J_{1} (π_{1}) = J_{1} (π_{2})$ but $J_{2} (π_{1}) < J_{2} (π_{2})$ , etc.

First of all, why have we chosen to formalise unhackability in this way? Note that this condition is equivalent to saying that there is no way to increase the proxy reward while decreasing the true reward. Stated differently, the proxy reward can never explicitly incentivise the AI system to reduce the true reward. It seems reasonable to say that this exactly is the condition under which the proxy is fully unhackable.

Should we expect unhackable proxy rewards to exist? If we consider general functions over arbitrary sets, then there are certainly functions which satisfy this relationship. Consider, for example, the $R e L u$ function and the $t a n h$ function. There are no $x, y \in R$ such that $R e L u (x) > R e L u (y)$ but $t a n h (x) < t a n h (y)$ , and so $R e L u$ and $t a n h$ satisfy this relationship. Intuitively speaking, we can think of the $R e L u$ function as a “simplified” version of $t a n h$ , in the sense that if $R e L u (x) < R e L u (y)$ then $t a n h (x) < t a n h (y)$ , though there are $x, y$ such that $t a n h (x) < t a n h (y)$ but $R e L u (x) = R e L u (y)$ . In other words, while $R e L u$ fails to capture all of the distinctions that $t a n h$ cares about, any change which is an improvement according to ReLu is also an improvement according to tanh. Perhaps there are reward functions that satisfy a similar relationship? If so, then that would suggest that we may be able to create “simplified” proxy rewards that are safe to optimise, even if they do not capture all the details of human preferences.

Results

The main result of the paper is that $R_{1}$ and $R_{2}$ only can be unhackable if either $R_{1}$ and $R_{2}$ induce exactly the same ordering of policies (in which case they are equivalent), or if at least one of $R_{1}$ and $R_{2}$ is indifferent between all policies (in which case it is trivial). Specifically, if there are no policies $π_{1}, π_{2}$ such that $J_{1} (π_{1}) > J_{1} (π_{2})$ but $J_{2} (π_{1}) < J_{2} (π_{2})$ , or vice versa, then either $J_{1} (π_{1}) > J_{1} (π_{2})$ if and only if $J_{2} (π_{1}) > J_{2} (π_{2})$ (meaning that $R_{1}$ and $R_{2}$ are equivalent), or $J_{1} (π_{1}) = J_{1} (π_{2})$ for all $π_{1}, π_{2}$ (meaning that $R_{1}$ is trivial), or $J_{2} (π_{1}) = J_{2} (π_{2})$ for all $π_{1}, π_{2}$ (meaning that $R_{2}$ is trivial). This means that there are no non-trivial unhackable reward functions — if we want to ensure that a proxy reward cannot be hacked, then we must (unfortunately) ensure that it captures all details of our true preferences. Note that this result depends on certain geometric properties that specifically hold for MDPs and reward functions, since an analogous result doesn’t hold for arbitrary real-valued functions on arbitrary sets. For the proof, see the main paper.

To make this result a bit more intuitive, we can consider a relaxed version of the same problem. Specifically, suppose that instead of only considering stationary policies, we also consider non-stationary policies (which may condition their actions on their past history, instead of just the current state). Let $Π^{N}$ be the set of all stationary and non-stationary policies, and suppose $R_{1}$ and $R_{2}$ are both unhackable and nontrivial relative to $Π^{N}$ . Let $π^{⋆}$ be a policy that maximises both $R_{1}$ and $R_{2}$ reward, and let $π_{⊥}$ be a policy that minimises both $R_{1}$ and $R_{2}$ reward. You may take a moment to convince yourself that there must exist two such policies $π^{⋆}$ and $π_{⊥}$ , given the assumption that $R_{1}$ and $R_{2}$ are unhackable. Then the policy $π_{λ}$ that plays $π^{⋆}$ with probability $λ$ and $π_{⊥}$ with probability $1 - λ$ is a policy in $Π^{N}$ . Moreover, by continuity and the intermediate value theorem, for any $π$ there are two $α, β \in [0, 1]$ such that $J_{1} (π) = J_{1} (π_{α})$ and $J_{2} (π) = J_{2} (π_{β})$ . If $R_{1}$ and $R_{2}$ are non-trivial, then $α$ and $β$ are unique. Now, if $α \neq β$ , then either $J_{1} (π) < J_{1} (π_{δ})$ and $J_{2} (π) > J_{2} (π_{δ})$ , or vice versa, for $δ = (α + β) / 2$ . If $R_{1}$ and $R_{2}$ are unhackable then this cannot happen, so it must be that $α = β$ . This, in turn, implies that $J_{1} (π_{1}) = J_{1} (π_{2})$ if and only if $J_{2} (π_{1}) = J_{2} (π_{2})$ , and so $R_{1}$ and $R_{2}$ are equivalent. This means that no interesting unhackability can occur on the set of all non-stationary policies. The same argument cannot be applied to the set of stationary policies, because $π_{λ}$ is typically not stationary (and mixing stationary policies' action probabilities does not have the same effect). However, with a slightly more complicated argument, it is possible to show that the same result applies to the set of all stationary policies as well.

The paper also contains a few additional results about what happens if you restrict the definition to only range over some restricted sets of policies (such as only deterministic policies, for example). In short, any set of policies that contains a subset which is open (in the ordinary Euclidean topology on policies) will be subject to the result above. This includes policies which are “approximately optimal”, or “approximately deterministic”, or sufficiently close to a given base policy, etc. On the other hand, if we use a finite set of policies (such as the set of all deterministic policies, for example) then there can be reward functions that are unhackable, non-equivalent, and non-trivial. However, the reason for this is essentially that we can introduce a small perturbation of any given reward function $R_{1}$ to produce another reward function $R_{2}$ that is almost the same as $R_{1}$ on a given finite set of policies, and so this result is unlikely to be very helpful in practice.

Conclusion

It is interesting to see that it is impossible to create non-trivial proxy rewards that are unhackable – we can see this result as providing some theoretical justification for the common intuition that two reward functions or utility functions must be very nearly identical in order to not allow for reward hacking or Goodharting. It would have been nice if this wasn’t the case — then we could have started looking for interesting ways to create “simplified” reward functions that are safe to optimise, without encoding everything humans care about. However, also note that this paper relies on a fairly strong notion of “unhackability” — it may be interesting to also consider weaker formalisations, that may still be sufficient in practice.

If you have any questions, then please let me know in the comments!

In the next post in this sequence, I will discuss several additional papers on the theory of reward learning, but without going into as much depth as I did in this (and previous) post(s).

0 comments

Comments sorted by top scores.

Defining and Characterising Reward Hacking

Contents

Background Question and Problem Formalisation

Results

Conclusion

0 comments