Defining and Characterising Reward Hacking
post by Joar Skalse (Logical_Lunatic) · 2025-02-28T19:25:42.777Z · LW · GW · 0 commentsContents
Background Question and Problem Formalisation Results Conclusion None No comments
In this post, I will provide a summary of the paper Defining and Characterising Reward Hacking, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the sixth post in the theoretical reward learning sequence, which starts in this post [LW · GW] (though this post is self-contained). This was joint work together with Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger.
Background Question and Problem Formalisation
This paper deals with the question of what it takes for two reward functions to be “unhackable”. The setup is as follows; we assume that the preferences of a human are described by some reward function , but that an AI system is told to instead optimise a proxy reward . We want to know under what conditions the proxy reward is “unhackable” with respect to the true reward function . We say that and are unhackable (with respect to each other) if there are no policies such that but , where and are the policy evaluation functions of and . Note that we allow for there to be policies such that but , etc.
First of all, why have we chosen to formalise unhackability in this way? Note that this condition is equivalent to saying that there is no way to increase the proxy reward while decreasing the true reward. Stated differently, the proxy reward can never explicitly incentivise the AI system to reduce the true reward. It seems reasonable to say that this exactly is the condition under which the proxy is fully unhackable.
Should we expect unhackable proxy rewards to exist? If we consider general functions over arbitrary sets, then there are certainly functions which satisfy this relationship. Consider, for example, the function and the function. There are no such that but , and so and satisfy this relationship. Intuitively speaking, we can think of the function as a “simplified” version of , in the sense that if then , though there are such that but . In other words, while fails to capture all of the distinctions that cares about, any change which is an improvement according to ReLu is also an improvement according to tanh. Perhaps there are reward functions that satisfy a similar relationship? If so, then that would suggest that we may be able to create “simplified” proxy rewards that are safe to optimise, even if they do not capture all the details of human preferences.
Results
The main result of the paper is that and only can be unhackable if either and induce exactly the same ordering of policies (in which case they are equivalent), or if at least one of and is indifferent between all policies (in which case it is trivial). Specifically, if there are no policies such that but , or vice versa, then either if and only if (meaning that and are equivalent), or for all (meaning that is trivial), or for all (meaning that is trivial). This means that there are no non-trivial unhackable reward functions — if we want to ensure that a proxy reward cannot be hacked, then we must (unfortunately) ensure that it captures all details of our true preferences. Note that this result depends on certain geometric properties that specifically hold for MDPs and reward functions, since an analogous result doesn’t hold for arbitrary real-valued functions on arbitrary sets. For the proof, see the main paper.
To make this result a bit more intuitive, we can consider a relaxed version of the same problem. Specifically, suppose that instead of only considering stationary policies, we also consider non-stationary policies (which may condition their actions on their past history, instead of just the current state). Let be the set of all stationary and non-stationary policies, and suppose and are both unhackable and nontrivial relative to . Let be a policy that maximises both and reward, and let be a policy that minimises both and reward. You may take a moment to convince yourself that there must exist two such policies and , given the assumption that and are unhackable. Then the policy that plays with probability and with probability is a policy in . Moreover, by continuity and the intermediate value theorem, for any there are two such that and . If and are non-trivial, then and are unique. Now, if , then either and , or vice versa, for . If and are unhackable then this cannot happen, so it must be that . This, in turn, implies that if and only if , and so and are equivalent. This means that no interesting unhackability can occur on the set of all non-stationary policies. The same argument cannot be applied to the set of stationary policies, because is typically not stationary (and mixing stationary policies' action probabilities does not have the same effect). However, with a slightly more complicated argument, it is possible to show that the same result applies to the set of all stationary policies as well.
The paper also contains a few additional results about what happens if you restrict the definition to only range over some restricted sets of policies (such as only deterministic policies, for example). In short, any set of policies that contains a subset which is open (in the ordinary Euclidean topology on policies) will be subject to the result above. This includes policies which are “approximately optimal”, or “approximately deterministic”, or sufficiently close to a given base policy, etc. On the other hand, if we use a finite set of policies (such as the set of all deterministic policies, for example) then there can be reward functions that are unhackable, non-equivalent, and non-trivial. However, the reason for this is essentially that we can introduce a small perturbation of any given reward function to produce another reward function that is almost the same as on a given finite set of policies, and so this result is unlikely to be very helpful in practice.
Conclusion
It is interesting to see that it is impossible to create non-trivial proxy rewards that are unhackable – we can see this result as providing some theoretical justification for the common intuition that two reward functions or utility functions must be very nearly identical in order to not allow for reward hacking or Goodharting. It would have been nice if this wasn’t the case — then we could have started looking for interesting ways to create “simplified” reward functions that are safe to optimise, without encoding everything humans care about. However, also note that this paper relies on a fairly strong notion of “unhackability” — it may be interesting to also consider weaker formalisations, that may still be sufficient in practice.
If you have any questions, then please let me know in the comments!
In the next post in this sequence, I will discuss several additional papers on the theory of reward learning, but without going into as much depth as I did in this (and previous) post(s).
0 comments
Comments sorted by top scores.