Goodhart Taxonomy

post by Scott Garrabrant · 2017-12-30T16:38:39.661Z · score: 192 (67 votes) · LW · GW · 30 comments

Contents

  Quick Reference
  Regressional Goodhart
    Abstract Model
    Examples
    Relationship with Other Goodhart Phenomena
    Mitigation
  Causal Goodhart
    Abstract Model
    Examples
    Relationship with Other Goodhart Phenomena
    Mitigation
  Extremal Goodhart
    Abstract Model
    Examples
    Relationship with Other Goodhart Phenomena
    Mitigation
  Adversarial Goodhart
    Abstract Model
    Examples
    Relationship with Other Goodhart Phenomena
    Mitigation
None
30 comments

Goodhart’s Law states that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." However, this is not a single phenomenon. I propose that there are (at least) four different mechanisms through which proxy measures break when you optimize for them.

The four types are Regressional, Causal, Extremal, and Adversarial. In this post, I will go into detail about these four different Goodhart effects using mathematical abstractions as well as examples involving humans and/or AI. I will also talk about how you can mitigate each effect.

Throughout the post, I will use to refer to the true goal and use to refer to a proxy for that goal which was observed to correlate with and which is being optimized in some way.


Quick Reference


Regressional Goodhart

When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.

Abstract Model

When is equal to , where is some noise, a point with a large value will likely have a large value, but also a large value. Thus, when is large, you can expect to be predictably smaller than .

The above description is when is meant to be an estimate of . A similar effect can be seen when is only meant to be correlated with by looking at percentiles. When a sample is chosen which is a typical member of the top percent of all values, it will have a lower value than a typical member of the top percent of all values. As a special case, when you select the highest value, you will often not select the highest value.

Examples

Examples of Regressional Goodhart are everywhere. Every time someone does something that is anything other than the thing that maximizes their goal, you could view it as them optimizing some kind of proxy (and the action to maximize the proxy is not the same as the action to maximize the goal).

Regression to the Mean, Winner’s Curse, and Optimizer’s Curse are all examples of Regressional Goodhart, as is the Tails Come Apart phenomenon.

Relationship with Other Goodhart Phenomena

Regressional Goodhart is by far the most benign of the four Goodhart effects. It is also the hardest to avoid, as it shows up every time the proxy and the goal are not exactly the same.

Mitigation

When facing only Regressional Goodhart, you still want to choose the option with the largest proxy value. While the proxy will be an overestimate it will still be better in expectation than options with a smaller proxy value. If you have control over what proxies to use, you can mitigate Regressional Goodhart by choosing proxies that are more tightly correlated with your goal.

If you are not just trying to pick the best option, but also trying to have an accurate picture of what the true value will be, Regressional Goodhart may cause you to overestimate the value. If you know the exact relationship between the proxy and the goal, you can account for this by just calculating the expected goal value for a given proxy value. If you have access to a second proxy with an error independent from the error in the first proxy, you can use the first proxy to optimize, and the second proxy to get an accurate expectation of the true value. (This is what happens when you set aside some training data to use for testing.)


Causal Goodhart

When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.

Abstract Model

If