Reward hacking and Goodhart’s law by evolutionary algorithms

jan_kulveit

Reward hacking and Goodhart’s law by evolutionary algorithms

post by Jan_Kulveit · 2018-03-30T07:57:05.238Z · LW · GW · 5 comments

This is a link post for https://arxiv.org/abs/1803.03453

5 comments

Nice collection of anecdotes from the Evolutionary Computation and Artificial Life research communities about evolutionary algorithms subverting researchers intentions, exposing unrecognized bugs in their code, producing unexpected adaptations, or exhibiting outcomes uncannily convergent with ones in nature. Some of my favorites:

In other experiments, the fitness function rewarded minimizing the difference between what the program generated and the ideal target output, which was stored in text files. After several generations of evolution, suddenly and strangely, many perfectly fit solutions appeared, seemingly out of nowhere. Upon manual inspection, these highly fit programs still were clearly broken. It turned out that one of the individuals had deleted all of the target files when it was run! With these files missing, because of how the test function was written, it awarded perfect fitness scores to the rogue candidate and to all of its peers

...

To test a distributed computation platform called EC-star [84], Babak Hodjat implemented a multiplexer problem [85], wherein the objective is to learn how to selectively forward an input signal. Interestingly, the system had evolved solutions that involved too few rules to correctly perform the task. Thinking that evolution had discovered an exploit, the impossibly small solution was tested over all possible cases. The experimenters expected this test to reveal a bug in fitness calculation. Surprisingly, all cases were validated perfectly, leaving the experimenters confused. Carefully examination of the code provided the solution: The system had exploited the logic engine’s rule evaluation order to come up with a compressed solution. In other words, evolution opportunistically offloaded some of its work into those implicit conditions.

5 comments

Comments sorted by top scores.

comment by rk · 2018-03-30T15:23:03.379Z · LW(p) · GW(p)

I also liked this one, on how easily a program became adversarial with its implementer:

To do so, he tried to turn off all mutations that improved an organism’s replication rate (i.e. its fitness). He configured the system to pause every time a mutation occurred, and then measured the mutant’s replication rate in an isolated test environment. If the mutant replicated faster than its parent, then the system eliminated the mutant [...] Replication rates leveled out for a time, but then they started rising again. After much surprise and confusion, Ofria discovered that he was not changing the inputs that the organisms were provided in the test environment. The organisms had evolved to recognize those inputs and halt their replication. Not only did they not reveal their improved replication rates, but they appeared to not replicate at all

comment by Qiaochu_Yuan · 2018-03-30T16:27:51.733Z · LW(p) · GW(p)

Terence Tao has a comment on this paper on G+ that I quite liked:

Goodhart's law can be formulated as "When a measure becomes a target, it ceases to be a good measure." It initially arose in economics and is mostly applied to situations involving human agents, but as the article below illustrates with several anecdotes, the same law applies in AI research. My favorite is the AI that learned to win at a generalised form of tic-tac-toe by sending their moves to the AI opponent in a highly convoluted fashion that caused them to crash due to exceeding memory limitations.

In mathematics, the analogous phenomenon is that the argmax (or argmin) function - that takes a function F ranging over some parameter space and locates its maximum (or minimum) - can be very unstable. An approximation G to F that agrees well with F for "typical" cases may have a vastly different location for its global maximum (or minimum), due to "edge" case discrepancies. More generally, it can be dangerous to extrapolate average case behaviour of a function to draw any conclusions about worst case (or best case) behaviour.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2018-03-31T01:51:26.177Z · LW(p) · GW(p)

I believe Terence is describing extremal goodhart [LW · GW] in his second paragraph.

comment by Yannick_Muehlhaeuser · 2018-03-30T13:39:32.906Z · LW(p) · GW(p)

This is a really interesting paper.

comment by Davidmanheim · 2018-03-30T20:18:51.111Z · LW(p) · GW(p)

These are really good examples, but I think it's important to distinguish between ill-advised proxies, which is what is described in many of these cases, ones which are misaligned even in the typical case, and ones that fail in the different ways we discussed in our paper https://arxiv.org/abs/1803.04585

Reward hacking and Goodhart’s law by evolutionary algorithms

Contents

5 comments