New Paper Expanding on the Goodhart Taxonomy

post by Scott Garrabrant · 2018-03-14T09:01:59.735Z · LW · GW · 4 comments

This is a link post for https://arxiv.org/pdf/1803.04585.pdf

4 comments

Comments sorted by top scores.

comment by Ben Pace (Benito) · 2018-03-14T15:03:27.237Z · LW(p) · GW(p)

Woo! Good ideas -> papers! I like this and that it happened. Nice going especially to David Manheim.

Replies from: Davidmanheim
comment by Davidmanheim · 2018-03-15T18:26:51.439Z · LW(p) · GW(p)

Thanks!

comment by Davidmanheim · 2018-03-15T18:26:43.957Z · LW(p) · GW(p)

We'd love any feedback people have on the write-up.

Note: I'm in the middle of writing an extension of this work that gets much more into adversarial situations.

comment by adam_shimi · 2019-02-27T15:31:17.541Z · LW(p) · GW(p)

First, thanks to both of you for writing this really nice paper. I have two questions, which are more about my understanding (or lack thereof) than about issues with the work.

My first question is some example I have in mind, and if my classification for it makes sense. In this example, the regulator is a grant-maker, and the agents are researchers in a field that is very theoretical. Our regulator wants to optimize for as much concrete applications as possible, regardless of what researchers are interested in (any similarity with the real world is obviously unintended). In order to reach this goal, the regulator will fund in priority grant-proposals promising applications. Yet the researchers could just write about applications while only pursuing the theory (still no intented similarity with the real world...).

It seems clear to me that this is an instance of adversarial goodhart, and more specifically of adversarial misalignment goodhart. Do you also think it is the case? If not, why?

My second question is more of a request: what "concrete" example could you give me of non-causal cobra effect goodhart? I have some trouble visualizing it.