comment by Koen.Holtman ·
2020-08-20T11:14:45.017Z · LW(p) · GW(p)
Just found your question via comment sections of recent posts. I understand you are still interested in the topic. so I'll add to the comments below. In the summer of 2019 I did significant work trying to understand the status of the corrigibility literature, so here is a long answer mostly based on that.
First, at this point in time there is no up-to-date centralised reading list on corrigibility. All research agenda or literature overview lists that I know of lack references to the most recent work.
Second, the 'MIRI corrigibility agenda', if we define this agenda as a statement of the type of R&D that MIRI wants to encourage when it comes to the question of corrigibility, is very different from e.g. the 'Paul Christiano corrigibility agenda', if we define that agenda as the type of R&D that Paul Christiano likes to do when it comes to the question of corrigibility. MIRI's agenda related to corrigibility still seems to be to encourage work on decision theory and embeddedness. I am saying 'still seems' here because MIRI as an organisation has largely stopped giving updates about what they are thinking collectively.
Below I am going to talk about the problem of compiling or finding up to date reading lists that show all work on the problem of corrigibility, not a subset of work that is most preferred or encouraged by a particular agenda.
One important thing to note is that by now, unfortunately, the word corrigibility means very different things to different people. MIRI very clearly defined corrigibility, in their 2015 paper with that title, by a list of 4 criteria, (and in a later section also by a list of 5 criteria at a different level of abstraction), 4 criteria that an agent has to satisfy in order to be corrigible. Many subsequent authors have used the terms 'corrigibility' or 'this agent is corrigible' to denote different, and usually weaker, desirable properties of an agent. So if someone says that they are working on corrigibility, they may not be working towards the exact 4 (or 5) criteria that MIRI defined. MIRI stresses that a corrigible agent should not take any action that tries to prevent a shutdown button press (or more generally a reward function update). But many authors are defining success in corrigibility to mean a weaker property, e.g. that the agent must always accept the shutdown instruction (or the reward function update) when it gets it, irrespective of whether the agent tried to manipulate the human into not pressing the stop button beforehand.
When writing the related work section of my 2019 paper corrigibility with utility preservation, I tried to do a survey of all related work on corrigibility, a survey without bias towards my own research agenda. I quickly found that there is a huge amount of writing about corrigibility in various blog/web forum posts and their comment sections, way too much for me to describe in a related work section. There was too much for me to even read it all, though I read a lot of it. So I limited myself, for the related work section, to reading and describing the available scientific papers, including arxiv preprints. I first created a long list of some 60 papers by using google scholar to search for all papers that reference the 2015 MIRI paper, by using some other search terms, any by using literature overviews. I then filtered out all the papers which a) just mention corrigibility in a related work section or b) describe the problem in more detail, but without contributing any new work or insights towards a solution. This left me with a short list of only a few papers to cite as related work, actually it surprised me that so little further work had been done on corrigibility after 2015, at least work that made it to publication in a formal paper or preprint.
In any case, I can offer the related work section in my mid 2019 paper on corrigibility is an up-to-date-as-of-mid-2019 reading list on corrigibility, for values of the word corrigibility that stay close to the original 2015 MIRI definition. For broader work that departs further from the definition, I used the device of referencing the 2018 literature review of Everitt, Lee and Hutter.
So what about the literature written after mid-2019 that would belong on a corrigibility reading list? I have not done a complete literature search since then, but definitely my feeling is that the pace of work on corrigibility has picked up a bit since mid 2019, for various values of the word corrigibility.
Several authors, including myself, are avoiding the word corrigibility, to refer to the problem of corrigibility, My own reason for avoiding it is that it just means too many different things to different people. So I prefer to use a broader terms like 'reward tampering' or 'unwanted manipulation of the end user by the agent'. In the 2019 book human compatible, Russell is using the phrasing 'the problem of control' to kind-of denote the problem of corrigibility.
So here is my list of post-mid-2019 books and papers are useful to read if you want to do new R&D on safety mechanisms that achieve corrigibility/that prevent reward tampering or unwanted manipulation, if you want to do more R&D on such mechanisms without risking re-inventing the wheel. Unlike the related work section discussed above, this is not based on a systematic global long-list-to-short-list literature search, it is just work that happened to encounter (and write myself).
- The book human compatible by Russell. -- This book provides a good natural-language problem statement of the reward tampering problem, but it does not get into much technical detail about possible solutions, because it is not aimed at a technical audience. For technical detail about possible solutions:
- Everitt, T., Hutter, M.: Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv:1908.04734 (2019) -- this paper is not just about causal influence diagrams but it also can be used as a good literature overview of many pre-mid-2019 reward tampering solutions, a literature overview that is more recent, and provides more descriptive detail, than the 2018 literature review I mentioned above.
- Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg: Pitfalls of learning a reward function online
https://arxiv.org/abs/2004.13654 -- this has a very good problem statement in the introduction, phrasing the tampering problem in an 'AGI agent as a reward learner' context. It then gets into a very mathematical examination of the problem.
- Koen Holtman: AGI Agent Safety by Iteratively Improving the Utility Function
https://arxiv.org/abs/2007.05411 (blog post intro here [LW · GW]) -- This deals with a particular solution direction to the tampering problem. It also uses math, but I have tried to make the math as accessible as possible to a general technical audience.
This post-mid-2019 reading list is also biased to my own research agenda, and my agenda favours the use of mathematical methods and mathematical analysis over the use of natural language when examining AGI safety problems and solutions. Other people might have other lists.Replies from: algon33
↑ comment by algon33 ·
2020-08-20T14:39:46.996Z · LW(p) · GW(p)
Hey, thanks for writing all of that. My current goal is to do an up to date literature review on corrigibility, so that was a most helpful comment. I'll definitely look over your blog, since some of these papers are quite dense. Out of the paper's you recommended, is there one that stands out? Bear in mind that I've read Stewart and MIRI's papers already.Replies from: Koen.Holtman
↑ comment by Koen.Holtman ·
2020-08-20T16:20:49.440Z · LW(p) · GW(p)
Thanks, you are welcome!
Dutch custom prevents me from recommending my own recent paper in any case, so I had to recommend one paper from the time frame 2015-2020 that you probably have not read yet, I'd recommend 'Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective'. This stands out as an overview of different approaches, and I think you can get a good feeling of the state of the field out of it even if you do not try to decode all the math.
Note that there are definitely some worthwhile corrigibility related topics that are discussed only/mainly in blog posts and in LW comment threads, but not in any of the papers I mention above or in my mid-2019 related work section. For example, there is the open question whether Christiano's Iterated Amplification approach will produce a kind of corrigibility as an emergent property of the system, and if so what kind, and is this the kind we want, etc. I have not seen any discussion of this in the 'formal literature', if we define the formal literature as conference/arxiv papers, but there is a lot of discussion of this in blog posts and comment threads.Replies from: algon33
↑ comment by algon33 ·
2020-08-20T16:34:36.823Z · LW(p) · GW(p)
Dutch custom prevents me from recommending my own recent paper in any case
This phrase and its implications are perfect examples of problems in corrigibility. Was that intentional? If so, bravo. Your paper looks interesting, but I think I'll read the blog post first. I want a break from reading heavy papers. I wonder if the researchers would be OK with my drawing on their blog posts in the review. Would you mind?
Thanks for recommending "Reward tampering", it is much appreciated. I'll get on it after synthesising what I've read so far. Otherwise, I don't think I'll learn much.Replies from: Koen.Holtman
↑ comment by Koen.Holtman ·
2020-08-20T16:50:41.002Z · LW(p) · GW(p)
Nope, not intentional.
You should feel free to write a literature overview that cites or draws heavily on paper-announcement blog posts. I definitely won't mind. In general, the blog posts tend to use language that is less mathematical and more targeted at a non-specialist audience. So if you aim to write a literature overview that is as readable as possible for a general audience, then drawing on phrases from the author's blog posts describing the papers (when such posts are available) may be your best bet.