Forum Digest: Corrigibility, utility indifference, & related control ideas

post by Benya_Fallenstein (Benja_Fallenstein) · 2015-03-24T17:39:09.000Z · score: 5 (5 votes) · LW · GW · None comments

Contents

  Papers
  Corrigibility
  Utility indifference
  Safe oracles
  Manipulating an agent's beliefs
  Low-impact agents
  Odds and ends
None
3 comments

This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent's goal system wrong, it doesn't try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent's goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It's current as of 3/21/15.

Papers

As background to the posts listed below, the following two papers may be helpful.

Corrigibility

Utility indifference

Safe oracles

Manipulating an agent's beliefs

Low-impact agents

Odds and ends

None comments

Comments sorted by top scores.

comment by Stuart_Armstrong · 2015-03-24T12:40:49.000Z · score: 0 (0 votes) · LW · GW

Thanks for that!

I think some of the old stuff is likely superseded, I'll see once the various ideas settle. And "resource gathering agent" should not be in "low-impact agents" (the "subtraction" idea does not seem a good one, but there are other uses for resource gathering agents).

comment by Benya_Fallenstein (Benja_Fallenstein) · 2015-03-24T17:42:20.000Z · score: 1 (1 votes) · LW · GW

Categorization is hard! :-) I wanted to break it up because long lists are annoying to read, but there was certainly some arbitrariness in dividing it up. I've moved "resource gathering agent" to the odds & ends.

comment by orthonormal · 2015-03-22T18:50:39.000Z · score: 0 (0 votes) · LW · GW

This reminds me, I should post the Loki corrigibility model here.