Forum Digest: Corrigibility, utility indifference, & related control ideas

post by Benya_Fallenstein (Benja_Fallenstein) · 2015-03-24T17:39:09.000Z · LW · GW · 5 comments

Contents

  Papers
  Corrigibility
  Utility indifference
  Safe oracles
  Manipulating an agent's beliefs
  Low-impact agents
  Odds and ends
None
5 comments

This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent's goal system wrong, it doesn't try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent's goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It's current as of 3/21/15.

Papers

As background to the posts listed below, the following two papers may be helpful.

Corrigibility

Utility indifference

Safe oracles

Manipulating an agent's beliefs

Low-impact agents

Odds and ends

5 comments

Comments sorted by top scores.

comment by WCargo (Wcargo) · 2023-06-15T14:47:40.633Z · LW(p) · GW(p)

Hey, almost all links are dead, would it be possible to update them ? otherwise the post is pretty useless and I am interested in them ^^

Replies from: TekhneMakre
comment by TekhneMakre · 2023-06-15T14:59:07.753Z · LW(p) · GW(p)

Note that you can probably find the broken LW posts by searching the title (+author) in LW.

comment by Stuart_Armstrong · 2015-03-24T12:40:49.000Z · LW(p) · GW(p)

Thanks for that!

I think some of the old stuff is likely superseded, I'll see once the various ideas settle. And "resource gathering agent" should not be in "low-impact agents" (the "subtraction" idea does not seem a good one, but there are other uses for resource gathering agents).

Replies from: Benja_Fallenstein
comment by Benya_Fallenstein (Benja_Fallenstein) · 2015-03-24T17:42:20.000Z · LW(p) · GW(p)

Categorization is hard! :-) I wanted to break it up because long lists are annoying to read, but there was certainly some arbitrariness in dividing it up. I've moved "resource gathering agent" to the odds & ends.

comment by orthonormal · 2015-03-22T18:50:39.000Z · LW(p) · GW(p)

This reminds me, I should post the Loki corrigibility model here.