Algorithm-dependent problems with self-modification

post by Manfred · 2011-05-23T18:59:39.436Z · LW · GW · Legacy · 0 comments

Imagine you're a self-modifying intelligent agent, and your mother is taking you out to lunch.  Over coffee, she offers you a deal: if you will self-modify to be repulsed by eating animals to the extent that you become a vegetarian, she will pay for your lunch.  Don't try to cheat - she can tell when you're lying.  So you compare the total utility of changing with the total utility of not-changing, and you decide that you would rather continue to eat meat than have your lunch paid for.

This is an example of an algorithm-dependent problem, a more general type than causal problems or Newcomb-like problems.  Someone can see inside your brain to some extent and rewards you based not just on your actions or your decisions, but on your values or your patterns of thought.  It's sort of interesting, but they seem unlikely to happen because it's difficult for us to modify our own algorithms and it's difficult for some external observer to verify that we've really done so.  AIs, on the other hand, can fulfill both of these requirements, and so might run into them on rare occasions.

The vegetarian/omnivore tradeoff seems fairly comprehensible because of its nice properties.  It's fairly time-symmetrical since being a vegetarian is pretty much the same day to day.  It's familiar to us so that we don't have to guess too much about what it's like being a vegetarian.  And since we don't feel like wanting to eat animals is an inherently valuable belief, we can just evaluate the utility of the consequences.

What sort of modifications would be trickier?  Well, going back to that last point, do we have inherently valuable beliefs?  I'd argue that I do - I would not want to want to kill people even if God promised to keep an eye on me and stop me before I made any outward sign of trying.  But it's fairly simple to extend our utility function over our own brains.

Worse is when you have multiple modifications.  If you modified to be a vegetarian, and then modified to enjoy skydiving, it wouldn't be a big deal - those are mostly independent.  But what if your mom wanted you to modify something that impacted your self-modification system?

I think it's not too hard to take an algorithm-determined problem, even with self-modification, and find the optimal classes of algorithms.  This is analogous to the decisions made by a decision-theory-following agent.  What would be nice is to have something analogous to a decision theory - 

0 comments

Comments sorted by top scores.