Posts

Comments

Comment by Andy (andy-1) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-09T05:28:25.940Z · LW · GW

So I've commented on this in other forums but why can't we just bit the bullet on happiness-suffering min-maxing utilitarianism as the utility function?

The case for it is pretty straightforward: if we want a utility function that is continuous over the set of all time, then it must have a value for a single moment in time. At this moment in time, all colloquially deontological concepts like "humans", "legal contracts", etc. have no meaning (these imply an illusory continuity chaining together different moments in time). What IS atomic though, is the valence of individual moments of qualia, aka happiness/suffering - that's not just a higher-order emergence.

It's almost like the question of "how do we get all these deontological intuitions to come to a correct answer?" has the answer "you can't, because we should be using a more first principles function".

Then reward hacking is irrelevant if the reward is 1-1 with the fundamental ordering principle.

Some questions then:

"What about the utility monster?" - 

If it's about a lot of suffering to make a lot of happiness, you can constrain suffering separately.

If it's about one entity sucking up net happiness calculations, if you really want to, you can constrain this out with some caveat on distribution. But if you really look at the math carefully, the "utility monster" is only an issue if you want to arbitrarily partition utility by some unit like "person".

"How do we define happiness/suffering?" - from a first person standpoint, contemplative traditions have solved this independently across the world quite convincingly (suffering = delusion of fundamental subject/object split of experience). From a third person standpoint, we're making lots of progress in mapping neurological correlates. In either case, if the AI has qualia it'll be very obvious to it; if it doesn't, it's still a very solvable question. 

--> BTW, the "figure it out to the best of your ability within X time using Y resources" wouldn't be as dangerous here because if it converts the solar system to computation units to figure it out, that's OK if it then min-maxes for the rest of time, if you bit the bullet.

"Others won't bite the bullet / I won't" - ok, then at the least we can make it a safety mechanism that gets rid of some edge cases like the high-suffering ones or even the extinction ones: "do your other utility function with constraint of suffering not exceeding and happiness not dipping below the values present in 2022, but without min-maxing on it".