post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by Stuart_Armstrong · 2016-08-08T16:52:50.000Z · LW(p) · GW(p)

I don't think the various L_i change anything mathematically. Combining the updates on the L_i and the updates for each L_i on the H_i, this seems to just make a single update rule for H_i, given the data.

And the solution to conservation of expected evidence seems to be (not totally sure about this) just another version of factoring out variables. I'm not sure what version it is, but, if I had to guess, it would be something like taking the default action to be a random choice of action.

Also, I think it would be best to prevent the AI from taking us apart to make us into morality sensors by impressing the values of not-taking-us-apartness early in the learning process, rather than having the AI not be able to update on valid information (remember that if there is no penalty to taking us apart, the AI may do so simply to avoid learning something it doesn't want to learn).

Finally, Dylan has a version of "uncertainty causing an AI not press shutdown button nor destroy it" which seems to address the issue better.

So, mathematically, I don't think this approach offers anything new. Conceptually, though, I might. I've advocated for explicitly including human meta-preferences in any form of value learning, and this might be a way to start doing so. If the update rules for the L_i are designed to be different from those of the H_i, and if they are designed with care to mimic some aspects of human meta-preferences updates, then this approach might have success, in practice if not in principle.

Replies from: cocoa
comment by michaelcohen (cocoa) · 2016-09-18T17:46:51.000Z · LW(p) · GW(p)

Thanks for the thoughts. I guess I need to do a lot more looking into CIRL, before I come back to this. I do still wonder (although this is at an unformalized level) whether an agent could potentially learn a lot about moral evidence from the constraint that it's own actions can't cause the expected evidence to change. For example, if it realizes that a certain action (like subtle coercion) would result in something that it would have thought was legitimate evidence, then that situation must not actually count as evidence at all. That constraint seems to pack a decent minority of our requirements for value learning into a relatively simple statement. There may be other ways to encode such a constraint besides having an agent be uncertain about its function for determining what observations provide what evidence, though.