Am I understanding the problem of fully updated deference correctly?

post by Davide_Zagami · 2018-09-30T10:55:58.965Z · LW · GW · 1 comments

Contents

1 comment

I understand that one solution to AI alignment would be to build an agent with uncertainty about its utility function, so that by observing the environment and in particular us, it can learn our true utility function and optimize for that. And according to the problem of fully updated deference, trying to accomplish this would not significantly simplify our work because it involves two steps:

1) Learning our true utility function (easy step)

2) Actually optimizing for (hard step)

1 comments

Comments sorted by top scores.

comment by cousin_it · 2018-10-02T09:33:48.990Z · LW(p) · GW(p)

The problem is more about corrigibility - making an AI that will shut down if you ask it to. The idea is that uncertainty about utility function isn't a good way to implement corrigibility. When you tell the AI to shut down because you know more about the true utility function, it might reply "nah, I won't shut down, I've learned enough to optimize the true utility now".