Ali 's Shortform

post by Ali (ali-merali) · 2024-03-04T02:41:59.728Z · LW · GW · 1 comments

Contents

1 comment

1 comments

Comments sorted by top scores.

comment by Ali (ali-merali) · 2024-03-04T02:41:59.862Z · LW(p) · GW(p)

I've only read a couple of pieces on corrigibility but curious why the following scheme wouldn't work/ I haven't seen any work in this direction (from my admittedly v brief scan). 

Suppose for every time t some decision maker would choose a probability p(t) with which the AI would shutdown. Further, suppose the AI's utility function in any period was initially U(t). Now scale the new AI's utility function to be U(t)/(1-p(t))- I think this can be quite easily generalized so future periods of utility are likewise unaffected by an increase in the risk of shutdown (eg. scale them by the product of (1-p(t)) over all intervening time periods).

In this world the AI should be indifferent over changes to p(t) (as long as it gets arbitrarily close to 1 but never reaches it) and so should take actions trying to maximize U whilst being indifferent to if humans decide to shut it down.