Doing Nothing Utility Function

post by k64 · 2024-09-26T22:05:18.821Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    2 quila
    1 ProgramCrafter
None
No comments

One of the questions I've heard asked is "how do you design a utility function that would make the AI do nothing?"  That is, how could we put a pause button on an AI so that we could pause it if we wanted to?  I had an idea about how one could do this, and am sure it has already been thought of, so I'm curious to know why this doesn't end up working.

 

Why can't we just write something like:

If (paused) and not (taking actions): utility=PredictedUtilityIfNotPaused

If not (paused): utility=NormalUtilityFunction

Answers

answer by quila · 2024-09-26T22:52:13.570Z · LW(p) · GW(p)

If not (paused): utility=NormalUtilityFunction

If NormalUtilityFunction assigns a higher score to futures where its agent avoids being paused, then that has the usual problem of instrumental convergence avoiding shut down or utility function modification.

I'll try modifying your example to instead use UtilityFunction2, which assigns, to worlds where its agent ends up paused, a score equal to the average score assigned to {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused. This doesn't mean it wouldn't still instrumentally converge and start disassembling things[1], just that if you tried to pause it, it wouldn't have a direct incentive to specifically try to stop you, per se, because if you succeed it still assigns score equal to that of the average world in its prior where you didn't succeed. It has some interesting indirect incentives to avoid being paused (or pause itself), though, which I get into below.

Consider if this utility function were to score the worlds in which the agent will take one action or the other: action (1), which has a 99% of chance leading to being paused, and results in an average score of 10 if not paused; and action (2), which has a 5% chance of leading to being paused, and results in an average score of 5 if not paused. In this case, UtilityFunction2 prefers the world where the agent takes action (1). (This alone does not lead to resisting pauses or self-pausing; this is just an observation for now).

Next, consider how "the average score of {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused" does depend on the agent's own action policy. In short, the agent can commit to pausing itself in all but the highest-scoring world in its prior to create an extreme version of action (1). (Longer, original explanation in footnote[2])

Separately, I also just noticed that, as worded - "as [the prior] was just before updating on observing being paused" - there are yet more unexpected implications. For example: at time 0, the set of non-paused futures are generally not low-probability exceptional worlds; But one hour later, the agent will soon be paused, absent its action; And the moment just before being paused, it is all but certain that it will be paused - and so, in the prior at that final point, the set of worlds where it does not end up paused are exceptional ones, like simulation hypotheses where there's an intervention at the last moment, which of course may themselves have higher or lower scores than remaining unpaused (through resistance). One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent learns information which decreases the score it assigns to the world, it will pause itself if no outweighing positive update occurs in the next 9999 updates).

  1. ^

    If this utility function is otherwise a classical physical-thing-maximizer

  2. ^

    "Specifically, {the score of the average world where the agent does not end up paused} could be higher or lower conditional on whether the agent tries to resist being paused, (as resisting would add more worlds to the set of non-paused worlds, whose scores are averaged to return the score of any paused world). The same applies to pausing itself in case it observes a negative or not-as-positive worldstate, to have such worldstates instead be assigned the score of the remaining unpaused ones. (This does lead to self-pausing in all possible futures except the highest-scoring one)"

  3. ^

    One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent were to learn information which decreased the score assigned to the world, it would pause itself if no outweighing positive update occurred in the next 9999 updates).

comment by k64 · 2024-09-27T01:20:50.439Z · LW(p) · GW(p)

Thank you for this answer -  I really like it!  I'm trying to wrap my head the last 2 paragraphs.  

2nd to last paragraph:
Ok, so you're saying that it could choose to self-pause unless it was in the highest-scoring world?  I'm conceptualizing a possible world as an (action,result) pair, from which it could calculate (action, E[result]) pairs and then would choose the action with the highest E[result], while being paused would also provide max(E[result]).  So are you saying it would limit the possible actions it would take?  That seems like it wouldn't change anything since it is always going to just take the one best action anyway.  Or that by setting a self-pausing policy it could alter E[result]?  That sounds possible to me but I don't have a concrete example of how that would work.  Like, would it go play the lottery (assuming money gives +utility for some reason) and pre-commit to pausing if it doesn't win?  Or do you have something else in mind?

Last paragraph:  
If just prior to being paused, there exists 1 scenario where it won't be paused, then it could be an average, low, or high utility scenario.  Obviously, average is fine.  And if it's really high, then it will get a lot of utility from being paused and certainly we're not worried about it self-pausing when surrounded by agents trying to pause it.  So, if it's a really low utility scenario where it won't end up being paused, then sure, it won't get much utility being paused, but since it won't get much utility if it doesn't end up being paused, why should it have a preference?  And, we could say - well, but it could fight back and then create a high-utility scenario - but then that would be the utility it would get if it doesn't end up paused, so it would get the high utility paused and again be indifferent.  

answer by ProgramCrafter · 2024-09-26T22:39:31.156Z · LW(p) · GW(p)

Your idea seems to break when AI is being unpaused: as it has not done any beneficial actions, utility would suddenly go down from "simulated" to "normal", meaning that AI will likely resist waking it up.

Also, it assumes there is a separate module for making predictions, which cannot be manipulated by the agent. This assumption is not very probable in my view.

comment by quila · 2024-09-26T23:55:46.807Z · LW(p) · GW(p)

that AI will likely resist waking it up.

If the AI is resisting being turned on, then it would have to be already on, by which point the updates (to the AI's prior, and score assigned to it) would have already happened.

No comments

Comments sorted by top scores.