Double Corrigibility: better Corrigibility

stuart_armstrong

Double Corrigibility: better Corrigibility

post by Stuart_Armstrong · 2016-05-04T13:59:13.000Z · LW · GW · 4 comments

  Double corrigibility
None
4 comments

This post is obsolete, replaced by this one.

EDIT: Defining $C$ and the conditions on $C$ is a bit more complicated than presented here, still have some details to formalise.

Corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI's utility changes from $u$ to $v$ at time $t$ , it maximises a meta-utility $U$ such that

$U_{\leq t} = u$ .
$U_{> t} = v + E (u | u \to u) - E (v | u \to v)$ .

Here $u \to u$ designates the event that $u$ remains unchanged, while $u \to v$ designates the change.

As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change - it is indifferent to the change. Those expectation terms can be designated compensatory rewards.

What it is not indifferent to, though, is changes to $E$ . If the AI knows that change $u \to v$ is coming, then it has incentives to change $E$ to $E^{'}$ with a high $E^{'} (u | u \to u)$ and a low $E^{'} (v | u \to v)$ . Its ability to rewire is limited by the fact that $E^{'}$ will serve as its future expectation estimator, so it won't mangle it completely, but it certainly can twist it a bit.

Notice that the AI is now no longer indifferent to the change of utility. It can benefit from rewiring $E$ this way only if a change is coming up. Notice that $E^{'}$ probably can't rewire itself any more, so the rewired AI will be indifferent, but not before rewiring.

Double corrigibility

To solve this problem, we need to turn to corrigibility a second time. Let $ρ$ be the AI's current expectation/probability estimator. Let $ρ^{'}$ be one that it can conceivably change into. Let $C$ be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that $C$ is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That's because $E (C) = E (u_{0} | u_{0} \to u_{0}) - E (u_{n} | u_{n - 1} \to u_{n})$ , where $u_{0}$ is the AI's first utility and $u_{n}$ its last. So some milder restrictions on the $u_{i}$ should suffice to bound $C$ .

Then, whenever the AI takes an action that changes $ρ$ to $ρ^{'}$ , it gets the special meta-compensatory reward

$E_{ρ} (C | ρ \to ρ) - E_{ρ} (C | ρ \to ρ^{'})$ .

This prevents it from deriving any advantage to $C$ via changes to $ρ$ . Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current $ρ$ .

Note that it doesn't prevent improvements to $ρ$ in general, it just removes the incentives to change it to affect compensatory rewards.

Since any action $a$ might have an indirect effect on $ρ$ , the "utility maximising" for a given $u$ must be changed to:

$E_{ρ} (u | a) + \sum_{ρ^{'}} P_{ρ} (ρ \to ρ^{'} | a) (E_{ρ} (C | ρ \to ρ) - E_{ρ} (C | ρ \to ρ^{'}))$ ,

where $P_{ρ}$ is the probability estimate corresponding to $ρ$ ; the probability term can be rewritten as $E_{ρ} (I_{ρ \to ρ^{'}})$ for $I_{ρ \to ρ^{'}}$ the indicator function for $ρ \to ρ^{'}$ . In fact the whole line above can be rewritten as

$E_{ρ} (u | a) + E_{ρ} (E_{ρ} (C | ρ \to ρ) - E_{ρ} (C | ρ \to ρ^{'}) | a)$ .

For this to work, $E_{ρ}$ needs to be able to say sensible things about itself, and also about $E_{ρ^{'}}$ , which is used to estimate $C$ if $ρ \to ρ^{'}$ .

If we compare this with various ways of factoring out variables, we can see that it's a case where we have a clear default, $ρ$ , and are estimating deviations from that.

4 comments

Comments sorted by top scores.

comment by IAFF-User-150 (Imported-IAFF-User-150) · 2016-04-29T08:00:14.000Z · LW(p) · GW(p)

I think we need a different approach to corrigibility: the AI should not be merely indifferent to corrections, it should be actively motivated to seek out relevant knowledge, including corrections to its current value model. I see this as being like the multi-armed bandit problem (see http://en.wikipedia.org/wiki/Multi-armed_bandit ) -- the AI should understand that it's trying to maximize a function that it doesn't know, its programmers couldn't fully and accurately describe to it, and that it is trying to learn. The actual function is likely to be something hard-to-define/test/compute, like the averaged coherent extrapolated volition of all of humanity, or the all-time integral of (accurate, undeceived) human retrospective satisfaction (in, say, quality-adjusted life years or some similar unit) -- the AI needs to have a coherent description of what value function it's trying to learn (that may well be the hard part).

The AI needs to understand that all it has at any point in time is an approximate model of the true value function, and it needs to devote part of its effort to attempting to improving its model (using something like the scientific method and/or Bayesian reasoning and/or statistical and logical inference and/or machine learning techniques). And in order to rationally decide how much effort to allocate to improving its future model rather than acting on its current model, and how much trust to put in its current model in various different contexts, it also needs an approximate estimate of the accuracy of its current value function in different situations -- likely including concepts such as a quantification of 'I'm pretty sure that at least under under most circumstances humans don't like being killed', and likely also an estimate of the accuracy of its estimate of the accuracy of its model, and so forth.)

The AI should be aware that if you evaluate the median value of a monte-carlo ensemble of different uncertain estimated value functions over a large space of possible actions, there is a significant chance that the maximum of the median value will lie at point in the search space where the uncertainty in the estimate of the true value is large and will be due to the estimated value functions being locally inaccurate at that point rather than to that being a true maximum of the genuine value function. So before maximization, the median of the monte-carlo ensemble of value functions should be penalized by a factor related to both the estimated local uncertainty and its estimated distribution (and the uncertainty in that uncertainty, and so on, and allowing for the fact that the unknown errors in the value function are unlikely to be normally distributed -- a 'fat-tail' distribution is much more likely) and also to the magnitude of the look-elsewhere effect from the effective size of the space being searched over. In short, the AI needs to understand probability and statistics, and how they apply to its own internal models, and act rationally upon this knowledge.

Corrigibility (and the corrigibility of any agents it constructs) then becomes an inherently automatically desirable sub-goal: the more the AI can improve its model of the true value function, the better the expected future results of its optimization on the basis of its future model are likely to be. Humans, as evolved sentient and sapient beings, are clearly domain experts on how to make humans happy/satisfied. As such (genuine, uncoerced, unmanipulated) correction input from humans (especially live, awake, sane, rational, well informed, adult humans) is a high value input, which should be actively sought out; and whenever it is received, the AI's value model and estimates of its value model's accuracy should be rationally and promptly updated in light of it.

To pick a specific example, if the AI has a shutdown button, and a (live, awake, sane, rational, well informed, adult, uncoerced) human (genuinely) presses it, then the AI's value model should be updated in light of the fact 'this human now believes that the average value of the states of the world in which I shut down is higher (and I can even roughly estimate how much higher from how hard they pressed the button) than the optimized-by-me value of the states of the world in which I don't -- which implies that they believe there is a very serious flaw in my current value model that I am unaware of and they believe will be unlikely to correct'. If, once the AI has revised its value model and estimates of its accuracy in light of this new input, the AI believes they are likely to be correct, then it should shut down (how cautiously/hard it does so depending on how much/little it now trusts its own judgement).

Such an AI will value the input possible from a shutdown button, and will thus be motivated to keep it in good repair, as it would for any other sensor that could convey valuable information. Indeed, it might even choose institute regular shut-down surveys, polling its current popularity, much as human politicians do (but hopefully with more willingness to flip-flop if its action prove unpopular).

Indeed , the AI should be actively devoting a suitably chosen portion of its resources to going out and performing, for example, sociological surveys and double-blind experiments on what humans want it to do that could improve its value model (perhaps including how they feel about the current world supply of paperclips).

If the AI thinks it has discovered an action that will maximize value under its current value model but that falls in an area where it suspects its value model might be less accurate (e.g. launching an interstellar fleet of paperclip-constructing Von Neumann machines to conquer the galaxy and convert it to paperclips -- yes, humans pretty clearly like paperclips, but perhaps it's uncertain how much this would be modified by them being light-years away, and it's notable that so far humans have shown little sign of interest in doing this for themselves), then before doing expending any significant resources on or doing anything hard-to-undo towards this goal, it would clearly be an excellent idea to first try to get more solid data on how much human happiness would actually be enhanced by the existence of vast numbers of paperclips orbiting distant stars, in case its current value model is in fact inaccurate in this area.

As the AI becomes better informed, more knowledgeable about humans, and more capable, you would also start to see what could be termed 'informed corrigibility': the AI values and incorporate correction input from humans, in proportion to its estimate of their likely accuracy, and has fairly accurate models for how trustworthy input from different humans is -- for example, it weights the considered opinion of a panel of domain experts based on extensive experience and surveys (especially if it fits pretty well with its previous knowledge) higher than the nonsensical drunken babblings of crazy person, and is aware that under certain circumstances humans are not always entirely accurate at predicting what will turn out to give them the most satisfaction, and that even their retrospective estimates of satisfaction are not always entirely accurate.

One possible (and under appropriate circumstances highly desirable) outcome of this sort of AI is the possibility that the AI, having performed some attempts at learning, comes to the conclusion that its current value model is too inaccurate and too unsafe/oversimplified for it to safely exist long enough to learn to fix, and voluntarily shuts itself down without anyone even needing to hit the shutdown button, likely after first writing a 'suicide note' explaining what aspects of its model it had come to the conclusion were too poorly designed for its continued existence to be safe. This is probably the best possible failure mode for a flawed advanced AI -- one that correctly diagnoses that it own design is flawed and shuts itself off.

Replies from: Stuart_Armstrong, paulfchristiano

↑ comment by Stuart_Armstrong · 2016-05-03T12:26:35.000Z · LW(p) · GW(p)

This approach relies on having a process that reaches the desired conclusion, without specifying the desired conclusion. It's a multi-armed bandit problem with not only the rewards uncertain, but the reward function uncertain. And it seems to rely on defining terms like "live, awake, sane, rational, well informed, adult, uncoerced", which ain't easy (though I have some developing ideas on how to do that for some of them).

Both your definition and corrigibility require human input. For your process, the AI has to assess what human input should be, at least as far as it has the power to influence future human input (see some of the issues with ). Corrigibility allows actual human input in many cases, without the AI doing any assessment.

Corrigibility is not needed if everything else is right; corrigibility is very useful if there might still be flaws in the AI's design.

↑ comment by paulfchristiano · 2016-04-30T17:49:09.000Z · LW(p) · GW(p)

Yes, plan A is for the AI to be corrigible because of uncertainty about human values and about the accuracy of its own reasoning (and which actively seeks feedback for the same reason). The question is how to set things up so that that happens. We have some rough idea but concrete existing proposals don't quite work.

I think plan B is for the AI to understand and satisfy human short-term preferences (including the preference for the AI to follow direct instructions, to not kill anyone or do anything serious and irreversible, to gather information that is relevant to understanding our preferences...). Realistically I think this will probably be the most robust measure, and we would use it even if we expect plan A to work.

The kind of utility-function surgery from this post is at best plan C.

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2016-05-03T12:27:21.000Z · LW(p) · GW(p)

The better our understanding, the less need for utility function surgery (and vice versa).

Double Corrigibility: better Corrigibility

Contents

Double corrigibility

4 comments