What is wrong with this approach to corrigibility?

rafael-cosman-1

What is wrong with this approach to corrigibility?

post by Rafael Cosman (rafael-cosman-1) · 2022-07-12T22:55:22.342Z · LW · GW · 2 comments

This is a question post.

  Answers
    6 ZT5
    3 Signer
    2 Quintin Pope
    2 Michael Soareverix
None
2 comments

Essentially: a button that when you press it kills the AI but instantly gives it reward equal to its expected discounted future reward. And to be clear, this is the AI's estimate of its expected discounted future reward, not some outside estimate.

(This is in the interest of asking dumb questions and learning in public. And I'm taking the specific narrow definition of corrigibility which is just an AI letting you turn it off.)

Thoughts: I've heard Dr. Christiano mention things like "precisely balanced incentives" (which I think uses to describe the weaknesses of this and similar approaches), but I don't see why this would be particularly difficult to balance given that this number is just an explicit float inside many RL models. Some issues I do see:

Such an agent would have no incentive to create corrigible child-agents
Such an agent would have no incentive to preserve this property while self-modifying

But I am probably missing a bunch of other issues with this general approach? Responses in the form of links to relevant papers are welcome as well.

Thanks,

Raf

Answers

answer by Victor Novikov (ZT5) · 2022-07-13T03:07:28.023Z · LW(p) · GW(p)

The way I understand the idea, the point is to make the AI indifferent to being shutdown? It doesn't care one way or the other?

As you describe it, one of the problems is indeed that AI has no incentive to make it sure it (or other agents it creates) retain the property of shutdownability.

Another problem: it seems that there is an assumption is that once the AI is destroyed any further reward it gains is zero, so press-the-button vs not-press-the-button are perfectly balanced. But I don't think that's the case for most utility functions: they don't have to be conditional on the AI's existence, and the AI doesn't have to be indifferent to what the world looks like the AI it's destroyed. Could maybe be corrected by: the button gives the AI a reward equal to its expected discounted future reward, then sets any future reward gained to zero (including from pressing the button again).

↑ comment by Rafael Cosman (rafael-cosman-1) · 2022-09-21T00:53:03.362Z · LW(p) · GW(p)

Great answer, thanks!

answer by Signer · 2022-07-13T00:12:19.597Z · LW(p) · GW(p)

AI will press the button itself?

↑ comment by Rafael Cosman (rafael-cosman) · 2022-07-17T01:08:52.640Z · LW(p) · GW(p)

If implemented as described, the AI should be exactly indifferent to pushing the button? I guess the AI’s behavior in that situation is not well defined… and if we make the button give expected value minus epsilon reward, then the AI might kill you to stop you from pressing the button (because it wants that epsilon reward!)

So overall I suppose this is a fair criticism of the approach and is possibly what Paul means by issues with precisely balancing!

answer by Quintin Pope · 2022-07-13T04:06:50.167Z · LW(p) · GW(p)

You're assuming that the AI is internally acting to maximize its future reward. I.e., you're assuming perfect inner alignment to a reward maximization objective. Also, what happens when the AI considers the strategy "make the humans press the button"? What's it's expected future reward for this strategy?

answer by Michael Soareverix · 2022-07-13T00:37:58.144Z · LW(p) · GW(p)

I am new to the AI Alignment field, but at first glance, this seems promising! You can probably hard-code it not to have the ability to turn itself off, if that turns out to be a problem in practice. We'd want to test this in some sort of basic simulation first. The problem would definitely be self-modification and I can imagine the system convincing a human to turn it off in some strange, manipulative, and potentially dangerous way. For instance, the model could begin attacking humans, instantly causing a human to run to shut it down, so the model would leave a net negative impact despite having achieved the same reward.

What I like about this approach is that it is simple/practical to test and implement. If we have some sort of alignment sandbox (using a much more basic AI as a controller or test subject) we can give the AI a way of simply manipulating another agent to press the button, as well as ways of maximizing its alternative reward function.

Upvoted, and I'm really interested to see the other replies here.

2 comments

Comments sorted by top scores.

comment by Vladimir_Nesov · 2022-07-13T01:22:42.977Z · LW(p) · GW(p)

The point about agents in environment suggests that shutdown is not really about corrigibility, it's more of a test case (application) for corrigibility. If an agent can trivially create other agents in environment, and poses a risk of actually doing so, then shutting it down doesn't resolve that risk, so you'd need to take care of the more general problem first. Not creating agents in environment seems closer to the soft optimization side of corrigibility: preferring less dangerous cognition, not being an optimizer. The agent not contesting or assisting shutdown is still useful when the risk is not there or doesn't actually trigger, but it's not necessarily related to corrigibility, other than in the sense of being an important task for corrigible agents to support.

comment by Rafael Cosman (rafael-cosman) · 2022-07-17T01:12:41.279Z · LW(p) · GW(p)

Really appreciate all the thoughtful and substantive comments!! Thanks very much, honestly was exactly what I was hoping for from posting.

What is wrong with this approach to corrigibility?

Contents

Answers

2 comments