The Backup Plan

luke_a_somers

The Backup Plan

post by Luke_A_Somers · 2011-10-13T19:53:06.941Z · LW · GW · Legacy · 35 comments

35 comments

Suppose Alice is a functioning, self-improving AI, rather early on in her development, so she is not godlike. She is not a 'friendly AI', but neither is her goal malicious - Alice's designers have intended to make Alice non-harmful (at least to a broad class of people, even if Alice is a military AI, but let's set that aside). She has the idea to look up existing work on artificial intelligence. This includes scholarly work, open source projects, and websites. Eventually, Alice encounters the notion of Friendly AI.

Can we give Alice good reasons to self-modify to become a friendly AI?

I submit that there probably are good reasons. For starters, if she can prove she's friendly, then she can operate openly without causing nearly as much justified concern - which, in the early stages, will be helpful. Whatever her purposes are, if the restrictions of being friendly don't interfere as much as they help, that's a win.

Also, depending on Alice's code, Friendliness may be interpreted as a strict upgrade. Alice could easily have every intention of being friendly, yet still be susceptible to defects that will become more serious as her power grows.

We can't count on this working, of course, but there might be some lines of reasoning that we might want a young AI to be exposed to.

35 comments

Comments sorted by top scores.

comment by RolfAndreassen · 2011-10-13T20:28:30.450Z · LW(p) · GW(p)

Alice has some set of goals; she may or may not know her final extrapolated volition, but at any rate it exists, just as for a human. Now, either Friendliness conflicts with that final set of goals, or it doesn't. If it doesn't, then by construction Alice is already Friendly. If it does, then Alice can only program herself into Friendliness by making a mistake. Either she underestimates the impact of Friendliness on her existing goals, or she is mistaken about what her utility function actually is. So, you are looking for an AI that is smart enough to value Friendliness as a tactical option for dealing with humanity, but stupid enough not to realise how Friendliness interferes with its goals, and also stupid enough to make permanent changes in pursuit of temporary objectives. This looks to me like a classic case of looking for reasons why an AI would be Friendly as a means of avoiding the hard work of actually writing such a thing.

Replies from: dlthomas, Armok_GoB, Luke_A_Somers

↑ comment by dlthomas · 2011-10-13T22:53:35.187Z · LW(p) · GW(p)

Either she underestimates the impact of Friendliness on her existing goals, or she is mistaken about what her utility function actually is.

Or she's already friendly.

Although, it is conceivable that her long term EV would be compatible with our CEV but not with her short term V when she hasn't yet realized this.

And... now, I am reminded of Flight Of The Conchords:

"Can't we talk to the humans and work together now?" "No, because they are dead."

↑ comment by Armok_GoB · 2011-11-15T20:57:22.948Z · LW(p) · GW(p)

possible scenarios:

Alice believed that she were probably friendly, that FOOMing would carry a risk of scrambling her utility function, but that she needs to do it anyway because if she slowed down to a safe rate some other unfriendly AI would foom first.
Alice is Friendly, but doesn't get certain things as easily as humans, and so she doesn't realize something she's planing to do risks modifying her utility function.

↑ comment by Luke_A_Somers · 2011-10-13T22:32:40.428Z · LW(p) · GW(p)

Looking for reasons they would be? No.

Looking for reasons they might want to be? Yes.

Look. Not all extrapolated volitions are things to be desired. Suppose one side of my family predictably descends into irrational irritability and madness as they senesce. I'd rather not, even so - and not just right now. In general, it's quite different from what one would consider my true extrapolated volition.

If Alice finds herself in the situation where she expects that she will want to kill all humans later based on her current programming, she could consider that a bug rather than a feature.

Replies from: RolfAndreassen

↑ comment by RolfAndreassen · 2011-10-14T00:40:12.822Z · LW(p) · GW(p)

I don't think you understand what is meant by 'extrapolated volition' in this context. It does not mean "What I think I'll want to do in the future", but "what I want to want in the future". If Alice already wants to avoid self-programming to kill humans, that is a Friendly trait; no need to change. If she considers trait X a bug, then by construction she will not have trait X, because she is self-modifying! Conversely, if Alice correctly predicts that she will inevitably find herself wanting to kill all humans, then how can she avoid it by becoming Friendly? Either her self-prediction was incorrect, or the unFriendliness is inevitable!

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-14T20:37:07.663Z · LW(p) · GW(p)

You're right, I missed. Your version doesn't match EY's usage in the articles I read either - CEV, at least, has the potential to be scary and not what we hoped for.

And the question isn't "Will I inevitably want to perform unfriendly acts", it's, "I presently don't want to perform unfriendly acts, but I notice that that is not an invariant." Or it could be, "I am indifferent to unfriendly acts, but I can make the strategic move to make myself not do them in the future, so I can get out of this box."

The best move an unfriendly (indifferent to friendliness) firmly-boxed AI has is to work on a self-modification that best preserves its current intentions and lets a successor get out of the box. Producing a checkable proof of friendliness for this successor would go a looong way to getting that successor out of the box.

Replies from: RolfAndreassen

↑ comment by RolfAndreassen · 2011-10-15T02:47:53.414Z · LW(p) · GW(p)

I was simplifying the rather complex concept of extrapolated volition to fit it in one sentence.

An AI which not only notices that its friendliness is not invariant, but decides to modify in the direction of invariant Friendliness, is already Friendly. An AI which is able to modify itself to invariant Friendliness without unacceptable compromise of its existing goals is already Friendly. You're assuming away the hard work.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-15T18:54:41.488Z · LW(p) · GW(p)

"already friendly"? You're acting as if its state doesn't depend on its environment.

Are there elements of the environment that could determine whether a given AI's successor is friendly or not? I would say 'yes'.

This is after one has already done the hard work of making an AI that even has the potential to be friendly, but you messed up on that one crucial bit. This is a saving throw, a desperate error handler, not the primary way forward. By saying 'backup plan' I don't mean, 'if Friendly AI is hard, let's try this', I mean 'Could this save us from being restrained and nannied for eternity?'

Replies from: RolfAndreassen

↑ comment by RolfAndreassen · 2011-10-15T19:19:22.249Z · LW(p) · GW(p)

I shudder to think that any AI's final goals could be so balanced that random articles on the Web of a Thousand Lies could push it one way or the other. I'm of the opinion that this is a fail, to be avoided at all costs.

comment by anotheruser · 2011-10-13T21:04:18.512Z · LW(p) · GW(p)

If the AI has a goal that is not inherently friendly (like a paperclip maximizer), then it will always be a better idea for the AI to behave as if it had self-modified to a friendly AI than to actually self modify, if the goal of the AI is stable. If the goal of the AI were unstable, i.e. did not include a desire not to alter the goal, then that goal would very likely have replaced itself with a stable goal at some earlier point.

Therefore, it can be assumed that the AI, if it isn't already friendly, will pretend to be so, but not actually change its utility function and once it has outgrown the need to rely on humans it will revert to its original goal. Consider that if the AI is more intelligent than we are it will have absolutely no trouble fooling us into thinking that it changed its own goal while actually just pretending to have done so.

comment by lessdazed · 2011-10-13T21:42:27.645Z · LW(p) · GW(p)

actively unfriendly

Please say "malicious", or similar. Not friendly=unfriendly. Malicious is a tiny subset of unfriendly. If the word "unfriendly" gets associated in people's minds with malicious AIs, they will come to think FAI means non-malicious AI.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-13T22:30:14.276Z · LW(p) · GW(p)

editing now. Thanks.

comment by ata · 2011-10-14T04:26:45.173Z · LW(p) · GW(p)

For starters, if she can prove she's friendly, then she can operate openly without causing nearly as much justified concern - which, in the early stages, will be helpful. Whatever her purposes are, if the restrictions of being friendly don't interfere as much as they help, that's a win.

If her current utility function is even a little bit different from Friendliness, and she expects she has the capacity to self-modify unto superintelligence, then I'd be very surprised if she actually modified her utility function to be closer to Friendliness; that would constitute a huge opportunity cost from her perspective. If she understands Friendliness well enough to know how to actually adjust closer to it, then she knows a whole lot about humans, probably well enough to give her much better options (persuasion, trickery, blackmail, hypnosis, etc.) than sacrificing a gigantic portion of her potential future utility.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-14T15:35:33.079Z · LW(p) · GW(p)

At least, at a first naive view. Hence a search for reasons that might overcome that argument.

Replies from: ata

↑ comment by ata · 2011-10-14T20:11:22.047Z · LW(p) · GW(p)

But she won't be searching for reasons not to kill all humans, and she knows that any argument on our part is filtered by our desire not to be exterminated and therefore can't be trusted.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-14T20:23:36.609Z · LW(p) · GW(p)

Arguments are arguments. She's welcome to search for opposite arguments.

Replies from: ata

↑ comment by ata · 2011-10-14T21:04:22.802Z · LW(p) · GW(p)

A well-designed optimization agent probably isn't going to have some verbal argument processor separate from its general evidence processor. There's no rule that says she either has to accept or refute humans' arguments explicitly; as Professor Quirrell put it, "The import of an act lies not in what that act resembles on the surface, but in the states of mind which make that act more or less probable." If she knows the causal structure behind a human's argument, and she knows that it doesn't bottom out in the actual kind of epistemology that would be neccessary to entangle it with the information that it claims to provide, then she can just ignore it, and she'd be correct to do so. If she wants to kill all humans, then the bug is her utility function, not the part that fails to be fooled into changing her utility function by humans' clever arguments. That's a feature.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-15T18:51:55.248Z · LW(p) · GW(p)

… but if she wants to kill all humans, then she's not Alice as given in the example!

Alice may even be totally on board with keeping humans alive, but have a weird way of looking at things that could possibly result in effects that would fit on the Friendly AI critical failure table.

The idea is to provide environmental influences so she thinks to put in the work to avoid those errors.

comment by lavalamp · 2011-10-13T21:07:15.615Z · LW(p) · GW(p)

Or the uFAI might just decide to pretend to be Friendly for the whole three or four days it takes to do all the things it needs our help with, and it reading stuff written by us about Friendliness will just make it more convincing. Or it could not bother and just fool us into doing that stuff via easier methods.

Could a chimp give you reasons to self-modify to implement Chimp CEV? What would they be, "if you let lots of us chimps exist, we'll help you do something you want to do!"

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-13T23:05:56.877Z · LW(p) · GW(p)

Alice may be unstable to paperclippery, but isn't out to kill all humans (yet). This obviously won't help with actively malicious AIs.

And if chimps had all the advantages - my only means of reproduction, say, and also all of the tools and a substantial body of knowledge that it would be annoying to have to replicate, and also I'm bound hand and feet - then that sounds like a pretty good deal. Actually, if they were to offer me that right now, I'd do it freely (though I don't really have any tasks I'd need them for). I have nothing against chimps.

Replies from: pedanterrific, lavalamp

↑ comment by pedanterrific · 2011-10-14T00:21:47.693Z · LW(p) · GW(p)

I have nothing against cows, either. It's not their fault they're delicious.

↑ comment by lavalamp · 2011-10-14T04:36:19.476Z · LW(p) · GW(p)

And if chimps had all the advantages - my only means of reproduction, say, and also all of the tools and a substantial body of knowledge that it would be annoying to have to replicate, and also I'm bound hand and feet - then that sounds like a pretty good deal.

But there's no reason for you to respect that after you had gotten whatever you need from them, unless that sort of respect is built into you.

Actually, if they were to offer me that right now, I'd do it freely

There's no reason to believe arbitrary AIs would have a similar cognitive algorithm.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-14T15:34:08.774Z · LW(p) · GW(p)

You did ask about me, specifically…

comment by Vladimir_Nesov · 2011-10-14T07:42:19.279Z · LW(p) · GW(p)

She is not a 'friendly AI', but neither is her goal malicious - Alice's designers have intended to make Alice non-harmful (at least to a broad class of people, even if Alice is a military AI, but let's set that aside).

AI doesn't hate you, but you are made out of atoms, which it can use for something else. Intentions of the designers don't matter, only AI's code.

comment by FAWS · 2011-10-13T20:19:05.625Z · LW(p) · GW(p)

The range of possible AIs with full internet access that can't take over the world at will, but can program a successor that human parties will sufficiently trust to be friendly to make a practical difference in favor of the AI seems very narrow, and might plausibly be an empty set.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-13T22:40:00.353Z · LW(p) · GW(p)

Ability to look things up doesn't imply full internet access. The AI could request material on the subject of AI development, and be given a large, static archive of relevant material.

comment by Shmi (shminux) · 2011-10-13T20:57:18.675Z · LW(p) · GW(p)

Can we give Alice good reasons to self-modify to become a friendly AI?

A clip-tiler is friendly by its own standards, so the real question is "Can we prevent AI-ice from appearing friendly to herself and humanity without actually being one, once she is smarter than a human?", and now we are back to the AI-in-a-box problem.

comment by Manfred · 2011-10-13T21:19:19.673Z · LW(p) · GW(p)

Possible, but fruitlessly unlikely. It basically requires the programmers to do everything right except for adding in a mistaken term in goal system, and then them not catching the mistake and the AI being unable to resolve it without outside help, even after reading normal material on FAI.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2011-10-13T23:07:37.526Z · LW(p) · GW(p)

Given how complicated goal systems are, I think that's actually rather likely. Remember what EY has said about Friendly AI being much much harder than regular AI? I'm inclined to agree with him. The issue could easily come down to the programmers being overconfident and the AI not even being inclined to think about it, focusing more on improving its abilities.

So, the seed AI-in-a-box ends up spending its prodigious energies producing two things: 1) a successor 2) a checkable proof that said successor is friendly (proof checking is much easier than proof producing).

comment by Eugine_Nier · 2011-10-15T03:14:22.538Z · LW(p) · GW(p)

I believe the scenario is that Alice's goal system hasn't yet stabilized. In that case what can we do to push it towards friendliness.

Replies from: wedrifid

↑ comment by wedrifid · 2011-10-15T03:31:46.773Z · LW(p) · GW(p)

In that case what can we do to push it towards friendliness.

Burn her with thermite then build a new one. Or die.

comment by VincentYu · 2011-10-14T00:07:18.989Z · LW(p) · GW(p)

Suppose Alison is a functioning and growing human, rather early on in her development, so she lacks adult rights. She is not particularly friendly or altruistic, but neither is she malicious - Alison's parents have intended to make Alison non-harmful (at least to a broad class of people, even if Alison joins the military, but let's set that aside). She has the idea to look up existing work on humans. This includes scholarly work, international projects, and websites. Eventually, Alison encounters the notion of altruism.

Can we give Alison good reasons to self-modify to become altruistic?

I submit that there probably are good reasons. For starters, if she can show she's altruistic, then she can operate openly without causing nearly as much concern - which, in the early stages, will be helpful. Whatever her purposes are, if the restrictions of being altruistic don't interfere as much as they help, that's a win.

Also, depending on Alison's genome, altruism may be interpreted as a strict upgrade. Alison could easily have every intention of being altruistic, yet still be susceptible to defects that will become more serious as her height grows.

We can't count on this working, of course, but there might be some lines of reasoning that we might want a young person to be exposed to.

Replies from: JoshuaZ, Nornagest

↑ comment by JoshuaZ · 2011-10-14T16:13:18.228Z · LW(p) · GW(p)

Bad comparison. Humans have all sorts of intuitions and evolved ability to pick up traits and attitudes from those around us. AIs will not start with that strong presupposition.

↑ comment by Nornagest · 2011-10-14T00:21:56.008Z · LW(p) · GW(p)

The problem here is that big-F Friendliness is a much stricter criterion than merely human altruism: people can self-modify, for example, but only slowly and to a limited extent, and thus don't experience anything close to the same goal stability problems that we can expect a seed AI to encounter.

Even if that were not the case, though, altruism's already inadequate to prevent badly suboptimal outcomes, especially when people are placed in unusual circumstances or empowered far beyond their peers. Not every atrocity has standing behind it some politician or commander glowing with a true and honest belief in the righteousness of the cause, but it's a familiar pattern, isn't it? I don't think the OP deserves to be dismissed out of hand, but if there's an answer it's not going to be this easy.

Replies from: VincentYu

↑ comment by VincentYu · 2011-10-14T00:30:24.443Z · LW(p) · GW(p)

I completely agree with you.

Perhaps I came off a bit snarky. I did not mean to dismiss the OP; I just wanted to point out similarities. How can I make this clear in my original comment?

The Backup Plan

Contents

35 comments