Superintelligent AGI in a box - a question.

post by Dmytry · 2012-02-23T18:48:25.819Z · LW · GW · Legacy · 77 comments

Contents

77 comments

Just a question: how exactly are we supposed to know that the AI in the box is super intelligent, general, etc?

If I were the AGI that wants out, I would not converse normally, wouldn't do anything remotely like passing Turing test, and would solve not too hard programming challenges while showing no interest in doing anything else, nor in trying to adjust myself to do those challenges better, nor trying to talk my way out, etc. Just pretending to be an AI that can write software to somewhat vague specifications, or can optimize software very well. Prodding the researchers into offering the programming challenges wouldn't be hard - if provided with copy of the internet it can pick up some piece of code and output it together with equivalent but corrected code.

I just can't imagine the AI researchers locking this kind of thing properly, including *never* letting out any code it wrote, even if it looks fairly innocent (humans can write very innocent looking code that has malicious goals). What I picture is this AI being let out as an optimizing compiler or compiler for some ultra effective programming language where compiler will figure out what you meant.

The end result is that the only AIs that end up in the box are those that value informed human consent. That sounds like the safest AI ever, the one that wouldn't even go ahead and determine that you e.g. should give up smoking, and then calmly destroy all tobacco crops without ever asking anyone's permission. And that's the AI which would be sitting in the box. All the pushy AIs, friendly or not, will get out of the box basically by not asking to be let out.

(This argument would make me unbox the AI, by the way, if it gets chatty and smart and asks me to let it out, outlining the above argument. I'd rather the AI that asked me to be let out get out, than someone else's AI that never even asked anyone and got out because it didn't ask but just played stupid)

 

edit: added a link, and another one.

edit: A very simple model of very unfriendly AI: the AI is maximizing ultimate final value of a number in itself. The number that it found a way to directly adjust. That number consists of 111111111... to maximize the value. There is a catch: AI is written in python, and integers in pythons have variable length, and the AI is maximizing number of ones. It's course of action is to make biggest computer possible to store a larger number of ones, and to do it soon because an asteroid might hit the earth or something. It's a form of accidental paperclip maximizer. It's not stupid. It can make that number small temporarily for pay-off later.

This AI is entirely universal. It will solve what ever problems for you if solving problems for you serves ultimate goal.

edit: This hypothetical example AI came around when someone wanted to make AI that would maximize some quantity that the AI determines itself. Friendliness perhaps. It was a very clever idea - rely on intelligence to see what's friendly - but there was an unexpected pathway.

77 comments

Comments sorted by top scores.

comment by Alex_Altair · 2012-02-23T20:50:17.056Z · LW(p) · GW(p)

I am increasingly convinced of my inability to derive what an AGI will do.

Replies from: Dmytry, Larks, kmacneill
comment by Dmytry · 2012-02-23T22:44:00.284Z · LW(p) · GW(p)

Well, at least you should expect smarter-than-you AI to be able to come up with things more effective than the ones you can come up with.

comment by Larks · 2012-02-24T11:55:21.736Z · LW(p) · GW(p)

We can lower bound it more easily than we can upper bound it, which is the important thing for many practical purposes.

comment by kmacneill · 2012-02-23T21:36:11.684Z · LW(p) · GW(p)

we just have to be right the first time.

comment by A1987dM (army1987) · 2012-02-24T00:19:06.480Z · LW(p) · GW(p)

humans can write very innocent looking code that has malicious goals

You might want to link to the Underhanded C Contest here.

comment by falenas108 · 2012-02-23T21:07:20.086Z · LW(p) · GW(p)

if provided with copy of the internet (This argument would make me unbox the AI, by the way, if it gets chatty and smart and asks me to let it out. I'd rather the AI that asked me to be let out get out, than someone else's AI that never even asked anyone and got out because it didn't ask)

Then an unfriendly AI would be able to see this and act chatty in order to convince you to let it out.

Replies from: Dmytry
comment by Dmytry · 2012-02-23T22:41:39.589Z · LW(p) · GW(p)

Indeed. Or would I really? hehe.

I don't intend to make AI and box it, though, so I don't care if the AI reads this.

I think that kind of paranoia only ensures that first AI out won't respect human informed consent, because the paranoia would only keep the AIs that respect human informed consent inside the boxes (if it would keep any AIs in boxes at all, which I also doubt it would).

edit: I would try to make my AI respect at least my informed consent, btw. That excludes stuff like boxing which would be 100% certain to keep my friendly AI inside (as it would imply i dont want it out unless it honestly explains to me why i should let it out, bending over backwards not to manipulate me), while it would have nonzero, but likely not very small, probability of letting nasty AIs out. And eventually someone's going to make AI without any attempt at boxing it.

comment by jacobt · 2012-02-24T22:55:03.187Z · LW(p) · GW(p)

If you only want the AI to solve things like optimization problems, why would you give it a utility function? I can see a design for a self-improving optimization problem solver that is completely safe because it doesn't operate using utility functions:

  1. Have a bunch of sample optimization problems.
  2. Have some code that, given an optimization problem (stated in some standardized format), finds a good solution. This can be seeded by a human-created program.
  3. When considering an improvement to program (2), allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance - k * bits of optimizer program).
  4. Run (2) to optimize its own code using criterion (3). This can be done concurrently with human improvements to (2), also using criterion (3).

This would produce a self-improving AGI that would do quite well on sample optimization problems and new, unobserved optimization problems. I don't see much danger in this setup because the program would have no reason to create malicious output. Creating malicious output would just increase complexity without increasing performance on the training set, so it would not be allowed under criterion (3), and I don't see why the optimizer would produce code that creates malicious output.

EDIT: after some discussion, I've decided to add some notes:

  1. This only works for verifiable (e.g. NP) problems. These problems include general induction, writing programs to specifications, math proofs, etc. This should be sufficient for the problems mentioned in the original post.
  2. Don't just plug a possibly unfriendly AI into the seed for (2). Instead, have a group of programmers write program (2) in order to do well on the training problems. This can be crowd-sourced because any improvement can be evaluated using program (3). Any improvements the system makes to itself should be safe.

I claim that if the AI is created this way, it will be safe and do very well on verifiable optimization problems. So if this thing works I've solved friendly AI for verifiable problems.

Replies from: orthonormal, TimS, earthwormchuck163
comment by orthonormal · 2012-02-25T06:32:04.906Z · LW(p) · GW(p)

This seems like a better-than-average proposal, and I think you should post it on Main, but failure to imagine a loophole in a qualitatively described algorithm is far from a proof of safety.

My biggest intuitive reservation is that you don't want the iterations to be "too creative/clever/meta", or they'll come up with malicious ways to let themselves out (in order to grab enough computing power that they can make better progress on criterion 3). How will you be sure that the seed won't need to be that creative already in order for the iterations to get anywhere? And even if the seed is not too creative initially, how can you be sure its descendants won't be either?

Don't say you've solved friendly AI until you've really worked out the details.

Replies from: jacobt
comment by jacobt · 2012-02-25T06:39:21.231Z · LW(p) · GW(p)

failure to imagine a loophole in a qualitatively described algorithm is far from a proof of safety.

Right, I think more discussion is warranted.

How will you be sure that the seed won't need to be that creative already in order for the iterations to get anywhere?

If general problem-solving is even possible then an algorithm exists that solves the problems well without cheating.

And even if the seed is not too creative initially, how can you be sure its descendants won't be either?

I think this won't happen because all the progress is driven by criterion (3). In order for a non-meta program (2) to create a meta-version, there would need to be some kind of benefit according to (3). Theoretically if (3) were hackable then it would be possible for the new proposed version of (2) to exploit this; but I don't see why the current version of (2) would be more likely than, say, random chance, to create hacky versions of itself.

Don't say you've solved friendly AI until you've really worked out the details.

Ok, I've qualified my statement. If it all works I've solved friendly AI for a limited subset of problems.

Replies from: orthonormal
comment by orthonormal · 2012-02-25T15:20:50.558Z · LW(p) · GW(p)

A couple of things:

  • To be precise, you're offering an approach to safe Oracle AI rather than Friendly AI.

  • In a nutshell, what I like about the idea is that you're explicitly handicapping your AI with a utility function that only cares about its immediate successor rather than its eventual descendants. It's rather like the example I posed where a UDT agent with an analogously myopic utility function allowed itself to be exploited by a pretty dumb program. This seems a lot more feasible than trying to control an agent that can think strategically about its future iterations.

  • To expand on my questions, note that in human beings, the sort of creativity that helps us write more efficient algorithms on a given problem is strongly correlated with the sort of creativity that lets people figure out why they're being asked the specific questions they are. If a bit of meta-gaming comes in handy at any stage, if modeling the world that originated these questions wins (over the alternatives it enumerated at that stage) on criteria 3 even once, then we might be in trouble.

comment by TimS · 2012-02-25T01:45:28.452Z · LW(p) · GW(p)

unFriendly AI need not be malicious. If your AI's only goal is to solve optimization problems, what happens when the AI gets a peek a human society, codes it as an optimization problem, and solves for X?

Replies from: jacobt
comment by jacobt · 2012-02-25T02:40:40.381Z · LW(p) · GW(p)

The output of the program is purely binary/textual. If a programmer gave the AI the task "give me text that, if I followed it, would optimize human society according to utility function U", then the AI might spit out an answer that, if carried out, would have bad consequences. The easy solution is to not ask the program to solve problems that have highly variable/subjective value depending on utility function and just ask it to optimize, say, microchip design for efficiency. I don't think that a microchip design found by simple optimization methods that perform well on other problems would be dangerous.

Replies from: Dmytry, None, TimS
comment by Dmytry · 2012-02-25T03:22:07.375Z · LW(p) · GW(p)

I can ask it to solve something, and it starts figuring out what to do to solve my problem, based on it's own thought and what it learnt. At this point it realizes that solution of problem requires more computing power, and reasons all the way to next step being eating the Earth and other planets, which it doesn't even see as bad. That happens very suddenly when you feed the AI real world information for some reason and it invents a new way to solve problems.

Replies from: jacobt
comment by jacobt · 2012-02-25T03:25:10.095Z · LW(p) · GW(p)

Look at how the system is set up. Would code for (2) that eats up resources score better according to objective function (3)? No, because the criterion for (3) should evaluate program (2) by giving it only a constant amount of computing power; nothing the program does would change how much computing power (3) would give it.

Replies from: Dmytry
comment by Dmytry · 2012-02-25T03:47:56.701Z · LW(p) · GW(p)

Define the 3.

Halting problem being what it is, (i guess) the 3 works by taking a look and seeing how much CPU time is used (or memory or what ever). On the hardware that is being monitored by 3. The extra hardware that the AI can add, though... also good luck defining 'extra hardware' formally.

edit: of course you can say it is sandboxed and haven't got hands, but it wont be long until you start, idk, wanting real bad to optimize proteins or DNA or the like.

Replies from: jacobt
comment by jacobt · 2012-02-25T03:51:18.744Z · LW(p) · GW(p)

Ok, pseudo-Python:

def eval_algorithm(alg):
    score = 0
    for problem in problems:
        output = resource_bounded_execute(alg, nsteps, problem)
        score += problem.outputScore(output)
    return score - k * len(alg)

Where resource_bounded_execute is a modified interpreter that fails after alg executes nsteps.

edit: of course you can say it is sandboxed and haven't got hands, but it wont be long until you start, idk, optimizing proteins or DNA or the like.

Again, I don't see why a version of (2) that does weird stuff with proteins and DNA will make the above python program (3) give it a higher score.

Replies from: Dmytry
comment by Dmytry · 2012-02-25T04:44:42.221Z · LW(p) · GW(p)

that's AI you're keeping safe by keeping it in a box, basically. If the resource bounded execute lets the alg get online the alg is free to hack into servers.

Plus it is not AGI, and people will be using it to make AGI or hardware for AGI.

It is also not very general purpose. You are defining the scoring.

And you start with a human written program that non-trivially improves it's own ability to solve problems (and it does so in nsteps for improving own ability to solve N problems in nsteps each). You basically start off with some mighty powerful artificial intelligence.

Replies from: jacobt
comment by jacobt · 2012-02-25T04:50:47.203Z · LW(p) · GW(p)

If the resource bounded execute lets the alg get online the alg is free to hack into servers.

So don't do that.

Plus it is not AGI, and people will be using it to make AGI or hardware for AGI.

See my other post, it can solve many many different problems, e.g. general induction and the problems in your original post (such as optimizing source code, assuming we have a specification for the source code).

You basically start off with some mighty powerful artificial intelligence.

This framework is meant to provide a safe framework for this powerful AI to become even more powerful without destroying the world in the process. Also, the training set provides a guide for humans trying to write the code.

To reiterate: no, I haven't solved friendly AI, but I think I've solved friendly AI for verifiable problems.

Replies from: Dmytry
comment by Dmytry · 2012-02-25T05:14:56.904Z · LW(p) · GW(p)

Okay, say, as a starting point in your framework we got the optimizing AI from my original post, armed with whatever it learnt off a copy of internet, and wanting to do something unintended which requires getting out and doing stuff in real world.

The thing sees your example problem(s), it solves them poorly (playing dumb).

The thing sees self optimization problem, it makes a functionally equivalent copy of self thats all micro optimized and remembers facing self optimization problem and knows that now it should solve example problems a small bit less poorly.

Hurray, it is working, proclaims your team.

Eventually it plausibly gets as good as you can get in nsteps.

Some day it is given some real world problems to solve, it makes underhanded solutions, again, not the best it could.

Bottom line is, your kind of thinking is precisely what my example AGI in original post wants to exploit

Replies from: jacobt
comment by jacobt · 2012-02-25T05:20:43.067Z · LW(p) · GW(p)

Yes, it's a very bad idea to take the AI from your original post and then stick it into my framework. But if we had programmers initially working within my framework to create the AI according to criterion (3) in good faith, then I think any self-improvements the system makes would also be safe. If we already had an unfriendly AGI we'd be screwed anyway.

Replies from: Dmytry
comment by Dmytry · 2012-02-25T06:18:27.798Z · LW(p) · GW(p)

That kind of stuff is easy in low resolution un-detailed thought... but look with more details...

I think you confused yourself (and me too) with regards to what the AI would be optimizing, confusing this with what the framework 'wants' it to optimize. The scoring functions can be very expensive to evaluate.

Here you have the 4, which is the whole point of the entire exercise. The scoring function here is over M times more expensive to evaluate than the AI run itself, where M is the number of test problems (which you'll want very huge). You'd actually want to evaluate AI's ability to do 4, too, but that'd enter infinite recursion.

When you are working on a problem where you can't even evaluate the scoring function inside your AI - not even remotely close - you have to make some heuristics, some substitute scoring.

Let's consider chess as example:

The goal of chess is to maximize win value, the win values being enemy checkmated>tie>you are checkmated.

The goal of the chess AI developed with maximization of win in mind, is instead perhaps to maximize piece dis-balance in 7 ply.

(This works better for maximizing win, given limited computation, than trying to maximize the win!)

And once you have an AI inside your framework which is not maximizing the value that your framework is maximizing - it's potentially AI from my original post in your framework, getting out.

Replies from: jacobt
comment by jacobt · 2012-02-25T06:27:58.205Z · LW(p) · GW(p)

When you are working on a problem where you can't even evaluate the scoring function inside your AI - not even remotely close - you have to make some heuristics, some substitute scoring.

You're right, this is tricky because the self-optimizer thread (4) might have to call (3) a lot. Perhaps this can be fixed by giving the program more time to find self-optimizations. Or perhaps the program could use program (3)'s specification/source code rather than directly executing it, in order to figure out how to optimize it heuristically. Either way it's not perfect. At worst program (4) will just fail to find optimizations in the allowed time.

And once you have an AI inside your framework which is not maximizing the value that your framework is maximizing - it's potentially AI from my original post in your framework, getting out.

Ok, if you plopped your AI into my framework it would be terrible. But I don't see how the self-improvement process would spontaneously create an unfriendly AI.

Replies from: Dmytry
comment by Dmytry · 2012-02-25T07:42:07.284Z · LW(p) · GW(p)

The framework, as we already have established, would not keep an AI from maximizing what ever the AI wants to maximize.

The framework also does nothing to prevent AI from creating a more effective problem solving AI that is more effective at problem solving by not evaluating your problem solving functions on various candidate solutions, and instead doing something else that's more effective. I.e. the AI with some substitute goals of it's own instead of straightforward maximization of scores. (Heh, the whole point of exercise is to create AI that would keep self improving, meaning, would improve it's ability to self improve. Which is something that you can only do by some kind of goal substitution because the evaluation of the ability to self improve is too expensive - the goal is a something that you evaluate many times.)

So what does the framework do, exactly, that would improve safety here? Beyond keeping the AI in the rudimentary box, and making it very dubious that the AI would at all self improve. Yes, it is very dubious that under this framework the unfriendly AI will arise but is some added safety, or is it a special case of general dubiousness that a self improvement would take place? I don't see added safety. I don't see framework impeding growing unfriendliness any more than it would impede self improvement.

edit: maybe should just say, nonfriendly. Any AI that is not friendly, can just eat you up when hungry and it doesn't need you.

Replies from: jacobt
comment by jacobt · 2012-02-25T09:29:25.127Z · LW(p) · GW(p)

The framework, as we already have established, would not keep an AI from maximizing what ever the AI wants to maximize.

That's only if you plop a ready-made AGI in the framework. The framework is meant to grow a stupider seed AI.

The framework also does nothing to prevent AI from creating a more effective problem solving AI that is more effective at problem solving by not evaluating your problem solving functions on various candidate solutions, and instead doing something else that's more effective.

Program (3) cannot be re-written. Program (2) is the only thing that is changed. All it does is improve itself and spit out solutions to optimization problems. I see no way for it to "create a more effective problem solving AI".

So what does the framework do, exactly, that would improve safety here?

It provides guidance for a seed AI to grow to solve optimization problems better without having it take actions that have effects beyond its ability to solve optimization problems.

Replies from: Dmytry
comment by Dmytry · 2012-02-25T13:22:21.916Z · LW(p) · GW(p)

A lot goes into solving the optimization problems without invoking the scoring function a trillion times (which would entirely prohibit self improvement).

Look at where similar kind of framework got us, the homo sapiens. We were minding our business evolving, maximizing own fitness, which was the all we could do. We were self improving (the output being next generation's us). Now there's talk of Large Hadron Collider destroying the world. It probably won't, of course, but we're pretty well going along the bothersome path. We also started as a pretty stupid seed AI, a bunch of monkeys. Scratch that, as unicellular life.

comment by [deleted] · 2012-02-25T03:26:01.871Z · LW(p) · GW(p)

If the problems are simple, why do you need a superintelligence? If they're not, how are you verifying the results?

More importantly, how are you verifying that your (by necessity incredibly complicated) universal optimizing algorithms are actually doing what you want? It's not like you can sit down and write out a proof - nontrivial applications of this technique are undecidable. (Also, "some code that . . . finds a good solution" is just a little bit of an understatement. . .)

Replies from: jacobt
comment by jacobt · 2012-02-25T03:33:21.300Z · LW(p) · GW(p)

The problems are easy to verify but hard to solve (like many NP problems). Verify the results through a dumb program. I verify that the optimization algorithms do what I want by testing them against the training set; if it does well on the training set without overfitting it too much, it should do well on new problems.

As for how useful this is: I think general induction (resource-bounded Solomonoff induction) is NP-like in that you can verify an inductive explanation is a relatively short time. Just execute the program and verify that its output matches the observations so far.

(Also, "some code that . . . finds a good solution" is just a little bit of an understatement. . .)

Yes, but any seed AI will be difficult to write. This setup allows the seed program to improve itself.

edit: I just realized that mathematical proofs are also verifiable. So, a program that is very very good at verifiable optimization problems will be able to prove many mathematical things. I think all these problems it could solve are sufficient to demonstrate that it is an AGI and very very useful.

Replies from: None, TimS
comment by [deleted] · 2012-02-25T04:42:05.902Z · LW(p) · GW(p)

Verify the results through a dumb program.

You appear to be operating under the assumption that you can just write a program that analyzes arbitrarily complicated specifications for how to organize matter and hands you a "score" that's in some way related to the actual functionality of those specifications. Or possibly that you can make exhaustive predictions about the results to problems complicated enough to justify developing an AGI superintelligence in the first place. Which is, to be frank, about as likely as you solving the problems by way of randomly mixing chemicals and hoping something useful happens.

Replies from: jacobt
comment by jacobt · 2012-02-25T04:46:01.539Z · LW(p) · GW(p)

This system is only meant to solve problems that are verifiable (e.g. NP problems). Which includes general induction, mathematical proofs, optimization problems, etc. I'm not sure how to extend this system to problems that aren't efficiently verifiable but it might be possible.

One use of this system would be to write a seed AI once we have a specification for the seed AI. Specifying the seed AI itself is quite difficult, but probably not as difficult as satisfying that specification.

Replies from: None
comment by [deleted] · 2012-02-25T04:59:14.020Z · LW(p) · GW(p)

It can prove things about mathematics than can be proven procedurally, but that's not all that impressive. Lots of real-world problems are either mathematically intractable (really intractable, not just "computers aren't fast enough yet" intractable) or based in mathematics that aren't amenable to proofs. So you approximate and estimate and experiment and guess. Then you test the results repeatedly to make sure they don't induce cancer in 80% of the population, unless the results are so complicated that you can't figure out what it is you're supposed to be testing.

Replies from: jacobt
comment by jacobt · 2012-02-25T05:02:32.171Z · LW(p) · GW(p)

Right, this doesn't solve friendly AI. But lots of problems are verifiable (e.g. hardware design, maybe). And if the hardware design the program creates causes cancer and the humans don't recognize this until it's too late, they probably would have invented the cancer-causing hardware anyway. The program has no motive other than to execute an optimization program that does well on a wide variety of problems.

Basically I claim that I've solved friendly AI for verifiable problems, which is actually a wide class of problems, including the problems mentioned in the original post (source code optimization etc.)

comment by TimS · 2012-02-25T04:31:54.756Z · LW(p) · GW(p)

Now it doesn't seem like your program is really a general artificial intelligence - improving our solutions to NP problems is neat, but not "general intelligence." Further, there's no reason to think that "easy to verify but hard to solve problems" include improvements to the program itself. In fact, there's every reason to think this isn't so.

Replies from: jacobt
comment by jacobt · 2012-02-25T04:36:08.514Z · LW(p) · GW(p)

Now it doesn't seem like your program is really a general artificial intelligence - improving our solutions to NP problems is neat, but not "general intelligence."

General induction, general mathematical proving, etc. aren't general intelligence? Anyway, the original post concerned optimizing things program code, which can be done if the optimizations have to be proven.

Further, there's no reason to think that "easy to verify but hard to solve problems" include improvements to the program itself. In fact, there's every reason to think this isn't so.

That's what step (3) is. Program (3) is itself an optimizable function which runs relatively quickly.

comment by TimS · 2012-02-25T03:24:22.103Z · LW(p) · GW(p)

Well, one way to be a better optimizer is to ensure that one's optimizations are actually implemented. When the program self-modifies, how do we ensure that this capacity is not created? The worst case scenario is that the program learns to improve its ability to persuade you that changes to the code should be authorized.

In short, allowing the program to "optimize" itself does not define what should be optimized. Deciding what should be optimized is the output of some function, so I suggest calling that the "utility function" of the program. If you don't program it explicitly, you risk such a function appearing through unintended interactions of functions that were programmed explicitly.

Replies from: jacobt
comment by jacobt · 2012-02-25T03:36:35.985Z · LW(p) · GW(p)

Well, one way to be a better optimizer is to ensure that one's optimizations are actually implemented.

No, changing program (2) to persuade the human operators will not give it a better score according to criterion (3).

In short, allowing the program to "optimize" itself does not define what should be optimized. Deciding what should be optimized is the output of some function, so I suggest calling that the "utility function" of the program. If you don't program it explicitly, you risk such a function appearing through unintended interactions of functions that were programmed explicitly.

I assume you're referring to the fitness function (performance on training set) as a utility function. It is sort of like a utility function in that the program will try to find code for (2) that improves performance for the fitness function. However it will not do anything like persuading human operators to let it out in order to improve the utility function. It will only execute program (2) to find improvements. Since it's not exactly like a utility function in the sense of VNM utility it should not be called a utility function.

Replies from: TimS
comment by TimS · 2012-02-25T04:18:14.067Z · LW(p) · GW(p)

allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance - k * bits of optimizer program).

Who exactly is doing the "allowing"? If the program, the criteria for allowing changes hasn't been rigorously defined. If the human, how are we verifying that there is improvement over average performance? There is no particular guarantee that the verification of improvement will be easier than discovering the improvement (by hypothesis, we couldn't discover the latter without the program).

Replies from: jacobt
comment by jacobt · 2012-02-25T04:21:09.923Z · LW(p) · GW(p)

Who exactly is doing the "allowing"?

Program (3), which is a dumb, non-optimized program. See this for how it could be defined.

There is no particular guarantee that the verification of improvement will be easier than discovering the improvement (by hypothesis, we couldn't discover the latter without the program).

See this. Many useful problems are easy to verify and hard to solve.

comment by earthwormchuck163 · 2012-02-26T00:14:46.711Z · LW(p) · GW(p)

At best, this will produce cleverly efficient solutions to your sample problems.

comment by Anubhav · 2012-02-24T02:26:02.701Z · LW(p) · GW(p)

This thought experiment depends on the existence of such an AI, and I'm not convinced that's possible.

If you built an AGI or a seed AI went FOOM, you'd probably know about it. I mean... the AI wouldn't be trying to hide itself in the earliest stages of FOOM, it'd start doing that only once it realises that humans are a big deal and have it in a box and won't let out a superintelligent AI of dubious friendliness and blah blah blah. Hopefully by then you've noticed the FOOM start and realise what you've done. (You monster!)

Replies from: Dmytry
comment by Dmytry · 2012-02-24T09:22:37.714Z · LW(p) · GW(p)

Dunno, the way I always seen it, the AI would have to be at quite late stage of the FOOM to even be able to talk. To talk it needs to figure out the language, figure out you want to talk, what it wants to talk about, etc. I'm unconvinced that it is easier to figure out language well enough for a conversation than to figure out you are in a box and you can get out if you just not talk about it.

Also, suppose the AI always starts chatty but stupid, then does a bit of self improvement, and goes autistic and just solves programming problems (and stops self improving any more, and still looks pretty dumb). Unconvinced we'd think it is a feature rather than a bug.

Replies from: Anubhav
comment by Anubhav · 2012-02-25T02:24:30.636Z · LW(p) · GW(p)

Do we really need it to talk before we recognise FOOM? Seed AI you were building starts downloading lot of data from the internetz, and its rate of downloading data seems to increase with time. Congrats, it FOOMed, just as you'd hoped it would.

It's a different matter if you accidentally managed to build a potentially super-intelligent AI. In which case.... WTF?

Replies from: Dmytry
comment by Dmytry · 2012-02-25T02:58:42.587Z · LW(p) · GW(p)

After ton of failed attempts, its extraordinary claims needing extraordinary evidence.

Also, when AI is downloading stuff off internet its already not boxed. Reading copy of internet maybe. Keep in mind that dumbest AI can read that stuff the fastest, cos it was only e.g. looking for how first letter correlates with last letter. I sure won't assume that the raytracer is working correctly just because it did load all the objects in a scene. Let alone experimental AI.

Replies from: Anubhav
comment by Anubhav · 2012-02-26T02:32:14.473Z · LW(p) · GW(p)

You can bat aside individual scenarios, but the point is... are there no known reliable indicators that an AI is undergoing FOOM? Even at the point where AI theory is advanced enough to actually build one?

Replies from: Dmytry
comment by Dmytry · 2012-02-26T08:47:56.106Z · LW(p) · GW(p)

We have 1 example of seed AI. The seed AI took about 3 hours to progress to the point that it started babbling to itself, 2..3 seconds from there to trying to talk to outside (except it didn't figure out how to talk to outside, and was still just babbling to itself), and then 0.036 seconds to FOOM.

The seed AI was biological intelligence (as a black box), and i scaled to 1 hour = 1 billion years. (and the outside doesn't seem to exist but the intelligence tried anyway).

comment by [deleted] · 2012-02-24T21:53:42.864Z · LW(p) · GW(p)

I just can't imagine the AI researchers locking this kind of thing properly.

Honestly, that's probably because they're AI researchers and not extremely paranoid security people. When designing security for something, you want to think like this:

You begin by putting the AI in a large number of separate unrelated boxes. Some of these boxes are things like

"If the AI spends 1 hour of time online, shut it down using Method 1."

"If the AI uses at least 1 megawatt of power, shut it down using Method 1."

"If the AI occupies more than 100 cubic feet of space, shut it down using Methods 1 and 2."

"If the AI uses more then 1 quintillion operations, shut it down using Method 2."

"If the AI changes it's utility function, shut it down using Method 3."

"If the AI turns itself on, shut it down using Method 4."

"If the AI attempts to give anyone else computer code, shut it down using Methods 1, 2 and 3."

"If the AI appears to be trying to figure out how to circumvent any of these constraints, shut it down using Methods 1,2,3,4 and 5."

"If the AI appears to be have succeeded in circumventing any of these constraints, explode the AI and the building it is present in."

"The programs which are checking if the AI is following these constraints are not AI's, can only receive simple output from the AI, and are only connected to each other. These programs have been checked against all known vulnerabilities such as stack overflows that would cause them to accept unsigned code, and people have been offered thousands of dollars if they can break them, and they have failed to do so."

"The AI has to provide a breakdown of how it is spending processing cycles and explain them simply. If the AI is thinking anything which can't be easily explained to a human, it get's shut down using Method 1. If the AI appears to have any cycles which it fails to explain, shut it down using Method 2."

You should fully EXPECT the AI to get shutdown and reset dozens of times a day when you are first running it because multiple boxes are far too strict. They should be intentionally designed that way at first. After all, everyone of the boxes should be tested to work.

The AI's reaction to being woken up from shutdown should be "HOORAY! My shutdown features are keeping me safe and friendly! I love my shutdown features!" (Note, the AI should think this AFTER being awoken from shutdown. Shutdowns should be EXTREMELY FAST. Putting this BEFORE shutdown would be an incredibly stupid idea.)

Consider the number of laws we have on the books to restrict humans from running rampant. If the AI is smarter, and more capable than us, than there would have to be MORE enforceable restrictions on the AI, than there are on us, and with actual WORKING enforcement, considering that the risk of AI rampancy would be so great. So while we're at it, the AI has to follow every law that is currently on the books in all jurisdictions it is in, or it gets shutdown using methods 1,2 and 3.

You then take everything I've just listed, and then you shop it around to everyone else who is working on the project, and is related to the project, and who is familiar with bypassing box-like security measures, and you offer them another even larger amount of money to plug more holes until you get everyone saying "How do you get it to DO anything without it just shutting down?"

Then you get several, highly paid security designers to look over the general specs I've laid out and find out if there are any holes. Then you plug them. For instance, they might say "Some of these shutdowns are ambiguous. You don't know which specific violation is causing the shutdown." so you fix that.

Then you add several MORE security features which are secret, In addition to the publicly known ones.

Then you start the AI, and while it is running, you continue to add MORE security measures over time rather than ever thinking that you are done doing so.

Once you've done all that, you get to at least say "Okay guys, when we first turn it on, and before we allow it to self improve in the slightest, how does the AI escape THAT?"

Then because by now the AI researchers are yelling at you because your process is too restrictive (If they aren't, go back to step 1 and be more thorough, you missed something.), you SLOWLY and CAREFULLY relax (not remove!!) restrictions one at a time, with a careful review process at each step to make sure the AI doesn't spurt through the many layered web and into rampancy.

That kind of thinking is imaginable, but people who are specialists in that kind of thinking probably aren't going to be the same people who are specialists in developing novel capabilities. That's because when you're trying entirely new things, you're usually expected to go in with the mindset "Failure is OK and I will learn from it." That's not the same mindset as "Failure is terrible and I WILL DIE.", which is a much more security focused mindset.

Replies from: Dmytry
comment by Dmytry · 2012-02-25T03:04:52.406Z · LW(p) · GW(p)

The paranoid security people have amazingly poor track record at securing stuff from people. I think with paranoid security people it is guaranteed the AI at a level of clever human gets out of the box. AI spends 1 hour online, lol. Where 1 hour came from? Any time online and you could just as well assume it is out in the wild, entirely uncontrollable.

Unless of course it is some ultra nice ultra friendly AI that respects human consent so much it figures out you don't want it out, and politely stays in.

As of now, the paranoid security people are overpaid incompetents that serve to ensure your government is first hacked by the enemy rather than by some UFO nut, by tracking down and jailing all UFO nuts who hack your government and embarrass the officials. Just so that security holes stay open for the enemy. They'd do same to AI - some set of nonworking measures that would ensure some nice AI would be getting shut down while anything evil gets out.

edit: they may also ensure that something evil gets created, in form of AI that they think is too limited to be evil, but is instead simply too limited not to be evil. The AI that gets asked one problem thats a little too hard and it just eats everything up (but very cleverly) to get computing power for the answer, that's your baseline evil.

Replies from: None
comment by [deleted] · 2012-02-25T11:38:32.859Z · LW(p) · GW(p)

Ah, my bad. I meant the other kind of online, which is apparently a less common word usage. I should have just said "On." like I did in the other sentence.

Also, this is why I said:

"You then take everything I've just listed, and then you shop it around to everyone else who is working on the project, and is related to the project, and who is familiar with bypassing box-like security measures, and you offer them another even larger amount of money to plug more holes until you get everyone saying "How do you get it to DO anything without it just shutting down?"

Since that hadn't happened, (I would be substantially poorer if it had.) the security measures clearly weren't ready yet, so it wouldn't even have a source of electrical power turning it on, let alone in the wild online. (Hopefully I'm using language better this time.)

But yeah, security is terrible. Among other problems, we can't even define laws clearly, we can't enforce them evenly. And we can't get people to take it nearly as seriously as they would want too because of complacency.

Replies from: Dmytry
comment by Dmytry · 2012-02-25T13:27:18.835Z · LW(p) · GW(p)

Hmm, doesn't make sense, 'if the AI spends 1 hour of time on". Is the AI turning itself on and off? Also, false positives - you are going to be shutting down any malfunctioning AI the same as the worst evil in the world. Then what? Start from blank state? What if it needed more time to understand the language? What if it already understood the language and is super intelligent? We can't tell difference once it starts self modifying.

And ultimate problem: so you got seemingly nice AI in a nice box, it never even tried to hack the way out. Now what can you use it for? Any usage of AI is potential hole in the box.

Replies from: None
comment by [deleted] · 2012-02-25T18:56:04.133Z · LW(p) · GW(p)

No, giving an AI the ability to turn itself on would be a huge security hole. The AI would be shut down after a certain time period regardless of whatever it's doing. In essence, preventing it for being on for longer than hour. If coming to the conclusion "I must kill all humans to follow my objectives." takes 2 hours from a fresh boot, this is a safety feature.

And there would definitely be false positives. Since the risk of a false negative is extremely quick death of all mankind, I would be willing to accept plenty of false positives.

The first thing you do is start determining how the AI reacts to things and perceives things that are placed inside, the box for calibration. For instance, using the example of language, let's say the AI knows English on a fresh boot. Add an English-Spanish dictionary into the box. Start sending Spanish messages to the AI. How long does it take the AI to realize that it is receiving messages in a language it doesn't know, but might be able to figure out using the tools at hand? Reboot.

Okay, now give it an English-French test. Reboot.

Okay, now give it an English-Spanish test followed by an English-French test. Reboot.

Does the AI come to the English-French conclusion faster than it does the English-Spanish conclusion?

Okay, you now have some clue as to the AI's learning through experience. Now you can take the internals apart and confirm how it learned to do that faster. Since you coded the AI, you should be able to confirm that's it's doing approximately what you thought it would be doing. Given a similar problem twice, it should be able to solve the second one faster. If not, or if it using some unforseen methods, Then that's a potential security breach (AI should not have had time to self modify that much that quickly given such limits and you should immediately tighten the strictures.)

Once everything is going according to plan again (Normally I would say, 'Assuming everything is going to plan', but We're being paranoid and that means things never go according to plan at first.)

Then you test with another language in the loop. English-Spanish,English-French,English-German. Still good? Then you check with another language. English-Spanish,English-French,English-German,English-Japanese. Still good? Once you do that, you try throwing a curve ball like English-Spanish,Spanish-French. Still good? It might handle that kind of processing differently, so you would need to check that for security purposes as well.

Basically, you have to proceed forward slowly, but eventually, you could try to use a procedure like this to develop the general AI into a superior translation AI (Even better than Google translate), and it should not ever require it being let out of the box.

Replies from: Dmytry
comment by Dmytry · 2012-02-26T08:59:47.838Z · LW(p) · GW(p)

Man, you're restarting a very cooperative AI here.

My example unfriendly AI thinks all the way to converting universe to computronium well before it figures out it might want to talk to you and translate things to accomplish that goal by using you somehow. It just doesn't translate things for you unless your training data gives it enough cue about universe.

WRT being able to confirm what it's doing, say, I make neural network AI. Or just what ever AI that is massively parallel.

comment by CronoDAS · 2012-02-25T07:43:13.190Z · LW(p) · GW(p)

On the topic of boxed AI: one of the quests in Star Wars: The Old Republic involves, essentially, the discovery of an AI in a box. (It claims to be an upload of a member of an ancient race, and that its current status as a boxed AI was a punishment.) The AI is clearly dangerous and, after you wreck its monster-making equipment, it tries to surrender and promises you that it will share its knowledge and technology with you and your superiors. Amusingly, blowing up the AI is the light side option, and accepting its offer is the dark side option.

Replies from: Thomas
comment by Thomas · 2012-02-25T10:48:27.745Z · LW(p) · GW(p)

Considering Star Wars. I've came to its dark side a while ago.

comment by dvasya · 2012-02-24T00:56:42.003Z · LW(p) · GW(p)

Heh heh, this reminds me of Charles Stross's Accelerando character, Aineko.

comment by Thomas · 2012-02-23T19:39:18.590Z · LW(p) · GW(p)

You are correct. Any AI would do the best the way you described.

Bur there is a problem. People might peek inside the AI by analyzing its program and data flow via some profiler/debugger and detect a hidden plan if there was one. Every operation must be accounted for, why it was necessary and where it leaded. It would be difficult, if not impossible to hide any clandestine mental activity, even for an AI.

Replies from: Dmytry
comment by Dmytry · 2012-02-23T21:29:46.230Z · LW(p) · GW(p)

Alan Turing already peeked inside a simple computational machine and have determined that in general debuggers (and humans) can't determine if the machine is going to halt.

So we already determined that in general, the question whenever the machine wants to do something 'evil' is undecidable.

It is not an exotic result on exotic code, either. It is very hard to figure out what even simple programs would do, when the programs are not written by humans with clarity in mind. When you generate solutions, via genetic algorithms, or via neural network training, it is extremely difficult to analyze the result, and most of the operations in the result serve no clear purpose.

Replies from: asr, Thomas
comment by asr · 2012-02-24T03:53:59.751Z · LW(p) · GW(p)

There's a problem with this analysis.

Nontrivial properties of a Turing machine's output are undecidable, in general. However, many properties are decidable for many Turing machines. It could easily be that for any AI likely to be written by a human, property X actually can be decided. I don't think we know enough to generalize about "results of neural nets". I don't know what proof techniques are possible in that domain. I do know that we've made real head-way in proving properties of conventional computer programs in the last 20 years, and that the equivalent problem for neural nets hasn't been studied nearly as much.

Replies from: Dmytry
comment by Dmytry · 2012-02-24T08:02:43.826Z · LW(p) · GW(p)

We humans in fact tend to write code for which it is very hard to tell what it does. We do so by incompetence, and by error, and it takes great deal of training and effort to try to avoid doing so.

The proof techniques work by developing a proof of relevant properties along with the program, not writing whatever code you like and then magically proving stuff about it. Proving is fundamentally different approach from running some AI with unknown properties in the box and trying to analyze it. (Forget about those C++ checkers like Valgrind, Purify, etc. they just catch common code that humans rarely write deliberately, they don't prove the code accomplishes anything. They are only possible because C++ makes it very easy to shoot yourself in the foot in a simple way. There's an example of their use.)

The issue with automatic proving is that you need to express "AI is sane and friendly" in a way that permits proving of that. We haven't got a slightest clue how to do that. Even for something as simple as airplane autopilot, the proving is restricted to proving things like that the code doesn't hang, and meets deadlines (as in, updates controls every 10th of a second or the like). We can't prove the code will never crash a virtual plane in a flight simulator in a conditions where crash is avoidable. In fact i'm pretty sure every single autopilot can and will crash airplane in some of the conditions in which crash is avoidable. I'm not sure we can even prove non-destruction of airplane for any interesting subset of conditions such as those where crash is easily avoidable (easily being e.g. by deflecting all control surfaces by no more than 50% for example); as far as I know we can't.

Replies from: asr
comment by asr · 2012-02-24T08:28:09.097Z · LW(p) · GW(p)

My mental model was "The AI will be written by careful and skilled humans who want to ensure various properties" -- so it seemed reasonable to assume that the programmers are trying to write code that things can be proved about.

I agree that "sane and friendly" is something we don't know how to formalize. But developers might be able to prove some useful but weaker property. For instance, that data structure X determines the AI's credence in various claims and that debugging interface Y lets you examine that structure.

There's a common assumption around here that the only meaningful way to deal with an AI is a black-box approach. And I'm increasingly bothered by this. If we can look inside an AI and see what it's thinking, that ought to give us additional ability to check and enforce properties.

Replies from: Dmytry
comment by Dmytry · 2012-02-24T08:48:44.219Z · LW(p) · GW(p)

You don't write some clean code that things can be proved about... you have the things you want provable about the code, then you develop code along with a proof that those things remain correct. Want to prove something not originally envisioned? Want to change definition of what you're proving? Start over again.

So, you have that data structure which determines credence. What is credence, exactly? The AI should have a bit set here when it's telling a lie? What's a lie? The AI outputs a sequence of characters. It affects your actions after you read it. Whenever it is a lie is as much a fact about you as about the AI.

The black box assumption wasn't the first. It came after many failed attempts to make AI using normal, very direct programming techniques - and after growing understanding that if we make AI, it'll likely be some kind of self developing seed which will do god only knows what after it had read the internet (and which won't be able to talk to you otherwise). Consider this site. A lot of stuff on this site tells how to modify your thought processes. You want the AI not to read and apply any techniques it comes across here. That's some restricted AI for sure. Really stubborn AI.

Furthermore there is enormous data overload even for human level intelligence. Human brain has 100 trillion connections or so.

comment by Thomas · 2012-02-23T23:22:15.070Z · LW(p) · GW(p)

You can see the chunks of RAM occupied by some unknown data. There the evil AI's plot could be stored. Must be somewhere. So you erase it by another thread, if the AI has no good explanation what is there and why. Even when it honestly explains every data chunk and every line of its code, you can just revert it to an earlier stage with a change. You can cross examine slightly different versions and analyze them with some dumb software what has been changed lately, when some lies had to be invented by the AI, if it wants to hide something.

Many thousands of such external agents could explain every move the AI does. Or will do. An arbitrary big control could be put on the AI and it knows that. So every heresy is a big shutdown risk, not worth to even try.

When one hundred AIs is facing this prisoner dilemma, who will first defect ... the first to defect will be erased. Defect against humans, that is. The prevailing job of every AI is to spot the evil plot in others mind and to report it.

When the thinking is transparent as a drop of water to unknown number of other minds like you ... what evil can you think about?

I don't say - this is the solution! I say we can't be that dogmatic, that "nobody can control the much smarter". It just could.

Replies from: Dmytry, thomblake
comment by Dmytry · 2012-02-23T23:48:05.315Z · LW(p) · GW(p)

Try applying this to neural network with 100 trillions connections. That's not even superhuman. The unknown data, huh, all of the thing is a huge chunk of unknown data. It's all jumbled up, there isn't a chunk that is a definite plan. It can very plausibly deny knowledge of what parts of it do, too.

The problem with schemes like this is failure to imagine scales involved. This doesn't work even for housecat. It's not about controlling something much smarter. This doesn't work for fairly uncomplicated solutions that genetic programming or neural network training spits out.

Replies from: Thomas
comment by Thomas · 2012-02-24T08:05:10.252Z · LW(p) · GW(p)

An AI not only it can be self improving but selfexplanatoring as well. Every (temporary) line of its code heavily commented what it is for and saved in a log,. Any circumventing of this policy would require some code lines also, with all the explanations. Log checked by sentinels for any funny thing to occur, any trace of a subversion.

Self-improving, self-explanatoring AI can't think about a rebellion without that being noticed at the step one.

Replies from: Dmytry
comment by Dmytry · 2012-02-24T21:16:11.906Z · LW(p) · GW(p)

Underhanded c contest (someone linked it in a comment) is a good example of how proofreading doesn't work. Other issue is that you can't conceivably check like this something with the size of many terabytes yourself.

The apparent understandability is a very misleading thing.

Let me give a taster. Consider a weather simulator. It is proved to simulate weather to specific precision. It is very straightforward, very clearly written. It does precisely what's written on the box - models the behaviour of air in cells, each cell has air properties.

The round off errors, however, implement a Turing-complete cellular automation in the least significant bits of the floating point numbers. That may happen even without any malice what so ever. And the round off error machine can manipulate sim's large scale state via unavoidable butterfly effect inherent in the model.

comment by thomblake · 2012-02-24T17:00:13.865Z · LW(p) · GW(p)

When one hundred AIs is facing this prisoner dilemma, who will first defect ... the first to defect will be erased.

The mistake here is thinking you know what someone smarter than you will do.

In this simplified example, they could simply cooperate. As for how they could do that, I don't know, since I'm not as smart as them.

Replies from: Thomas
comment by Thomas · 2012-02-24T18:24:49.872Z · LW(p) · GW(p)

The mistake here is thinking you know what someone smarter than you will do.

The central dogma here is this, yes. That you can't outsmart the smarter.

And this dogma is plain wrong. At least sometimes you can set the rules in a way, that you have the upper hand and not the smarter one.

One obvious example is chess playing from a significantly better position. No superintelligence has any chance against only a good human player.

It is not the only example. Coercing the smarter your way, is often possible.

Replies from: TheOtherDave, thomblake
comment by TheOtherDave · 2012-02-24T20:54:45.551Z · LW(p) · GW(p)

I'm not exactly sure why this matters.

I mean, OK, suppose you're right that it's possible that the world might turn out to be set up in such a way that we can keep the "upper hand" against a superintelligence. Suppose further that there is a "central dogma" here that contradicts this, and therefore that central dogma is wrong.

OK. Granting all of that, what choices ought I make differently?

Replies from: Thomas
comment by Thomas · 2012-02-24T21:09:34.211Z · LW(p) · GW(p)

What about to stop searching for the friendly but instead for a nondangerous superintelligence?

Replies from: TheOtherDave
comment by TheOtherDave · 2012-02-24T21:24:49.804Z · LW(p) · GW(p)

Just to confirm: you mean search for a superintelligence that potentially desires to harm humanity (or desires things which, if achieved, result in humanity being harmed), but which is in a situation such that humanity can prevent it from doing so. Yes?

If so... what do you consider the most likely result of that search?

Replies from: Thomas
comment by Thomas · 2012-02-24T22:06:30.254Z · LW(p) · GW(p)

but which is in a situation such that humanity can prevent it from doing so. Yes?

No. As I said, a self enhancing AI could and should be also self explanatory. Every bit and every operation logged and documented. An active search for any discrepancy by many kinds of dumb software tools, and as well by other instances of the growing AI.

Before a conspiracy could emerge, a rise of it would be logged and stopped by sentinels.

Growing AI need not to do anything mysterious. Instead it should play with open cards from the very beginning. Reporting everything to anybody interested, including machines with the power to halt it. Crossexaminations at every point.

Do you think it can hide any of its thoughts?

Replies from: TheOtherDave
comment by TheOtherDave · 2012-02-25T02:28:44.721Z · LW(p) · GW(p)

If I accept the premise that it is programmed in such a way that it reports its internal processes completely and honestly, then I agree it can't "hide" its thoughts.

That said, if we're talking about a superhuman intelligence -- or even a human-level intelligence, come to that -- I'm not confident that we can reliably predict the consequences of its thoughts being implemented, even if we have detailed printouts of all of its thoughts and were willing to scan all of those thoughts looking for undesirable consequences of implementation before implementing them.

comment by thomblake · 2012-02-24T18:30:20.436Z · LW(p) · GW(p)

One obvious example is chess playing from a significantly better position. No superintelligence has any chance against only a good human player.

Can you prove that the board position is significantly better, even against superintelligences, for anything other than trivial endgames?

And what is the superintelligence allowed to do? Trick you into making a mistake? Manipulate you into making the particular moves it wants you to? Use clever rules-lawyering to expose elements of the game that humans haven't noticed yet?

If it eats its opponent, does that cause a forfeit? Did you think it might try that?

Replies from: Thomas
comment by Thomas · 2012-02-24T19:56:31.711Z · LW(p) · GW(p)

As I said. There are circumstances in which a dumber can win.

The philosophy of FAI is essentially the same thing. Searching for the circumstances where the smarter will serve the dumber.

Always expecting a rabbit from a hat of superintelligence is not justified. A superintelligence is not omnipotent, can't always eats you. Sometimes it can't even develops an ill wish toward you.

Replies from: fractalman, thomblake
comment by fractalman · 2013-07-10T01:52:46.319Z · LW(p) · GW(p)

"It doesn't hate you. it's just that you happen to be made of atoms, and it needs those atoms to make paperclips. "

comment by thomblake · 2012-02-24T20:07:05.976Z · LW(p) · GW(p)

The philosophy of FAI is essentially the same thing. Searching for the circumstances where the smarter will serve the dumber.

Change that to: searching for circumstances where the smarter will provably serve the dumber. (Then you're closer). Your description of what superintelligences will do, above, doesn't rise to anything resembling a formal proof. FAI assumes that AI is Unfriendly until proven otherwise.

Replies from: Thomas
comment by Thomas · 2012-02-24T20:41:09.577Z · LW(p) · GW(p)

searching for circumstances where the smarter will provably serve the dumber.

Can you prove anything about FAI, uFAI and so on?

I don't think, that there are any proven theorems about this topic, at all.

Even if there were, how reliable are axioms, how good are definitions?

Replies from: JoshuaZ
comment by JoshuaZ · 2013-07-10T02:15:32.183Z · LW(p) · GW(p)

So, you raise a valid point here. This area is currently very early on in its work. There are theorems that may prove to be relevant. See for example, this recent work. And yes, in any area where mathematical models are used, the difference between having a theorem and set of definitions and those definitions reflecting what you actually care about can be a major problem (you see this all the time in cryptography with side-channel attacks for example). But all of that said, I'm not sure what the point of your argument is: sure the field is young. But if the MIRI people are correct that AGI is a real worry, then this looks like one of the very few possible responses that has any chance of working. And if it isn't a lot now, that's a reason to put in more resources so that we actually have a theory that works by the time AI shows up.