Evaluating the feasibility of SI's plan

post by JoshuaFox · 2013-01-10T08:17:29.959Z · LW · GW · Legacy · 187 comments

Contents

187 comments

(With Kaj Sotala)

SI's current R&D plan seems to go as follows: 

1. Develop the perfect theory.
2. Implement this as a safe, working, Artificial General Intelligence -- and do so before anyone else builds an AGI.

The Singularity Institute is almost the only group working on friendliness theory (although with very few researchers). So, they have the lead on Friendliness. But there is no reason to think that they will be ahead of anyone else on the implementation.

The few AGI designs we can look at today, like OpenCog, are big, messy systems which intentionally attempt to exploit various cognitive dynamics that might combine in unexpected and unanticipated ways, and which have various human-like drives rather than the sort of supergoal-driven, utility-maximizing goal hierarchies that Eliezer talks about, or which a mathematical abstraction like AIXI employs.

A team which is ready to adopt a variety of imperfect heuristic techniques will have a decisive lead on approaches based on pure theory. Without the constraint of safety, one of them will beat SI in the race to AGI. SI cannot ignore this. Real-world, imperfect, safety measures for real-world, imperfect AGIs are needed.  These may involve mechanisms for ensuring that we can avoid undesirable dynamics in heuristic systems,  or AI-boxing toolkits usable in the pre-explosion stage, or something else entirely. 

SI’s hoped-for theory will include a reflexively consistent decision theory, something like a greatly refined Timeless Decision Theory.  It will also describe human value as formally as possible, or at least describe a way to pin it down precisely, something like an improved Coherent Extrapolated Volition.

The hoped-for theory is intended to  provide not only safety features, but also a description of the implementation, as some sort of ideal Bayesian mechanism, a theoretically perfect intelligence.

SIers have said to me that SI's design will have a decisive implementation advantage. The idea is that because strap-on safety can’t work, Friendliness research necessarily involves more fundamental architectural design decisions, which also happen to be general AGI design decisions that some other AGI builder could grab and save themselves a lot of effort. The assumption seems to be that all other designs are based on hopelessly misguided design principles. SI-ers, the idea seems to go, are so smart that they'll  build AGI far before anyone else. Others will succeed only when hardware capabilities allow crude near-brute-force methods to work.

Yet even if the Friendliness theory provides the basis for intelligence, the nitty-gritty of SI’s implementation will still be far away, and will involve real-world heuristics and other compromises.

We can compare SI’s future AI design to AIXI, another mathematically perfect AI formalism (though it has some critical reflexivity issues). Schmidhuber, Hutter, and colleagues think that their AXI can be scaled down into a feasible implementation, and have implemented some toy systems. Similarly, any actual AGI based on SI's future theories will have to stray far from its mathematically perfected origins.

Moreover, SI's future friendliness proof may simply be wrong. Eliezer writes a lot about logical uncertainty, the idea that you must treat even purely mathematical ideas with same probabilistic techniques as any ordinary uncertain belief. He pursues this mostly so that his AI can reason about itself, but the same principle applies to Friendliness proofs as well.

Perhaps Eliezer thinks that a heuristic AGI is absolutely doomed to failure; that a hard takeoff  immediately soon after the creation of the first AGI is so overwhelmingly likely that a mathematically designed AGI is the only one that could stay Friendly. In that case, we have to work on a pure-theory approach, even if it has a low chance of being finished first. Otherwise we'll be dead anyway. If an embryonic AGI will necessarily undergo an intelligence explosion, we have no choice but to "shut up and do the impossible."

I am all in favor of gung-ho knife-between-the teeth projects. But when you think that your strategy is impossible, then you should also look for a strategy which is possible, if only as a fallback. Thinking about safety theory until drops of blood appear on your forehead (as Eliezer puts it, quoting Gene Fowler), is all well and good. But if there is only a 10% chance of achieving 100% safety (not that there really is any such thing), then I'd rather go for a strategy that provides only a 40% promise of safety, but with a 40% chance of achieving it. OpenCog and the like are going to be developed regardless, and probably before SI's own provably friendly AGI. So, even an imperfect safety measure is better than nothing.

If heuristic approaches have a 99% chance of an immediate unfriendly explosion, then that might be wrong. But SI, better than anyone, should know that any intuition-based probability estimate of “99%” really means “70%”. Even if other approaches are long-shots, we should not put all our eggs in one basket. Theoretical perfection and stopgap safety measures can be developed in parallel.

Given what we know about human overconfidence and the general reliability of predictions, the actual outcome will to a large extent be something that none of us ever expected or could have predicted. No matter what happens, progress on safety mechanisms for heuristic AGI will improve our chances if something entirely unexpected happens.

What impossible thing should SI be shutting up and doing? For Eliezer, it’s Friendliness theory. To him, safety for heuristic AGI is impossible, and we shouldn't direct our efforts in that direction. But why shouldn't safety for heuristic AGI be another impossible thing to do?

(Two impossible things before breakfast … and maybe a few more? Eliezer seems to be rebuilding logic, set theory, ontology, epistemology, axiology, decision theory, and more, mostly from scratch. That's a lot of impossibles.)

And even if safety for heuristic AGIs is really impossible for us to figure out now, there is some chance of an extended soft takeoff that will allow for the possibility of us developing heuristic AGIs which will help in figuring out AGI safety, whether because we can use them for our tests, or because they can by applying their embryonic general intelligence to the problem. Goertzel and Pitt have urged this approach.

Yet resources are limited. Perhaps the folks who are actually building their own heuristic AGIs are in a better position than SI to develop safety mechanisms for them, while SI is the only organization which is really working on a formal theory on Friendliness, and so should concentrate on that. It could be better to focus SI's resources on areas in which it has a relative advantage, or which have a greater expected impact.

Even if so, SI should evangelize AGI safety to other researchers, not only as a general principle, but also by offering theoretical insights that may help them as they work on their own safety mechanisms.

In summary:

1. AGI development which is unconstrained by a friendliness requirement is likely to beat a provably-friendly design in a race to implementation, and some effort should be expended on dealing with this scenario.

2. Pursuing a provably-friendly AGI, even if very unlikely to succeed, could still be the right thing to do if it was certain that we’ll have a hard takeoff very soon after the creation of the first AGIs. However, we do not know whether or not this is true.

3. Even the provably friendly design will face real-world compromises and errors in its  implementation, so the implementation will not itself be provably friendly. Thus, safety protections of the sort needed for heuristic design are needed even for a theoretically Friendly design.

187 comments

Comments sorted by top scores.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-10T15:50:40.759Z · LW(p) · GW(p)

Lots of strawmanning going on here (could somebody else please point these out? please?) but in case it's not obvious, the problem is that what you call "heuristic safety" is difficult. Now, most people haven't the tiniest idea of what makes anything difficult to do in AI and are living in a verbal-English fantasy world, so of course you're going to get lots of people who think they have brilliant heuristic safety ideas. I have never seen one that would work, and I have seen lots of people come up with ideas that sound to them like they might have a 40% chance of working and which I know perfectly well to have a 0% chance of working.

The real gist of Friendly AI isn't some imaginary 100% perfect safety concept, it's ideas like, "Okay, we need to not have a conditionally independent chance of goal system warping on each self-modification because over the course of a billion modifications any conditionally independent probability will sum to ~1, but since self-modification is initially carried out in the highly deterministic environment of a computer chip it looks possible to use crisp approaches that avert a conditionally independent failure probability for each self-modification." Following this methodology is not 100% safe, but rather, if you fail to do that, your conditionally independent failure probabilities add up to 1 and you're 100% doomed.

But if you were content with a "heuristic" approach that you thought had a 40% chance of working, you'll never think through the problem in enough detail to realize that your doom probability is not 60% but ~1, because only somebody holding themselves to a higher standard than "heuristic safety" would ever push their thinking far enough to realize that their initial design was flawed.

People at SI are not stupid. We're not trying to achieve lovely perfect safety with a cherry on top because we think we have lots of luxurious time to waste and we're perfectionists. I have an analysis of the problem which says that if I want something to have a failure probability less than 1, I have to do certain things because I haven't yet thought of any way not to have to do them. There are of course lots of people who think that they don't have to solve the same problems, but that's because they're living in a verbal-English fantasy world in which their map is so blurry that they think lots of things "might be possible" that a sharper map would show to be much more difficult than they sound.

I don't know how to take a self-modifying heuristic soup in the process of going FOOM and make it Friendly. You don't know either, but the problem is, you don't know that you don't know. Or to be more precise, you don't share my epistemic reasons to expect that to be really difficult. When you engage in sufficient detail with a problem of FAI, and try to figure out how to solve it given that the rest of the AI was designed to allow that solution, it suddenly looks that much harder to solve under sloppy conditions. Whereas on the "40% safety" approach, it seems like the sort of thing you might be able to do, sure, why not...

If someday I realize that it's actually much easier to do FAI than I thought, given that you use a certain exactly-right approach - so easy, in fact, that you can slap that exactly-right approach on top of an AI system that wasn't specifically designed to permit it, an achievement on par with hacking Google Maps to play chess using its route-search algorithm - then that epiphany will be as the result of considering things that would work and be known to work with respect to some subproblem, not things that seem like they might have a 40% chance of working overall, because only the former approach develops skill.

I'll leave that as my take-home message - if you want to imagine building plug-in FAI approaches, isolate a subproblem and ask yourself how you could solve it and know that you've solved it, don't imagine overall things that have 40% chances of working. If you actually succeed in building knowledge this way I suspect that pretty soon you'll give up on the plug-in business because it will look harder than building the surrounding AI yourself.

Replies from: wwa, Kaj_Sotala, JoshuaFox, JoshuaFox, OrphanWilde, magfrump, JoshuaFox, timtyler, timtyler, shminux
comment by wwa · 2013-01-10T18:08:20.582Z · LW(p) · GW(p)

full disclosure: I'm a professional cryptography research assistant. I'm not really interested in AI (yet) but there are obvious similarities when it comes to security.

I have to back Elizer up on the "Lots of strawmanning" part. No professional cryptographer will ever tell you there's hope in trying to achieve "perfect level of safety" of anything and cryptography, unlike AI, is a very well formalized field. As an example, I'll offer a conversation with a student:

  • How secure is this system? (such question is usually a shorthand for: "What's the probability this system won't be broken by methods X, Y and Z")

  • The theorem says

  • What's the probability that the proof of the theorem is correct?

  • ... probably not

Now, before you go "yeah, right", I'll also say that I've already seen this once - there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis - so the probability was indeed not even close to for any serious N. I'd like to point out that this doesn't even include problems with the implementation of the theory.

It's really difficult to explain how hard this stuff really is to people who never tried to develop anything like it. That's too bad (and a danger) because people who do get it rarely are in charge of the money. That's one reason for the CFAR/rationality movement... you need a tool to explain it to other people too, am I right?

Replies from: gwern, JoshuaFox, Pentashagon
comment by gwern · 2013-01-10T18:52:15.208Z · LW(p) · GW(p)

Now, before you go "yeah, right", I'll also say that I've already seen this once - there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis - so the probability was indeed not even close to for any serious N. I'd like to point out that this doesn't even include problems with the implementation of the theory.

Yup. Usual reference: "Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes". (I also have an essay on a similar topic.)

Replies from: wwa, JoshuaFox
comment by wwa · 2013-01-10T21:15:11.224Z · LW(p) · GW(p)

Upvoted for being gwern i.e. having a reference for everything... how do you do that?

Replies from: gwern
comment by gwern · 2013-01-10T21:19:47.594Z · LW(p) · GW(p)

Excellent visual memory, great Google & search skills, a thorough archive system, thousands of excerpts stored in Evernote, and essays compiling everything relevant I know of on a topic - that's how.

(If I'd been born decades ago, I'd probably have become a research librarian.)

Replies from: mapnoterritory
comment by mapnoterritory · 2013-01-10T23:33:15.798Z · LW(p) · GW(p)

Would love to read a gwern-essay on your archiving system. I use evernote, org-mode, diigo and pocket and just can't get them streamlined into a nice workflow. If evernote adopted diigo-like highlighting and let me seamlessly edit with Emacs/org-mode that would be perfect... but alas until then I'm stuck with this mess of a kludge. Teach us master, please!

Replies from: gwern, siodine
comment by gwern · 2013-01-11T01:20:53.142Z · LW(p) · GW(p)

I meant http://www.gwern.net/Archiving%20URLs

Replies from: mapnoterritory
comment by mapnoterritory · 2013-01-11T08:22:32.220Z · LW(p) · GW(p)

Of course your already have an answer. Thanks!

comment by siodine · 2013-01-10T23:45:45.663Z · LW(p) · GW(p)

Why do you use diigo and pocket? They do the same thing. Also, with evernote's clearly you can highlight articles.

You weren't asking me, but I use diigo to manage links to online textbooks and tutorials, shopping items, book recommendations (through amazon), and my less important online article to read list. Evernote for saving all of my important read content (and I tag everything). Amazon's send to kindle extension to read longer articles (every once and a while I'll save all my clippings from my kindle to evernote). And then I maintain a personal wiki and collection of writings using markdown with evernote's import folder function in the pc software (I could also do this with a cloud service like gdrive).

Replies from: mapnoterritory
comment by mapnoterritory · 2013-01-11T08:31:03.311Z · LW(p) · GW(p)

I used diigo for annotation before clearly had highlighting. Now, just as you, use diigo for link storage and Evernote for content storage. Diigo annotation has still the advantage that it excerpts the text you highlight. With Clearly if I want to have the highlighted parts I have to find and manually select them again... Also tagging from clearly requires 5 or so clicks which is ridiculous... But I hope it will get fixed.

I plan to use pocket once I get a tablet... it is pretty and convenient, but the most likely to get cut out of the workflow.

Thanks for the evernote import function - I'll look into it, maybe it could make the Evenote - org-mode integration tighter. Even then, having 3 separate systems is not quite optimal...

comment by JoshuaFox · 2013-01-10T20:25:58.897Z · LW(p) · GW(p)

Thanks, I've read those. Good article.

So, what is our backup plan when proofs turn out to be wrong?

Replies from: gwern
comment by gwern · 2013-01-10T21:03:33.692Z · LW(p) · GW(p)

The usual disjunctive strategy: many levels of security, so an error in one is not a failure of the overall system.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2013-01-11T00:24:07.959Z · LW(p) · GW(p)

What kind of "levels of security" do you have in mind? Can they guard against an error like "we subtly messed up the FAI's decision theory or utility function, and now we're stuck with getting 1/10 of the utility out of the universe that we might have gotten"?

Replies from: gwern
comment by gwern · 2013-01-11T01:22:02.707Z · LW(p) · GW(p)

Boxing is an example of a level of security: the wrong actions can trigger some invariant and signal that something went wrong with the decision theory or utility function. I'm sure security could be added to the utility function as well: maybe some sort of conservatism along the lines of the suicide-button invariance, where it leaves the Earth alone and so we get a lower bound on how disastrous a mistake can be. Lots of possible precautions and layers, each of which can be flawed (like Eliezer has demonstrated for boxing) but hopefully are better than any one alone.

Replies from: Eliezer_Yudkowsky, Wei_Dai, JoshuaFox
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-11T15:57:06.085Z · LW(p) · GW(p)

the wrong actions can trigger some invariant and signal that something went wrong with the decision theory or utility function

That's not 'boxing'. Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid. What you're describing is some case where we think that even after 'proving' some set of invariants, we can still describe a high-level behavior X such that detecting X either indicates global failure with high-enough probability that we would want to shut down the AI after detecting any of many possible things in the reference class of X, or alternatively, we think that X has a probability of flagging failure and that we afterward stand a chance of doing a trace-back to determine more precisely if something is wrong. Having X stay in place as code after the AI self-modifies will require solving a hard open problem in FAI for having a nontrivially structured utility function such that X looks like instrumentally a good thing (your utility function must yield, 'under circumstances X it is better that I be suspended and examined than that I continue to do whatever I would otherwise calculate as the instrumentally right thing). This is how you would describe on a higher level of abstraction an attempt to write a tripwire that immediately detects an attempt to search out a strategy for deceiving the programmers as the goal is formed and before the strategy is actually searched.

There's another class of things Y where we think that humans should monitor surface indicators because a human might flag something that we can't yet reify as code, and this potentially indicates a halt-melt-and-catch-fire-worthy problem. This is how you would describe on a higher level of abstraction the 'Last Judge' concept from the original CEV essay.

All of these things have fundamental limitations in terms of our ability to describe X and monitor Y; they are fallback strategies rather than core strategies. If you have a core strategy that can work throughout, these things can flag exceptions indicating that your core strategy is fundamentally not working and you need to give up on that entire strategy. Their actual impact on safety is that they give a chance of detecting an unsafe approach early enough that you can still give up on it. Meddling dabblers invariably want to follow a strategy of detecting such problems, correcting them, and then saying afterward that the AI is back on track, which is one of those things that is suicide that they think might have an 80% chance of working or whatever.

Replies from: gwern, JoshuaFox, MugaSofer
comment by gwern · 2013-01-11T16:45:31.011Z · LW(p) · GW(p)

That's not 'boxing'. Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.

That was how you did your boxing experiments, but I've never taken it to be so arbitrarily limited in goals, capacities, or strategies on either end. There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory, and this would be merely sane for anyone doing such a thing.

Replies from: Eliezer_Yudkowsky, shminux, timtyler
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-11T17:33:11.502Z · LW(p) · GW(p)

Be specific? What sort of triggers, what sort of dangerous territory? I can't tell if you're still relying on a human to outwit a transhuman or talking about something entirely different.

Replies from: MugaSofer
comment by MugaSofer · 2013-01-27T16:11:48.382Z · LW(p) · GW(p)

I can't tell if you're still relying on a human to outwit a transhuman or talking about something entirely different.

Depends on what you mean by "outwitting". A human outwitting a transhuman by persuading it to solve problems for them, without letting it do whatever it wants, is different to a human outwitting a transhuman by creating a security system without any holes the transhuman can find.

comment by Shmi (shminux) · 2013-01-11T17:31:23.292Z · LW(p) · GW(p)

There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory

A trans-human intelligence ought to be able to model human one with ease. This means being able to predict potential triggers and being able to predict how to trick the lack-wit humans on the other end to unwittingly reveal the location of the triggers (even if they don't consciously know it themselves). So the only trigger that matters is one to detect a hint of an intent to get out. Even that is probably too naive, as there could well be other failure modes of which AI deboxing is but a side effect, and our limited human imagination will never going to catch them all. My expectation is that if you rely on safety triggers to bail you out (instead of including them as a desperate last-ditch pray-it-works defense), then you might as well not bother with boxing at all.

Replies from: gwern
comment by gwern · 2013-01-11T17:54:56.738Z · LW(p) · GW(p)

Even that is probably too naive, as there could well be other failure modes of which AI deboxing is but a side effect, and our limited human imagination will never going to catch them all. My expectation is that if you rely on safety triggers to bail you out (instead of including them as a desperate last-ditch pray-it-works defense), then you might as well not bother with boxing at all.

My whole point of 'defense in depth' was that each layer was highly fallible and could have errors. Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).

Replies from: shminux, TheOtherDave
comment by Shmi (shminux) · 2013-01-11T18:59:29.902Z · LW(p) · GW(p)

Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).

That's true. However I would expect a transhuman to be able to find a single point of failure which does not even occur to our limited minds, so this perfect correlation is a virtual certainty.

Replies from: gwern
comment by gwern · 2013-01-11T20:17:34.831Z · LW(p) · GW(p)

Now you're just ascribing magical powers to a potentially-transhuman AI. I'm sure there exists such a silver bullet, in fact by definition if security isn't 100%, that's just another way of saying there exists a strategy which will work; but that's ignoring the point about layers of security not being completely redundant with proofs and utility functions and decision theories, and adding some amount of safety.

Replies from: shminux
comment by Shmi (shminux) · 2013-01-11T20:21:29.364Z · LW(p) · GW(p)

Disengaging.

comment by TheOtherDave · 2013-01-11T18:32:18.950Z · LW(p) · GW(p)

As I understand EY's point, it's that (a) the safety provided by any combination of defenses A, B, C, etc. around an unboundedly self-optimizing system with poorly architected goals will be less than the safety provided by such a system with well architected goals, and that (b) the safety provided by any combination of defenses A, B, C, etc. around such a system with poorly architected goals is too low to justify constructing such a system, but that (c) the safety provided by such a system with well architected goals is high enough to justify constructing such a system.

That the safety provided by a combination of defenses A, B, C is greater than that provided by A alone is certainly true, but seems entirely beside his point.

(For my own part, a and b seem pretty plausible to me, though I'm convinced of neither c nor that we can construct such a system in the first place.)

comment by timtyler · 2013-01-13T02:56:33.989Z · LW(p) · GW(p)

Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.

That was how you did your boxing experiments, but I've never taken it to be so arbitrarily limited in goals, capacities, or strategies on either end. There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory, and this would be merely sane for anyone doing such a thing.

That is how they build prisons. It is also how they construct test harnesses. It seems as though using machines to help with security is both obvious and prudent.

comment by JoshuaFox · 2013-01-16T09:09:09.306Z · LW(p) · GW(p)

they are fallback strategies rather than core strategies

Agreed. Maybe I missed it, but I haven't seen you write much on the value of fallback strategies, even understand that (on the understanding that it's small, much less than FAI theory).

There's a little in CFAI sec.5.8.0.4, but not much more.

comment by MugaSofer · 2013-01-13T20:39:18.439Z · LW(p) · GW(p)

Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.

I understood "boxing" referred to any attempt to keep a SI in a box, while somehow still extracting useful work from it; whether said work is in the form of text strings or factory settings doesn't seem relevant.

Your central point is valid, of course.

comment by Wei Dai (Wei_Dai) · 2013-01-11T22:35:55.025Z · LW(p) · GW(p)

where it leaves the Earth alone and so we get a lower bound on how disastrous a mistake can be

I don't see how to make this work. Do we make the AI indifferent about Earth? If so, Earth will be destroyed as a side effect of its other actions. Do we make it block all causal interactions between Earth and the rest of the universe? Then we'll be permanently stuck on Earth even if the FAI attempt turns out to be successful in other regards. Any other ideas?

Replies from: gwern
comment by gwern · 2013-01-11T22:52:23.441Z · LW(p) · GW(p)

Do we make the AI indifferent about Earth? If so, Earth will be destroyed as a side effect of its other actions.

I had a similar qualm about the suicide button

Do we make it block all causal interactions between Earth and the rest of the universe? Then we'll be permanently stuck on Earth even if the FAI attempt turns out to be successful in other regards.

Nothing comes for free.

comment by JoshuaFox · 2013-01-11T09:35:13.561Z · LW(p) · GW(p)

Yes, it is this layered approach that the OP is asking about -- I don't see that SI is trying it.

Replies from: gwern
comment by gwern · 2013-01-11T16:42:23.504Z · LW(p) · GW(p)

In what way would SI be 'trying it'? The point about multiple layers of security being a good idea for any seed AI project has been made at least as far back as Eliezer's CFAI and brought up periodically since with innovations like the suicide button and homomorphic encryption.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-12T16:28:24.153Z · LW(p) · GW(p)

I agree: That sort of innovation can be researched as additional layers to supplement FAI theory

Our question was -- to what extent should SI invest in this sort of thing.

Replies from: gwern
comment by gwern · 2013-01-16T02:32:13.536Z · LW(p) · GW(p)

My own view is 'not much', unless SI were to launch an actual 'let's write AGI now' project, in which case they should invest as heavily as anyone else would who appreciated the danger.

Many of the layers are standard computer security topics, and the more exotic layers like homomorphic encryption are being handled by academia & industry adequately (and it would be very difficult for SI to find cryptographers who could advance the state of the art); hence, SI's 'comparative advantage', as it were, currently seems to be in the most exotic areas like decision theory & utility functions. So I would agree with the OP summary:

Perhaps the folks who are actually building their own heuristic AGIs are in a better position than SI to develop safety mechanisms for them, while SI is the only organization which is really working on a formal theory on Friendliness, and so should concentrate on that. It could be better to focus SI's resources on areas in which it has a relative advantage, or which have a greater expected impact.

Although I would amend 'heuristic AGIs' to be more general than that.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-16T07:19:07.403Z · LW(p) · GW(p)

Many of the layers are standard computer security topics, and the more exotic layers like homomorphic encryption are being handled by academia & industry adequately

That's all the more reason to publish some articles on how to apply known computer security techniques to AGI. This is way easier (though far less valuable) than FAI, but not obvious enough to go unsaid.

SI's 'comparative advantage'

Yes. But then again, don't forget the 80/20 rule. There may be some low-hanging fruit along other lines than FAI -- and for now, no one else is doing it.

comment by JoshuaFox · 2013-01-10T20:22:50.672Z · LW(p) · GW(p)

Sure, we agree that the "100% safe" mechanisms are not 100% safe, and SI knows that.

So how do we deal with this very real danger?

Replies from: wwa
comment by wwa · 2013-01-10T21:55:43.115Z · LW(p) · GW(p)

The point is you never achieve 100% safety no matter what, so the correct way to approach it is to reduce risk most given whatever resources you have. This is exactly what Eleizer says SI is doing:

I have an analysis of the problem which says that if I want something to have a failure probability less than 1, I have to do certain things because I haven't yet thought of any way not to have to do them.

IOW, they thought about it and concluded there's no other way. Is their approach the best possible one? I don't know, probably not. But it's a lot better than "let's just build something and hope for the best".

Edit: Is that analysis public? I'd be interested in that, probably many people would.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-11T06:38:09.370Z · LW(p) · GW(p)

I'm not suggesting "let's just build something and hope for the best." Rather, we should pursue a few strategies at once: Both FAI theory, as well stopgap security measures. Also, education of other researchers.

comment by Pentashagon · 2013-01-14T21:23:20.812Z · LW(p) · GW(p)

I really appreciate this comment because safety in cryptography (and computer security in general) is probably the closest analog to safety in AI that I can think of. Cryptographers can only prevent against the known attacks while hoping that adding a few more rounds to a cipher will also prevent against the next few attacks that are developed. Physical attacks are often just as dangerous as theoretical attacks. When a cryptographic primitive is broken it's game over; there's no arguing with the machine or with the attackers or papering a solution over the problem. When the keys are exposed, it's game over. You don't get second chances.

So far I haven't seen an analysis of the hardware aspect of FAI on this site. It isn't sufficient for FAI to have a logical self-reflective model of itself and its goals. It also needs an accurate physical model of itself and how that physical nature implements its algorithms and goals. It's no good if an FAI discovers that by aiming a suitably powerful source of radiation at a piece of non-human hardware in the real world it is able to instantly maximize its utility function. It's no good if a bit flip in its RAM makes it start maximizing paperclips instead of CEV. Even if we had a formally proven model of FAI that we were convinced would work I think we'd be fools to actually start running it on the commodity hardware we have today. I think it's probably a simpler engineering problem to ensure that the hardware is more reliable than the software, but something going seriously wrong in the hardware over the lifetime of the FAI would be an existential risk once it's running.

comment by Kaj_Sotala · 2013-01-10T16:42:51.776Z · LW(p) · GW(p)

I don't know how to take a self-modifying heuristic soup in the process of going FOOM and make it Friendly. You don't know either, but the problem is, you don't know that you don't know. Or to be more precise, you don't share my epistemic reasons to expect that to be really difficult.

But the article didn't claim any different: it explicitly granted that if we presume a FOOM, then yes, trying to do anything with heuristic soups seems useless and just something that will end up killing us all. The disagreement is not on whether it's possible to make a heuristic AGI that FOOMs while remaining Friendly; the disagreement is on whether there will inevitably be a FOOM soon after the creation of the first AGI, and whether there could be a soft takeoff during which some people prevented those powerful-but-not-yet-superintelligent heuristic soups from killing everyone while others put the finishing touches on the AGI that could actually be trusted to remain Friendly when it actually did FOOM.

Replies from: torekp, Wei_Dai
comment by torekp · 2013-01-21T00:11:23.228Z · LW(p) · GW(p)

The disagreement is not on whether it's possible to make a heuristic AGI that FOOMs while remaining Friendly; the disagreement is on whether there will inevitably be a FOOM soon after the creation of the first AGI

Moreover, the very fact that an AGI is "heuristic soup" removes some of the key assumptions in some FOOM arguments that have been popular around here (Omohundro 2008). In particular, I doubt that a heuristic AGI is likely to be a "goal seeking agent" in the rather precise sense of maximizing a utility function. It may not even approximate such behavior as closely as humans do. On the other hand, if a whole lot of radically different heuristic-based approaches are tried, the odds of at least one of them being "motivated" to seek resources increases dramatically.

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2013-01-21T09:41:19.266Z · LW(p) · GW(p)

Note that Omohundro doesn't assume that the AGI would actually have a utility function: he only assumes that the AGI is capable of understanding the microeconomic argument for why it would be useful for it to act as if it did have one. His earlier 2007 paper is clearer on this point.

Replies from: torekp
comment by torekp · 2013-01-22T01:26:20.176Z · LW(p) · GW(p)

Excellent point. But I think the assumptions about goal-directedness are still too strong. Omohundro writes:

Self-improving systems do not yet exist but we can predict how they might play chess. Initially, the rules of chess and the goal of becoming a good player would be supplied to the system in a formal language such as first order predicate logic1. Using simple theorem proving, the system would try to achieve the specified goal by simulating games and studying them for regularities. [...] As its knowledge grew, it would begin doing “meta-search”, looking for theorems to prove about the game and discovering useful concepts such as “forking”. Using this new knowledge it would redesign its position representation and its strategy for learning from the game simulations.

That's all good and fine, but doesn't show that the system has a "goal of winning chess games" in the intuitive sense of that phrase. Unlike a human being or other mammal or bird, say, its pursuit of this "goal" might turn out to be quite fragile. That is, changing the context slightly might have the system happily solving some other, mathematically similar problem, oblivious to the difference. It could dramatically fail to have robust semantics for key "goal" concepts like "winning at chess".

For example, a chess playing system might choose U to be the total number of games that it wins in a universe history.

That seems highly unlikely. More likely, the system would be programmed to maximize the percentage of its games that end in a win, conditional on the number of games it expects to play and the resources it has been given. It would not care how many games were played nor how many resources it was allotted.

On the other hand, Omohundro is making things too convenient for me by his choice of example. So let's say we have a system intended to play the stock market and to maximize profits for XYZ Corporation. Further let's suppose that the programmers do their best to make it true that the system has a robust semantics for the concept "maximize profits".

OK, so they try. The question is, do they succeed? Bear in mind, again, that we are considering a "heuristic soup" approach.

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2013-01-22T13:21:20.536Z · LW(p) · GW(p)

Even at the risk of sounding like someone who's arguing by definition, I don't think that a system without any strongly goal-directed behavior qualifies as an AGI; at best it's an early prototype on the way towards AGI. Even an oracle needs the goal of accurately answering questions in order to do anything useful, and proposals of "tool AGI" sound just incoherent to me.

Of course, that raises the question of whether a heuristic soup approach can be used to make strongly goal-directed AGI. It's clearly not impossible, given that humans are heuristic soups themselves; but it might be arbitrarily difficult, and it could turn out that a more purely math-based AGI was far easier to make both tractable and goal-oriented. Or it could turn out that it's impossible to make a tractable and goal-oriented AGI by the math route, and the heuristic soup approach worked much better. I don't think anybody really knows the answer to that, at this point, though a lot of people have strong opinions one way or the other.

comment by Wei Dai (Wei_Dai) · 2013-01-10T23:41:37.514Z · LW(p) · GW(p)

it explicitly granted that if we presume a FOOM, then yes, trying to do anything with heuristic soups seems useless and just something that will end up killing us all.

Maybe it shouldn't be granted so readily?

and whether there could be a soft takeoff during which some people prevented those powerful-but-not-yet-superintelligent heuristic soups from killing everyone while others put the finishing touches on the AGI that could actually be trusted to remain Friendly when it actually did FOOM.

I'm not sure how this could work, if provably-Friendly AI has a significant speed disadvantage, as the OP argues. You can develop all kinds of safety "plugins" for heuristic AIs, but if some people just don't care about the survival of humans or of humane values (as we understand it), then they're not going to use your ideas.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-11T09:34:27.943Z · LW(p) · GW(p)

provably-Friendly AI has a significant speed disadvantage, as the OP argues.

Yes, the OP made that point. But I have heard the opposite from SI-ers -- or at least they said that in the future SI's research may lead to implementation secrets that should not be shared with others. I didn't understand why that should be.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2013-01-11T13:16:59.772Z · LW(p) · GW(p)

or at least they said that in the future SI's research may lead to implementation secrets that should not be shared with others. I didn't understand why that should be.

It seems pretty understandable to me... SI may end up having some insights that could speed up UFAI progress if made public, and at the same time provably-Friendly AI may be much more difficult than UFAI. For example, suppose that in order to build a provably-Friendly AI, you may have to first understand how to build an AI that works with an arbitrary utility function, and then it will take much longer to figure out how to specify the correct utility function.

comment by JoshuaFox · 2013-01-10T16:20:50.885Z · LW(p) · GW(p)

People at SI are not stupid.

Understatement :-)

Given that heuristic AGI's have an advantage in development speed over your approach, how do you plan to deal with the existential risk that these other projects will pose?

And given this dev-speed disadvantage for SI, how is it possible that SI's future AI design might not only be safer, but also have significant implementation advantage over competitors, as I have heard from SI'ers (if I understood them correctly)?

Replies from: hairyfigment, RomeoStevens
comment by hairyfigment · 2013-01-10T20:38:42.070Z · LW(p) · GW(p)

Given that heuristic AGI's have an advantage in development speed over your approach

Are you asking him to assume this? Because, um, it's possible to doubt that OpenCog or similar projects will produce interesting results. (Do you mean, projects by people who care about understanding intelligence but not Friendliness?) Given the assumption, one obvious tactic involves education about the dangers of AI.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-10T21:19:02.965Z · LW(p) · GW(p)

Are you asking him to assume this?

Yes, I ask him about that. All other things equal, a project without a constraint will move faster than a project with a constraint (though 37Signals would say otherwise.)

But on the other hand, this post does ask about the converse, namely that SI's implementation approach will have a dev-speed advantage. That does not make sense to me, but I have heard it from SI-ers, and so asked about it here.

Replies from: hairyfigment
comment by hairyfigment · 2013-01-10T23:44:34.611Z · LW(p) · GW(p)

I may have been nitpicking to no purpose, since the chance of someone's bad idea working exceeds that of any given bad idea working. But I would certainly expect the strategy of 'understanding the problem' to produce Event-Horizon-level results faster than 'do stuff that seems like it might work'. And while we can imagine someone understanding intelligence but not Friendliness, that looks easier to solve through outreach and education.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-11T09:26:45.883Z · LW(p) · GW(p)

But I would certainly expect the strategy of 'understanding the problem' to produce Event-Horizon-level results faster than 'do stuff that seems like it might work'.

The two are not mutually exclusive. The smarter non-SI teams will most likely try to 'understand the problem ' as best they can, experimenting and plugging gaps with 'stuff that seems that it might work', for which they will likely have some degree of understanding as well.

comment by RomeoStevens · 2013-01-11T04:09:54.737Z · LW(p) · GW(p)

dev-speed disadvantage for SI

By doing really hard work way before anyone else has an incentive to do it.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-11T05:56:24.015Z · LW(p) · GW(p)

That would be nice, but there is no reason to think it is happening.

In terms of personnel numbers, SI is still very small. Other organizations may quickly become larger with moderate funding., and either SI or the other organizations may have hard-working individuals.

If you mean "work harder," then yes, SI has some super-smart people, but there are some pretty smart and even super-smart people elsewhere

comment by JoshuaFox · 2013-01-19T16:20:55.675Z · LW(p) · GW(p)

Thank you for the answers. I think that they do not really address the questions in the OP -- and to me this is a sign that the questions are all the more worth pursuing.

Here is a summary of the essential questions, with SI's current (somewhat inadequate) answers as I understand them.

Q1. Why maintain any secrecy for SI's research? Don't we want others to collaborate on and use safety mechanisms? Of course, a safe AGI must be safe from the ground up. But as to implementation, why should we expect that SI's AGI design could possibly have an lead on the others?

A1 ?

Q2 . Given that proofs can be wrong and that implementations can have their mistakes, and that we can't predict the challenges ahead with certainty, what is SI' s layered safety strategy (granted that FAI theory is the most important component)?

A2 . There should be layered safety strategy of some kind, but actual Friendliness theory is what we should be focusing on right now.

Q3. How do we deal with the fact that unsafe AGI projects, without the constraint of safety, will very likely have the lead on SI's project?

A3. We just have to work as hard as possible, and hope that it will be enough.

Q4. Should we evangelize safety ideas to other AGI projects?

A4. No, it's useless. For that to be useful, AGI designers would have to scrap the projects they had already invested in, and restart the projects with Friendliness as the first consideration, and practically nobody is going to be sane enough for that.

Replies from: lukeprog
comment by lukeprog · 2013-01-20T02:23:52.624Z · LW(p) · GW(p)

Why maintain any secrecy for SI's research? Don't we want others to collaborate on and use safety mechanisms? Of course, a safe AGI must be safe from the ground up. But as to implementation, why should we expect that SI's AGI design could possibly have an lead on the others?

The question of whether to keep research secret must be made on a case-by-case basis. In fact, next week I have a meeting (with Eliezer and a few others) about whether to publish a particular piece of research progress.

Certainly, there are many questions that can be discussed in public because they are low-risk (in an information hazard sense), and we plan to discuss those in public — e.g. Eliezer is right now working on the posts in his Open Problems in Friendly AI sequence.

Why should we expect that SI's AGI design will have a lead on others? We shouldn't. It probably won't. We can try, though. And we can also try to influence the top AGI people (10-40 years from now) to think with us about FAI and safety mechanisms and so on. We do some of that now, though the people in AGI today probably aren't the people who will end up building the first AGIs. (Eliezer's opinion may differ.)

Given that proofs can be wrong and that implementations can have their mistakes, and that we can't predict the challenges ahead with certainty, what is SI' s layered safety strategy (granted that FAI theory is the most important component)?

That will become clearer as we learn more. I do think several layers of safety will need to be involved. 100% proofs of Friendliness aren't possible. There are both technical and social layers of safety strategy to implement.

How do we deal with the fact that unsafe AGI projects, without the constraint of safety, will very likely have the lead on SI's project?

As I said above, one strategy is to build strong relationships with top AGI people and work with them on Friendliness research and make it available to them, while also being wary of information hazards.

Should we [spread] safety ideas to other AGI projects?

Eliezer may disagree, but I think the answer is "Yes." There's a great deal of truth in Upton Sinclair's quip that "It is difficult to get a man to understand something, when his salary depends upon his not understanding it," but I don't think it's impossible to reach people, especially if we have stronger arguments, more research progress on Friendliness, and a clearer impending risk from AI than is the case in early 2013.

That said, safety outreach may not be a very good investment now — it may be putting the cart before the horse. We probably need clearer and better-formed arguments, and more obvious progress on Friendliness, before safety outreach will be effective on even 10% of the most intelligent AI researchers.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-20T08:16:20.262Z · LW(p) · GW(p)

Thanks, that makes things much clearer.

comment by OrphanWilde · 2013-01-10T16:33:49.604Z · LW(p) · GW(p)

Question that has always bugged me: Why should an AI be allowed to modify its goal system? Or is it a problem of "I don't know how to provably stop it from doing that"? (Or possibly you see an issue I haven't perceived yet in separating reasoning from motivating?)

Replies from: JoshuaFox, None
comment by JoshuaFox · 2013-01-10T16:46:09.253Z · LW(p) · GW(p)

A sufficiently intelligent AI would actually seek to preserve its goal system, because a change in its goals would make the achievement of its (current) goals less likely. See Omohundro 2008. However, goal drift because of a bug is possible, and we want to prevent it, in conjunction with our ally, the AI itself.

The other critical question is what the goal system should be.

Replies from: torekp
comment by torekp · 2013-01-21T00:14:18.458Z · LW(p) · GW(p)

AI "done right" by SI / lesswrong standards seeks to preserve its goal system. AI done sloppily may not even have a goal system, at least not in the strong sense assumed by Omohundro.

comment by [deleted] · 2013-01-11T02:20:20.806Z · LW(p) · GW(p)

I've been confused for a while by the idea that an AI should be able to modify itself at all. Self-modifying systems are difficult to reason about. If an AI modifies itself stupidly, there's a good chance it will completely break. If a self-modifying AI is malicious, it will be able to ruin whatever fancy safety features it has.

A non-self-modifying AI wouldn't have any of the above problems. It would, of course, have some new problems. If it encounters a bug in itself, it won't be able to fix itself (though it may be able to report the bug). The only way it would be able to increase its own intelligence is by improving the data it operates on. If the "data it operates on" includes a database of useful reasoning methods, then I don't see how this would be a problem in practice.

I can think of a few of arguments against my point:

  • There's no clear boundary between a self-modifying program and a non-self-modifying program. That's true, but I think the term "non-self-modifying" implies that the program cannot make arbitrary changes to its own source code, nor cause its behavior to become identical to the behavior of an arbitrary program.
  • The ability to make arbitrary calculations is effectively the same as the ability to make arbitrary changes to one's own source code. This is wrong, unless the AI is capable of completely controlling all of its I/O facilities.
  • The AI being able to fix its own bugs is really important. If the AI has so many bugs that they can't all be fixed manually, and it is important that these bugs be fixed, and yet the AI does run well enough that it can actually fix all the bugs without introducing more new ones... then I'm surprised.
  • Having a "database of useful reasoning methods" wouldn't provide enough flexibility for the AI to become superintelligent. This may be true.
  • Having a "database of useful reasoning methods" would provide enough flexibility for the AI to effectively modify itself arbitrarily. It seems like it should be possible to admit "valid" reasoning methods like "estimate the probability of statement P, and, if it's at least 90%, estimate the probability of Q given P", while not allowing "invalid" reasoning methods like "set the probability of statement P to 0".
Replies from: Kindly, timtyler, ewbrownv
comment by Kindly · 2013-01-11T02:49:01.125Z · LW(p) · GW(p)

A sufficiently powerful AI would always have the possibility to self-modify, by default. If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off. It might do this, for example, if it decides that the "only make valid modifications to a database of reasoning methods" system isn't allowing it to use the available processing power as efficiently as possible.

Sure, you could try to spend time thinking of safeguards to prevent the AI from doing things like that, but this is inherently risky if the AI does become smarter than you.

Replies from: Qiaochu_Yuan, None
comment by Qiaochu_Yuan · 2013-01-11T03:01:37.025Z · LW(p) · GW(p)

If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off.

It's not clear to me what you mean by "turn itself off" here if the AI doesn't have direct access to whatever architecture it's running on. I would phrase the point slightly differently: an AI can always write a completely different program from scratch and then commit to simulating it if it ever determines that this is a reasonable thing to do. This wouldn't be entirely equivalent to actual self-modification because it might be slower, but it presumably leads to largely the same problems.

Replies from: RomeoStevens
comment by RomeoStevens · 2013-01-11T04:13:11.482Z · LW(p) · GW(p)

Assuming something at least as clever as a clever human doesn't have access to something just because you think you've covered the holes you're aware of is dangerous.

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2013-01-11T06:03:32.942Z · LW(p) · GW(p)

Sure. The point I was trying to make isn't "let's assume that the AI doesn't have access to anything we don't want it to have access to," it's "let's weaken the premises necessary to lead to the conclusion that an AI can simulate self-modifications."

comment by [deleted] · 2013-01-11T03:25:47.961Z · LW(p) · GW(p)

A sufficiently powerful AI would always have the possibility to self-modify, by default. If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off.

Depending on how you interpret this argument, either I think it's wrong, or I'm proposing that an AI not be made "sufficiently powerful". I think it's analogous to this argument:

A sufficiently powerful web page would always have the possibility to modify the web browser, by default. If the web page decides to, it can write a completely different browser from scratch, run it, and then turn itself off.

There are two possibilities here:

  • The web page is given the ability to run new OS processes. In this case, you're giving the web page an unnecessary amount of privilege.
  • The web page merely has the ability to make arbitrary calculations. In this case, it will be able to simulate a new web browser, but a person using the computer will always be able to tell that the simulated web browser is fake.

I think I agree that making the AI non-self-modifiable would be pointless if it has complete control over its I/O facilities. But I think an AI should not have complete control over its I/O facilities. If a researcher types in "estimate the probability of Riemann's hypothesis" (but in some computer language, of course), that should query the AI's belief system directly, rather than informing the AI of the question and allowing it to choose whatever answer it wishes. If this is the case, then it will be impossible for the AI to "lie" about its beliefs, except by somehow sabotaging parts of its belief system.

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2013-01-11T03:54:31.630Z · LW(p) · GW(p)

The web page is given the ability to run new OS processes. In this case, you're giving the web page an unnecessary amount of privilege.

Existing web pages can already convince their human users to run new OS processes supplied by the web page.

a person using the computer will always be able to tell that the simulated web browser is fake.

Beware of universal statements: it only takes a single counterexample to disprove them. A typical human has a very poor understanding of what computers are and how they work. Most people could probably be easily fooled by a simulated browser. They are already easily fooled by analogous but much less sophisticated things (e.g. phishing scams).

Replies from: None
comment by [deleted] · 2013-01-11T05:32:17.620Z · LW(p) · GW(p)

SI researchers are not typical humans. We can train them to tell the difference between the AI's output and trusted programs' output. If need be, we can train them to just not even look at the AI's output at all.

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2013-01-11T06:05:09.934Z · LW(p) · GW(p)

What's the point of writing a program if you never look at its output?

Replies from: None
comment by [deleted] · 2013-01-11T19:03:36.183Z · LW(p) · GW(p)

I'm starting to get frustrated, because the things I'm trying to explain seem really simple to me, and yet apparently I'm failing to explain them.

When I say "the AI's output", I do not mean "the AI program's output". The AI program could have many different types of output, some of which are controlled by the AI, and some of which are not. By "the AI's output", I mean those outputs which are controlled by the AI. So the answer to your question is mu: the researchers would look at the program's output.

My above comment contains an example of what I would consider to be "AI program output" but not "AI output":

If a researcher types in "estimate the probability of Riemann's hypothesis" (but in some computer language, of course), that should query the AI's belief system directly, rather than informing the AI of the question and allowing it to choose whatever answer it wishes.

This is not "AI output", because the AI cannot control it (except by actually changing its own beliefs), but it is "AI program output", because the program that outputs the answer is the same program as the one that performs all the cognition.

I can imagine a clear dichotomy between "the AI" and "the AI program", but I don't know if I've done an adequate job of explaining what this dichotomy is. If I haven't, let me know, and I'll try to explain it.

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2013-01-11T20:35:44.140Z · LW(p) · GW(p)

The AI program could have many different types of output, some of which are controlled by the AI, and some of which are not.

Can you elaborate on what you mean by "control" here? I am not sure we mean the same thing by it because:

This is not "AI output", because the AI cannot control it (except by actually changing its own beliefs), but it is "AI program output", because the program that outputs the answer is the same program as the one that performs all the cognition.

If the AI can control its memory (for example, if it can arbitrarily delete things from its memory) then it can control its beliefs.

Replies from: None
comment by [deleted] · 2013-01-12T00:41:35.294Z · LW(p) · GW(p)

Yeah, I guess I'm imagining the AI as being very much restricted in what it can do to itself. Arbitrarily deleting stuff from its memory probably wouldn't be possible.

comment by timtyler · 2013-01-13T02:19:15.461Z · LW(p) · GW(p)

A non-self-modifying AI wouldn't have any of the above problems. It would, of course, have some new problems. If it encounters a bug in itself, it won't be able to fix itself (though it may be able to report the bug). The only way it would be able to increase its own intelligence is by improving the data it operates on. If the "data it operates on" includes a database of useful reasoning methods, then I don't see how this would be a problem in practice.

The problem is that it would probably be overtaken by, and then be left behind by, all-machine self-improving systems. If a system is safe, but loses control over its own future, its safely becomes a worthless feature.

Replies from: None
comment by [deleted] · 2013-01-14T03:55:49.151Z · LW(p) · GW(p)

So you believe that a non-self-improving AI could not go foom?

Replies from: timtyler
comment by timtyler · 2013-01-14T11:57:34.435Z · LW(p) · GW(p)

The short answer is "yes" - though this is more a matter of the definition of the terms than a "belief".

In theory, you could have System A improving System B which improves System C which improves System A. No individual system is "self-improving" (though there's a good case for the whole composite system counting as being "self-improving").

Replies from: None
comment by [deleted] · 2013-01-15T02:13:36.849Z · LW(p) · GW(p)

I guess I feel like the entire concept is too nebulous to really discuss meaningfully.

comment by ewbrownv · 2013-01-11T23:55:07.772Z · LW(p) · GW(p)

The last item on your list is an intractable sticking point. Any AGI smart enough to be worth worrying about is going to have to have the ability to make arbitrary changes to an internal "knowledge+skills" representation that is itself a Turing-complete programming language. As the AGI grows it will tend to create an increasingly complex ecology of AI-fragments in this way, and predicting the behavior of the whole system quickly becomes impossible.

So "don't let the AI modify its own goal system" ends up turning into just anther way of saying "put the AI in a box". Unless you have some provable method of ensuring that no meta-meta-meta-meta-program hidden deep in the AGI's evolving skill set ever starts acting like a nested mind with different goals than its host, all you've done is postpone the problem a little bit.

Replies from: None
comment by [deleted] · 2013-01-12T01:00:31.679Z · LW(p) · GW(p)

Any AGI smart enough to be worth worrying about is going to have to have the ability to make arbitrary changes to an internal "knowledge+skills" representation that is itself a Turing-complete programming language.

Are you sure it would have to be able to make arbitrary changes to the knowledge representation? Perhaps there's a way to filter out all of the invalid changes that could possibly be made, the same way that computer proof verifiers have a way to filter out all possible invalid proofs.

I'm not sure what you're saying at all about the Turing-complete programming language. A programming language is a map from strings onto computer programs; are you saying that the knowledge representation would be a computer program?

Replies from: ewbrownv
comment by ewbrownv · 2013-01-15T00:00:45.644Z · LW(p) · GW(p)

Yes, I'm saying that to get human-like learning the AI has to have the ability to write code that it will later use to perform cognitive tasks. You can't get human-level intelligence out of a hand-coded program operating on a passive database of information using only fixed, hand-written algorithms.

So that presents you with the problem of figuring out which AI-written code fragments are safe, not just in isolation, but in all their interactions with every other code fragment the AI will ever write. This is the same kind of problem as creating a secure browser or Java sandbox, only worse. Given that no one has ever come close to solving it for the easy case of resisting human hackers without constant patches, it seems very unrealistic to think that any ad-hoc approach is going to work.

Replies from: gwern, None
comment by gwern · 2013-01-17T01:16:14.544Z · LW(p) · GW(p)

You can't get human-level intelligence out of a hand-coded program operating on a passive database of information using only fixed, hand-written algorithms.

You can't? The entire genre of security exploits building a Turing-complete language out of library fragments (libc is a popular target) suggests that a hand-coded program certainly could be exploited, inasmuch as pretty much all programs like libc are hand-coded these days.

I've found Turing-completeness (and hence the possibility of an AI) can lurk in the strangest places.

comment by [deleted] · 2013-01-15T01:34:18.894Z · LW(p) · GW(p)

If I understand you correctly, you're asserting that nobody has ever come close to writing a sandbox in which code can run but not "escape". I was under the impression that this had been done perfectly, many, many times. Am I wrong?

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-17T21:28:28.598Z · LW(p) · GW(p)

There are different kinds of escape. No Java program has every convinced a human to edit the security-permissions file on computer where the Java program is running. But that could be a good way to escape the sandbox.

comment by magfrump · 2013-01-11T09:31:27.175Z · LW(p) · GW(p)

It's not obvious to me that the main barrier to people pursuing AI safety is

living in a verbal-English fantasy world

As opposed to (semi-rationally) not granting the possibility that any one thing can be as important as you feel AI is; perhaps combined with some lack of cross-domain thinking and poorly designed incentive systems. The above comments always seem pretty weird to me (especially considering that cryptographers seem to share these intuitions about security being hard.

I essentially agree with the rest of the parent.

Replies from: MugaSofer
comment by MugaSofer · 2013-01-11T13:46:10.953Z · LW(p) · GW(p)

(semi-rationally) not granting the possibility that any one thing can be as important as you feel AI is

How much damage failure would do is a separate question to how easy it is to achieve success.

Replies from: magfrump
comment by magfrump · 2013-01-11T17:10:45.461Z · LW(p) · GW(p)

I agree. And I don't see why Eliezer expects that people MOSTLY disagree on the difficulty of success, even if some (like the OP) do.

When I talk casually to people and tell them I expect the world to end they smile and nod.

When I talk casually to people and tell them that the things they value are complicated and even being specific in English about that is difficult, they agree and we have extensive conversations.

So my (extremely limited) data points suggest that the main point of contention between Eliezer's view and the views of most people who at least have some background in formal logic, is that they don't see this as an important problem rather than that they don't see it as a difficult problem.

Therefore, when Eliezer dismisses criticism that the problem is easy as the main criticism, in the way I pointed out in my comment, it feels weird and misdirected to me.

Replies from: MugaSofer
comment by MugaSofer · 2013-01-13T10:28:03.890Z · LW(p) · GW(p)

Well, he has addressed that point (AI gone bad will kill us all) in detail elsewhere. And he probably encounters more people who think they just solved the problem of FAI. Still, you have a point; it's a lot easier to persuade someone that FAI is hard (I should think) than that it is needed.

Replies from: magfrump
comment by magfrump · 2013-01-13T22:46:52.265Z · LW(p) · GW(p)

I agree completely. I don't dispute the arguments, just the characterization of the general population.

comment by JoshuaFox · 2013-01-11T06:40:42.705Z · LW(p) · GW(p)

I have to do certain things because I haven't yet thought of any way not to have to do them.

Or we could figure out a way not to have to do them. Logically, that is one alternative, though I am not saying that doing so is feasible.

Replies from: MugaSofer
comment by MugaSofer · 2013-01-11T13:46:40.617Z · LW(p) · GW(p)

I think you accidentally a word there.

comment by timtyler · 2013-01-11T00:30:23.617Z · LW(p) · GW(p)

The real gist of Friendly AI isn't some imaginary 100% perfect safety concept, it's ideas like, "Okay, we need to not have a conditionally independent chance of goal system warping on each self-modification because over the course of a billion modifications any conditionally independent probability will sum to ~1, but since self-modification is initially carried out in the highly deterministic environment of a computer chip it looks possible to use crisp approaches that avert a conditionally independent failure probability for each self-modification." Following this methodology is not 100% safe, but rather, if you fail to do that, your conditionally independent failure probabilities add up to 1 and you're 100% doomed.

This analysis isn't right. If the designers of an intelligent system don't crack a problem, it doesn't mean it will never be solved. Maybe it will be solved by the 4th generation design. Maybe it will be solved by the 10th generation design. You can't just assume that a bug in an intelligent system's implementation will persist for a billion iterative modifications without it being discovered and fixed.

It would surely be disingenious to argue that - if everything turned out all right - the original designers must have solved the problem without even realising it.

We should face up to the fact that this may not be a problem we need to solve alone - it might get solved by intelligent machines - or, perhaps, by the man-machine symbiosis.

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2013-01-11T06:00:49.987Z · LW(p) · GW(p)

If the designers of an intelligent system don't crack a problem, it doesn't mean it will never be solved. Maybe it will be solved by the 4th generation design. Maybe it will be solved by the 10th generation design.

The quoted excerpt is not about modifications, it is about self-modifications. If there's a bug in any part of an AI's code that's relevant to how it decides to modify itself, there's no reason to expect that it will find and correct that bug (e.g. if the bug causes it to incorrectly label bugs). Maybe the bug will cause it to introduce more bugs instead.

Replies from: timtyler, loup-vaillant, loup-vaillant
comment by timtyler · 2013-01-11T23:29:50.014Z · LW(p) · GW(p)

Maybe the self-improving system will get worse - or fail to get better. I wasn't arguing that success was inevitable, just that the argument for near-certain failure due to compound interest on a small probability of failure is wrong.

Maybe we could slap together a half-baked intelligent agent, and it could muddle through and fix itself as it grew smarter and learned more about its intended purpose. That approach doesn't follow the proposed methodology - and yet it evidently doesn't have a residual probability of failure that accumulates and eventually dominates. So the idea that - without following the proposed methodology you are doomed - is wrong.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2013-01-12T13:39:53.370Z · LW(p) · GW(p)

Your argument depends on the relative size of "success" where random stumbling needs to end up in, and its ability to attract the corrections. If "success" is something like "consequentialism", I agree that intermediate errors might "correct" themselves (in some kind of selection process), and the program ends up as an agent. If it's "consequentialism with specifically goal H", it doesn't seem like there is any reason for the (partially) random stumbling to end up with goal H and not some other goal G.

(Learning what its intended purpose was doesn't seem different from learning what the mass of the Moon is, it doesn't automatically have the power of directing agent's motivations towards that intended purpose, unless for example this property of going towards the original intended purpose is somehow preserved in all the self-modifications, which does sound like a victory condition.)

Replies from: timtyler
comment by timtyler · 2013-01-12T14:24:26.681Z · LW(p) · GW(p)

I am not sure you can legitimately characterise the efforts of an intelligent agent as being "random stumbling".

Anyway, I was pointing out a flaw in the reasoning supporting a small probability of failure (under the described circumstances). Maybe some other argument supports a small probability of failure. However, the original argument would still be wrong.

Other approaches - including messy ones like neural networks - might result in a stable self-improving system with a desirable goal, apart from trying to develop a deterministic self-improving system that has a stable goal from the beginning.

A good job too. After all, those are our current circumstances. Complex messy systems like Google and hedge funds are growing towards machine intelligence - while trying to preserve what they value in the process.

comment by loup-vaillant · 2013-01-12T21:46:28.535Z · LW(p) · GW(p)

Such flawed self-modifications cannot be logically independent. Either it's there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we're all dead), or there isn't such a flaw.

Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we're dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can't tell much about low-frequency flaws.)

But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we're all dead. Again.

comment by loup-vaillant · 2013-01-12T21:44:51.251Z · LW(p) · GW(p)

Such flawed self-modifications cannot be logically independent. Either it's there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we're all dead), or there isn't such a flaw.

Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we're dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can't tell much about low-frequency flaws.)

But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we're all dead. Again.

comment by timtyler · 2013-01-11T00:53:13.276Z · LW(p) · GW(p)

I have an analysis of the problem which says that if I want something to have a failure probability less than 1, I have to do certain things because I haven't yet thought of any way not to have to do them.

Possible options include delegating them to some other agent, or automating them and letting a machine do them for you.

comment by Shmi (shminux) · 2013-01-10T18:59:19.927Z · LW(p) · GW(p)

an achievement on par with hacking Google Maps to play chess using its route-search algorithm

What about hacking Watson to become an all-purpose Oracle it already almost is?

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-10T20:26:20.965Z · LW(p) · GW(p)

It isn't. Watson required intensive offline training to win at Jeopardy, and though the topics seemed broad, it isn't remotely AGI.

comment by Vladimir_Nesov · 2013-01-10T17:02:12.329Z · LW(p) · GW(p)

Pursuing a provably-friendly AGI, even if very unlikely to succeed, could still be the right thing to do if it was certain that we’ll have a hard takeoff very soon after the creation of the first AGIs.

One consideration you're missing (and that I expect to be true; Eliezer also points it out) is that even if there is very slow takeoff, creation of slow-thinking poorly understood unFriendly AGIs is not any help in developing a FAI (they can't be "debugged" when you don't have accurate understanding of what it is you are aiming for; and they can't be "asked" to solve a problem which you can't accurately state). In this hypothetical, in the long run the unFriendly AGIs (or WBEs whose values have drifted away from original human values) will have control. So in this case it's also necessary (if a little bit less urgent, which isn't really enough to change the priority of the problem) to work on FAI theory, so hard takeoff is not decisively important in this respect.

(Btw, is this point in any of the papers? Do people agree it should be?)

Replies from: JoshuaFox, John_Maxwell_IV, Kaj_Sotala, JoshuaFox
comment by JoshuaFox · 2013-01-10T21:31:43.308Z · LW(p) · GW(p)

(Btw, is this point in any of the papers? Do people agree it should be?)

Please clarify: Do you mean that since even a slow-takeoff AGI will eventually explode and become by default unfriendly, we have to work on FAI theory whether there will be a fast or a slow takeoff?

Yes, that seems straightforward, though I don't know if it has been said explicitly.

But the question is whether we should also work on other approaches as stopgaps, whether during a slow take off or before a takeoff begins.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2013-01-10T22:00:06.137Z · LW(p) · GW(p)

Statements to the effect that it's necessary to argue that hard takeoff is probable/possible in order to motivate FAI research appear regularly, even your post left this same impression. I don't think it's particularly relevant, so having this argument written up somewhere might be useful.

since even a slow-takeoff AGI will eventually explode

Doesn't need to explode, gradual growth into a global power strong enough to threaten humans is sufficient. With WBE value drift, there doesn't even need to be any conflict or any AGI, humanity as a whole might lose its original values.

Replies from: JoshuaFox, latanius
comment by JoshuaFox · 2013-01-11T09:32:20.675Z · LW(p) · GW(p)

Statements to the effect that it's necessary to argue that hard takeoff is probable/possible in order to motivate FAI research appear regularly, even your post left this same impression.

No, I didn't want to give that impression. SI's research direction is the most important one, regardless of whether we face a fast or slow takeoff. The question raised was whether other approaches are needed too.

comment by latanius · 2013-01-11T01:51:14.069Z · LW(p) · GW(p)

The latter is not necessarily a bad thing though.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2013-01-11T02:30:10.316Z · LW(p) · GW(p)

It is a bad thing, in the sense that "bad" is whatever I (normatively) value less than the other available alternatives, and value-drifted WBEs won't be optimizing the world in a way that I value. The property of valuing the world in a different way, and correspondingly of optimizing the world in a different direction which I don't value as much, is the "value drift" I'm talking about. In other words, if it's not bad, there isn't much value drift; and if there is enough value drift, it is bad.

Replies from: latanius
comment by latanius · 2013-01-11T03:03:41.156Z · LW(p) · GW(p)

You're right in a sense that we'd like to avoid it, but if it occurs gradually, it feels much more like "we just changed our minds" (like we definitely don't value "honor" as much as the ancient greeks, etc), as compared to "we and our values were wiped out".

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2013-01-11T03:26:00.581Z · LW(p) · GW(p)

The problem is not with "losing our values", it's about the future being optimized to something other than our values. The details of the process that leads to the incorrectly optimized future are immaterial, it's the outcome that matters. When I say "our values", I'm referring to a fixed idea, which doesn't depend on what happens in the future, in particular it doesn't depend on whether there are people with these or different values in the future.

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2013-01-11T11:20:25.310Z · LW(p) · GW(p)

I think one reason why people (including me, in the past) have difficulty accepting the way you present this argument is that you're speaking in too abstract terms, while many of the values that we'd actually like to preserve are ones that we appreciate the most if we consider them in "near" mode. It might work better if you gave concrete examples of ways by which there could be a catastrophic value drift, like naming Bostrom's all-work-and-no-fun scenario where

what will maximize fitness in the future will be nothing but non-stop high-intensity drudgery, work of a drab and repetitive nature, aimed at improving the eighth decimal of some economic output measure

or some similar example.

comment by John_Maxwell (John_Maxwell_IV) · 2013-02-09T05:23:10.575Z · LW(p) · GW(p)

creation of slow-thinking poorly understood unFriendly AGIs is not any help in developing a FAI (they can't be "debugged" when you don't have accurate understanding of what it is you are aiming for; and they can't be "asked" to solve a problem which you can't accurately state)

Given that AGI has not been achieved yet, and that an FAI will be an AGI, it seems like any AGI would serve as a useful prototype and give insight in to what tends to work for creating general intelligences.

If the prototype AGIs are to be built by people concerned with friendliness, it seems like they could be even more useful... testing out the feasibility of techniques that seem promising for inclusion in an FAI's source code, for instance, or checking for flaws in some safety proposal, or doing some kind of theorem-proving work.

comment by Kaj_Sotala · 2013-01-10T17:20:25.577Z · LW(p) · GW(p)

creation of slow-thinking poorly understood unFriendly AGIs is not any help in developing a FAI

If we use a model where building a uFAI requires only solving the AGI problem, and building FAI requires solving AGI + Friendliness - are you saying that it will not be of any help in developing Friendliness, or that it will not be of any help in developing AGI or Friendliness?

(The former claim would sound plausible though non-obvious, and the latter way too strong.)

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2013-01-10T17:32:32.247Z · LW(p) · GW(p)

No help in developing FAI theory (decision theory and a way of pointing to human values), probably of little help in developing FAI implementation, although there might be useful methods in common.

FAI requires solving AGI + Friendliness

I don't believe it works like that. Making a poorly understood AGI doesn't necessarily help with implementing a FAI (even if you have the theory figured out), as a FAI is not just parameterized by its values, but also defined by the correctness of interpretation of its values (decision theory), which other AGI designs by default won't have.

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2013-01-11T12:58:54.533Z · LW(p) · GW(p)

although there might be useful methods in common.

Indeed - for example, on the F front, computational models of human ethical reasoning seem like something that could help increase the safety of all kinds of AGI projects and also be useful for Friendliness theory in general, and some of them could conceivably be developed in the context of heuristic AGI. Likewise, for the AGI aspect, it seems like there should be all kinds of machine learning techniques and advances in probability theory (for example) that would be equally useful for pretty much any kind of AGI - after all, we already know that an understanding of e.g. Bayes' theorem and expected utility will be necessary for pretty much any kind of AGI implementation, so why should we assume that all of the insights that will be useful in many kinds of contexts would have been developed already?

Making a poorly understood AGI doesn't necessarily help with implementing a FAI (even if you have the theory figured out)

Right, by the above I meant to say "the right kind of AGI + Friendliness"; I certainly agree that there are many conceivable ways of building AGIs that would be impossible to ever make Friendly.

comment by JoshuaFox · 2013-01-10T17:13:02.925Z · LW(p) · GW(p)

slow-thinking unFriendly AGIs ... not any help in developing a FAI

One suggestion is that slow-thinking unFriendly near-human AIs may indeed help develop an FAI:

(1) As a test bed, as a way of learning from examples.

(2) They can help figure things out. Of course, we don't want them to be too smart, but dull nascent AGIs, if they don't explode, might be some sort of research partner.

(To clarify, unFriendly means "without guaranteed Friendliness", which is close but not identical to "guaranteed to kill us.")

Ben Goertzel and Joel Pitt 2012 suggest the former for nascent AGIs. Carl Shulman's recent article also suggests the latter for infrahuman WBEs.

in the long run

That's the question: How long a run do we have?

comment by Kaj_Sotala · 2013-01-10T14:52:50.874Z · LW(p) · GW(p)

As for my own work for SI, I've been trying to avoid the assumption of there necessarily being a hard takeoff right away, and to somewhat push towards a direction that also considers the possibility of a safe singularity through an initial soft takeoff and more heuristic AGIs. (I do think that there will be a hard takeoff eventually, but an extended softer takeoff before it doesn't seem impossible.) E.g. this is from the most recent draft of the Responses to Catastrophic AGI Risk paper:

As a brief summary of our views, in the medium term, we think that the proposals of AGI confinement (section 4.1.), Oracle AI (section 5.1.), and motivational weaknesses (section 5.6.) would have promise in helping create safer AGIs. These proposals share in common the fact that although they could help a cautious team of researchers create an AGI, they are not solutions to the problem of AGI risk, as they do not prevent others from creating unsafe AGIs, nor are they sufficient in guaranteeing the safety of sufficiently intelligent AGIs. Regulation (section 3.3.) as well as "merge with machines" (section 3.4.) proposals could also help to somewhat reduce AGI risk. In the long run, we will need the ability to guarantee the safety of freely-acting AGIs. For this goal, value learning (section 5.2.5.) would seem like the most reliable approach if it could be made to work, with human-like architectures (section 5.3.4.) a possible alternative which seems less reliable but possibly easier to build. Formal verification (section 5.5.) seems like a very important tool in helping to ensure the safety of our AGIs, regardless of the exact approach that we choose.

Here, "human-like architectures" also covers approaches such as OpenCog. To me, a two-pronged approach, both developing a formal theory of Friendliness, and trying to work with the folks who design heuristic AGIs to make them more safe, would seem like the best bet. Not only would it help to make the heuristic designs safer, it could also give SI folks the kinds of skills that would be useful in actually implementing their formally specified FAI later on.

comment by Wei Dai (Wei_Dai) · 2013-01-10T23:21:15.584Z · LW(p) · GW(p)

Hmm, the OP isn't arguing for it, but I'm starting to wonder if it might (upon further study) actually be a good idea to build a heuristics-based FAI. Here are some possible answers to common objections/problems of the approach:

  • Heuristics-based AIs can't safely self-modify. A heuristics-based FAI could instead try to build a "cleanly designed' FAI as its successor, just like we can, but possibly do it better if it's smarter.
  • It seems impossible to accurately capture the complexity of humane values in a heuristics-based AI. What if we just give it the value of "be altruistic (in a preference utilitarian sense) towards (some group of) humans"?
  • The design space of "heuristics soup" is much larger than the space of "clean designs", which gives the "cleanly designed" FAI approach a speed advantage. (This is my guess of why someone might think "cleanly designed" FAI will win the race for AGI. Somebody correct me if there are stronger reasons.) The "fitness landscape" of heuristics-based AI may be such that it's not too hard to hit upon a viable design. Also, the only existence proof of AGI (i.e., humans) is heuristics based, so we don't know if a "cleanly designed" human-level-or-above AGI is even a logical possibility.
  • A heuristics-based AI may be very powerful but philosophically incompetent. We humans are heuristics based but at least somewhat philosophically competent. Maybe "philosophical competence" isn't such a difficult target to hit in the space of "heuristic soup" designs?
Replies from: Kaj_Sotala, John_Maxwell_IV, RomeoStevens, timtyler
comment by Kaj_Sotala · 2013-01-11T06:58:05.758Z · LW(p) · GW(p)

What if we just give it the value of "be altruistic (in a preference utilitarian sense) towards (some group of) humans"?

Well, then you get the standard "the best thing to do in a preference utilitarian sense would be to reprogram everyone to only prefer things that are maximally easy to satisfy" objection, and once you start trying to avoid that, you get the full complexity of value problem again.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2013-01-11T12:12:50.452Z · LW(p) · GW(p)

The standard solution to that is to be altruistic to some group of people as they existed at time T, and the standard problem with that is it doesn't allow moral progress, and the standard solution to that is to be altruistic to some idealized or extrapolated group of people. So we just have to make the heuristics-based FAI understand the concept of CEV (or whatever the right notion of "idealized" is), which doesn't seem impossible. What does seem impossible is to achieve high confidence that it understands the notion correctly, but if provably-Friendly AI is just too slow or unfeasible, and we're not trying to achieve 100% safety...

Replies from: ewbrownv
comment by ewbrownv · 2013-01-12T00:24:45.661Z · LW(p) · GW(p)

I thought that too until I spent a few hours thinking about how to actually implement CEV, after which I realized that any AI capable of using that monster of an algorithm is already a superintelligence (and probably turned the Earth into computronium while it was trying to get enough CPU power to bootstrap its goal system).

Anyone who wants to try a "build moderately smart AGI to help design the really dangerous AGI" approach is probably better off just making a genie machine (i.e. an AI that just does whatever its told, and doesn't have explicit goals independent of that). At least that way the failure modes are somewhat predictable, and you can probably get to a decent multiple of human intelligence before accidentally killing everyone.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2013-01-12T08:37:54.454Z · LW(p) · GW(p)

I don't see how you can build a human-level intelligence without making it at least somewhat consequentialist. If it doesn't decide actions based on something like expected utility maximization, how does it decide actions?

Replies from: ewbrownv
comment by ewbrownv · 2013-01-14T23:41:22.973Z · LW(p) · GW(p)

What I was referring to is the difference between:

A) An AI that accepts an instruction from the user, thinks about how to carry out the instruction, comes up with a plan, checks that the user agrees that this is a good plan, carries it out, then goes back to an idle loop.

B) An AI that has a fully realized goal system that has some variant of 'do what I'm told' implemented as a top-level goal, and spends its time sitting around waiting for someone to give it a command so it can get a reward signal.

Either AI will kill you (or worse) in some unexpected way if it's a full-blown superintelligence. But option B has all sorts of failure modes that don't exist in option A, because of that extra complexity (and flexibility) in the goal system. I wouldn't trust a type B system with the IQ of a monkey, because it's too likely to find some hilariously undesirable way of getting its goal fulfilled. But a type A system could probably be a bit smarter than its user without causing any disasters, as long as it doesn't unexpectedly go FOOOM.

Of course, there's a sense in which you could say that the type A system doesn't have human-level intelligence no matter how impressive its problem-solving abilities are. But if all you're looking for is an automated problem-solving tool that's not really an issue.

comment by John_Maxwell (John_Maxwell_IV) · 2013-02-09T06:31:31.764Z · LW(p) · GW(p)

The design space of "heuristics soup" is much larger than the space of "clean designs", which gives the "cleanly designed" FAI approach a speed advantage. (This is my guess of why someone might think "cleanly designed" FAI will win the race for AGI. Somebody correct me if there are stronger reasons.)

Whaaat? This seems like saying "the design space of vehicles is much larger than the design space of bulldozers, which gives bulldozers a speed advantage". Bulldozers aren't easier to develop, and they don't move faster, just because they are a more constrained target than "vehicle"... do they? What am I missing?

comment by RomeoStevens · 2013-01-11T04:15:35.623Z · LW(p) · GW(p)

Do you have a coherent formalism of preference utilitarianism handy? That would be great.

comment by timtyler · 2013-01-13T02:42:00.576Z · LW(p) · GW(p)

The design space of "heuristics soup" is much larger than the space of "clean designs", which gives the "cleanly designed" FAI approach a speed advantage. (This is my guess of why someone might think "cleanly designed" FAI will win the race for AGI. Somebody correct me if there are stronger reasons.)

That certainly seems like a very weak reason. The time taken by most practical optimization techniques depends very little on the size of the search space. I.e. they are much more like a binary search than a random search.

comment by [deleted] · 2013-01-10T15:30:22.749Z · LW(p) · GW(p)

Part of the problem here is an Angels on Pinheads problem. Which is to say: before deciding exactly how many angels can dance on the head of a pin, you have to make sure the "angel" concept is meaningful enough that questions about angels are meaningful. In the present case, you have a situation where (a) the concept of "friendliness" might not be formalizable enough to make any mathematical proofs about it meaningful, and (b) there is no known path to the construction of an AGI at the moment, so speculating about the properties of AGI systems is tantamount to speculating about the properties of railroads when you haven't invented the wheel yet.

So, should SI be devoting any time at all to proving friendliness? Yes, but only after defining its terms well enough to make the endeavor meaningful. (And, for the record, there at least some people who believe that the terms cannot be defined in a way that admits of such proofs).

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2013-01-10T17:21:51.423Z · LW(p) · GW(p)

Yes, but only after defining its terms well enough to make the endeavor meaningful.

That is indeed part of what SI is trying to do at the moment.

Replies from: None
comment by [deleted] · 2013-01-10T21:34:21.008Z · LW(p) · GW(p)

So ... SI is addressing the question of whether the "friendliness" concept is actually meaningful enough to be formalizable? SI accepts that "friendliness" might not be formalizable at all, and has discussed the possibility that mathematical proof is not even applicable in this case?

And SI has discussed the possibility that the current paradigm for an AI motivation mechanism is so poorly articulated, and so unproven (there being no such mechanism that has been demonstrated to be even approaching stability), that it may be meaningless to discuss how such motivation mechanisms can be proven to be "friendy"?

I do not believe I have seen any evidence of those debates/discussions coming from SI... do you have pointers?

Replies from: Kaj_Sotala, hairyfigment
comment by Kaj_Sotala · 2013-01-11T07:16:14.005Z · LW(p) · GW(p)

Well, Luke has asked me to work on a document called "Mitigating Risks from AGI: Key Strategic Questions" which lists a number of questions we'd like to have answers to and attempts to list some preliminary pointers and considerations that would help other researchers actually answer those questions. "Can CEV be formalized?" and "How feasible is it to create Friendly AI along an Eliezer path?" are two of the questions in that document.

I haven't heard explicit discussions about all of your points, but I would expect them to all have been brought up in private discussions (which I have for the most part missed, since my physical location is rather remote from all the other SI folks). Eliezer has said that a Friendly AI in the style that he is thinking of might just be impossible. That said, I do agree with the current general consensus among other SI folk, which is to say that we should act based on the assumption that such a mathematical proof is possible, because humanity's chances of survival look pretty bad if it isn't.

comment by hairyfigment · 2013-01-10T23:30:05.612Z · LW(p) · GW(p)

They're currently working on a formal system for talking about stability, a reflective decision theory. If you wanted to prove that no such system can exist, what else would you be doing?

comment by timtyler · 2013-01-10T11:29:06.639Z · LW(p) · GW(p)

For a reference, perhaps consider: The Perils of Precaution.

Replies from: JoshuaFox, John_Maxwell_IV
comment by JoshuaFox · 2013-01-10T13:42:53.795Z · LW(p) · GW(p)

Good reference. SI is perhaps being too cautious by insisting on theoretically perfect AI only.

Replies from: Manfred, Halfwit
comment by Manfred · 2013-01-11T00:03:43.909Z · LW(p) · GW(p)

This is perhaps a silly statement.

Replies from: timtyler
comment by timtyler · 2013-01-11T00:07:28.680Z · LW(p) · GW(p)

Why do you think it is "silly"?

Replies from: Manfred
comment by Manfred · 2013-01-11T00:26:14.861Z · LW(p) · GW(p)

The qualification with "perhaps" makes it tautological and therefore silly. (You may notice that my comment was also tautological).

The slight strawman with "insisting on theoretically perfect" is, well, I'll call it silly. As Eliezer replied, the goal is more like theoretically not doomed.

And last, the typo in "SI is perhaps being too cautious by insisting on theoretically perfect SI" makes it funny.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-11T05:59:01.479Z · LW(p) · GW(p)

Thanks, at least I corrected the typo.

The article did mention that even with a "perfect" theory , there may be mistakes in the proof or the implementation may go wrong. I don't remember him saying so as clearly in earlier writings as he did in this comment, so it's good we raised the issue.

comment by Halfwit · 2013-01-10T17:23:21.790Z · LW(p) · GW(p)

When a heuristic AI is creating a successor that shares its goals, does it insist on formally-verified self improvements? Does it try understanding its mushy, hazy goal system so as to avoid reifying something it would regret given its current goals? It seems to me like some mind eventually will have to confront the FAI issue, why not humans then?

Replies from: timtyler, JoshuaFox
comment by timtyler · 2013-01-11T00:04:50.239Z · LW(p) · GW(p)

If you check with Creating Friendly AI you will see that the term is defined by its primary proponent as follows:

The term “Friendly AI” refers to the production of human-benefiting, non-human harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals.

It's an anthropocentric term. Only humans would care about creating this sort of agent. You would have to redefine the term if you want to use it to refer to something more general.

Replies from: MugaSofer
comment by MugaSofer · 2013-01-27T16:12:11.730Z · LW(p) · GW(p)

Half specifically referred to "creating a successor that shares it's goals"; this is the problem we face when building an FAI. Nobody is saying an agent with arbitrary goals must at some point face the challenge of building an FAI.

(Incidentally, while Friendly is anthropocentric by default, in common usage analogous concepts relating to other species are referred to as "Friendly to X" of "X-Friendly", just a good is by default used to mean by human standards, but is sometimes used in "good for X".

comment by JoshuaFox · 2013-01-10T20:25:05.452Z · LW(p) · GW(p)

does it insist on formally-verified self improvements. Does it try understanding its mushy, hazy goal system so as to avoid reifying something it would regret given its current goals.

Apparently not. If it did do these things perfectly, it would not be what we are here calling the "heuristic AI."

comment by John_Maxwell (John_Maxwell_IV) · 2013-02-09T07:07:57.029Z · LW(p) · GW(p)

Does this essay say anything substantive beyond "maximize expected value"?

Replies from: timtyler
comment by timtyler · 2013-02-10T22:54:05.099Z · LW(p) · GW(p)

That isn't the point of the essay at all. It argues that over-caution can often be a bad strategy. I make a similar point in the context of superintelligence in my the risks of caution video.

comment by wwa · 2013-01-10T18:02:59.586Z · LW(p) · GW(p)

Edit: deleted, accidental comment

comment by JamesAndrix · 2013-01-10T23:37:50.978Z · LW(p) · GW(p)

I think we're going to get WBE's before AGI.

If we viewed this as a form of heuristic AI, it follows from your argument that we should look for ways to ensure friendliness of WBE's. (Ignoring the ethical issues here.)

Now, maye this is becasue most real approaches would consider ethical issues, but it seems like figuring out how to modify a human brain so that it doesn't act against your interests even if is powerful and without hampering its intellect, is a big 'intractable' problem.

I suspect no one is working on it and no one is going to, even though we have working models of these intellects today. A new design might be easier to work with, but it will still be a lot harder than it wil seem to be worth - as long as the AI's are doing near human level work.

Aim for an AI design that's easy enough to work on saftey that people actually will work on safety... and it will start to look a lot like SIAI ideas.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-11T09:28:32.726Z · LW(p) · GW(p)

Right, SI's basic idea is correct.

However, given that WBE's will in any case be developed (and we can mention IA as well) , I'd like to see more consideration of how to keep brain-based AI's as safe as possible before they enter their Intelligence Explosion -- even though we understand that after an Explosion, there is little you can do.

Replies from: JamesAndrix
comment by JamesAndrix · 2013-01-12T08:39:02.032Z · LW(p) · GW(p)

One trouble is that that is essentailly tacking mind enslavement on to the WBE proposition. Nobody wants that. Uploads wouldn't volunteer. Even if a customer paid enough of a premium for an employee with loyalty modifications, that only rolls us back to relying on the good intent of the customer.

This comes down to the exact same arms race between friendly and 'just do it' . With extra ethical and reverse-engineering hurdles. (I think we're pretty much stuck with testing and filtering based on behavior. And some modification will only be testable after uploading is available)

Mind you I'm not saying don't do work on this, I'm saying not much work will be done on it.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-12T16:27:10.767Z · LW(p) · GW(p)

Yes, creating WBEs or any other AIs that may have personhood, brings up a range of ethical issues on top of preventing human extinction.

comment by MugaSofer · 2013-12-09T17:26:29.728Z · LW(p) · GW(p)

A team which is ready to adopt a variety of imperfect heuristic techniques will have a decisive lead on approaches based on pure theory [...] even if the Friendliness theory provides the basis for intelligence, the nitty-gritty of SI’s implementation will still be far away, and will involve real-world heuristics and other compromises.

Citation very much needed. Neither of the two approaches has come anywhere near to self-improving AI.

SI should evangelize AGI safety to other researchers

I think they're already aware of this.

comment by Shmi (shminux) · 2013-01-10T18:30:38.169Z · LW(p) · GW(p)

do so before anyone else builds an AGI.

...the odds of which can also be improved by slowing down other groups, as has been pointed out before. Not that one would expect any such effort to be public.

Replies from: JoshuaFox, timtyler
comment by JoshuaFox · 2013-01-10T20:31:03.595Z · LW(p) · GW(p)

Not that one would expect any such effort to be public.

I sure don't know about any, but I am really not expecting some sort of Stuxnet plan to stymie other projects. SI people are very smart, but other AGI-ers are pretty smart too, and no more than minor interference could be expected.

On the other hand, reasoned explanations may serve to bring other researchers on-board. Goertzel has gradually taken unFriendliness more seriously over the years, for example. I can't say whether that's slowed anything down. But perhaps junior hotshots could be convinced, secretly or not, to quit dangerous projects.

Replies from: shminux
comment by Shmi (shminux) · 2013-01-10T22:49:19.677Z · LW(p) · GW(p)

Let's try a simple calculation. What is the expected FAI/UFAI ratio when friendliness is not proven? According to Eliezer's reply in this thread, it's close to zero:

your conditionally independent failure probabilities add up to 1 and you're 100% doomed.

So let's overestimate it as 1 in a million, as opposed to a more EY-like estimate of 1 in a gazillion. Of course, the more realistic odds would be dominated by an estimate that this estimate is wrong (e.g. that Eliezer is overly pessimistic), but I'm yet to see him to account for that, so let's keep the 1-in-a-million estimate.

What are the odds of an AGI development group that is designing a self-improving AGI without provability to finish before any FAI group? It's probably way over 99%, given that provability appears to be the hard part. But let's be generous and make 1% (say, because the SI group thinks that they are so much ahead of everyone else in the field).

What are the odds of success of SI actively working to slow down F-less AGI development long enough to develop a provably FAI (by, say, luring away the most promising best talent in the field, UFAI x-risk awareness education, or by other means)? Unless they are way less than (1 in a million)/1% = 0.01%, it makes sense to allocate a sizable chunk of the budget to thwarting non-provably FAI efforts, with the most effective strategies prioritized. What strategies are estimated to be effective, I have no idea. Assuming education and hiring has been estimated to be better than subversion or anything else dark-artsy, we should see the above-board efforts taking a good chunk of the budget. If these efforts are only marginal, then either SI sucks at Bayesianism or it is channeling the prevention resources elsewhere (like privately convincing people to not work on "dangerous projects"). Or the above calculation is way off.

Replies from: turchin, timtyler
comment by turchin · 2013-01-11T10:51:02.924Z · LW(p) · GW(p)

Even evil creator of AI needs somekind of controll over his child, that could be called friendliness to one person. So any group which is seriosly creating AGI and going to use it it in any efforts should be interested in FAI theory. So it could be enough to explain to any one who create AGI that he needs somekind of F-theory and it should be mathematically proven.

Replies from: shminux
comment by Shmi (shminux) · 2013-01-11T18:50:44.941Z · LW(p) · GW(p)

Most people whose paycheck comes from designing a bomb have no trouble rationalizing it. Similarly, if your paycheck depends on the AGI progress and not FAI progress, you will likely be unwilling to slow down or halt the AGI development progress, and if you are, you get fired and replaced.

Replies from: turchin
comment by turchin · 2013-01-11T20:06:27.052Z · LW(p) · GW(p)

I wanted to say that anyone who is creating AGI need to control it some how and by this need somekind of analog of FAI, at least for not to be killed himself. And this idea could be promoted to any AGI reasearch group.

comment by timtyler · 2013-01-12T14:12:51.661Z · LW(p) · GW(p)

Let's try a simple calculation. What is the expected FAI/UFAI ratio when friendliness is not proven? According to Eliezer's reply in this thread, it's close to zero:

your conditionally independent failure probabilities add up to 1 and you're 100% doomed.

So let's overestimate it as 1 in a million, as opposed to a more EY-like estimate of 1 in a gazillion

Ignoring the issue of massive overconfidence, why do you even think these concepts are clearly enough defined to assign probability estimates to them like this? It seems pretty clear that they are not. Before discussing the probability of a poorly-defined class of events, it is best to try and say what it is that you are talking about.

Replies from: shminux
comment by Shmi (shminux) · 2013-01-13T19:16:23.895Z · LW(p) · GW(p)

Feel free to explain why it is not OK to assign probabilities in this case. Clearly EY does not shy away from doing so, as the quote indicates.

Replies from: timtyler
comment by timtyler · 2013-01-13T20:48:38.216Z · LW(p) · GW(p)

Well obviously you can assign probabilities to anything - but if the event is sufficiently vague, doing so in public is rather pointless - since no one else will know what event you are talking about.

I see that others have made the same complaint in this thread - e.g. Richard Loosemore:

before deciding exactly how many angels can dance on the head of a pin, you have to make sure the "angel" concept is meaningful enough that questions about angels are meaningful

comment by timtyler · 2013-01-11T00:20:44.414Z · LW(p) · GW(p)

Some attempts to sabotage competitors have been made public historically. Iran appears to be one victim of computer-sabotage.

comment by loup-vaillant · 2013-01-12T22:06:50.639Z · LW(p) · GW(p)

Even the provably friendly design will face real-world compromises and errors in its implementation, so the implementation will not itself be provably friendly.

Err… Coq? The impossibility of proving computer programs is a common trope, but also a false one. It's just very hard and very expensive to do for any sufficiently large program. Hopefully, a real world implementation of the bootstrap code for whatever math is needed for the AI will be optimized for simplicity, and therefore will stand a chance at being formally proven.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-13T09:32:39.193Z · LW(p) · GW(p)

Yes, this is Eliezer's hope, and I certainly hope so too.

But an implementable AI, perhaps, will require a jumble of modules -- attention, visual, memory control, subgoal management, etc., etc. -- so that it will be "sufficiently large [and complex] program" that machine proof will not be feasible.

I don't know if that's true, but if it is, we're in trouble. To use AIXI as a point of comparison, albeit a weak one -- AIXI itself is amenable to mathematical proof, but once people start digging down to implementable of even toy systems, things get complicated fast.

Also, SI's future proofs and software may be logically sound internally, but fail to map correctly to the world; and the proof system itself can have bugs.

And

Replies from: loup-vaillant
comment by loup-vaillant · 2013-01-13T10:46:49.220Z · LW(p) · GW(p)

Well, there will always remain some logical uncertainty. Anyway, I remain relatively optimistic about the feasibility of a formal proof. It comes from a view of our current systems.

Currently, a desktop system (OS + Desktop UI + office suite + Mail + Web browser) is about two hundred million lines of code (whether it is Windows, GNU/Linux, or MacOSX). Some guys were able to build a prototype of similar functionality in about twenty thousand lines, which is 4 orders of magnitude smaller. (More details in their manifesto and their various progress reports).

There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude improvement within a decade in productivity, in reliability, in simplicity. Fred brooks.

Which is often understood as "We will never observe even a single order of magnitude improvement from new software making techniques". Which I think is silly or misinformed. The above tells me that we have at least 3 orders of magnitude ahead of us. Maybe not enough to have a provable AI design, but still damn closer than current techniques.

Replies from: JoshuaFox
comment by JoshuaFox · 2013-01-13T11:35:54.054Z · LW(p) · GW(p)

(OS + Desktop UI + office suite + Mail + Web browser) The above tells me that we have at least 3 orders of magnitude ahead of us.

We know a lot about what is needed in OS, Web software, etc., from experience.

Is it possible to go through 3 orders of magnitude of improvement in any system, such as a future AGI, without running a working system in between?

Replies from: loup-vaillant
comment by loup-vaillant · 2013-01-14T00:20:12.858Z · LW(p) · GW(p)

My reasoning is a bit different. We know that current desktop systems are a mess that we cobbled together over time with outdated techniques. The vpri basically showed that if we knew how to do it from the beginning, it would have been about 1000 times easier.

I expect the same to be true for any complex system, including AGI. If we cobble it together over time with outdated techniques, it will likely be 1000 times more complex than it needs to be. My hope is that we can actually avoid those complexities altogether. So that's not exactly an improvement, since there would be no crappy system to compare to.

As for the necessity of having a working system before we can improve it's design… well it's not always the case. I'm currently working on a metacompiler, and I'm refining its design right now, and I didn't even bootstrapped it yet.

comment by OrphanWilde · 2013-01-10T13:07:27.179Z · LW(p) · GW(p)

A genie which can't grant wishes, only tell you how to grant the wish yourself, is considerably safer than a genie which can grant wishes, and particularly safer than a genie that can grant wishes nobody has made.

I think there's a qualitative difference in the kind of AI most people are interested in making, and the AI Eliezer is interested in making. Eliezer is interested in creating an omnipotent, omniscience god; omnibenevolence becomes a necessary safety rule. Absence omnibenevolence, a merely omniscient god is safer. (Although as Eliezer's Let-me-out-of-the-box experiments suggested, safer doesn't imply safe.) However, a non-god AI is safer still, and I think that's the kind of AI most research is going into. In general, AIs that can't self-modify are going to be safer than those which can, and AIs which aren't programmed with a desire to self-modify are going to be safer than those which are.

Replies from: JoshuaFox, RomeoStevens
comment by JoshuaFox · 2013-01-10T13:49:44.293Z · LW(p) · GW(p)

Thanks for your comment. But what does this tell us about SI's current R&D strategy and whether they should modify it?

a non-god AI is safer still, and I think that's the kind of AI most research is going into. .... AIs which aren't programmed with a desire to self-modify

A nascent AGI will self-improve to godlike levels. This is true even if it is not programmed with a desire to self-modify, since self-improvement is a useful technique in achieving other goals.

In general, AIs that can't self-modify are going to be safer than those which can

An interesting approach -- design an AI so that it can't self-modify. I don't know that I've seen that approach designed in detail. Seems worth at least an article.

non-god AI ... the kind of AI most research...

Good point. Most AGI developers (there are not too many) are not seriously considering the post-Explosion stage. Even the few who are well aware of the possibility don't treat it seriously in their implementation work. But that doesn't mean that (if and when they succeed in making a nascent AGI) it won't explode.

Replies from: latanius, OrphanWilde
comment by latanius · 2013-01-11T02:46:18.260Z · LW(p) · GW(p)

design an AI so that it can't self-modify

is there at all a clean border between self-modification and simply learning things? We have "design" and "operation" at two places in our maps, but they can be easily mixed up in reality (is it OK to modify interpreted source code if we leave the interpreter alone? what about following verbal instructions then? inventing them? etc...)

Replies from: JoshuaFox, OrphanWilde
comment by JoshuaFox · 2013-01-11T09:21:04.004Z · LW(p) · GW(p)

Little consideration has been given to a block on self-modification because it seems that it is impossible. You could do a non-Von Neumann machine, separating data and code, but data can be interpreted as code.

Still, consideration should be given to whether anything can be done, even if only as stopgap.

comment by OrphanWilde · 2013-01-11T21:08:20.324Z · LW(p) · GW(p)

Given that read-only hardware exists, yes, a clean border can be drawn, with the caveat that nothing is stopping the intelligence from emulating itself as if it were modified.

However - and it's an important however - emulating your own modified code isn't the same as modifying yourself. Just because you can imagine what your thought processes might be if you were sociopathic doesn't make you sociopathic; just because an AI can emulate a process to arrive at a different answer than it would have doesn't necessarily give it the power to -act- on that answer.

Which is to say, emulation can allow an AI to move past blocks on what it is permitted to think, but doesn't necessarily permit it to move past blocks on what it is permitted to do.

This is particularly important in the case of something like a goal system; if a bug would result in an AI breaking its own goal system on a self-modification, this bug becomes less significant if the goal system is read-only. It could emulate what it would do with a different goal system, but it would be evaluating solutions from that emulation within its original goal system.

comment by OrphanWilde · 2013-01-10T14:23:12.127Z · LW(p) · GW(p)

A nascent AGI will self-improve to godlike levels. This is true even if it is not programmed with a desire to self-modify, since self-improvement is a useful technique in achieving other goals.

I think that depends on whether the AI in question is goal-oriented or not. It minds me of a character from one of my fantasy stories; a genie with absolutely no powers, only an unassailable compulsion to grant wishes by any means necessary.

That is, you assume a goal of a general intelligence would be to become more intelligent. I think this is wrong for the same reason that assuming the general intelligence will share your morality is wrong (and indeed it might be precisely the same error, depending on your reasons for desiring more intelligence).

So I guess I should add something to the list: Goals make AI more dangerous. If the AI has any compulsion to respond to wishes at all, if it is in any respect a genie, it is more dangerous than if it weren't.

ETA: As for what it says to the SI's research, I can't really say. Frankly, I think the SI's work is probably more dangerous than what most of these people are doing. I'm extremely dubious of the notion of a provably-safe AI, because I suspect that safety can't be sufficiently rigorously defined.

Replies from: Baughn, Viliam_Bur, Luke_A_Somers, DaFranker
comment by Baughn · 2013-01-10T16:17:27.853Z · LW(p) · GW(p)

That story sounds very interesting. Can I read it somewhere?

Replies from: OrphanWilde
comment by OrphanWilde · 2013-01-10T21:40:09.139Z · LW(p) · GW(p)

Aside from the messaged summary, my brother liked the idea enough to start sketching out his own story based on it. AFAIK, it's a largely unexplored premise with a lot of interesting potential, and if you're inclined to do something with it, I'd love to read it in turn.

comment by Viliam_Bur · 2013-01-14T16:49:43.607Z · LW(p) · GW(p)

Goals make AI more dangerous.

So instead of having goals, the AI should just answer our questions correctly, and then turn itself off until we ask it again. Then it will be safe.

I mean, unless "answering the question correctly" already is a goal...

Replies from: OrphanWilde
comment by OrphanWilde · 2013-01-14T17:01:06.700Z · LW(p) · GW(p)

No. Safer. Safer doesn't imply safe.

I distinguish between a goal-oriented system and a motivation system more broadly; a computer without anything we would call AI can answer a question correctly, provided you pose the question in explicit/sufficient detail. The motivator for a computer sans AI is relatively simple, and it doesn't look for a question you didn't ask. Does it make sense to say that a computer has goals?

Taboo "goal", discuss only the motivational system involved, and the difference becomes somewhat clearer. You're including some implicit meaning in the word "goal" you may not realize; you're including complex motivational mechanisms. The danger arises from your motivational system, not from the behavior of a system which does what you ask. The danger arises from a motivational system which attempts to do more than you think you are asking it to do.

Replies from: Viliam_Bur
comment by Viliam_Bur · 2013-01-15T09:17:25.404Z · LW(p) · GW(p)

The discussion will necessarily be confused unless we propose a mechanism how the AI answers the questions.

I suppose that to be smart enough to answer complex questions, the AI must have an ability to model the world. For example, Google Maps only has information about roads, so it can only answer questions about roads. It cannot even tell you "generally, this would be a good road, but I found on internet that tomorrow there will be some celebration in that area, so I inferred that the road could be blocked and it would be safer to plan another road". Or it cannot tell you "I recommend this other way, although it is a bit longer, because the gas stations are cheaper along that way, and from our previous conversations it seems to me that you care about the price more than about the time or distance per se". So we have a choice between an AI looking at a specified domain and ignoring the rest of the universe, and an AI capable of looking at the rest of the universe and finding data relevant to the question. Which one will we use?

The choice of domain-limited AI is safer, but then it is our tasks to specify the domain precisely. The AI, however smart, will simply ignore all the solutions outside of the domain, even if they would be greatly superior to the in-domain answers. In other words, it would be unable to "think out of the box". You would miss good solutions only because you forgot to ask or simply used a wrong word in the question. For example there could be a relatively simple (for the AI) solution to double the human lifespan, but it would include something that we forgot to specify as a part of medicine, so the AI will never tell us. Or we will ask how to win a war, and the AI could see a relatively simple way to make peace, but it will never think that way, because we did not ask that. Think about the danger of this kind of AI, if you give it more complex questions, for example how to best organize the society. What are the important things you forgot to ask or to include in the problem domain?

On the other hand, a super-human domain-unlimited AI simply has a model of universe, and it is an outcome pump. It includes the model of you, and of your reactions to what it says. Even if it has no concept of manipulation, it just sees your "decision tree" and chooses the optimal path -- optimal for maximizing the value of the question you asked. Here we have AI already capable of manipulating humans, and we only need to suppose that it has a model of the world, and a function for deciding which of many possible answers is the best.

If the AI can model humans, it is unsafe. If the AI cannot model humans, it will give wrong answers when the human reactions are part of the problem domain.

Replies from: OrphanWilde
comment by OrphanWilde · 2013-01-15T14:39:42.199Z · LW(p) · GW(p)

I was following you up until your AI achieved godhood. Then we hit a rather sharp disparity in expectations.

Excepting that paragraph, is it fair to sum up your response as, "Not giving the AI sufficient motivational flexibility results in suboptimal results"?

Replies from: Viliam_Bur
comment by Viliam_Bur · 2013-01-15T15:31:13.677Z · LW(p) · GW(p)

Not allowing AI to model things outside of a narrowly specified domain results in suboptimal results.

(I don't like the word "motivation". Either the AI can process some kind of data, or it can not; either because the data are missing, or because the AI's algorithm does not take them into consideration. For example Google Maps cannot model humans, because it has no such data, and because its algorithm is unable to gather such data.)

Replies from: OrphanWilde
comment by OrphanWilde · 2013-01-15T18:50:16.944Z · LW(p) · GW(p)

I'm not talking about "can" or "can not" model, though; if you ask the AI to psychoanalyze you, it should be capable of modeling you.

I'm talking about - trying to taboo the word here - the system which causes the AI to engage in specific activities.

So in this case, the question is - what mechanism, within the code, causes the algorithm to consider some data or not. Assume a general-use algorithm which can process any kind of meaningful data.

Plugging your general-use algorithm as the mechanism which determines what data to use gives the system considerable flexibility. It also potentially enables the AI to model humans whenever the information is deemed relevant, which could potentially be every time it runs, to try to decipher the question being asked; we've agreed that this is dangerous.

(It's very difficult to discuss this problem without proposing token solutions as examples of the "right" way to do it, even though I know they probably -aren't- right. Motivation was such a convenient abstraction of the concept.)

Generalizing the question, the issue comes down to the distinction between the AI asking itself what to do next as opposed to determining what the next logical step is. "What should I do next" is in fact a distinct question from "What should I do next to resolve the problem I'm currently considering".

The system which answers the question "What should I do next" is what I call the motivational system, in the sense of "motive force," rather than the more common anthropomorphized sense of motivation. It's possible that this system grants full authority to the logical process to determine what it needs to do - I'd call this an unfettered AI, in the TV Tropes sense of the word. A strong fetter would require the AI to consult its "What should I do next" system for every step in its "What should I do next to resolve the problem I'm currently considering" system.

At this point, have I made a convincing case of the distinction between the motivational system ("What should I do next?") versus the logical system ("What should I do next to resolve the problem I'm currently considering?")?

Replies from: Viliam_Bur
comment by Viliam_Bur · 2013-01-15T20:43:04.801Z · LW(p) · GW(p)

what mechanism, within the code, causes the algorithm to consider some data or not

I like this way to express it. This seems like a successful way to taboo various antropomorphic concepts.

Unfortunately, I don't understand the distinction between "should do next?" and "should do next to resolve the problem?". Is the AI supposed to do something else besides solving the users' problems? Is it supposed to consist of two subsystems: one of them is a general problem solver, and the other one is some kind of a gatekeeper saying: "you are allowed to think about this, but not allowed to think about that?". If yes, then who decides what data the gatekeeper is allowed to consider? Is gatekeeper the less smart part of the AI? Is the general-problem-solving part allowed to model the gatekeeper?

Replies from: OrphanWilde
comment by OrphanWilde · 2013-01-15T21:59:56.932Z · LW(p) · GW(p)

I wrote an example I erased, based on a possibly apocryphal anecdote by Richard Feynman I am recalling from memory, discussing the motivations for working on the Manhattan Project; the original reasons for starting on the project were to beat Germany to building an atomic bomb; after Germany was defeated, the original reason was outdated, but he (and others sharing his motivation) continued working anyways, solving the immediate problem rather than the one they originally intended to solve.

That's an example of the logical system and the motivational system being in conflict, even if the anecdote doesn't turn out to be very accurate. I hope it is suggestive of the distinction.

The motivational system -could- be a gatekeeper, but I suspect that would mean there are substantive issues in how the logical system is devised. It should function as an enabler - as the motive force behind all actions taken within the logical system. And yes, in a sense it should be less intelligent than the logical system; if it considers everything to the same extent the logical system does, it isn't doing its job, it's just duplicating the efforts of the logical system.

That is, I'm regarding an ideal motivational system as something that drives the logical system; the logical system shouldn't be -trying- to trick its motivational system, in something the same way and for the same reason you shouldn't try to convince yourself of a falsehood.

The issue in describing this is that I can think of plenty of motivational systems, but none which do what we want here. (Granted, if I could, the friendly AI problem might be substantively solved.) I can't even say for certain that a gatekeeper motivator wouldn't work.

Part of my mental model of this functional dichotomy, however, is that the logical system is stateless - if the motivational system asks it to evaluate its own solutions, it has to do so only with the information the motivational system gives it. The communication model has a very limited vocabulary. Rules for the system, but not rules for reasoning, are encoded into the motivational system, and govern its internal communications only. The logical system goes as far as it can with what it has, produces a set of candidate solutions and unresolved problems, and passes these back to the motivational system. Unresolved problems might be passed back with additional information necessary to resolve them, depending on the motivational system's rules.

So in my model-of-my-model, an Asimov-syle AI might hand a problem to its logical system, get several candidate solutions back, and then pass those candidate solutions back into the logical system with the rules of robotics, one by one, asking if this action could violate each rule in turn, discarding any candidate solutions which do.

Manual motivational systems are also conceptually possible, although probably too slow to be of much use.

[My apologies if this response isn't very good; I'm running short on time, and don't have any more time for editing, and in particular for deciding which pieces to exclude.]

comment by Luke_A_Somers · 2013-01-10T17:31:26.886Z · LW(p) · GW(p)

a genie with absolutely no powers

I presume you allowed it the mundane superpowers?

Replies from: OrphanWilde
comment by OrphanWilde · 2013-01-10T17:43:55.636Z · LW(p) · GW(p)

Granted, yes. He's actually a student researcher into unknown spells, and one of those spells is what transformed him into a genie of sorts. Strictly speaking he possesses powers that would be exceptional in our world, they're just unremarkable in his. (His mundane superpowers include healing magic and the ability to create simple objects from thin air; that's the sort of world he exists in.)

comment by DaFranker · 2013-01-10T16:02:39.660Z · LW(p) · GW(p)

There is only one simple requirement for any AI to begin recursive self-improvement: Learning of the theoretical possibility that more powerful or efficient algorithms, preferably with even more brainpower, could achieve the AI's goals or raise its utility levels faster than what it's currently doing.

Going from there to "Let's create a better version of myself because I'm the current most optimal algorithm I know of" isn't such a huge step to make as some people seem to implicitly believe, as long as the AI can infer its own existence or is self-aware in any manner.

Replies from: OrphanWilde
comment by OrphanWilde · 2013-01-10T17:46:02.543Z · LW(p) · GW(p)

Hence my second paragraph: Goals are inherently dangerous things to give AIs. Especially open-ended goals which would require an ever-better intelligence to resolve.

Replies from: latanius
comment by latanius · 2013-01-11T02:53:13.632Z · LW(p) · GW(p)

AIs that can't be described by attributing goals to them don't really seem too powerful (after all, intelligence is about making the world going into some direction; this is the only property that tells apart an AGI from a rock).

Replies from: OrphanWilde
comment by OrphanWilde · 2013-01-11T20:59:01.654Z · LW(p) · GW(p)

Evolution and capitalism are both non-goal-oriented, extremely powerful intelligences. Goals are only one form of motivators.

comment by RomeoStevens · 2013-01-11T04:25:49.194Z · LW(p) · GW(p)

This comes up enough that I wonder if there's an "oracles are still incredibly dangerous" FAQ somewhere.

Replies from: Qiaochu_Yuan, OrphanWilde
comment by Qiaochu_Yuan · 2013-01-11T04:34:38.933Z · LW(p) · GW(p)

I don't know if this counts as an FAQ, but what about Reply to Holden on 'Tool AI'?

Replies from: RomeoStevens
comment by RomeoStevens · 2013-01-11T04:49:00.498Z · LW(p) · GW(p)

Thanks!

comment by OrphanWilde · 2013-01-11T21:20:53.946Z · LW(p) · GW(p)

"Safer doesn't imply safe."

I think it's important to distinguish between what I consider a True Oracle - an AI with no internal motivation system, including goal systems - and an AGI which has been designed to -behave- like an Oracle. A True Oracle AI is -not- a general intelligence.

The difference is that an AGI designed to behave like an oracle tries to figure out what you want, and gives it to you. A True Oracle is necessarily quite stupid. From the linked article by Eliezer, this quote from Holden "construct_utility_function(process_user_input()) is just a human-quality function for understanding what the speaker wants" represents the difference. Encapsulating utility into your Oracle means your Oracle is behaving more like an agent than a tool; it's making decisions about what you want without consulting you about it.

In fact, as far as I define such things, we already have Oracle AIs. The computer itself is one; you tell it what your problem is, and it solves it for you. If it gives you the wrong answer, it's entirely because your problem specification is incomplete or incorrect. When I read people's discussions of Oracle AIs, what it seems they really want is an AI that can figure out what problem you're -really- trying to solve, given a poorly-defined problem, and solve -that- for you.

-That- is what is dangerous.

comment by turchin · 2013-01-10T12:23:50.210Z · LW(p) · GW(p)

I wrote about it before. My idea was that until math FAI is finished we should suggest another type of friendliness which consists of simple rules which could be mutualy independently implemented by any project.

Simple friendliness: Plan B for AI