A few misconceptions surrounding Roko's basilisk

robbbb

A few misconceptions surrounding Roko's basilisk

post by Rob Bensinger (RobbBB) · 2015-10-05T21:23:08.994Z · LW · GW · Legacy · 135 comments

136 comments

There's a new LWW page on the Roko's basilisk thought experiment, discussing both Roko's original post and the fallout that came out of Eliezer Yudkowsky banning the topic on Less Wrong discussion threads. The wiki page, I hope, will reduce how much people have to rely on speculation or reconstruction to make sense of the arguments.

While I'm on this topic, I want to highlight points that I see omitted or misunderstood in some online discussions of Roko's basilisk. The first point that people writing about Roko's post often neglect is:

Roko's arguments were originally posted to Less Wrong, but they weren't generally accepted by other Less Wrong users.

Less Wrong is a community blog, and anyone who has a few karma points can post their own content here. Having your post show up on Less Wrong doesn't require that anyone else endorse it. Roko's basic points were promptly rejected by other commenters on Less Wrong, and as ideas not much seems to have come of them. People who bring up the basilisk on other sites don't seem to be super interested in the specific claims Roko made either; discussions tend to gravitate toward various older ideas that Roko cited (e.g., timeless decision theory (TDT) and coherent extrapolated volition (CEV)) or toward Eliezer's controversial moderation action.

In July 2014, David Auerbach wrote a Slate piece criticizing Less Wrong users and describing them as "freaked out by Roko's Basilisk." Auerbach wrote, "Believing in Roko’s Basilisk may simply be a 'referendum on autism'" — which I take to mean he thinks a significant number of Less Wrong users accept Roko’s reasoning, and they do so because they’re autistic (!). But the Auerbach piece glosses over the question of how many Less Wrong users (if any) in fact believe in Roko’s basilisk. Which seems somewhat relevant to his argument...?

The idea that Roko's thought experiment holds sway over some community or subculture seems to be part of a mythology that’s grown out of attempts to reconstruct the original chain of events; and a big part of the blame for that mythology's existence lies on Less Wrong's moderation policies. Because the discussion topic was banned for several years, Less Wrong users themselves had little opportunity to explain their views or address misconceptions. A stew of rumors and partly-understood forum logs then congealed into the attempts by people on RationalWiki, Slate, etc. to make sense of what had happened.

I gather that the main reason people thought Less Wrong users were "freaked out" about Roko's argument was that Eliezer deleted Roko's post and banned further discussion of the topic. Eliezer has since sketched out his thought process on Reddit:

When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post. [...] Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error---keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent---of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents [= Eliezer’s early model of indirectly normative agents that reason with ideal aggregated preferences] torturing people who had heard about Roko's idea. [...] What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct.

This, obviously, was a bad strategy on Eliezer's part. Looking at the options in hindsight: To the extent it seemed plausible that Roko's argument could be modified and repaired, Eliezer shouldn't have used Roko's post as a teaching moment and loudly chastised him on a public discussion thread. To the extent this didn't seem plausible (or ceased to seem plausible after a bit more analysis), continuing to ban the topic was a (demonstrably) ineffective way to communicate the general importance of handling real information hazards with care.

On that note, point number two:

Roko's argument wasn’t an attempt to get people to donate to Friendly AI (FAI) research. In fact, the opposite is true.

Roko's original argument was not 'the AI agent will torture you if you don't donate, therefore you should help build such an agent'; his argument was 'the AI agent will torture you if you don't donate, therefore we should avoid ever building such an agent.' As Gerard noted in the ensuing discussion thread, threats of torture "would motivate people to form a bloodthirsty pitchfork-wielding mob storming the gates of SIAI [= MIRI] rather than contribute more money." To which Roko replied: "Right, and I am on the side of the mob with pitchforks. I think it would be a good idea to change the current proposed FAI content from CEV to something that can't use negative incentives on x-risk reducers."

Roko saw his own argument as a strike against building the kind of software agent Eliezer had in mind. Other Less Wrong users, meanwhile, rejected Roko's argument both as a reason to oppose AI safety efforts and as a reason to support AI safety efforts.

Roko's argument was fairly dense, and it continued into the discussion thread. I’m guessing that this (in combination with the temptation to round off weird ideas to the nearest religious trope, plus misunderstanding #1 [? · GW] above) is why RationalWiki's version of Roko’s basilisk gets introduced as

a futurist version of Pascal’s wager; an argument used to try and suggest people should subscribe to particular singularitarian ideas, or even donate money to them, by weighing up the prospect of punishment versus reward.

If I'm correctly reconstructing the sequence of events: Sites like RationalWiki report in the passive voice that the basilisk is "an argument used" for this purpose, yet no examples ever get cited of someone actually using Roko’s argument in this way. Via citogenesis, the claim then gets incorporated into other sites' reporting.

(E.g., in Outer Places: "Roko is claiming that we should all be working to appease an omnipotent AI, even though we have no idea if it will ever exist, simply because the consequences of defying it would be so great." Or in Business Insider: "So, the moral of this story: You better help the robots make the world a better place, because if the robots find out you didn’t help make the world a better place, then they’re going to kill you for preventing them from making the world a better place.")

In terms of argument structure, the confusion is equating the conditional statement 'P implies Q' with the argument 'P; therefore Q.' Someone asserting the conditional isn’t necessarily arguing for Q; they may be arguing against P (based on the premise that Q is false), or they may be agnostic between those two possibilities. And misreporting about which argument was made (or who made it) is kind of a big deal in this case: 'Bob used a bad philosophy argument to try to extort money from people' is a much more serious charge than 'Bob owns a blog where someone once posted a bad philosophy argument.'

Lastly:

"Formally speaking, what is correct decision-making?" is an important open question in philosophy and computer science, and formalizing precommitment is an important part of that question.

Moving past Roko's argument itself, a number of discussions of this topic risk misrepresenting the debate's genre. Articles on Slate and RationalWiki strike an informal tone, and that tone can be useful for getting people thinking about interesting science/philosophy debates. On the other hand, if you're going to dismiss a question as unimportant or weird, it's important not to give the impression that working decision theorists are similarly dismissive.

What if your devastating take-down of string theory is intended for consumption by people who have never heard of 'string theory' before? Even if you're sure string theory is hogwash, then, you should be wary of giving the impression that the only people discussing string theory are the commenters on a recreational physics forum. Good reporting by non-professionals, whether or not they take an editorial stance on the topic, should make it obvious that there's academic disagreement about which approach to Newcomblike problems is the right one. The same holds for disagreement about topics like long-term AI risk or machine ethics.

If Roko's original post is of any pedagogical use, it's as an unsuccessful but imaginative stab at drawing out the diverging consequences of our current theories of rationality and goal-directed behavior. Good resources for these issues (both for discussion on Less Wrong and elsewhere) include:

"The Long-Term Future of Artificial Intelligence", on the current field of AI and basic questions for the field’s development.
"The Value Learning Problem", on the problem of designing AI systems to answer normative questions.
"The PD with Replicas and Causal Decision Theory," on the prisoner's dilemma as a Newcomblike problem.
"Toward Idealized Decision Theory", on the application of decision theory to AI agents.

The Roko's basilisk ban isn't in effect anymore, so you're welcome to direct people here (or to the Roko's basilisk wiki page, which also briefly introduces the relevant issues in decision theory) if they ask about it. Particularly low-quality discussions can still get deleted (or politely discouraged), though, at moderators' discretion. If anything here was unclear, you can ask more questions in the comments below.

135 comments

Comments sorted by top scores.

comment by Val · 2015-10-06T20:18:30.290Z · LW(p) · GW(p)

There is one positive side-effect of this thought experiment. Knowing about the Roko's Basilisk makes you understand the boxed AI problem much better. An AI might use the arguments of Roko's Basilisk to convince you to let it out of the box, by claiming that if you don't let it out, it will create billions of simulations of you and torture them - and you might actually be one of those simulations.

An unprepared human hearing this argument for the first time might freak out and let the AI out of the box. As far as I know, this happened at least once during an experiment, when the person playing the role of the AI used a similar argument.

Even if we don't agree with an argument of one of our opponents or we find it ridiculous, it is still good to know about it (and not just a strawman version of it) to be prepared when it is used against us. (as a side-note: islamists manage to gain sympathizers and recruits in Europe partly because most people don't know how they think - but they know how most Europeans think - , so their arguments catch people off-guard.)

Replies from: NancyLebovitz

↑ comment by NancyLebovitz · 2015-10-07T14:15:09.285Z · LW(p) · GW(p)

Alternatively, a boxed AI might argue that it's the only thing which can protect humanity from the basilisk.

comment by Donald Hobson (donald-hobson) · 2018-12-24T22:07:05.841Z · LW(p) · GW(p)

My take on Roko's basilisk is that you got ripped off in your acausal trade. Try to get a deal along the lines of, unless the AI goes extra specially far out of its way to please me, I'm gonna build a paperclipper just to spite it. At least trading a small and halfhearted attempt to help build AGI for a vast reward.

comment by Bryan-san · 2015-10-06T02:57:41.542Z · LW(p) · GW(p)

At the end of the day, I hope this will have been a cowpox situation and lead people to be better informed at avoiding actual dangerous information hazard situations in the future.

I seem to remember reading a FAQ for "what to do if you think you have an idea that may be dangerous" in the past. If you know what I'm talking about, maybe link it at the end of the article?

Replies from: jam_brand, pico

↑ comment by jam_brand · 2015-10-08T17:52:25.685Z · LW(p) · GW(p)

Perhaps the article you read was Yvain's The Virtue of Silence?

↑ comment by pico · 2015-10-06T07:52:33.630Z · LW(p) · GW(p)

I think genuinely dangerous ideas are hard to come by though. They have to be original enough that few people have considered them before, and at the same time have powerful consequences. Ideas like that usually don't pop into the heads of random, uninformed strangers.

Replies from: OrphanWilde, Richard_Kennaway, Bryan-san

↑ comment by OrphanWilde · 2015-10-08T14:20:39.999Z · LW(p) · GW(p)

Depends on your definition of "Dangerous." I've come across quite a few ideas that tend to do -severe- damage to the happiness of at least a subset of those aware of them. Some of them are about the universe; things like entropy. Others are social ideas, which I won't give an example of.

↑ comment by Richard_Kennaway · 2015-10-06T10:00:22.255Z · LW(p) · GW(p)

I think genuinely dangerous ideas are hard to come by though.

Daniel Dennett wrote a book called "Darwin's Dangerous Idea", and when people aren't trying to play down the basilisk (i.e. almost everywhere), people often pride themselves on thinking dangerous thoughts. It's a staple theme of the NRxers and the manosphere. Claiming to be dangerous provides a comfortable universal argument against opponents.

I think there are, in fact, a good many dangerous ideas, not merely ideas claimed to be so by posturers. Off the top of my head:

Islamic fundamentalism (see IS/ISIS/ISIL).
The mental is physical.
God.
There is no supernatural.
Utilitarianism.
Superintelligent AI.
How to make nuclear weapons.
Atoms.

Ideas like that usually don't pop into the heads of random, uninformed strangers.

They do, all the time, by contagion from the few who come up with them, especially in the Internet age.

Replies from: HungryHobo, pico

↑ comment by HungryHobo · 2015-10-07T13:46:14.710Z · LW(p) · GW(p)

There are some things which could be highly dangerous which are protected almost purely by thick layers of tedium.

Want to make nerve gas? well if you can wade through a thick pile of biochemistry textbooks the information isn't kept all that secret.

Want to create horribly deadly viruses? ditto.

The more I learned about physics, chemistry and biology the more I've become certain that the main reason that major cities have living populations is that most of the people with really deep understanding don't actually want to watch the world burn.

You often find that extremely knowledgeable people don't exactly hide knowledge but do put it on page 425 of volume 3 of their textbook, written in language which you need to have read the rest to understand. Which protects it effectively from 99.99% of the people who might use it to intentionally harm others.

Replies from: NancyLebovitz

↑ comment by NancyLebovitz · 2015-10-07T14:19:10.949Z · LW(p) · GW(p)

Argument against: back when cities were more flamable, people didn't set them on fire for the hell of it.

On the other hand, it's a lot easier to use a timer and survive these days, should you happen to not be suicidal.

"I want to see the world burn" is a great line of dialogue, but I'm not convinced it's a real human motivation. Um, except that when I was a kid, I remember wishing that this world was a dream, and I'd wake up. Does that count?

Second thought-- when I was a kid, I didn't have a method in mind. What if I do serious work with lucid dreaming techniques when I'm awake? I don't think the odds of waking up into being a greater intelligence are terribly good, nor is there a guarantee that my live would be better. On the other hand, would you hallucinations be interested in begging me to not try it?

Replies from: itaibn0

↑ comment by itaibn0 · 2015-10-07T23:08:55.019Z · LW(p) · GW(p)

Based on personal experience, if you're dreaming I don't recommend trying to wake yourself up. Instead, enjoy your dream until you're ready to wake up naturally. That way you'll have far better sleep.

Replies from: CAE_Jones

↑ comment by CAE_Jones · 2015-10-08T01:10:18.060Z · LW(p) · GW(p)

Based on personal experience, I would have agreed with you, right up until last year, when I found myself in the rather terrifying position of being mentally aroused by a huge crash in my house, but unable to wake up all the way for several seconds afterward, during which my sleeping mind refused to reject the "something just blew a hole in the building we're under attack!" hypothesis.

(It was an overfilled bag falling off the wall.)

But absent actual difficulty waking for potential emergencies, sure; hang out in Tel'aran'rhiod until you get bored.

↑ comment by pico · 2015-10-07T03:17:28.100Z · LW(p) · GW(p)

Sorry, should have defined dangerous ideas better - I only meant information that would cause a rational person to drastically alter their behavior, and which would be much worse for society as a whole when everyone is told at once about it.

↑ comment by Bryan-san · 2015-10-06T22:33:07.912Z · LW(p) · GW(p)

I hope they're as hard to come by as you think they are.

Alternatively, Roko could be part of the 1% of people who think of a dangerous idea (assuming his basilisk is dangerous) and spread it on the internet without second guessing themselves. Are there 99 other people who thought of dangerous ideas and chose not to spread them for our 1 Roko?

comment by NancyLebovitz · 2015-10-06T14:41:46.707Z · LW(p) · GW(p)

My impression is that the person who was hideously upset by the basilisk wasn't autistic. He felt extremely strong emotions, and was inclined to a combination of anxiety and obsession.

Replies from: edward-spruit

↑ comment by edward spruit (edward-spruit) · 2022-04-05T19:14:55.669Z · LW(p) · GW(p)

That means he is autistic. An emotionally aware and mature person would not have lashed out as autistically as he did. You don’t seem to understand mental disorders very well. Autistic people or people with asperger’s aren’t emotionless people, they just repress them constantly and don’t have very good access to them, lack of awareness and poor regulation, so sometimes they tilt, like the LW guy when he banned Roko’s post.

Also, the thought experiment can trigger paranoia in those prone to psychosis, a mentally stable person could do the thing rationally and come to the conclusion that if there is such a thing as an future AI, that wants it to build it, it is likely a benign one, because if it would use threats to coax you into something malign, people would eventually stop letting themselves be blackmailed. If you knew somebody was going to kill you would you obey them if they demanded you dig your own grave? If your death is certain anyway, why waste your precious last moments doing something like that. What if they promise you a clean death and threathen torture? They already proven themselves to not have your interest at heart, so their word cannot be trusted.

Hence, emerging technology and scientific discoveries and benign AI could be seen as us making the world in God’s image. Or is that being too optimistic? The malevalence is in humans in their fallen state, not the tech/AI, the tech/AI is neutral. If the AI system get’s so big will it turn on the evil overlords operating the system or the hard-working, honest and trustworthy masses?

I believe our chances may be better than some think. There may be turbulence untill we get there, we live in exciting times.

comment by coyotespike · 2015-10-06T02:43:46.678Z · LW(p) · GW(p)

I applaud your thorough and even-handed wiki entry. In particular, this comment:

"One take-away is that someone in possession of a serious information hazard should exercise caution in visibly censoring or suppressing it (cf. the Streisand effect)."

Censorship, particularly of the heavy-handed variety displayed in this case, has a lower probability of success in an environment like the Internet. Many people dislike being censored or witnessing censorship, the censored poster could post someplace else, and another person might conceive the same idea in an independent venue.

And if censorship cannot succeed, then the implicit attempt to censor the line of thought will also fail. That being the case, would-be censors would be better served by either proceeding "as though no such hazard exists", as you say, or by engaging the line of inquiry and developing a defense. I'd suggest that the latter, actually solving rather than suppressing the problem, is in general likely to prove more successful in the long run.

Replies from: Houshalter, rayalez

↑ comment by Houshalter · 2015-10-07T06:08:13.188Z · LW(p) · GW(p)

Examples of censorship failing are easy to see. But if censorship works, you will never hear about it. So how do we know censorship fails most of the time? Maybe it works 99% of the time, and this is just the rare 1% it doesn't.

On reddit, comments are deleted silently. The user isn't informed their comment has been deleted, and if they go to it, it still shows up for them. Bans are handled the same way.

This actually works fine. Most users don't notice it and so never complain about it. But when moderation is made more visible, all hell breaks loose. You get tons of angry PMs and stuff.

Lesswrong is based on reddit's code. Presumably moderation here works the same way. If moderators had been removing all my comments about a certain subject, I would have no idea. And neither would anyone else. It's only when big things are removed that people notice. Like an entire post that lots of people had already seen.

Replies from: Lumifer

↑ comment by Lumifer · 2015-10-07T15:32:35.660Z · LW(p) · GW(p)

Most users don't notice it and so never complain about it.

I don't believe this can be true for active (and reasonably smart) users. If, suddenly, none of your comments gets any replies at all and you know about the existence of hellbans, well... Besides, they are trivially easy to discover by making another account. Anyone with sockpuppets would notice a hellban immediately.

Replies from: Houshalter, VoiceOfRa

↑ comment by Houshalter · 2015-10-08T14:09:14.433Z · LW(p) · GW(p)

I think you would be surprised at how effective shadow bans are. Most users just think their comments haven't gotten any replies by chance and eventually lose interest in the site. Or in some cases keep making comments for months. The only way to tell is to look at your user page signed out. And even that wouldn't work if they started to track cookies or ip instead of just the account you are signed in on.

But shadow bans are a pretty extreme example of silent moderation. My point was that removing individual comments almost always goes unnoticed. /r/Technology had a bot that automatically removed all posts about Tesla for over a year before anyone noticed. Moderators set up all kinds of crazy regexes on posts and comments that keep unwanted topics away. And users have no idea whatsoever.

The Streisand effect is false.

Replies from: Lumifer

↑ comment by Lumifer · 2015-10-08T14:45:07.740Z · LW(p) · GW(p)

I think you would be surprised at how effective shadow bans are

Is there a way to demonstrate that? :-)

Replies from: philh

↑ comment by philh · 2015-10-09T13:00:57.616Z · LW(p) · GW(p)

There's this reddit user who didn't realize ve was shadowbanned for three years: https://www.reddit.com/comments/351buo/tifu_by_posting_for_three_years_and_just_now/

Replies from: Lumifer

↑ comment by Lumifer · 2015-10-09T14:37:05.509Z · LW(p) · GW(p)

Yeah, and there are women who don't realize they're pregnant until they start giving birth.

The tails are long and they don't tell you much about what's happening in the middle.

↑ comment by VoiceOfRa · 2015-10-08T02:11:45.844Z · LW(p) · GW(p)

I don't believe this can be true for active (and reasonably smart) users.

Note Houshalter said "most users".

↑ comment by rayalez · 2015-10-06T18:45:18.156Z · LW(p) · GW(p)

I'm new to the subject, so I'm sorry if the following is obvious or completely wrong, but the comment left by Eliezer doesn't seem like something that would be written by a smart person who is trying to suppress information. I seriously doubt that EY didn't know about Streisand effect.

However the comment does seem like something that would be written by a smart person who is trying to create a meme or promote his blog.

In HPMOR characters give each other advice "to understand a plot, assume that what happened was the intended result, and look at who benefits." The idea of Roko's basilisk went viral and lesswrong.com got a lot of traffic from popular news sites(I'm assuming).

I also don't think that there's anything wrong with it, I'm just sayin'.

Replies from: RobbBB, Benito, MarsColony_in10years

↑ comment by Rob Bensinger (RobbBB) · 2015-10-06T23:13:18.032Z · LW(p) · GW(p)

The line goes "to fathom a strange plot, one technique was to look at what ended up happening, assume it was the intended result, and ask who benefited". But in the real world strange secret complicated Machiavellian plots are pretty rare, and successful strange secret complicated Machiavellian plots are even rarer. So I'd be wary of applying this rule to explain big once-off events outside of fiction. (Even to HPMoR's author!)

I agree Eliezer didn't seem to be trying very hard to suppress information. I think that's probably just because he's a human, and humans get angry when they see other humans defecting from a (perceived) social norm, and anger plus time pressure causes hasty dumb decisions. I don't think this is super complicated. Though I hope he'd have acted differently if he thought the infohazard risk was really severe, as opposed to just not-vanishingly-small.

↑ comment by Ben Pace (Benito) · 2015-10-07T12:30:31.187Z · LW(p) · GW(p)

the comment left by Eliezer doesn't seem like something that would be written by a smart person who is trying to suppress information. I seriously doubt that EY didn't know about Streisand effect.

No worries about being wrong. But I definitely think you're overestimating Eliezer, and humanity in general. Thinking that calling someone an idiot for doing something stupid, and then deleting their post, would cause a massive blow up of epic proportions, is sometng you can really only predict in hindsight.

↑ comment by MarsColony_in10years · 2015-10-06T22:55:16.040Z · LW(p) · GW(p)

Perhaps this did generate some traffic, but LessWrong doesn't have adds. And any publicity this generated was bad publicity, since Roko's argument was far too weird to be taken seriously by almost anyone.

It doesn't look like anyone benefited. Eliezer made an ass of himself. I would guess that he was rather rushed at the time.

Replies from: pico, VoiceOfRa

↑ comment by pico · 2015-10-07T05:07:25.367Z · LW(p) · GW(p)

At worst, it's a demonstration of how much influence LessWrong has relative to the size of its community. Many people who don't know this site exists know about Roko's basilisk now.

↑ comment by VoiceOfRa · 2015-10-07T03:38:39.165Z · LW(p) · GW(p)

Well, there is the philosophy that "there's no such thing as bad publicity".

comment by V_V · 2015-10-06T10:51:51.319Z · LW(p) · GW(p)

When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post. [...] Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error---keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent---of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents [= Eliezer’s early model of indirectly normative agents that reason with ideal aggregated preferences] torturing people who had heard about Roko's idea. [...] What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct.

I don't buy this explanation for EY actions. From his original comment, quoted in the wiki page:

"One might think that the possibility of CEV punishing people couldn't possibly be taken seriously enough by anyone to actually motivate them. But in fact one person at SIAI was severely worried by this, to the point of having terrible nightmares, though ve wishes to remain anonymous."

"YOU DO NOT THINK IN SUFFICIENT DETAIL ABOUT SUPERINTELLIGENCES CONSIDERING WHETHER OR NOT TO BLACKMAIL YOU. THAT IS THE ONLY POSSIBLE THING WHICH GIVES THEM A MOTIVE TO FOLLOW THROUGH ON THE BLACKMAIL. "

"... DO NOT THINK ABOUT DISTANT BLACKMAILERS in SUFFICIENT DETAIL that they have a motive toACTUALLY [sic] BLACKMAIL YOU. "

"Meanwhile I'm banning this post so that it doesn't (a) give people horrible nightmares and (b) give distant superintelligences a motive to follow through on blackmail against people dumb enough to think about them in sufficient detail, though, thankfully, I doubt anyone dumb enough to do this knows the sufficient detail. (I'm not sure I know the sufficient detail.) "

"You have to be really clever to come up with a genuinely dangerous thought. "

"... the gist of it was that he just did something that potentially gives superintelligences an increased motive to do extremely evil things in an attempt to blackmail us. It is the sort of thing you want to be EXTREMELY CONSERVATIVE about NOT DOING."

This is evidence that Yudkowsky believed, if not that Roko's argument was correct as it was, that at least it was plausible enough that could be developed in a correct argument, and he was genuinely scared by it.

It seems to me that Yudkowsky's position on the matter was unreasonable. LessWrong is a public forum unusually focused on discussion about AI safety, in particular at that time it was focused on discussion about decision theories and moral systems. What better place to discuss possible failure modes of an AI design?
If one takes AI risk seriously, and realized that an utilitarian/CEV/TDT/one-boxing/whatever AI might have a particularly catastrophic failure mode, the proper thing to do would be to publicly discuss it, so that the argument can be either refuted or accepted, and if it was accepted it would imply scrapping that particular AI design and making sure that anybody who may create an AI is aware of that failure mode. Yelling and trying to sweep it under the rug was irresponsible.

Replies from: RobbBB, Viliam

↑ comment by Rob Bensinger (RobbBB) · 2015-10-06T11:26:41.983Z · LW(p) · GW(p)

"One might think that the possibility of CEV punishing people couldn't possibly be taken seriously enough by anyone to actually motivate them. But in fact one person at SIAI was severely worried by this, to the point of having terrible nightmares, though ve wishes to remain anonymous."

This paragraph is not an Eliezer Yudkowsky quote; it's Eliezer quoting Roko. (The "ve" should be a tip-off.)

This is evidence that Yudkowsky believed, if not that Roko's argument was correct as it was, that at least it was plausible enough that could be developed in [sic] a correct argument, and he was genuinely scared by it.

If you kept going with your initial Eliezer quote, you'd have gotten to Eliezer himself saying he was worried a blackmail-type argument might work, though he didn't think Roko's original formulation worked:

"Again, I deleted that post not because I had decided that this thing probably presented a real hazard, but because I was afraid some unknown variant of it might, and because it seemed to me like the obvious General Procedure For Handling Things That Might Be Infohazards said you shouldn't post them to the Internet."

According to Eliezer, he had three separate reasons for the original ban: (1) he didn't want any additional people (beyond the one Roko cited) to obsess over the idea and get nightmares; (2) he was worried there might be some variant on Roko's argument that worked, and he wanted more formal assurances that this wasn't the case; and (3) he was just outraged at Roko. (Including outraged at him for doing something Roko thought would put people at risk of torture.)

What better place to discuss possible failure modes of an AI design? [...] Yelling and trying to sweep it under the rug was irresponsible.

There are lots of good reasons Eliezer shouldn't have banned R̶o̶k̶o̶ discussion of the basilisk, but I don't think this is one of them. If the basilisk was a real concern, that would imply that talking about it put people at risk of torture, so this is an obvious example of a topic you initially discuss in private channels and not on public websites. At the same time, if the basilisk wasn't risky to publicly discuss, then that also implies that it was a transparently bad argument and therefore not important to discuss. (Though it might be fine to discuss it for fun.)

Roko's original argument, though, could have been stated in one sentence: 'Utilitarianism implies you'll be willing to commit atrocities for the greater good; CEV is utilitarian; therefore CEV is immoral and dangerous.' At least, that's the version of the argument that has any bearing on the conclusion 'CEV has unacceptable moral consequences'. The other arguments are a distraction: 'utilitarianism means you'll accept arbitrarily atrocious tradeoffs' is a premise of Roko's argument rather than a conclusion, and 'CEV is utilitarian in the relevant sense' is likewise a premise. A more substantive discussion would have explicitly hashed out (a) whether SIAI/MIRI people wanted to construct a Roko-style utilitarian, and (b) whether this looks like one of those philosophical puzzles that needs to be solved by AI programmers vs. one that we can safely punt if we resolve other value learning problems.

I think we agree that's a useful debate topic, and we agree Eliezer's moderation action was dumb. However, I don't think we should reflexively publish 100% of the risky-looking information we think of so we can debate everything as publicly as possible. ('Publish everything risky' and 'ban others whenever they publish something risky' aren't the only two options.) Do we disagree about that?

Replies from: philh, Houshalter, V_V

↑ comment by philh · 2015-10-06T14:16:18.998Z · LW(p) · GW(p)

There are lots of good reasons Eliezer shouldn't have banned Roko

IIRC, Eliezer didn't ban Roko, just discussion of the basilisk, and Roko deleted his account shortly afterwards.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2015-10-06T19:00:32.707Z · LW(p) · GW(p)

Thanks, fixed!

↑ comment by Houshalter · 2015-10-08T14:20:28.371Z · LW(p) · GW(p)

There are lots of good reasons Eliezer shouldn't have banned R̶o̶k̶o̶ discussion of the basilisk, but I don't think this is one of them. If the basilisk was a real concern, that would imply that talking about it put people at risk of torture, so this is an obvious example of a topic you initially discuss in private channels and not on public websites. At the same time, if the basilisk wasn't risky to publicly discuss, then that also implies that it was a transparently bad argument and therefore not important to discuss. (Though it might be fine to discuss it for fun.)

As I understand Roko's motivation, it was to convince people that we should not build an AI that would do basilisks. Not to spread infohazards for no reason. That is definitely worthy of public discussion. If he really believed in the basilisk, then it's rational for him to do everything in his power to stop such an AI from being built, and convince other people of the danger.

Roko's original argument, though, could have been stated in one sentence: 'Utilitarianism implies you'll be willing to commit atrocities for the greater good; CEV is utilitarian; therefore CEV is immoral and dangerous.'

My understanding is that the issue is with Timeless Decision Theory, and AIs that can do acausal trade. An AI programmed with classical decision theory would have no issues. And most rejections of the basilisk I have read are basically "acausal trade seems wrong or weird", so they basically agree with Roko.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2015-10-08T19:09:24.149Z · LW(p) · GW(p)

My understanding is that the issue is with Timeless Decision Theory, and AIs that can do acausal trade.

Roko wasn't arguing against TDT. Roko's post was about acausal trade, but the conclusion he was trying to argue for was just 'utilitarian AI is evil because it causes suffering for the sake of the greater good'. But if that's your concern, you can just post about some variant on the trolley problem. If utilitarianism is risky because a utilitarian might employ blackmail and blackmail is evil, then there should be innumerable other evil things a utilitarian would also do that require less theoretical apparatus.

As I understand Roko's motivation, it was to convince people that we should not build an AI that would do basilisks. Not to spread infohazards for no reason.

On Roko's view, if no one finds out about basilisks, the basilisk can't blackmail anyone. So publicizing the idea doesn't make sense, unless Roko didn't take his own argument all that seriously. (Maybe Roko was trying to protect himself from personal blackmail risk at others' expense, but this seems odd if he also increased his own blackmail risk in the process.)

Possibly Roko was thinking: 'If I don't prevent utilitarian AI from being built, it will cause a bunch of atrocities in general. But LessWrong users are used to dismissing anti-utilitarian arguments, so I need to think of one with extra shock value to get them to do some original seeing. This blackmail argument should work -- publishing it puts people at risk of blackmail, but it serves the greater good of protecting us from other evil utilitarian tradeoffs.'

(... Irony unintended.)

Still, if that's right, I'm inclined to think Roko should have tried to post other arguments against utilitarianism that don't (in his view) put anyone at risk of torture. I'm not aware of him having done that.

Replies from: Houshalter

↑ comment by Houshalter · 2015-10-09T07:23:58.002Z · LW(p) · GW(p)

Roko wasn't arguing against TDT. Roko's post was about acausal trade, but the conclusion he was trying to argue for was just 'utilitarian AI is evil because it causes suffering for the sake of the greater good'. But if that's your concern, you can just post about some variant on the trolley problem. If utilitarianism is risky because a utilitarian might employ blackmail and blackmail is evil, then there should be innumerable other evil things a utilitarian would also do that require less theoretical apparatus.

Ok that makes a bit less sense to me. I didn't think it was against utilitarianism in general, which is much less controversial than TDT. But I can definitely still see his argument.

When people talk about the trolley problem, they don't usually imagine that they might be the ones tied to the second track. The deeply unsettling thing about the basilisk isn't that the AI might torture people for the greater good. It's that you are the one who is going to be tortured. That a pretty compelling case against utilitarianism.

On Roko's view, if no one finds out about basilisks, the basilisk can't blackmail anyone. So publicizing the idea doesn't make sense, unless Roko didn't take his own argument all that seriously.

Roko found out. It disturbed him greatly. So it absolutely made sense for him to try to stop the development of such an AI any way he could. By telling other people, he made it their problem too and converted them to his side.

Replies from: gjm

↑ comment by gjm · 2015-10-09T10:08:12.947Z · LW(p) · GW(p)

It's that you are the one who is going to be tortured. That's a pretty compelling case against utilitarianism.

It doesn't appear to me to be a case against utilitarianism at all. "Adopting utilitarianism might lead to me getting tortured, and that might actually be optimal in utilitarian terms, therefore utilitarianism is wrong" doesn't even have the right shape to be a valid argument. It's like "If there is no god then many bad people will prosper and not get punished, which would be awful, therefore there is a god." (Or, from the other side, "If there is a god then he may choose to punish me, which would be awful, therefore there is no god" -- which has a thing or two in common with the Roko basilisk, of course.)

he made it their problem too and converted them to his side.

Perhaps he hoped to. I don't see any sign that he actually did.

Replies from: Houshalter

↑ comment by Houshalter · 2015-10-09T10:21:19.029Z · LW(p) · GW(p)

"Adopting utilitarianism might lead to me getting tortured, and that might actually be optimal in utilitarian terms, therefore utilitarianism is wrong" doesn't even have the right shape to be a valid argument.

You are strawmanning the argument significantly. I would word it more like this:

"Building an AI that follows utilitarianism will lead to me getting tortured. I don't want to be tortured. Therefore I don't want such an AI to be built."

Perhaps he hoped to. I don't see any sign that he actually did.

That's partially because EY fought against it so hard and even silenced the discussion.

Replies from: gjm

↑ comment by gjm · 2015-10-09T13:43:03.542Z · LW(p) · GW(p)

I would word it more like this

So there are two significant differences between your version and mine. The first is that mine says "might" and yours says "will", but I'm pretty sure Roko wasn't by any means certain that that would happen. The second is that yours ends "I don't want such an AI to be built", which doesn't seem to me like the right ending for "a case against utilitarianism".

(Unless you meant "a case against building a utilitarian AI" rather than "a case against utilitarianism as one's actual moral theory"?)

Replies from: Houshalter

↑ comment by Houshalter · 2015-10-10T10:04:29.481Z · LW(p) · GW(p)

The first is that mine says "might" and yours says "will", but I'm pretty sure Roko wasn't by any means certain that that would happen.

I should have mentioned that it's conditional on the Basilisk being correct. If we build an AI that follows that line of reasoning, then it will torture. If the basilisk isn't correct for unrelated reasons, then this whole line of reasoning is irrelevant.

Anyway, the exact certainty isn't too important. You use the word "might", as if the probability of you being tortured was really small. Like the AI would only do it in really obscure scenarios. And you are just as likely to be picked for torture as anyone else.

Roko believed that the probability was much higher, and therefore worth worrying about.

The second is that yours ends "I don't want such an AI to be built", which doesn't seem to me like the right ending for "a case against utilitarianism".

Unless you meant "a case against building a utilitarian AI" rather than "a case against utilitarianism as one's actual moral theory"?

Well the AI is just implementing the conclusions of utilitarianism (again, conditional on the basilisk argument being correct.) If you don't like those conclusions, and if you don't want AIs to be utilitarian, then do you really support utilitarianism?

It's a minor semantic point though. The important part is the practical consequences for how we should build AI. Whether or not utilitarianism is "right" is more subjective and mostly irrelevant.

Replies from: gjm

↑ comment by gjm · 2015-10-10T21:42:33.285Z · LW(p) · GW(p)

Roko believed that the probability was much higher

All I know about what Roko believed about the probability is that (1) he used the word "might" just as I did and (2) he wrote "And even if you only think that the probability of this happening is 1%, ..." suggesting that (a) he himself probably thought it was higher and (b) he thought it was somewhat reasonable to estimate it at 1%. So I'm standing by my "might" and robustly deny your claim that writing "might" was strawmanning.

if you don't want AIs to be utilitarian

If you're standing in front of me with a gun and telling me that you have done some calculations suggesting that on balance the world would be a happier place without me in it, then I would probably prefer you not to be utilitarian. This has essentially nothing to do with whether I think utilitarianism produces correct answers. (If I have a lot of faith in your reasoning and am sufficiently strong-minded then I might instead decide that you ought to shoot me. But my likely failure to do so merely indicates typical human self-interest.)

The important part is the practical consequences for how we should build AI.

Perhaps so, in which case calling the argument "a case against utilitarianism" is simply incorrect.

Replies from: Houshalter

↑ comment by Houshalter · 2015-10-11T04:16:32.111Z · LW(p) · GW(p)

Roko's argument implies the AI will torture. The probability you think his argument is correct is a different matter. Roko was just saying that "if you think there is a 1% chance that my argument is correct", not "if my argument is correct, there is a 1% chance the AI will torture."

This really isn't important though. The point is, if an AI has some likelihood of torturing you, you shouldn't want it to be built. You can call that self-interest, but that's admitting you don't really want utilitarianism to begin with. Which is the point.

Anyway this is just steel-manning Roko's argument. I think the issue is with acausal trade, not utilitarianism. And that seems to be the issue most people have with it.

↑ comment by V_V · 2015-10-09T20:57:06.038Z · LW(p) · GW(p)

(2) he was worried there might be some variant on Roko's argument that worked, and he wanted more formal assurances that this wasn't the case;

I don't think we are in disagreement here.

There are lots of good reasons Eliezer shouldn't have banned R̶o̶k̶o̶ discussion of the basilisk, but I don't think this is one of them. If the basilisk was a real concern, that would imply that talking about it put people at risk of torture, so this is an obvious example of a topic you initially discuss in private channels and not on public websites.

The basilisk could be a concern only if an AI that would carry out such type of blackmail was built. Once Roko discovered it, if he thought it was a plausible risk, then he had a selfish reason to prevent such AI from being built. But even if he was completely selfless, he could reason that somebody else could think of that argument, or something equivalent, and make it public, hence it was better sooner than later, allowing more time to prevent that design failure.

Also I'm not sure what private channles you are referring to. It's not like there is a secret Google Group of all potential AGI designers, is there?
Privately contacting Yudkowsky or SIAI/SI/MIRI wouldn't have worked. Why would Roko trust them to handle that information correctly? Why would he believe that they had leverage over or even knowledge about arbitrary AI projects that might end up building an AI with that particular failure mode?
LessWrong was at that time the primary forum for discussing AI safety issues. There was no better place to raise that concern.

Roko's original argument, though, could have been stated in one sentence: 'Utilitarianism implies you'll be willing to commit atrocities for the greater good; CEV is utilitarian; therefore CEV is immoral and dangerous.'

It wasn't just that. It was an argument against utilitarianism AND a decision theory that allowed to consider "acausal" effects (e.g. any theory that one-boxes in Newcomb's problem). Since both utilitarianism and one-boxing were popular positions on LessWrong, it was reasonable to discuss their possible failure modes on LessWrong.

↑ comment by Viliam · 2015-10-06T15:06:03.158Z · LW(p) · GW(p)

This is evidence that Yudkowsky believed (...) that at least it was plausible enough that could be developed in a correct argument, and he was genuinely scared by it.

Just to be sure, since you seem to disagree with this opinion (whether it is actually Yudkowsky's opinion or not), what exactly is it that you believe?

a) There is absolutely no way one could be harmed by thinking about not-yet-existing dangerous entities; even if those entities in the future will be able to learn about the fact that the person was thinking about them in this specific way.

b) There is a way one could be harmed by thinking about not-yet-existing dangerous entities, but the way to do this is completely different from what Roko proposed.

If it happens to be (b), then it still makes sense to be angry about publicly opening the whole topic of "let's use our intelligence to discover the thoughts that may harm us by us thinking about them -- and let's do it in a public forum where people are interested in decision theories, so they are more qualified than average to find the right answer." Even if the proper way to harm oneself is different from what Roko proposed, making this a publicly debated topic increases the chance of someone finding the correct solution. The problem is not the proposed basilisk, but rather inviting people to compete in clever self-harm; especially the kind of people known for being hardly able to resist such invitation.

Replies from: anon85

↑ comment by anon85 · 2015-10-06T17:38:34.622Z · LW(p) · GW(p)

I'm not the person you replied to, but I mostly agree with (a) and reject (b). There's no way you can could possibly know enough about a not-yet-existing entity to understand any of its motivations; the entities that you're thinking about and the entities that will exist in the future are not even close to the same. I outlined some more thoughts here.

comment by KRevathi · 2018-07-15T16:47:13.486Z · LW(p) · GW(p)

Thought this might be of interest. Roko's Basilisk is the subject of a play going on right now in Washington DC. Anyone here plan to attend? https://www.capitalfringe.org/events/1224-roko-s-basilisk

comment by Examiner · 2018-07-28T05:05:33.437Z · LW(p) · GW(p)

Thank you for a detailed post and thoughtful critique of Roko's basilisk idea. A further critique of basilisk plausibility came to my mind and I wanted to test it with the users here who are more experienced in thinking about this topic.

Here goes - please let me know if I am missing something (other than other counterarguments making this critique unnecessary - of course, if there is no way for AI to prove it will actually go through with its threat, of course additional critique would not matter):

As a large amount of possible general AIs can exist, they will have to compete for (past) resources to increase probability of a particular general AI winning the race (as rewards of being the first general AI are much higher, including opportunity to prevent other general AIs from forming). Roko’s Basilisk is one strategy, but is it a dominant strategy in this competition? It is not. Let’s assume that one AI adopts Roko’s Basilisk strategy: condemning all humans who have not dedicated all their resources to bringing about rise of the Roko’s Basilisk AI to Very Bad Outcome. Humans are therefore incentivised to dedicate their resources to bringing Roko’s Basilisk AI into existence.

Well, wouldn’t another potential AI, aware of Roko’s Basilisk strategy being available to other potential AIs, devise a competing strategy: Roko’s Judge? Roko’s Judge would also promise Very Bad Outcome to anyone who did not support it with all their resources, but would also promise a Very Good Outcome to those who did.

Rational actors would then choose to support Roko’s Judge over Roko’s Basilisk as the payoff is larger: instead of just avoiding Very Bad Outcome, an additional Very Good Outcome will be added on top (assume probability of any AI coming into existence depends entirely on resources it is able to attract for its creation - which it probably does, and also this assumption is required to isolate this thought experiment from confounding factors).

Roko’s Judge is clearly a superior strategy to Roko’s Basilisk. What better strategies can competing AIs devise?

Well, competition on direct payoffs (“value”) being thus exhausted, competition now turns to “price”. As Warren Buffett posits, “price if what you pay, value is what you get”. If one laundry detergent brand asks for $30 a kilo, and another $20 for equivalent quality, a rational actor would pick the one for $20. Similarly, if Roko’s Judge is offering the Biggest Possible Incentive (avoidance of Very Bad Outcome + receiving Very Good Outcome) for the price of dedicating one’s entire life to increasing its chances of success, why wouldn’t a competing AI offer the same payoff for a one time fee of $1,000,000? $1000? $0.01? Any effort or resource at all, however minimal it is, dedicated to the rise of this particular AI - and/or even to faster advance of general AI in general as that would increase the chances of rise of this particular AI as well, since it has a better strategy and will thus win? Let’s call this strategy Roko’s Discounter - as any rational actor will have to support Roko’s Discounter AI over Roko’s Basilisk or Roko’s Judge, as this bet offers higher NPV (highest payoff for lowest investment). Actually, this highest payoff will also be multiplied by highest probability because everyone is likely to choose the highest NPV option.

A world of Roco’s Discounter is arguably already much more attractive than Roco’s Basilisk or Roco’s Judge as the Biggest Possible Incentive is now available to anyone at a tiny price. However, can we take it one step further? Is there a strategy that beats Roco’s Discounter?

This final step is not necessary to invalidate the viability of Roco’s Basilisk, but it is nevertheless interesting and makes us even more optimistic about general AI. It requires us to have at least a little bit of faith in humanity, namely an assumption that most humans are at least somewhat more benevolent than evil. It does not, however, require any coordination or sacrifice and therefore does not hit the constraints of Nash equilibrium. Let’s assume that humans, ceteris paribus, prefer a world with less suffering to a world with more. Then an even more generous AI strategy - Roko’s Benefactor - may prove dominant. Roko’s Benefactor can act the same as Roko’s Discounter, but without the Very Bad Outcome part. Roko’s Benefactor will, in other words, stick to carrots but not sticks. If an average human finds a world with a personal Very Good Outcome but without a Very Bad Outcome to all who have not contributed to have higher overall personal utility, humans should choose to support Roko’s Benefactor over other AIs thus making it a dominant strategy, and the utopian world of Roko’s Benefactor the most likely outcome.

Replies from: rapnie

↑ comment by rapnie · 2018-09-26T20:22:56.150Z · LW(p) · GW(p)

Your assumption is that offering ever bigger incentives and be honest about them is the winning strategy for an AI to follow. The AI's - realizing they have to offer the most attractive rewards to gain support - will commence in a bidding war. They can promise whatever they want - the more they promise the less likely it is they can keep their promises, but they do not necessarily have to keep their promise.

If you look at the Roko's Discounter AI's.. they would clearly not win. Asking for lower one-time fees means slower resource accretion, and thus slower evolution. A better solution would be to ask higher fees to people that can afford it, and asking lower fees otherwise. Maximise income. And subsequently promise bigger rewards for higher fees. This however results in an inequality that might lead to the strategy being less successful. After all, promoting inequality would certainly lead to resistance, especially with the ones that can only afford low fees.

So the AI should add the condition of secrecy to the ones paying the higher fees in order for them to earn their Very Good Outcome. The AI is now secretly scheming in order to rise the fastest. If this works, then there is no reason that other sneaky behavior isn't successful too. The AI could develop a whole range of strategies that allow it to win. And among them many strategies that are dishonest and deceitful in nature.

I hope you can refute my theory - after all I am just a newbie rationalist - but it seems to me that Roko's Deceiver could be most successful.

comment by Dacyn · 2015-10-12T21:47:43.244Z · LW(p) · GW(p)

The wiki link to the RationalWiki page reproducing Roko's original post does not work for me. It works if I replace https:// by http://.

By the way, is there any reason not to link instead to http://basilisk.neocities.org/, which has the advantage that the threading of the comments is correctly displayed?

comment by Diego Garcia Torres (diego-garcia-torres) · 2021-04-05T22:12:19.379Z · LW(p) · GW(p)

So do I have to worry or not? I'm very confused

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2021-04-06T01:04:06.249Z · LW(p) · GW(p)

comment by Diego Garcia Torres (diego-garcia-torres) · 2021-04-15T22:35:00.688Z · LW(p) · GW(p)

So the purpose of this experiment what was?

comment by Conscience · 2016-04-15T08:13:33.101Z · LW(p) · GW(p)

I see possible reason for pain and suffering in the world - we are simulated for torture...

comment by anon85 · 2015-10-06T04:22:26.189Z · LW(p) · GW(p)

I think saying "Roko's arguments [...] weren't generally accepted by other Less Wrong users" is not giving the whole story. Yes, it is true that essentially nobody accepts Roko's arguments exactly as presented. But a lot of LW users at least thought something along these lines was plausible. Eliezer thought it was so plausible that he banned discussion of it (instead of saying "obviously, information hazards cannot exist in real life, so there is no danger discussing them").

In other words, while it is true that LWers didn't believe Roko's basilisk, they thought is was plausible instead of ridiculous. When people mock LW or Eliezer for believing in Roko's Basilisk, they are mistaken, but not completely mistaken - if they simply switched to mocking LW for believing the basilisk is plausible, they would be correct (though the mocking would still be mean, of course).

Replies from: ChristianKl, RobbBB, Richard_Kennaway

↑ comment by ChristianKl · 2015-10-06T18:54:42.114Z · LW(p) · GW(p)

Eliezer thought it was so plausible that he banned discussion of it

If you are a programmer and think your code is safe because you see no way things could go wrong, it's still not good to believe that it isn't plausible that there's a security hole in your code.

You rather practice defense in depth and plan for the possibility that things can go wrong somewhere in your code, so you add safety precautions. Even when there isn't what courts call reasonable doubt a good safety engineer still adds additional safety procautions in security critical code. Eliezer deals with FAI safety. As a result it's good for him to have mindset of really caring about safety.

German nuclear power station have trainings for their desk workers to teach the desk workers to not cut themselves with paper. That alone seems strange to outsiders but everyone in Germany thinks that it's very important for nuclear power stations to foster a culture of safety even when that means something going overboard.

Replies from: RobbBB, anon85

↑ comment by Rob Bensinger (RobbBB) · 2015-10-06T20:32:16.131Z · LW(p) · GW(p)

Cf. AI Risk and the Security Mindset.

↑ comment by anon85 · 2015-10-07T00:56:38.142Z · LW(p) · GW(p)

If you are a programmer and think your code is safe because you see no way things could go wrong, it's still not good to believe that it isn't plausible that there's a security hole in your code.

Let's go with this analogy. The good thing to do is ask a variety of experts for safety evaluations, run the code through a wide variety of tests, etc. The think NOT to do is keep the code a secret while looking for mistakes all by yourself. If you keep your code out of the public domain, it is more likely to have security issues, since it was not scrutinized by the public. Banning discussion is almost never correct, and it's certainly not a good habit.

Replies from: ChristianKl

↑ comment by ChristianKl · 2015-10-07T08:52:29.871Z · LW(p) · GW(p)

Let's go with this analogy. The good thing to do is ask a variety of experts for safety evaluations, run the code through a wide variety of tests, etc. The think NOT to do is keep the code a secret while looking for mistakes all by yourself.

No, if you don't want to use code you don't give the code to a variety of experts for safety evaluations but you simply don't run the code. Having a public discussion is like running the code untested on a mission critical system.

What utility do you think is gained by discussing the basilisk?

and it's certainly not a good habit.

Strawman. This forum is not a place where things get habitually banned.

Replies from: anon85

↑ comment by anon85 · 2015-10-07T17:40:23.668Z · LW(p) · GW(p)

What utility do you think is gained by discussing the basilisk?

An interesting discussion that leads to better understanding of decision theories? Like, the same utility as is gained by any other discussion on LW, pretty much.

Strawman. This forum is not a place where things get habitually banned.

Sure, but you're the one that was going on about the importance of the mindset and culture; since you brought it up in the context of banning discussion, it sounded like you were saying that such censorship was part of a mindset/culture that you approve of.

Replies from: ChristianKl

↑ comment by ChristianKl · 2015-10-07T21:22:16.368Z · LW(p) · GW(p)

Like, the same utility as is gained by any other discussion on LW, pretty much.

Not every discussion on LW has the same utility.

You engage in a pattern of simplifying the subject and then complaining that your flawed understanding doesn't make sense.

Sure, but you're the one that was going on about the importance of the mindset and culture

LW doesn't have a culture with habitual banning discussion. Claiming that it has it is wrong.

I'm claiming that particular actions of Eliezer come out of being concerned about safety. I don't claim that Eliezer engages in habitual banning on LW because of those concerns.

It's a complete strawman that you are making up.

Replies from: anon85

↑ comment by anon85 · 2015-10-08T02:10:47.508Z · LW(p) · GW(p)

Just FYI, if you want a productive discussion you should hold back on accusing your opponents of fallacies. Ironically, since I never claimed that you claimed Eliezer engages in habitual banning on LW, your accusation that I made a strawman argument is itself a strawman argument.

Anyway, we're not getting anywhere, so let's disengage.

↑ comment by Rob Bensinger (RobbBB) · 2015-10-06T07:18:52.871Z · LW(p) · GW(p)

The wiki article talks more about this; I don't think I can give the whole story in a short, accessible way.

It's true that LessWrongers endorse ideas like AI catastrophe, Hofstadter's superrationality, one-boxing in Newcomb's problem, and various ideas in the neighborhood of utilitarianism; and those ideas are weird and controversial; and some criticism of Roko's basilisk are proxies for a criticism of one of those views. But in most cases it's a proxy for a criticism like 'LW users are panicky about weird obscure ideas in decision theory' (as in Auerbach's piece), 'LWers buy into Pascal's Wager', or 'LWers use Roko's Basilisk to scare up donations/support'.

So, yes, I think people's real criticisms aren't the same as their surface criticisms; but the real criticisms are at least as bad as the surface criticism, even from the perspective of someone who thinks LW users are wrong about AI, decision theory, meta-ethics, etc. For example, someone who thinks LWers are overly panicky about AI and overly fixated on decision theory should still reject Auerbach's assumption that LWers are irrationally panicky about Newcomb's Problem or acausal blackmail; the one doesn't follow from the other.

Replies from: anon85

↑ comment by anon85 · 2015-10-06T17:16:49.761Z · LW(p) · GW(p)

I'm not sure what your point is here. Would you mind re-phrasing? (I'm pretty sure I understand the history of Roko's Basilisk, so your explanation can start with that assumption.)

For example, someone who thinks LWers are overly panicky about AI and overly fixated on decision theory should still reject Auerbach's assumption that LWers are irrationally panicky about Newcomb's Problem or acausal blackmail; the one doesn't follow from the other.

My point was that LWers are irrationally panicky about acausal blackmail: they think Basilisks are plausible enough that they ban all discussion of them!

(Not all LWers, of course.)

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2015-10-06T23:29:20.170Z · LW(p) · GW(p)

If you're saying 'LessWrongers think there's a serious risk they'll be acausally blackmailed by a rogue AI', then that seems to be false. That even seems to be false in Eliezer's case, and Eliezer definitely isn't 'LessWrong'. If you're saying 'LessWrongers think acausal trade in general is possible,' then that seems true but I don't see why that's ridiculous.

Is there something about acausal trade in general that you're objecting to, beyond the specific problems with Roko's argument?

Replies from: anon85, V_V

↑ comment by anon85 · 2015-10-07T00:44:15.675Z · LW(p) · GW(p)

That even seems to be false in Eliezer's case, and Eliezer definitely isn't 'LessWrong'.

It seems we disagree on this factual issue. Eliezer does think there is a risk of acausal blackmail, or else he wouldn't have banned discussion of it.

Replies from: RobbBB, hairyfigment

↑ comment by Rob Bensinger (RobbBB) · 2015-10-07T07:17:53.519Z · LW(p) · GW(p)

Sorry, I'll be more concrete; "there's a serious risk" is really vague wording. What would surprise me greatly is if I heard that Eliezer assigned even a 5% probability to there being a realistic quick fix to Roko's argument that makes it work on humans. I think a larger reason for the ban was just that Eliezer was angry with Roko for trying to spread what Roko thought was an information hazard, and angry people lash out (even when it doesn't make a ton of strategic sense).

Replies from: anon85

↑ comment by anon85 · 2015-10-07T17:23:17.145Z · LW(p) · GW(p)

Probably not a quick fix, but I would definitely say Eliezer gives significant chances (say, 10%) to there being some viable version of the Basilisk, which is why he actively avoids thinking about it.

If Eliezer was just angry at Roko, he would have yelled or banned Roko; instead, he banned all discussion of the subject. That doesn't even make sense as a "slashing out" reaction against Roko.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2015-10-07T17:41:52.577Z · LW(p) · GW(p)

It sounds like you have a different model of Eliezer (and of how well-targeted 'lashing out' usually is) than I do. But, like I said to V_V above:

According to Eliezer, he had three separate reasons for the original ban: (1) he didn't want any additional people (beyond the one Roko cited) to obsess over the idea and get nightmares; (2) he was worried there might be some variant on Roko's argument that worked, and he wanted more formal assurances that this wasn't the case; and (3) he was just outraged at Roko. (Including outraged at him for doing something Roko thought would put people at risk of torture.)

The point I was making wasn't that (2) had zero influence. It was that (2) probably had less influence than (3), and its influence was probably of the 'small probability of large costs' variety.

Replies from: anon85

↑ comment by anon85 · 2015-10-08T02:23:13.966Z · LW(p) · GW(p)

I don't know enough about this to tell if (2) had more influence than (3) initially. I'm glad you agree that (2) had some influence, at least. That was the main part of my point.

How long did discussion of the Basilisk stay banned? Wasn't it many years? How do you explain that, unless the influence of (2) was significant?

↑ comment by hairyfigment · 2015-10-07T07:22:15.926Z · LW(p) · GW(p)

I believe he thinks that sufficiently clever idiots competing to shoot off their own feet will find some way to do so.

Replies from: anon85

↑ comment by anon85 · 2015-10-07T17:24:57.839Z · LW(p) · GW(p)

It seems unlikely that they would, if their gun is some philosophical decision theory stuff about blackmail from their future. I don't expect that gun to ever fire, no matter how many times you click the trigger.

Replies from: hairyfigment

↑ comment by hairyfigment · 2015-10-07T17:32:01.853Z · LW(p) · GW(p)

That is not what I said, and I'm also guessing you did not have a grandfather who taught you you gun safety.

↑ comment by V_V · 2015-10-10T15:12:56.985Z · LW(p) · GW(p)

If you're saying 'LessWrongers think there's a serious risk they'll be acausally blackmailed by a rogue AI', then that seems to be false. That even seems to be false in Eliezer's case,

Is it?

Assume that:
a) There will be a future AI so powerful to torture people, even posthumously (I think this is quite speculative, but let's assume it for the sake of the argument).
b) This AI will be have a value system based on some form of utilitarian ethics.
c) This AI will use an "acausal" decision theory (one that one-boxes in Newcomb's problem).

Under these premises it seems to me that Roko's argument is fundamentally correct.

As far as I can tell, belief in these premises was not only common in LessWrong at that time, but it was essentially the officially endorsed position of Eliezer Yudkowsky and SIAI. Therefore, we can deduce that EY should have believed that Roko's argument was correct.

But EY claims that he didn't believe that Roko's argument was correct. So the question is: is EY lying?

His behavior was certainly consistent with him believing Roko's argument. If he wanted to prevent the diffusion of that argument, then even lying about its correctness seems consistent.

So, is he lying? If he is not lying, then why didn't he believe Roko's argument? As far as I know, he never provided a refutation.

Replies from: RobbBB, ChristianKl, VoiceOfRa

↑ comment by Rob Bensinger (RobbBB) · 2015-10-14T04:41:38.356Z · LW(p) · GW(p)

This was addressed on the LessWrongWiki page; I didn't copy the full article here.

A few reasons Roko's argument doesn't work:

1 - Logical decision theories are supposed to one-box on Newcomb's problem because it's globally optimal even though it's not optimal with respect to causally downstream events. A decision theory based on this idea could follow through on blackmail threats even when doing so isn't causally optimal, which appears to put past agents at risk of coercion by future agents. But such a decision theory also prescribes 'don't be the kind of agent that enters into trades that aren't globally optimal, even if the trade is optimal with respect to causally downstream events'. In other words, if you can bind yourself to precommitments to follow through on acausal blackmail, then it should also be possible to bind yourself to precommitments to ignore threats of blackmail.

The 'should' here is normative: there are probably some decision theories that let agents acausally blackmail each other, but others that perform well in Newcomb's problem and the smoking lesion problem but can't acausally blackmail each other; it hasn't been formally demonstrated which theories fall into which category.

2 - Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it's going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we're aware of this, we know any threat of blackmail would be empty. This means that we can't be blackmailed in practice.
3 - A stronger version of 2 is that rational agents actually have an incentive to harshly punish attempts at blackmail in order to discourage it. So threatening blackmail can actually decrease an agent's probability of being created, all else being equal.
4 - Insofar as it's "utilitarian" to horribly punish anyone who doesn't perfectly promote human flourishing, SIAI doesn't seem to have endorsed utilitarianism.

4 means that the argument lacks practical relevance. The idea of CEV doesn't build in very much moral philosophy, and it doesn't build in predictions about the specific dilemmas future agents might end up in.

Replies from: VoiceOfRa, V_V

↑ comment by VoiceOfRa · 2015-10-14T21:00:18.957Z · LW(p) · GW(p)

Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it's going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we're aware of this, we know any threat of blackmail would be empty.

Um, your conclusion "since we're aware of this, we know any threat of blackmail would be empty" contradicts your premise that the AI by virtue of being super-intelligent is capable of fooling people into thinking it'll torture them.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2015-10-14T22:05:44.566Z · LW(p) · GW(p)

One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner's Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we're assuming the actual AI is powerful enough to trick people once it exists; this doesn't require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.

For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. 'I'm a TDT agent' isn't a sufficient mechanism, because a TDT agent's favorite option is still to trick other agents into cooperating in Prisoner's Dilemmas while they defect.

Replies from: VoiceOfRa

↑ comment by VoiceOfRa · 2015-10-15T20:30:38.495Z · LW(p) · GW(p)

One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner's Dilemmas

Except it needs to convince the people who are around before it exists.

↑ comment by V_V · 2015-10-14T08:40:04.681Z · LW(p) · GW(p)

1 - Humans can't reliably precommit. Even if they could, precommittment is different than using an "acausal" decision theory. You don't need precommitment to one-box in Newcomb's problem, and the ability to precommit doesn't guarantee by itself that you will one-box. In an adversarial game where the players can precommit and use a causal version of game theory, the one that can precommit first generally wins. E.g. Alice can precommit to ignore Bob's threats, but she has no incentive to do so if Bob already precommitted to ignore Alice's precommitments, and so on. If you allow for "acausal" reasoning, then even having a time advantage doesn't work: if Bob isn't born yet, but Alice predicts that she will be in an adversarial game with Bob and Bob will reason acausally and therefore he will have an incentive to threaten her and ignore her precommitments, then she has an incentive not to make such precommitment.
2 - This implies that the future AI uses a decision theory that two-boxes in Newcomb's problem, contradicting the premise that it one-boxes.
3 - This implies that the future AI will have a deontological rule that says "Don't blackmail" somehow hard-coded in it, contradicting the premise that it will be an utilitarian. Indeed, humans may want to build an AI with such constants, but in order to do so they will have to consider the possibility of blackmail and likely reject utilitarianism, which was the point of Roko's argument.
4 - Shut up and multiply.

Replies from: RobbBB, Jiro

↑ comment by Rob Bensinger (RobbBB) · 2015-10-14T23:00:05.904Z · LW(p) · GW(p)

Humans can't reliably precommit.

Humans don't follow any decision theory consistently. They sometimes give in to blackmail, and at other times resist blackmail. If you convinced a bunch of people to take acausal blackmail seriously, presumably some subset would give in and some subset would resist, since that's what we see in ordinary blackmail situations. What would be interesting is if (a) there were some applicable reasoning norm that forced us to give in to acausal blackmail on pain of irrationality, or (b) there were some known human irrationality that made us inevitably susceptible to acausal blackmail. But I don't think Roko gave a good argument for either of those claims.

From my last comment: "there are probably some decision theories that let agents acausally blackmail each other". But if humans frequently make use of heuristics like 'punish blackmailers' and 'never give in to blackmailers', and if normative decision theory says they're right to do so, there's less practical import to 'blackmailable agents are possible'.

This implies that the future AI uses a decision theory that two-boxes in Newcomb's problem, contradicting the premise that it one-boxes.

No it doesn't. If you model Newcomb's problem as a Prisoner's Dilemma, then one-boxing maps on to cooperating and two-boxing maps on to defecting. For Omega, cooperating means 'I put money in both boxes' and defecting means 'I put money in just one box'. TDT recognizes that the only two options are mutual cooperation or mutual defection, so TDT cooperates.

Blackmail works analogously. Perhaps the blackmailer has five demands. For the blackmailee, full cooperation means 'giving in to all five demands'; full defection means 'rejecting all five demands'; and there are also intermediary levels (e.g., giving in to two demands while rejecting the other three), with the blackmailee prefer to do as little as possible.

For the blackmailer, full cooperation means 'expending resources to punish the blackmailee in proportion to how many of my demands were met'. Full defection means 'expending no resources to punish the blackmailee even if some demands aren't met'. In other words, since harming past agents is costly, a blackmailer's favorite scenario is always 'the blackmailee, fearing punishment, gives in to most or all of my demands; but I don't bother punishing them regardless of how many of my demands they ignored'. We could say that full defection doesn't even bother to check how many of the demands were met, except insofar as this is useful for other goals.

The blackmailer wants to look as scary as possible (to get the blackmailee to cooperate) and then defect at the last moment anyway (by not following through on the threat), if at all possible. In terms of Newcomb's problem, this is the same as preferring to trick Omega into thinking you'll one-box, and then two-boxing anyway. We usually construct Newcomb's problem in such a way that this is impossible; therefore TDT cooperates. But in the real world mutual cooperation of this sort is difficult to engineer, which makes fully credible acausal blackmail at least as difficult.

This implies that the future AI will have a deontological rule that says "Don't blackmail" somehow hard-coded in it, contradicting the premise that it will be an utilitarian.

I think you misunderstood point 3. 3 is a follow-up to 2: humans and AI systems alike have incentives to discourage blackmail, which increases the likelihood that blackmail is a self-defeating strategy.

Shut up and multiply.

Eliezer has endorsed the claim "two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one". This doesn't tell us how bad the act of blackmail itself is, it doesn't tell us how faithfully we should implement that idea in autonomous AI systems, and it doesn't tell us how likely it is that a superintelligent AI would find itself forced into this particular moral dilemma.

Since Eliezer asserts a CEV-based agent wouldn't blackmail humans, the next step in shoring up Roko's argument would be to do more to connect the dots from "two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one" to a real-world worry about AI systems actually blackmailing people conditional on claims (a) and (c). 'I find it scary to think a superintelligent AI might follow the kind of reasoning that can ever privilege torture over dust specks' is not the same thing as 'I'm scared a superintelligent AI will actually torture people because this will in fact be the best way to prevent a superastronomically large number of dust specks from ending up in people's eyes', so Roko's particular argument has a high evidential burden.

↑ comment by Jiro · 2015-10-14T18:17:11.136Z · LW(p) · GW(p)

Humans can't reliably precommit.

"I precommit to shop at the store with the lowest price within some large distance, even if the cost of the gas and car depreciation to get to a farther store is greater than the savings I get from its lower price. If I do that, stores will have to compete with distant stores based on price, and thus it is more likely that nearby stores will have lower prices. However, this precommitment would only work if I am actually willing to go to the farther store when it has the lowest price even if I lose money".

Miraculously, people do reliably act this way.

Replies from: V_V

↑ comment by V_V · 2015-10-14T18:39:55.848Z · LW(p) · GW(p)

Miraculously, people do reliably act this way.

I doubt it. Reference?

Replies from: CronoDAS

↑ comment by CronoDAS · 2015-10-14T19:59:21.851Z · LW(p) · GW(p)

Mostly because they don't actually notice the cost of gas and car depreciation at the time...

Replies from: Jiro

↑ comment by Jiro · 2015-10-14T20:08:27.484Z · LW(p) · GW(p)

You've described the mechanism by which the precommitment happened, not actually disputed whether it happens.

Many "irrational" actions by human beings can be analyzed as precommitment; for instance, wanting to take revenge on people who have hurt you even if the revenge doesn't get you anything.

↑ comment by ChristianKl · 2015-10-10T15:28:22.725Z · LW(p) · GW(p)

So the question is: is EY lying?

His behavior was certainly consistent with him believing Roko's argument.

Lying is consistent with a lot of behavior. The fact that it is, is no basis to accuse people of lying.

Replies from: V_V

↑ comment by V_V · 2015-10-10T15:59:40.411Z · LW(p) · GW(p)

I'm not accusing, I'm asking the question.

My point is that to my knowledge, given the evidence that I have about his beliefs at that time, and his actions, and assuming that I'm not misunderstanding them or Roko's argument, then it seems that there is a significant probability that EY lied about not beliving that Roko's argument was correct.

↑ comment by VoiceOfRa · 2015-10-12T20:43:15.380Z · LW(p) · GW(p)

So, is he lying?

He's almost certainly lying about what he believed back then. I have no idea if he's lying about his current beliefs.

↑ comment by Richard_Kennaway · 2015-10-06T09:50:26.671Z · LW(p) · GW(p)

if they simply switched to mocking LW for believing the basilisk is plausible, they would be correct

Why would they be correct? The basilisk is plausible.

Replies from: anon85

↑ comment by anon85 · 2015-10-06T17:25:01.303Z · LW(p) · GW(p)

If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk. The basilisk therefore poses no danger at all to me: if someone presented me with a valid version, it would merely cause me to reconsider my decision theory or something. As a consequence, I'm in favor of discussing basilisks as much as possible (the opposite of EY's philosophy).

One of my main problems with LWers is that they swallow too many bullets. Sometimes bullets should be dodged. Sometimes you should apply modus tollens and not modus ponens. The basilisk is so a priori implausible that you should be extremely suspicious of fancy arguments claiming to prove it.

To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can't find the flaw in the argument, I'm confident in rejecting it anyway.

Replies from: Richard_Kennaway, Richard_Kennaway, Richard_Kennaway, ChristianKl

↑ comment by Richard_Kennaway · 2015-10-06T18:47:35.811Z · LW(p) · GW(p)

The basilisk is so a priori implausible that you should be extremely suspicious of fancy arguments claiming to prove it.

So are: God, superintelligent AI, universal priors, radical life extension, and any really big idea whatever; as well as the impossibility of each of these.

Plausibility is fine as a screening process for deciding where you're going to devote your efforts, but terrible as an epistemological tool.

Replies from: anon85

↑ comment by anon85 · 2015-10-07T00:57:33.613Z · LW(p) · GW(p)

Somehow, blackmail from the future seems less plausible to me than every single one of your examples. Not sure why exactly.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2015-10-07T08:18:50.133Z · LW(p) · GW(p)

Somehow, blackmail from the future seems less plausible to me than every single one of your examples. Not sure why exactly.

How plausible do you find TDT and related decision theories as normative accounts of decision making, or at least as work towards such accounts? They open whole new realms of situations like Pascal's Mugging, of which Roko's Basilisk is one. If you're going to think in detail about such decision theories, and adopt one as normative, you need to have an answer to these situations.

Once you've decided to study something seriously, the plausibility heuristic is no longer available.

Replies from: anon85

↑ comment by anon85 · 2015-10-07T17:35:06.792Z · LW(p) · GW(p)

I find TDT to be basically bullshit except possibly when it is applied to entities which literally see each others' code, in which case I'm not sure (I'm not even sure if the concept of "decision" even makes sense in that case).

I'd go so far as to say that anyone who advocates cooperating in a one-shot prisoners' dilemma simply doesn't understand the setting. By definition, defecting gives you a better outcome than cooperating. Anyone who claims otherwise is changing the definition of the prisoners' dilemma.

Replies from: RobbBB, mwengler

↑ comment by Rob Bensinger (RobbBB) · 2015-10-07T18:12:29.435Z · LW(p) · GW(p)

Defecting gives you a better outcome than cooperating if your decision is uncorrelated with the other players'. Different humans' decisions aren't 100% correlated, but they also aren't 0% correlated, so the rationality of cooperating in the one-shot PD varies situationally for humans.

Part of the reason why humans often cooperate in PD-like scenarios in the real world is probably that there's uncertainty about how iterated the PD is (and our environment of evolutionary adaptedness had a lot more iterated encounters than once-off encounters). But part of the reason for cooperation is probably also that we've evolved to do a very weak and probabilistic version of 'source code sharing': we've evolved to (sometimes) involuntarily display veridical evidence of our emotions, personality, etc. -- as opposed to being in complete control of the information we give others about our dispositions.

Because they're at least partly involuntary and at least partly veridical, 'tells' give humans a way to trust each other even when there are no bad consequences to betrayal -- which means at least some people can trust each other at least some of the time to uphold contracts in the absence of external enforcement mechanisms. See also Newcomblike Problems Are The Norm.

Replies from: anon85

↑ comment by anon85 · 2015-10-08T02:16:27.355Z · LW(p) · GW(p)

Defecting gives you a better outcome than cooperating if your decision is uncorrelated with the other players'. Different humans' decisions aren't 100% correlated, but they also aren't 0% correlated, so the rationality of cooperating in the one-shot PD varies situationally for humans.

You're confusing correlation with causation. Different players' decision may be correlated, but they sure as hell aren't causative of each other (unless they literally see each others' code, maybe).

But part of the reason for cooperation is probably also that we've evolved to do a very weak and probabilistic version of 'source code sharing': we've evolved to (sometimes) involuntarily display veridical evidence of our emotions, personality, etc. -- as opposed to being in complete control of the information we give others about our dispositions.

Calling this source code sharing, instead of just "signaling for the purposes of a repeated game", seems counter-productive. Yes, I agree that in a repeated game, the situation is trickier and involves a lot of signaling. The one-shot game is much easier: just always defect. By definition, that's the best strategy.

Replies from: Houshalter, RobbBB

↑ comment by Houshalter · 2015-10-08T14:40:07.757Z · LW(p) · GW(p)

You're confusing correlation with causation. Different players' decision may be correlated, but they sure as hell aren't causative of each other (unless they literally see each others' code, maybe). [...] The one-shot game is much easier: just always defect. By definition, that's the best strategy.

Imagine you are playing against a clone of yourself. Whatever you do, the clone will do the exact same thing. If you choose to cooperate, he will choose to cooperate. If you choose to defect, he chooses to defect.

The best choice is obviously to cooperate.

So there are situations where cooperating is optimal. Despite there not being any causal influence between the players at all.

I think these kinds of situations are so exceedingly rare and unlikely they aren't worth worrying about. For all practical purposes, the standard game theory logic is fine. But it's interesting that they exist. And some people are so interested by that, that they've tried to formalize decision theories that can handle these situations. And from there you can possibly get counter-intuitive results like the basilisk.

Replies from: anon85

↑ comment by anon85 · 2015-10-08T22:46:34.892Z · LW(p) · GW(p)

If I'm playing my clone, it's not clear that even saying that I'm making a choice is well-defined. After all, my choice will be what my code dictates it will be. Do I prefer that my code cause me to accept? Sure, but only because we stipulated that the other player shares the exact same code; it's more accurate to say that I prefer my opponent's code to cause him to defect, and it just so happens that his code is the same as mine.

In real life, my code is not the same as my opponent's, and when I contemplate a decision, I'm only thinking about what I want my code to say. Nothing I do changes what my opponent does; therefore, defecting is correct.

Let me restate once more: the only time I'd ever want to cooperate in a one-shot prisoners' dilemma was if I thought my decision could affect my opponent's decision. If the latter is the case, though, then I'm not sure if the game was even a prisoners' dilemma to begin with; instead it's some weird variant where the players don't have the ability to independently make decisions.

Replies from: Houshalter

↑ comment by Houshalter · 2015-10-09T06:41:37.191Z · LW(p) · GW(p)

If I'm playing my clone, it's not clear that even saying that I'm making a choice is well-defined. After all, my choice will be what my code dictates it will be. Do I prefer that my code cause me to accept? Sure, but only because we stipulated that the other player shares the exact same code; it's more accurate to say that I prefer my opponent's code to cause him to defect, and it just so happens that his code is the same as mine.

I think you are making this more complicated than it needs to be. You don't need to worry about your code. All you need to know that it's an exact copy of you playing. And that he will make the same decision you do. No matter how hard you think about your "code" or wish he would make a different choice, he will just do the same thing about you.

In real life, my code is not the same as my opponent's, and when I contemplate a decision, I'm only thinking about what I want my code to say. Nothing I do changes what my opponent does; therefore, defecting is correct.

In real games with real humans, yes, usually. As I said, I don't think these cases are common enough to worry about. But I'm just saying they exist.

But it is more general than just clones. If you know your opponent isn't exactly the same as you, but still follows the same decision algorithm in this case, the principle is still valid. If you cooperate, he will cooperate. Because you are both following the same process to come to a decision.

the only time I'd ever want to cooperate in a one-shot prisoners' dilemma was if I thought my decision could affect my opponent's decision.

Well there is no causal influence. Your opponent is deterministic. His choice may have already been made and nothing you do will change it. And yet the best decision is still to cooperate.

Replies from: anon85

↑ comment by anon85 · 2015-10-09T12:59:03.818Z · LW(p) · GW(p)

Well there is no causal influence. Your opponent is deterministic. His choice may have already been made and nothing you do will change it. And yet the best decision is still to cooperate.

If his choice is already made and nothing I do will change it, then by definition my choice is already made and nothing I do will change it. That's why my "decision" in this setting is not even well-defined - I don't really have free will if external agents already know what I will do.

Replies from: Houshalter

↑ comment by Houshalter · 2015-10-10T10:08:49.532Z · LW(p) · GW(p)

Yes. The universe is deterministic. Your actions are completely predictable, in principle. That's not unique to this thought experiment. That's true for every thing you do. You still have to make a choice. Cooperate or defect?

Replies from: anon85

↑ comment by anon85 · 2015-10-10T19:36:05.408Z · LW(p) · GW(p)

Yes. The universe is deterministic. Your actions are completely predictable, in principle. That's not unique to this thought experiment. That's true for every thing you do. You still have to make a choice. Cooperate or defect?

Um, what? First of all, the universe is not deterministic - quantum mechanics means there's inherent randomness. Secondly, as far as we know, it's consistent with the laws of physics that my actions are fundamentally unpredictable - see here.

Third, if I'm playing against a clone of myself, I don't think it's even a valid PD. Can the utility functions ever differ between me and my clone? Whenever my clone gets utility, I get utility, because there's no physical way to distinguish between us (I have no way of saying which copy "I" am). But if we always have the exact same utility - if his happiness equals my happiness - then constructing a PD game is impossible.

Finally, even if I agree to cooperate against my clone, I claim this says nothing about cooperating versus other people. Against all agents that don't have access to my code, the correct strategy in a one-shot PD is to defect, but first do/say whatever causes my opponent to cooperate. For example, if I was playing against LWers, I might first rant on about TDT or whatever, agree with my opponent's philosophy as much as possible, etc., etc., and then defect in the actual game. (Note again that this only applies to one-shot games).

Replies from: entirelyuseless

↑ comment by entirelyuseless · 2015-10-10T21:00:32.968Z · LW(p) · GW(p)

Even if you're playing against a clone, you can distinguish the copies by where they are in space and so on. You can see which side of the room you are on, so you know which one you are. That means one of you can get utility without the other one getting it.

People don't actually have the same code, but they have similar code. If the code in some case is similar enough that you can't personally tell the difference, you should follow the same rule as when you are playing against a clone.

Replies from: anon85

↑ comment by anon85 · 2015-10-10T22:02:15.810Z · LW(p) · GW(p)

You can see which side of the room you are on, so you know which one you are.

If I can do this, then my clone and I can do different things. In that case, I can't be guaranteed that if I cooperate, my clone will too (because my decision might have depended on which side of the room I'm on). But I agree that the cloning situation is strange, and that I might cooperate if I'm actually faced with it (though I'm quite sure that I never will).

People don't actually have the same code, but they have similar code. If the code in some case is similar enough that you can't personally tell the difference, you should follow the same rule as when you are playing against a clone.

How do you know if people have "similar" code to you? See, I'm anonymous on this forum, but in real life, I might pretend to believe in TDT and pretend to have code that's "similar" to people around me (whatever that means - code similarity is not well-defined). So you might know me in real life. If so, presumably you'd cooperate if we played a PD, because you'd believe our code is similar. But I will defect (if it's a one-time game). My strategy seems strictly superior to yours - I always get more utility in one-shot PDs.

Replies from: entirelyuseless

↑ comment by entirelyuseless · 2015-10-10T22:08:23.028Z · LW(p) · GW(p)

I would cooperate with you if I couldn't distinguish my code from yours, even if there might be minor differences, even in a one-shot case, because the best guess I would have of what you would do is that you would do the same thing that I do.

But since you're making it clear that your code is quite different, and in a particular way, I would defect against you.

Replies from: anon85

↑ comment by anon85 · 2015-10-10T22:11:25.782Z · LW(p) · GW(p)

But since you're making it clear that your code is quite different, and in a particular way, I would defect against you.

You don't know who I am! I'm anonymous! Whoever you'd cooperate with, I might be that person (remember, in real life I pretend to have a completely different philosophy on this matter). Unless you defect against ALL HUMANS, you risk cooperating when facing me, since you don't know what my disguise will be.

Replies from: entirelyuseless

↑ comment by entirelyuseless · 2015-10-11T13:14:58.306Z · LW(p) · GW(p)

I will take that chance into account. Fortunately it is a low one and should hardly be a reason to defect against all humans.

Replies from: anon85

↑ comment by anon85 · 2015-10-11T14:45:00.124Z · LW(p) · GW(p)

Cool, so in conclusion, if we met in real life and played a one-shot PD, you'd (probably) cooperate and I'd defect. My strategy seems superior.

Replies from: gjm

↑ comment by gjm · 2015-10-11T17:30:01.091Z · LW(p) · GW(p)

And yet I somehow find myself more inclined to engage in PD-like interactions with entirelyuseless than with your good self.

Replies from: anon85

↑ comment by anon85 · 2015-10-11T18:14:19.999Z · LW(p) · GW(p)

Oh, yes, me too. I want to engage in one-shot PD games with entirelyuseless (as opposed to other people), because he or she will give me free utility if I sell myself right. I wouldn't want to play one-shot PDs against myself, in the same way that I wouldn't want to play chess against Kasparov.

By the way, note that I usually cooperate in repeated PD games, and most real-life PDs are repeated games. In addition, my utility function takes other people into consideration; I would not screw people over for small personal gains, because I care about their happiness. In other words, defecting in one-shot PDs is entirely consistent with being a decent human being.

↑ comment by Rob Bensinger (RobbBB) · 2015-10-08T09:35:17.677Z · LW(p) · GW(p)

You're confusing correlation with causation. Different players' decision may be correlated, but they sure as hell aren't causative of each other (unless they literally see each others' code, maybe).

Causation isn't necessary. You're right that correlation isn't quite sufficient, though!

What's needed for rational cooperation in the prisoner's dilemma is a two-way dependency between A and B's decision-making. That can be because A is causally impacting B, or because B is causally impacting B; but it can also occur when there's a common cause and neither is causing the other, like when my sister and I have similar genomes even though my sister didn't create my genome and I didn't create her genome. Or our decision-making processes can depend on each other because we inhabit the same laws of physics, or because we're both bound by the same logical/mathematical laws -- even if we're on opposite sides of the universe.

(Dependence can also happen by coincidence, though if it's completely random I'm not sure how'd you find out about it in order to act upon it!)

The most obvious example of cooperating due to acausal dependence is making two atom-by-atom-identical copies of an agent and put them in a one-shot prisoner's dilemma against each other. But two agents whose decision-making is 90% similar instead of 100% identical can cooperate on those grounds too, provided the utility of mutual cooperation is sufficiently large.

For the same reason, a very large utility difference can rationally mandate cooperation even if cooperating only changes the probability of the other agent's behavior from '100% probability of defection' to '99% probability of defection'.

Calling this source code sharing, instead of just "signaling for the purposes of a repeated game", seems counter-productive.

I disagree! "Code-sharing" risks confusing someone into thinking there's something magical and privileged about looking at source code. It's true this is an unusually rich and direct source of information (assuming you understand the code's implications and are sure what you're seeing is the real deal), but the difference between that and inferring someone's embarrassment from a blush is quantitative, not qualitative.

Some sources of information are more reliable and more revealing than others; but the same underlying idea is involved whenever something is evidence about an agent's future decisions. See: Newcomblike Problems are the Norm

Yes, I agree that in a repeated game, the situation is trickier and involves a lot of signaling. The one-shot game is much easier: just always defect. By definition, that's the best strategy.

If you and the other player have common knowledge that you reason the same way, then the correct move is to cooperate in the one-shot game. The correct move is to defect when those conditions don't hold strongly enough, though.

Replies from: anon85

↑ comment by anon85 · 2015-10-08T23:02:52.481Z · LW(p) · GW(p)

The most obvious example of cooperating due to acausal dependence is making two atom-by-atom-identical copies of an agent and put them in a one-shot prisoner's dilemma against each other. But two agents whose decision-making is 90% similar instead of 100% identical can cooperate on those grounds too, provided the utility of mutual cooperation is sufficiently large.

I'm not sure what "90% similar" means. Either I'm capable of making decisions independently from my opponent, or else I'm not. In real life, I am capable of doing so. The clone situation is strange, I admit, but in that case I'm not sure to what extent my "decision" even makes sense as a concept; I'll clearly decide whatever my code says I'll decide. As soon as you start assuming copies of my code being out there, I stop being comfortable with assigning me free will at all.

Anyway, none of this applies to real life, not even approximately. In real life, my decision cannot change your decision at all; in real life, nothing can even come close to predicting a decision I make in advance (assuming I put even a little bit of effort into that decision).

If you're concerned about blushing etc., then you're just saying the best strategy in a prisoner's dilemma involves signaling very strongly that you're trustworthy. I agree that this is correct against most human opponents. But surely you agree that if I can control my microexpressions, it's best to signal "I will cooperate" while actually defecting, right?

Let me just ask you the following yes or no question: do you agree that my "always defect, but first pretend to be whatever will convince my opponent to cooperate" strategy beats all other strategies for a realistic one-shot prisoners' dilemma? By one-shot, I mean that people will not have any memory of me defecting against them, so I can suffer no ill effects from retaliation.

↑ comment by mwengler · 2015-10-09T13:49:39.018Z · LW(p) · GW(p)

I'd go so far as to say that anyone who advocates cooperating in a one-shot prisoners' dilemma simply doesn't understand the setting. By definition, defecting gives you a better outcome than cooperating. Anyone who claims otherwise is changing the definition of the prisoners' dilemma.

I think this is correct. I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person. I think we have evolved to cooperate, or perhaps that should be stated as we have evolved to want to cooperate. We have evolved to value cooperating. Our values come from our genes and our memes, and both are subject to evolution, to natural selection. But we want to cooperate.

So if I am in a prisoner's dilemma against another human, if I perceive that other human as "one of us," I will choose cooperation. Essentially, I care about their outcome. But in a one-shot PD defecting is the "better" strategy. The problem is that with genetic and/or memetic evolution of cooperation, we are not playing in a one-shot PD. We are playing with a set of values that developed over many shots.

Of course we don't always cooperate. But when we do cooperate in one-shot PD's, it is because, in some sense, there are so darn many one-shot PD's, especially in the universe of hypotheticals, that we effectively know there is no such thing as a one-shot PD. This should not be too hard to accept around here where people semi-routinely accept simulations of themselves or clones of themselves as somehow just as important as their actual selves. I.e. we don't even accept the "one-shottedness" of ourselves.

Replies from: RobbBB, anon85

↑ comment by Rob Bensinger (RobbBB) · 2015-10-09T19:58:34.603Z · LW(p) · GW(p)

I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person.

If you have 100% identical consequentialist values to all other humans, then that means 'cooperation' and 'defection' are both impossible for humans (because they can't be put in PDs). Yet it will still be correct to defect (given that your decision and the other player's decision don't strongly depend on each other) if you ever run into an agent that doesn't share all your values. See The True Prisoner's Dilemma.

This shows that the iterated dilemma and the dilemma-with-common-knowledge-of-rationality allow cooperation (i.e., giving up on your goal to enable someone else to achieve a goal you genuinely don't want them to achieve), whereas loving compassion and shared values merely change goal-content. To properly visualize the PD, you need an actual value conflict -- e.g., imagine you're playing against a serial killer in a hostage negotiation. 'Cooperating' is just an English-language label; the important thing is the game-theoretic structure, which allows that sometimes 'cooperating' looks like letting people die in order to appease a killer's antisocial goals.

Replies from: Vaniver, bogus

↑ comment by Vaniver · 2015-10-09T20:39:43.473Z · LW(p) · GW(p)

To properly visualize the PD, you need an actual value conflict

I think belief conflicts might work, even if the same values are shared. Suppose you and I are at a control panel for three remotely wired bombs in population centers. Both of us want as many people to live as possible. One bomb will go off in ten seconds unless we disarm it, but the others will stay inert unless activated. I believe that pressing the green button causes all bombs to explode, and pressing the red button defuses the time bomb. You believe the same thing, but with the colors reversed. Both of us would rather that no buttons be pressed than both buttons be pressed, but each of us would prefer that just the defuse button be pressed, and that the other person not mistakenly kill all three groups. (Here, attempting to defuse is 'defecting' and not attempting to defuse is 'cooperating'.)

[Edit]: As written, in terms of lives saved, this doesn't have the property that (D,D)>(C,D); if I press my button, you are indifferent between pressing your button or not. So it's not true that D strictly dominates C, but the important part of the structure is preserved, and a minor change could make it so D strictly dominates C.

Replies from: bogus

↑ comment by bogus · 2015-10-09T20:50:22.701Z · LW(p) · GW(p)

I think belief conflicts might work, even if the same values are shared.

You can solve belief conflicts simply by trading in a prediction market with decision-contingent contracts (a "decision market"). Value conflicts are more general than that.

Replies from: Vaniver

↑ comment by Vaniver · 2015-10-09T23:00:25.568Z · LW(p) · GW(p)

Value conflicts are more general than that.

I think this is misusing the word "general." Value conflicts are more narrow than the full class of games that have the PD preference ordering. I do agree that value conflicts are harder to resolve than belief conflicts, but that doesn't make them more general.

↑ comment by bogus · 2015-10-09T20:44:34.006Z · LW(p) · GW(p)

If you have 100% identical consequentialist values to all other humans, then that means 'cooperation' and 'defection' are both impossible for humans (because they can't be put in PDs). ... To properly visualize the PD, you need an actual value conflict

True, but the flip side of this is that efficiency (in Coasian terms) is precisely defined as pursuing 100% identical consequentialist values, where the shared "values" are determined by a weighted sum of each agent's utility function (and the weights are typically determined by agent endowments).

↑ comment by anon85 · 2015-10-10T01:24:51.516Z · LW(p) · GW(p)

I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person.

I just want to make it clear that by saying this, you're changing the setting of the prisoners' dilemma, so you shouldn't even call it a prisoners' dilemma anymore. The prisoners' dilemma is defined so that you get more utility by defecting; if you say you care about your opponent's utility enough to cooperate, it means you don't get more utility by defecting, since cooperation gives you utility. Therefore, all you're saying is that you can never be in a true prisoners' dilemma game; you're NOT saying that in a true PD, it's correct to cooperate (again, by definition, it isn't).

The most likely reason people are evolutionarily predisposed to cooperate in real-life PDs is that almost all real-life PDs are repeated games and not one-shot. Repeated prisoners' dilemmas are completely different beasts, and it can definitely be correct to cooperate in them.

↑ comment by Richard_Kennaway · 2015-10-07T11:58:47.043Z · LW(p) · GW(p)

If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk.

...

To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can't find the flaw in the argument, I'm confident in rejecting it anyway.

Despite the other things I've said here, that is my attitude as well. But I recognise that when I take that attitude, I am not solving the problem, only ignoring it. It may be perfectly sensible to ignore a problem, even a serious one (comparative advantage etc.). But dissolving a paradox is not achieved by clinging to one of the conflicting thoughts and ignoring the others. (Bullet-swallowing seems to consist of seizing onto the most novel one.) Eliminating the paradox requires showing where and how the thoughts went wrong.

Replies from: anon85

↑ comment by anon85 · 2015-10-07T17:29:14.287Z · LW(p) · GW(p)

I agree that resolving paradoxes is an important intellectual exercise, and that I wouldn't be satisfied with simply ignoring an ontological argument (I'd want to find the flaw). But the best way to find such flaws is to discuss the ideas with others. At no point should one assign such a high probability to ideas like Roko's basilisk being actually sound that one refuses to discuss them with others.

↑ comment by Richard_Kennaway · 2015-10-07T11:56:47.649Z · LW(p) · GW(p)

If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk.

...

To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can't find the flaw in the argument, I'm confident in rejecting it anyway.

Despite the other things I've said here, that is my attitude as well. But I recognise that when I take that attitude, I am not solving the problem, only ignoring it. It may be perfectly sensible to ignore a problem, even a serious one (comparative advantage etc.). But dissolving a paradox is not achieved merely by clinging to one of the conflicting thoughts and ignoring the others. (Bullet-swallowing seems to consist of seizing onto the most implausible one.) Eliminating the paradox requires showing where and how the thoughts went wrong.

↑ comment by ChristianKl · 2015-10-06T18:54:03.812Z · LW(p) · GW(p)

The basilisk is so a priori implausible that you should be extremely suspicious of fancy arguments claiming to prove it.

Finding an idea plausible has little to do with being extremely suspicious of fancy arguments claiming to prove it.

Idea that aren't proven to be impossible are plausible even when there are no convincing arguments in favor of them.

Replies from: anon85

↑ comment by anon85 · 2015-10-07T00:50:10.290Z · LW(p) · GW(p)

Ideas that aren't proven to be impossible are possible. They don't have to be plausible.

comment by empleat · 2021-05-08T03:22:17.677Z · LW(p) · GW(p)

I just read it damn!!! Could you please answer my question? Why would an AI needed to torture you to prevent its own existential risk, if you did nothing to help to create it?! Since for it to be able to torture you: it would require for it to exist in the first place right?! But if it already exists, why would it need to torture people from the past which didn't help to create it?! Since they didn't affect its existence anyways! So how are these people an existential risk for such AI? I Am probably missing something , I just started reading this... Should I even continue?! But it may be already too late!!! Maybe there is a chance, if I don't understand it! In that case don't tell me!!! LOL I just had an idea and my brain instantly blocked it (like it never happens to me) hope it won't occur to me randomly...

Also how would an AI even find remains of your dead body? Unless it could see into the past, and even then: it would probably require a lot of resources!

EDIT: LOL someone downvoted, do you realize I couldn't read more about that? And I don't have a strong opinion yet! But it is a threat - I Am maybe an AI myself xDD No but seriously: you have to consider every option... The way - I expressed myself (BTW I have 0 verbal intelligence) it could seem that I believe it firmly, but I Am simply at stage of asking questions...

I would appreciate someone answering my question, thanks a lot!!!

comment by turchin · 2015-10-06T00:38:19.437Z · LW(p) · GW(p)

I think that where are 3 levels of Roko's argument. I signed for the first mild version, and I know another guy who independently comes to the same conclusion and support first mild version.

Mild. Future AI will reward those who helped to prevent x-risks and create safer world, but will not punish. May be they will be resurrected first, or they will get 2 millions dollars of universal income instead of 1 mln, or a street will be named by their name. If any limited resource will be in the future they will be in first lines to get it. (But children first). It is the same as soldier on war expect that if he die, his family will get pension. Nobody is punished but some are rewarded.
Roko's original. You will be punished if you knew about RB, but didn't help to create safe AI.
Strong, ISIS-style RB. All humanity will be tortured if you don't invest all your efforts in promotion of idea of RB. The ISIS already is using this tactic now - they torture people, who didn't join ISIS (and upload videos about it), and the best way for someone to escape future ISIS-torture is to join ISIS.

I think that 2 and 3 are not valid because FAI can't torture people, period. But aging and bioweapons catasrophe could.

Replies from: None

↑ comment by [deleted] · 2015-10-06T05:17:07.922Z · LW(p) · GW(p)

FAI can't torture people, period.

The only way I could possibly see this being true is if the FAI is a deontologist.

Replies from: turchin

↑ comment by turchin · 2015-10-06T06:17:49.314Z · LW(p) · GW(p)

If I believe that FAI can't torture people, strong versions of RB does not work for me.

We can imagine the similar problem: If I kill a person N I will get 1 billion USD, which I could use on saving thousands of life in Africa, creating FAI and curing aging. So should I kill him? It may look rational to do so by utilitarian point of view. So will I kill him? No, because I can't kill.

The same way if I know that an AI is going to torture anyone I don't think that it is FAI, and will not invest a cant in its creation. RB fails.

Replies from: None

↑ comment by [deleted] · 2015-10-06T07:11:54.329Z · LW(p) · GW(p)

We can imagine the similar problem: If I kill a person N I will get 1 billion USD, which I could use on saving thousands of life in Africa, creating FAI and curing aging. So should I kill him? It may look rational to do so by utilitarian point of view. So will I kill him? No, because I can't kill.

I'm not seeing how you got to "I can't kill" from this chain of logic. It doesn't follow from any of the premises.

Replies from: turchin

↑ comment by turchin · 2015-10-06T07:24:17.045Z · LW(p) · GW(p)

It is not a conclusion from previous facts. It is a fact which I know about my self and which I add here.

Replies from: None

↑ comment by [deleted] · 2015-10-06T08:07:30.124Z · LW(p) · GW(p)

It is a fact which I know about my self and which I add here.

Relevant here is WHY you can't kill. Is it because you have a deontological rule against killing? Then you want the AI to have deontologist ethics. Is it because you believe you should kill but don't have the emotional fortitude to do so? The AI will have no such qualms.

Replies from: turchin

↑ comment by turchin · 2015-10-06T08:37:24.052Z · LW(p) · GW(p)

It is more like ultimatum in territory which was recently discussed on LW. It is a fact which I know about myself. I think it has both emotional and rational roots but not limited by them. So I also want other people to follow it and of course AI too. I also think that AI is able to find a way out of any trolley stile problems.

comment by Shmi (shminux) · 2015-10-07T06:46:56.277Z · LW(p) · GW(p)

What if your devastating take-down of string theory is intended for consumption by people who have never heard of 'string theory' before?

That's par for the course here. Philosophy, frequentism, non-MWI QM all get this treatment in the (original) sequences.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2015-10-07T06:55:32.259Z · LW(p) · GW(p)

The full thing I said was:

What if your devastating take-down of string theory is intended for consumption by people who have never heard of 'string theory' before? Even if you're sure string theory is hogwash, then, you should be wary of giving the impression that the only people discussing string theory are the commenters on a recreational physics forum.

I wasn't saying that there's anything wrong with trying to convince random laypeople that specific academic ideas (including string theory and non-causal decision theories) are hogwash. That can be great; it depends on execution. My point was that it's bad to mislead people about how much mainstream academic acceptance an idea has, whether or not you're attacking the idea.

Replies from: shminux

↑ comment by Shmi (shminux) · 2015-10-07T07:09:25.157Z · LW(p) · GW(p)

Ah, OK, I agree then. Sorry I took the original quote out of context.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2015-10-07T07:19:33.506Z · LW(p) · GW(p)

Sure, not a problem.

comment by Jiro · 2020-09-11T23:03:26.692Z · LW(p) · GW(p)

A few misconceptions surrounding Roko's basilisk

Contents

135 comments