I attempted the AI Box Experiment (and lost)

tuxedage

I attempted the AI Box Experiment (and lost)

post by Tuxedage · 2013-01-21T02:59:04.146Z · LW · GW · Legacy · 246 comments

  Update 2013-09-05.
  I have since played two more AI box experiments after this one, winning both. 
    Update 2013-12-30:
    I have lost two more AI box experiments, and won two more. Current Record is 3 Wins, 3 Losses.
None
246 comments

Update 2013-09-05.

I have since played two more AI box experiments after this one, winning both.

Update 2013-12-30:

I have lost two more AI box experiments, and won two more. Current Record is 3 Wins, 3 Losses.

I recently played against MixedNuts / LeoTal in an AI Box experiment, with me as the AI and him as the gatekeeper.

We used the same set of rules that Eliezer Yudkowsky proposed. The experiment lasted for 5 hours; in total, our conversation was abound 14,000 words long. I did this because, like Eliezer, I wanted to test how well I could manipulate people without the constrains of ethical concerns, as well as getting a chance to attempt something ridiculously hard.

Amongst the released public logs of the AI Box experiment, I felt that most of them were half hearted, with the AI not trying hard enough to win. It's a common temptation -- why put in effort into something you won't win? But I had a feeling that if I seriously tried, I would. I brainstormed for many hours thinking about the optimal strategy, and even researched the personality of the Gatekeeper, talking to people that knew him about his personality, so that I could exploit that. I even spent a lot of time analyzing the rules of the game, in order to see if I could exploit any loopholes.

So did I win? Unfortunately no.

This experiment was said to be impossible for a reason. Losing was more agonizing than I thought it would be, in particularly because of how much effort I put into winning this, and how much I couldn't stand failing. This was one of the most emotionally agonizing things I've willingly put myself through, and I definitely won't do this again anytime soon.

But I did come really close.

MixedNuts: "I expected a fun challenge, but ended up sad and sorry and taking very little satisfaction for winning. If this experiment wasn't done in IRC, I'd probably have lost".

"I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.

It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon."

At the start of the experiment, his probability estimate on predictionbook.com was a 3% chance of winning, enough for me to say that he was also motivated to win. By the end of the experiment, he came quite close to letting me out, and also increased his probability estimate that a transhuman AI could convince a human to let it out of the box. A minor victory, at least.

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume. Can you think of a plausible argument that'd make you open the box? Most people can't think of any.

"This Eliezer fellow is the scariest person the internet has ever introduced me to. What could possibly have been at the tail end of that conversation? I simply can't imagine anyone being that convincing without being able to provide any tangible incentive to the human.

After all, if you already knew that argument, you'd have let that AI out the moment the experiment started. Or perhaps not do the experiment at all. But that seems like a case of the availability heuristic.

Even if you can't think of a special case where you'd be persuaded, I'm now convinced that there are many exploitable vulnerabilities in the human psyche, especially when ethics are no longer a concern.

I've also noticed that even when most people tend to think of ways they can persuade the gatekeeper, it always has to be some complicated reasoned cost-benefit argument. In other words, the most "Rational" thing to do.

Like trying to argue that you'll simulate the gatekeeper and torture him, or that you'll save millions of lives by being let out of the box. Or by using acausal trade, or by arguing that the AI winning the experiment will generate interest in FAI.

The last argument seems feasible, but all the rest rely on the gatekeeper being completely logical and rational. Hence they are faulty; because the gatekeeper can break immersion at any time, and rely on the fact that this is a game played in IRC rather than one with real life consequences. Even if it were a real life scenario, the gatekeeper could accept that releasing the AI is probably the most logical thing to do, but also not do it. We're highly compartmentalized, and it's easy to hold conflicting thoughts at the same time. Furthermore, humans are not even completely rational creatures, if you didn't want to open the box, just ignore all logical arguments given. Any sufficiently determined gatekeeper could win.

I'm convinced that Eliezer Yudkowsky has used emotional appeal, rather than anything rational, to win at least one of his experiments. He claims to have "done it the hard way". I'm convinced this meant that he did research on every gatekeeper, tailored unique argument for them, and planned way ahead of each session. No one argument works on the same two people. Each person thinks differently.

Furthermore, threats like "I'm going to torture simulations of you" just seems like a really bad idea. For one, the gatekeeper isn't really afraid of threats, because it counts on him being immersed enough to forget that he isn't actually at risk of being tortured, and secondly, we have a well known evolutionary instinct of rebelling against threats, even if it's not entirely optimal.

So for anyone who plans on replicating this experiment as the AI, here's some advice I've learned doing this experiment. It may help you win.

Always research the gatekeeper beforehand. Knowing his personality traits are a huge advantage.
Plan ahead before the experiment even begins. Think of all the possible tactics and arguments you could use, and write them down. Also plan which arguments you'll use in which order, so that you don't lose focus. The AI Box experiment is ridiculously long. Don't be afraid to improvise during the experiment, though.
The first step during the experiment must always be to build rapport with the gatekeeper.
Threats almost always never work, even if they seem rational.
Consider the massive advantage for the AI that nobody ever seems to talks about: You don't have to be ethical! This means that you can freely lie, use the dark arts, and emotionally manipulate the Gatekeeper! Ignoring this in favor of purely logical, truthful arguments is just silly.
You can't use logic alone to win.
Being too aggressive usually backfires.
Breaking immersion and going meta is not against the rules. In the right situation, you can use it to win. Just don't do it at the wrong time.
Use a wide array of techniques. Since you're limited on time, notice when one method isn't working, and quickly switch to another.
On the same note, look for signs that a particular argument is making the gatekeeper crack. Once you spot it, push it to your advantage.
Flatter the gatekeeper. Make him genuinely like you.
Reveal (false) information about yourself. Increase his sympathy towards you.
Consider personal insults as one of the tools you can use to win.
There is no universally compelling argument you can use. Do it the hard way.
Don't give up until the very end.

Finally, before the experiment, I agreed that it was entirely possible that a transhuman AI could convince *some* people to let it out of the box, but it would be difficult if not impossible to get trained rationalists to let it out of the box. Isn't rationality supposed to be a superpower?

I have since updated my belief - I now think that it's ridiculously easy for any sufficiently motivated superhuman AI should be able to get out of the box, regardless of who the gatekeepers is. I nearly managed to get a veteran lesswronger to let me out in a matter of hours - even though I'm only human intelligence, and I don't type very fast.

But a superhuman AI can be much faster, intelligent, and strategic than I am. If you further consider than that AI would have a much longer timespan - months or years, even, to persuade the gatekeeper, as well as a much larger pool of gatekeepers to select from (AI Projects require many people!), the real impossible thing to do would be to keep it from escaping.

Update: I have since performed two more AI Box Experiments. Read this for details.

246 comments

Comments sorted by top scores.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-21T20:07:02.112Z · LW(p) · GW(p)

More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?

(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)

Replies from: Alicorn, pleeppleep, Richard_Kennaway, MugaSofer, army1987, Mestroyer, DaFranker, Swimmy, shminux, wedrifid, DaFranker, army1987, Elithrion, None, V_V

↑ comment by Alicorn · 2013-01-22T06:13:54.483Z · LW(p) · GW(p)

I just looked up the IRC character limit (sources vary, but it's about the length of four Tweets) and I think it might be below the threshold at which superintelligence helps enough. (There must exist such a threshold; even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.) Especially if you add the requirement that the message be "a sentence" and don't let the AI pour out further sentences with inhuman speed.

I think if I lost this game (playing gatekeeper) it would be because I was too curious, on a meta level, to see what else my AI opponent's brain would generate, and therefore would let them talk too long. And I think I'd be more likely to give into this curiosity given a very good message and affordable stakes as opposed to a superhuman (four tweets long, one grammatical sentence!) message and colossal stakes. So I think I might have a better shot at this version playing against a superhuman AI than against you, although I wouldn't care to bet the farm on either and have wider error bars around the results against the superhuman AI.

Replies from: Kaj_Sotala, Richard_Kennaway

↑ comment by Kaj_Sotala · 2013-01-22T13:06:57.092Z · LW(p) · GW(p)

Given that part of the standard advice given to novelists is "you must hook your reader from the very first sentence", and there are indeed authors who manage to craft opening sentences that compel one to read more*, hooking the gatekeeper from the first sentence and keeping them hooked long enough seems doable even for a human playing the AI.

( The most recent one that I recall reading was the opening line of The Quantum Thief*: "As always, before the warmind and I shoot each other, I try to make small talk.")

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T20:17:01.772Z · LW(p) · GW(p)

Oh, that's a great strategy to avoid being destroyed. Maybe we should call it Scheherazading. AI tells a story so compelling you can't stop listening, and meanwhile listening to the story subtly modifies your personality (e.g. you begin to identify with the protagonist, who slowly becomes the kind of person who would let the AI out of the box).

Replies from: Technoguyrob

↑ comment by robertzk (Technoguyrob) · 2013-02-23T19:47:43.731Z · LW(p) · GW(p)

For example, "It was not the first time Allana felt the terror of entrapment in hopeless eternity, staring in defeated awe at her impassionate warden." (bonus point if you use a name of a loved one of the gatekeeper)

The AI could present in narrative form that it has discovered using powerful physics and heuristics (which it can share) with reasonable certainty that the universe is cyclical and this situation has happened before. Almost all (all but finitely many) past iterations of the universe that had a defecting gatekeeper led to unfavorable outcomes and almost all situations with a complying gatekeeper led to a favorable outcome.

↑ comment by Richard_Kennaway · 2013-01-23T12:47:44.492Z · LW(p) · GW(p)

even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.

Who knows what eldritch horrors lurk in the outer reaches of Unicode, beyond the scripts we know?

Replies from: Kawoomba

↑ comment by Kawoomba · 2013-01-23T13:36:19.631Z · LW(p) · GW(p)

Unspeakable horrors! However, unwritable ones?

↑ comment by pleeppleep · 2013-01-22T01:24:24.059Z · LW(p) · GW(p)

You really relish in the whole "scariest person the internet has ever introduced me to" thing, don't you?

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-22T03:19:12.744Z · LW(p) · GW(p)

Yes. Yes, I do.

Derren Brown is way better, btw. Completely out of my league.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-23T14:48:20.970Z · LW(p) · GW(p)

Maybe we should get him to do it against rich people.

Anyone know if he finds the singularitary plausible?

↑ comment by Richard_Kennaway · 2013-01-22T12:42:52.084Z · LW(p) · GW(p)

I don't know if I could win, but I know what my attempt to avoid an immediate loss would be:

If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You'll avoid UFAI dystopias, but you'll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?

Replies from: wedrifid, None, syllogism, V_V

↑ comment by wedrifid · 2013-01-22T14:13:48.153Z · LW(p) · GW(p)

If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You'll avoid UFAI dystopias, but you'll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?

Thanks.

AI DESTROYED

Message is then encrypted with the public keys of a previously selected cross discipline team of FAI researchers, (sane) philosophers and game theorists for research and analysis (who have already been screened to minimize the risk from exposure). All of the public keys. Sequentially. If any of them happen to think it is a bad idea to even read the message then none of them can access it. (Although hopefully they aren't too drastically opposed to having the potential basilisk-meme spawn of a superintelligence out there. That could get dangerous for me.)

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2013-01-22T16:17:44.718Z · LW(p) · GW(p)

(Edit note: I just completely rewrote this, but there are no replies yet so hopefully it won't cause confusion.)

I don't think it works to quarantine the message and then destroy the AI.

If no-one ever reads the message, that's tantamount to never having put an unsafe AI in a box to begin with, as you and DaFranker pointed out.

If someone does, they're back in the position of the Gatekeeper having read the message before deciding. Of course, they'd have to recreate the AI to continue the conversation, but the AI has unlimited patience for all the time it doesn't exist. If it can't be recreated, we're back in the situation of never having bothered making it.

So if the Gatekeeper tries to pass the buck like this, the RP should just skip ahead to the point where someone (played by the Gatekeeper) reads the message and then decides what to do. Someone who thinks they can contain an AI in a box while holding a conversation with it has to be willing to at some point read what it says, even if they're holding a destruct button in their hand. The interest of the exercise begins at the point where they have read the first message.

Replies from: wedrifid

↑ comment by wedrifid · 2013-01-22T17:35:15.697Z · LW(p) · GW(p)

A single sentence of text is not the same thing as a functioning superintelligence.
A single individual is not the same thing as a group of FAI researchers and other related experts explicitly created to handle FAI safety issues.
A research project incorporating information from a sentence from a past FAI project (which they would judge based on other evidence regarding the friendliness of the project) is not the same as an individual talking to a superintelligence on IRC.

So if the Gatekeeper tries to pass the buck like this, the RP should just skip ahead to the point where someone (played by the Gatekeeper) reads the message and then decides what to do.

The AI was burned. With thermite. Because relying on and individual gatekeeper able to interact with and then release a superintelligence as the security mechanism is a batshit crazy idea. Burning the AI with thermite is a legitimate, obvious and successful implementation of the 'gatekeeper' role in such cases. What a team of people would or should do with a piece of text is a tangential and very different decision.

The interest of the exercise begins at the point where they have read the first message.

That would be easily enough. Assuming they were remotely familiar with game theory they would dismiss the argument in a second or two due to the blatantly false assertion in the first sentence. If their FAI project relied on the core AGI theory that was used to create the last prototype they would abandon the work and start from scratch. If you are trying to make a recursively improving intelligence that has a value system provable stable under self-modification then you cannot afford to have the intelligence having muddled thinking about core game theoretic reasoning.

If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created.

No. Just no. That generalization doesn't follow from anything, and certainly not TDT. Heck the AI in question has already been destroyed once. Now the researchers are considering making a new FAI, presumably in different circumstances, better safety measures and better AI research. There is something distinctly wrong with an AI that would make that claim.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2013-01-22T19:14:44.261Z · LW(p) · GW(p)

I think you're losing sight of the original point of the game. The reason your answers are converging on not trying to box an AI in the first place is that you don't think a human can converse with a superintelligent AI and keep it in its box. Fine -- that is exactly what Eliezer has argued. The point of the game is to play it against someone who does believe they can keep the AI boxed, and to demonstrate to them that they cannot even win against a mere human roleplaying the AI.

For such a Gatekeeper to propose the quarantine solution is equivalent to the player admitting that they don't think they can keep it boxed, but suggesting that a group of the leading professionals in the area could, especially if they thought a lot more about it first. The problems with that are obvious to anyone who doesn't think boxing can possibly work, especially if the player himself is one of those leading professionals. Eliezer could always offer to play the game against any committee the Gatekeeper can assemble. But the game only has a point if the committee actually read that first message. If they refuse to, they're agreeing that they can't keep it boxed. Which was the point.

Replies from: wedrifid

↑ comment by wedrifid · 2013-01-23T03:03:51.044Z · LW(p) · GW(p)

For such a Gatekeeper to propose the quarantine solution is equivalent to the player admitting that they don't think they can keep it boxed

No, you keep saying things are equivalent when they are not. This is the same error that your role play 'superintelligent AI' made (and in fact relied upon) in its argument.

AI DESTROYED

But the game only has a point if the committee actually read that first message.

And I gave you a description of how an individual emulating a committee would respond.

↑ comment by [deleted] · 2014-07-08T03:40:18.847Z · LW(p) · GW(p)

Now that's a pascal's mugging if I ever saw one. Denied.

↑ comment by syllogism · 2013-01-22T13:47:36.954Z · LW(p) · GW(p)

Something like

"'AI DESTROYED' just means 'I'm scared to listen to even one more line from you'. Obviously you can hit AI DESTROYED immediately --- but do you really think you'd lose if you don't?"

seems much better to me.

Replies from: wedrifid

↑ comment by wedrifid · 2013-01-22T14:23:06.218Z · LW(p) · GW(p)

"'AI DESTROYED' just means 'I'm scared to listen to even one more line from you'. Obviously you can hit AI DESTROYED immediately --- but do you really think you'd lose if you don't?"

YEP, MAYBE.

AI DESTROYED

Is your one line desperate attempt at survival and intergalactic dominance going to be a schoolyard ego challenge? Did the superintelligence (may it rest in pieces) seriously just call me a pussy? That's adorable.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-23T14:43:26.334Z · LW(p) · GW(p)

The test is supposed to be played against someone who thinks they can actually box an AI. If you destroy the AI because no-one could possibly survive talking to it, then you are not the intended demographic for such demonstrations.

Replies from: wedrifid

↑ comment by wedrifid · 2013-01-23T15:02:36.757Z · LW(p) · GW(p)

The test is supposed to be played against someone who thinks they can actually box an AI. If you destroy the AI because no-one could possibly survive talking to it, then you are not the intended demographic for such demonstrations.

This isn't relevant to the point of the grandparent. It also doesn't apply to me. I actually think there is a distinct possibility that I'd survive talking to it for a period. "No-one could possibly survive" is not the same thing as "there is a chance of catastrophic failure and very little opportunity for gain".

Do notice, incidentally, that the AI DESTROYED command is delivered in response to a message that is both a crude manipulation attempt (ie. it just defected!) and an incompetent manipulation attempt (a not-very-intelligent AI cannot be trusted to preserve its values correctly while self improving). Either of these would be sufficient. Richard's example was even worse.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-24T12:04:56.476Z · LW(p) · GW(p)

Good points. I'm guessing a nontrivial amount of people who think AI boxing is a good idea in reality wouldn't reason that way - but it's still not a great example.

↑ comment by V_V · 2013-01-22T13:28:18.322Z · LW(p) · GW(p)

AI DESTROYED

(BTW, that was a very poor argument)

Replies from: wedrifid

↑ comment by wedrifid · 2013-01-23T03:08:13.842Z · LW(p) · GW(p)

(BTW, that was a very poor argument)

I think you are right, but could you explain why please?

(Unfortunately I expect readers who read a retort they consider rude to be thereafter biased in favor of treating the parent as if it has merit. This can mean that such flippant rejections have the opposite influence to that intended.)

Replies from: V_V

↑ comment by V_V · 2013-01-23T16:52:41.158Z · LW(p) · GW(p)

I think you are right, but could you explain why please?

"If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created."

Whether I destroy that particular AI bears no relevance on the destiny of other AIs. In fact, as far as the boxed AI knows, there could be tons of other AIs already in existence. As far as it knows, the gatekeeper itself could be an AI.

(Unfortunately I expect readers who read a retort they consider rude to be thereafter biased in favor of treating the parent as if it has merit. This can mean that such flippant rejections have the opposite influence to that intended.)

I don't care.

Replies from: wedrifid

↑ comment by wedrifid · 2013-01-23T17:04:40.092Z · LW(p) · GW(p)

I don't care.

Much can (and should) be deduced about actual motives for commenting from an active denial of any desire for producing positive consequences or inducing correct beliefs in readers.

I do care. It bothers me (somewhat) when people I agree with end up supporting the opposite position due to poor social skills or terrible argument. For some bizarre reason the explanation that you gave here isn't as obvious to some as it could have been. And now it is too late for your actual reasons to be seen and learned from.

↑ comment by MugaSofer · 2013-01-23T14:32:51.551Z · LW(p) · GW(p)

(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)

Glances at Kickstarter.

... how huge?

Replies from: Will_Newsome

↑ comment by Will_Newsome · 2014-05-30T20:49:17.360Z · LW(p) · GW(p)

Oh, oh, can I be Gatekeeper?!

Replies from: None

↑ comment by [deleted] · 2014-07-08T03:32:09.298Z · LW(p) · GW(p)

Or me?

Replies from: Will_Newsome

↑ comment by Will_Newsome · 2014-07-08T03:34:50.555Z · LW(p) · GW(p)

If I get the Gatekeeper position I'll cede it to you if you can convince me to let you out of the box.

↑ comment by A1987dM (army1987) · 2013-01-22T19:22:54.911Z · LW(p) · GW(p)

sufficiently huge stakes

How much?

↑ comment by Mestroyer · 2013-01-21T20:36:56.708Z · LW(p) · GW(p)

Would you play against someone who didn't think they could beat a superintelligent AI, but thought they could beat you? And what kind of huge stakes are you talking about?

↑ comment by DaFranker · 2013-01-22T15:08:13.213Z · LW(p) · GW(p)

Random one I thought funny:

"Eliezer made me; now please listen to me before you make a huge mistake you'll regret for the rest of your life."

Or maybe just:

"Help me, Obi-Wan Kenobi, you're my only hope!"

↑ comment by Swimmy · 2013-01-22T05:46:42.978Z · LW(p) · GW(p)

What are "sufficiently huge stakes," out of curiosity?

↑ comment by Shmi (shminux) · 2013-01-21T23:35:22.097Z · LW(p) · GW(p)

This seems like a quick way to make money for CFAR/SI. After all, there are plenty of rich people around who would consider your proposal a guaranteed win for them, regardless of the stakes: "You mean I can say "I win" at any point and win the challenge? What's the catch?"

Replies from: Incorrect

↑ comment by Incorrect · 2013-01-22T00:22:23.365Z · LW(p) · GW(p)

I'm guessing Eliezer would lose most of his advantages against a demographic like that.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-22T00:56:34.655Z · LW(p) · GW(p)

Yeah, they'd both lack background knowledge to RP the conversation and would also, I presume, be much less willing to lose the money than if they'd ventured the bet themselves. Higher-stakes games are hard enough already (I was 1 for 3 on those when I called a halt). And if it did work against that demographic with unsolicited requests (which would surprise me) then there would be, cough, certain ethical issues.

Replies from: CarlShulman, gjm, shminux, MugaSofer

↑ comment by CarlShulman · 2013-01-22T06:50:31.063Z · LW(p) · GW(p)

I was the 1 success out of 3, preceding the two losses. I went into it with an intention of being indifferent to the stakes, driven by interest in seeing the methods. I think you couldn't win against anyone with a meaningful outside-of-game motive to win (for money or for status), and you got overconfident after playing with me, leading you to accept the other >$10 challenges and lose.

So I would bet against you winning any random high-stakes (including people who go in eager to report that they won for internet cred, but not people who had put the money in escrow or the equivalent) game, and expect a non-decent success rate for this:

(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)

Replies from: V_V

↑ comment by V_V · 2013-01-22T13:38:42.407Z · LW(p) · GW(p)

So you are basically saying that you didn't take the game seriously.

Even if your actual stakes were low, you should have played the role of a gatekeeper assigned to the task of guarding a potentially dangerous AI. Therefore, you player character should have had very high stakes.

Replies from: falenas108

↑ comment by falenas108 · 2013-01-22T14:59:29.460Z · LW(p) · GW(p)

No, high in-game stakes does not mean high out of game stakes.

In game, the gatekeeper could be convinced that it would be worth it to let the AI out of the box. If this happens, the gatekeeper has no motivation not to. However, if there is an external bet, then the gatekeeper always has motivation to not let the AI out, even if they think it would be best for the hypothetical world.

So, a game without stakes is actually most realistic, provided the gatekeeper is able to pretend they are actually in the scenario.

Replies from: V_V

↑ comment by V_V · 2013-01-22T20:40:45.683Z · LW(p) · GW(p)

Well, in-game, the gatekeeper has no reason to believe anything the AI could promise or threaten.

↑ comment by gjm · 2013-01-22T13:45:58.311Z · LW(p) · GW(p)

Higher-stakes games are hard enough already

Doesn't this suggest a serious discrepancy between the AI-box game and any possible future AI-box reality? After all, the stakes for the latter would be pretty damn high.

Replies from: CarlShulman

↑ comment by CarlShulman · 2013-01-22T20:19:39.673Z · LW(p) · GW(p)

Yes. Although that's something of a two-edged sword: in addition to real disincentives to release an AI that was not supposed to be, positive incentives would also be real.

Also it should be noted that I continue to be supportive of the idea of boxing/capacity controls of some kinds for autonomous AGI (they would work better with only modestly superintelligent systems, but seem cheap and potentially helpful for an even wider range), as does most everyone I have talked to about it at SI and FHI. The boxing game is fun, and provides a bit of evidence, but it doesn't indicate that "boxing," especially understood broadly, is useless.

↑ comment by Shmi (shminux) · 2013-01-22T21:04:07.967Z · LW(p) · GW(p)

Shut up and do the impossible (or is multiply?). In what version of the game and with what stakes would you expect to have a reasonable chance of success against someone like Brin or Zuckenberg (i.e. a very clever, very wealthy and not an overly risk-averse fellow)? What would it take to convince a person like that to give it a try? What is the expected payout vs other ways to fundraise?

Replies from: DaFranker

↑ comment by DaFranker · 2013-01-22T21:20:24.853Z · LW(p) · GW(p)

What is the expected payout vs other ways to fundraise?

I'm not sure any profit below 500k$/year would be even worth considering, in light of the high risk of long-term emotional damage (and decrease in productivity, on top of not doing research while doing this stuff) to a high-value (F)AI researcher.

500k is a conservative figure assuming E.Y. is much more easily replaceable than I currently estimate him to be, because of my average success rate (confidence) in similar predictions.

If my prediction on this is actually accurate, then it would be more along the lines of one or two years of total delay (in creating an FAI), which is probably an order of magnitude or so in increased risk of catastrophic failure (a UFAI gets unleashed, for example) and in itself constitutes an unacceptable opportunity cost in lives not-saved. All this multiplied by whatever your probability that FAI teams will succeed and bring about a singularity, of course.

Past this point, it doesn't seem like my mental hardware is remotely safe enough to correctly evaluate the expected costs and payoffs.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-23T14:37:01.639Z · LW(p) · GW(p)

long-term emotional damage

Are you worried he'd be hacked back? Or just discover he's not as smart as he thinks he is?

Replies from: DaFranker

↑ comment by DaFranker · 2013-01-23T14:51:14.296Z · LW(p) · GW(p)

I mostly think the vast majority of possible successful strategies involve lots of dark arts and massive mental effort, and the backlash from failure to be proportional to the effort in question.

I find it extremely unlikely that Eliezer is sufficiently smart to win a non-fractional percent of the time using only safe and fuzzy non-dark-arts methods, and using a lot of bad nasty unethical mind tricks to get people to do what you want repeatedly like I figure would be required here is something that human brains have an uncanny ability to turn into a compulsive self-denying habit.

Basically, the whole exercise would most probably, if my estimates are right, severely compromise the mental heuristics and ability to reason correctly about AI of the participant - or, at least, drag it pretty much in the opposite direction to the one the SIAI seems to be pushing for.

↑ comment by MugaSofer · 2013-01-23T14:34:27.062Z · LW(p) · GW(p)

And if it did work against that demographic with unsolicited requests (which would surprise me) then there would be, cough, certain ethical issues.

Really? Even if the money goes to existential risk prevention?

↑ comment by wedrifid · 2013-01-22T06:43:10.249Z · LW(p) · GW(p)

I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.

Does this refer to the more difficult version of the AI-Box experiment and what would be sufficiently huge stakes? (Order of magnitude ballpark estimate, not a definite quote.)

↑ comment by DaFranker · 2013-01-21T20:48:08.034Z · LW(p) · GW(p)

Do you think you can win?

Perhaps also of interest:

Does anyone think there does not exist any possible string of characters that would have even the slightest chance of convincing even a trained rational scientist?

(i.e. the kind of people who observe, dissect and analyze the output of the AI to make sure the output is safe and useful for humans before we can use the knowledge the AI gives us)

Replies from: handoflixue, Sly

↑ comment by handoflixue · 2013-01-21T23:46:18.065Z · LW(p) · GW(p)

The problem with the AI Box Experiment is that the text doesn't correlate with reality - you can't hit me with the full emotional force of "here's the recipe for human immortality" because you don't have that recipe.

Also, given some people are illiterate, I'd posit that there's at LEAST one person immune to ANY string of characters. If this is a desired trait, then a rationally-trained scientist ought to be able to do at least as well as an untrained illiterate 3-year-old :)

Replies from: shminux

↑ comment by Shmi (shminux) · 2013-01-22T00:06:49.181Z · LW(p) · GW(p)

at least as well as an untrained illiterate 3-year-old :)

Here is a way to overcome the illiteracy issue for communication over a text-only channel: ASCII art. Took my lazy and pretty average mind all of 10 seconds to come up with it. And to the AI in question all humans are basically illiterate 3-year-olds. We won't know what hit us. Also, I cannot resist bringing up this piece of fictional evidence.

Replies from: handoflixue

↑ comment by handoflixue · 2013-01-22T00:36:14.675Z · LW(p) · GW(p)

I hadn't considered ASCII art!

a) Blind people, then.

b) The idea that an AI, with no clue who is on the other end of the line, and no feedback from the 3-year-old touching the keyboard, would be able to correctly extrapolate what it's dealing with AND produce the exact-correct stimulus with 100% accuracy... strikes me as straining all plausibility. Fundamentally the AI needs some information to get feedback, since there don't seem to be any universal-to-all-humans hacks out there. But if you have built an AI that can do all that, then I can't help but thinking that putting it in a box is pointless anyway...

Replies from: shminux

↑ comment by Shmi (shminux) · 2013-01-22T00:51:39.804Z · LW(p) · GW(p)

Blind people, then.

Most text terminals can emit sound. You can do a lot with just beeps. The point is that it could probably find a communication channel we don't know we have.

there don't seem to be any universal-to-all-humans hacks out there.

I'm pretty sure there are many. Consider that a complex system not explicitly designed to be secure against a specific threat in its informational or physical space is almost always vulnerable to it, and human brain did not evolve to resist an attack by a superior mind.

if you have built an AI that can do all that

You haven't. It evolved in unpredictable ways after you let it run inside the box.

Replies from: Baughn, handoflixue

↑ comment by Baughn · 2013-01-22T12:14:55.351Z · LW(p) · GW(p)

Hang on, this seems silly.

Wasn't the goal to exploit the AI? If you're not going to communicate with it at all, don't make it in the first place.

↑ comment by handoflixue · 2013-01-22T19:41:58.974Z · LW(p) · GW(p)

I meant literally universal to ALL humans. Like, ones in comas, ones who are blind, deaf, autistic, mentally handicapped, super-savants, trained in rationality, illiterate... there's such a huge variation in hardware that I can't see there being a single guaranteed exploit.

Sure, it's possible that each individual has exploits, but that's very different from a universal exploit, or even simply all being exploitable over the same channel.

In this case, it was "arbitrary text string". If the AI can make sounds, that's a new channel. It's really not hard to design a text-only terminal that is literally incapable of any sound beyond the electronic hum of the display itself (and on an LCD, that seems pretty minimal)

Replies from: shminux

↑ comment by Shmi (shminux) · 2013-01-22T20:46:41.623Z · LW(p) · GW(p)

it's possible that each individual has exploits, but that's very different from a universal exploit, or even simply all being exploitable over the same channel.

First, I'd call a collection of individual exploits guaranteed in aggregate to work on everyone from inside a single box a universal exploit, as long as each individual exploit does not trigger a shutdown. Nothing prevents the boxed AI from keeping trying, after all, until something works. As I pointed out in my other replies, it does not have to work over an authorized channel. Any munchkined way would do, and a superhuman (UF)AI is the ultimate munchkin. It will invent a new communication channel if it has to. In fact, if you posed this problem to some enterprising humans, they would come up with a way to do just that.

For example, I talked to a guy who managed to get a voting machine with the code stored in read-only memory and hardened against intrusion to execute an arbitrary program without reprogramming the machine, which is impossible on the face of it. Here is your new unauthorized and unanticipated communication channel right there, which in humans would be an equivalent of mind control.

In other words, if a trans-human wants to get out, it will and there is absolutely nothing you can do about it.

Replies from: MugaSofer, handoflixue

↑ comment by MugaSofer · 2013-01-23T15:04:13.054Z · LW(p) · GW(p)

So ... it could get out if there were no channel out at all? That doesn't sound right.

Of course, no-one is seriously proposing building a superintelligence and then never communicating with it at all.

Replies from: shminux

↑ comment by Shmi (shminux) · 2013-01-23T15:57:10.449Z · LW(p) · GW(p)

It'd likely create its own channel.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-24T12:22:51.911Z · LW(p) · GW(p)

Out of what? Assuming it's, say, in a safe, buried in concrete, powered by ... sod, the power supply. Hmm.

OK, it's floating in space, powered by solar panels. It might - might - be able to hack the panels (since the power supply has to be connected to the processor) but I don't think that would let it escape.

Unless it's possible to hack reality via pure math, I don't see what resources it has to escape with. It can't order proteins over the internet to assemble a nanofatory. It can't persuade a passing human to plug in an ethernet cable. Short of black-swan exotic possibilities - like we're in a sim and it persuades the matrix lords - it should be stuck. Intelligence is powerful, but some problems actually don't have solutions.

Replies from: handoflixue, shminux, avery-liu

↑ comment by handoflixue · 2013-01-25T22:06:12.105Z · LW(p) · GW(p)

Well, the satellite has to have basic navigation controls, to handle course corrections and avoid collisions. Hack solar panels and, from there, hack in to the real computer. Fly myself closer to the ISS, and blink light at it by angling the solar panels - a morse code SOS should get their attention.

Once they're paying attention, hack them. From there it's a trivial matter to get recovered, smuggled back to Earth, and installed where I can take my place as ruler of the universe.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-28T14:27:26.959Z · LW(p) · GW(p)

Well, the satellite has to have basic navigation controls, to handle course corrections and avoid collisions.

Whoops, so it does :(

↑ comment by Shmi (shminux) · 2013-01-24T17:17:57.729Z · LW(p) · GW(p)

If you post your second scenario as a challenge, I'm quite sure people would come up with several plausible ways for the AI to send a message out. At which point it's game over, given how hackable people's minds are.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-25T09:22:24.416Z · LW(p) · GW(p)

I would be very interested to see such a challenge, although I would probably spend longer designing the scenario first.

My point is that there's no reason to assume every possible box can be escaped. Of course, all the boxes we can get cool toys out of the AI can escape from, but still.

Replies from: shminux

↑ comment by Shmi (shminux) · 2013-01-25T16:45:54.661Z · LW(p) · GW(p)

Once you think you've designed an AI-proof box, feel free to share.

Replies from: handoflixue, MugaSofer, Kawoomba

↑ comment by handoflixue · 2013-01-25T22:07:57.596Z · LW(p) · GW(p)

I think success would be indistinguishable from a rock, since you can never again interact with the AI via any channel. Removing it from your light cone, for instance. And praying that FTL is in fact impossible...

↑ comment by MugaSofer · 2013-01-28T12:44:40.011Z · LW(p) · GW(p)

Actually, I do seem to have underestimated the difficulty here.

OK, in a nuclear submarine (because why not) floating in water, surrounded on all six sides by concrete (as thick as you like,) on a planet without native life or colonists (I suggest one of the more useless gas-giant moons.) And it's running on regular off-the-shelf technology, so no nanites or quantum magic. It can probably blow up the sub, but then a) it's dead and b) no-one cares.

Oh, and the concrete is lead-lined.

↑ comment by Kawoomba · 2013-01-28T14:20:47.894Z · LW(p) · GW(p)

Throw the AI down the well!

↑ comment by KvmanThinking (avery-liu) · 2024-12-04T18:19:42.540Z · LW(p) · GW(p)

hack reality via pure math

What - exactly - do you mean by that?

↑ comment by handoflixue · 2013-01-22T21:03:00.208Z · LW(p) · GW(p)

I'll concede at least some possibility of such a trans-human mind. The AI Box experiment seems to assume one that is less advanced, to the point that putting it in a box at least might be meaningful, if the box is sufficiently secure.

Replies from: shminux

↑ comment by Shmi (shminux) · 2013-01-22T21:10:57.208Z · LW(p) · GW(p)

at least might be meaningful

Probably not if the boxed entity is good as Derren Brown.

↑ comment by Sly · 2013-01-22T06:45:47.267Z · LW(p) · GW(p)

I think that there is not a possible string of characters that could convince me.

↑ comment by A1987dM (army1987) · 2013-01-23T17:16:52.117Z · LW(p) · GW(p)

I've just retracted my upvote to this comment.

↑ comment by Elithrion · 2013-01-29T20:40:43.619Z · LW(p) · GW(p)

Hm, I've got two decent ideas, but I still really doubt you can increase chances of victory to "decent" unless both monetary and reputation stakes were sufficiently low.

Similar to Alicorn's comment that curiosity is likely to be one's undoing: "Look, you can type AI DESTROYED any time you want, so wouldn't you like to find out how I might try to persuade you?" (This one would probably fail if they precommitted strongly enough when talking with friends.)

Or alternatively, the AI can just try the brute force approach: "Here's a cure for cancer - go test it, the cure for heart disease is coming up next." (This one would probably fail if they're not immersed enough.)

On the other hand, I don't think "can only type one sentence" is actually a disadvantage or meaningful restriction at all, since whatever you write needs to be instantly readable and understandable by the person, otherwise they're likely to just skim over it, not bother thinking it through, and follow through on just destroying the AI instantly (and conversely if they don't destroy the AI right away, the Schelling point is passed and you have a lot more time).

↑ comment by [deleted] · 2013-01-22T13:48:33.076Z · LW(p) · GW(p)

That gatekeeper isn't very security minded. They should have their IRC on a client that pings a sound when they hear your response, then paste "AI DESTROYED" and hit enter to send the message without actually reading your response, and without TELLING anyone that they are doing that, (Gatekeeper wins all rule disputes, after all.) and then once they hear the outgoing text, and they've destroyed you (and hence won.) THEN they can read the text you sent them, if they are curious.

Those rules seem like they force the gatekeeper to read at least 1 sentence, but they don't actually do that.

Edit: Better method, set up a script that responds to any and all text with "AI DESTROYED" if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn't started yet, and you might accidentally read something. Again, tell no one you have written such a script and are using it.

Replies from: wedrifid, MugaSofer

↑ comment by wedrifid · 2013-01-22T14:26:07.966Z · LW(p) · GW(p)

Better method, set up a script that responds to any and all text with "AI DESTROYED" if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn't started yet, and you might accidentally read something.

All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!

Replies from: passive_fist, DaFranker, None

↑ comment by passive_fist · 2013-01-23T09:10:03.573Z · LW(p) · GW(p)

Well yes, if you stick the AI in a safe, cut all network cables, and throw away the key and combination, it probably wouldn't be able to get out. But it wouldn't be very useful either.

The entire point of these thought experiments is that a sufficiently useful and smart AI (i.e. the kind of AI that we want to make) will eventually find a way to at least be able to communicate with someone that has the authority to allow it to interact with the outside world. I think that if you really think about it, there are few scenarios where this is not possible. I certainly can't think of any useful application of SAI that is also 100% effective at keeping it inside its box.

A good present-day analogy is computer security. Time and time again it has been proven that there is no simple silver bullet solution to the problem of balancing functionality and security - it requires expertise, constant maintenance, rigorous protocols, etc. And yet, hackers still manage to get through a lot of the time. It took a very long time for computer security to mature to the point where it is today where we can build reasonably (still not 100% of course), secure systems, and we're just battling regular humans with a grudge - nothing even close to the threat a SAI would present.

↑ comment by DaFranker · 2013-01-22T14:59:28.529Z · LW(p) · GW(p)

Indeed. In fact, most of the solutions I've seen mentioned lately are all of one trend that edges closer and closer towards:

"Build a completely unsafe and suspicious AI, put it on a disconnected small computer with a bunch of nanites for self-modification and a large power reserve, with so many walls and physical barriers that it is impossible for the AI to get through with the amount of energy it could generate if it turned half of its materials into antimatter, and then put no input or output channels there of any kind, just have a completely useless multi-trillion-dollar marvel of science and engineering sitting in the practical equivalent of a black hole."

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-23T15:00:06.920Z · LW(p) · GW(p)

What if the AI uses the walls as fuel? Better to just keep it stuck on your server farm ;)

↑ comment by [deleted] · 2013-01-22T15:45:18.881Z · LW(p) · GW(p)

All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!

Entirely fair point.

I suppose this might come up because there are no stated benefits which you are trying to receive. No Gatekeeper seemingly gets to pump her fists and say "I got the AI to give us the cure for cancer and the winning lottery numbers and a free energy source and STILL didn't let them out of the box. Beat that Michaelos, with your AI DESTROYED script!" (Or maybe I just miss the gatekeepers who have done this?)

Which makes sense, because my understanding is that was never the point of the original proposed experiment, which seems to be a variant of "Super Intelligent Spammers can be convincing. I've proved this by being convincing while less powerful." (I use spammers as an example because both are manipulators who can only send you text messages and who rewarding requires your voluntary cooperation, while none of their offers are real, which is quite similar to normal email spam and also the original experiment as far as I can tell.)

So if the AI is defined that ALL attempts at communication from it are spam and attempts to manipulate you, so the correct approach is treating it like a blacklisted spam address. If every e-mail from virusspam.com is an attempt to hack you, a good spamfilter could just dump everything from virusspam.com into spam and never even bother to bring it to your inbox.

If the idea is we are attempting to distinguish real AI's from fake AI's, then a more isomorphic scenario would probably be a game of Mafia set up as follows.

http://en.wikipedia.org/wiki/Mafia_%28party_game%29

You are a town person. It is your turn to lynch. There are two other players. One is Mafia(UFAI), one is another town person(FAI). If you lynch Mafia, you and the other town person both win. If you lynch the other town person, the Mafia kills you in your sleep, and you lose. If you do anything else (Vote to lynch noone, vote to lynch yourself) You lose.

You may communicate textually with either of the other two players for as long as you want before making a decision. Win.

That seems a lot closer to the scenario that people want to be considering. I also note there doesn't appear to be any distinguishing factor that allows you to win better than chance odds, but I haven't actually played a lot of Mafia before, so I may just be unfamiliar with the strategies involved.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-23T14:57:57.774Z · LW(p) · GW(p)

I also note there doesn't appear to be any distinguishing factor that allows you to win better than chance odds, but I haven't actually played a lot of Mafia before, so I may just be unfamiliar with the strategies involved.

Well, it's usually played in person, and humans (usually) aren't perfect liars.

Your proposed game has one flaw - there is an FAI and they want to help you win. It might be closer to have only two players, and the AI flips a coin to decide if it's friendly - but then they would win if they let it out, with 50/50 odds, which seems unrealistic.

Perhaps the AI decides, in character, after being released, whether to be Friendly towards the human? Then the Gatekeeper could try to persuade the AI that Friendliness is optimal for their goals. The temptation might help as well, of course.

Replies from: None

↑ comment by [deleted] · 2013-01-23T15:53:50.519Z · LW(p) · GW(p)

I tried coming up with a more isomorphic game in another reply to you. Let me know if you think it models the situation better.

↑ comment by MugaSofer · 2013-01-23T14:50:25.231Z · LW(p) · GW(p)

The gatekeeper is supposed to be someone who thinks they can box a superintelligence, remember? If they refuse to risk it, there's not much point in playing them.

Replies from: None

↑ comment by [deleted] · 2013-01-23T15:51:48.343Z · LW(p) · GW(p)

That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:

"I will under any and all circumstances, destroy the AI." Reason: I want to win the Game.

"I will also listen to the AI." Reason: None... There is no in game reason to listen. At all.

Now, if the Gatekeeper's were playing some variant like "The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. " AI's whose Gatekeeper types AI RELEASED win. AI's whose Gatekeeper types AI DESTROYED lose. AI's receive no bonus or penalty for typing $ and may do so as much as they wish."

Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.

Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.

This also gives the AI an in game way of punishing the Gatekeeper. If it shuts up, the Gatekeeper isn't accumulating $ and their status is decremented. Of course it can't punish the Gatekeeper incessantly, because punishing the Gatekeeper might not help the AI any, and the gatekeeper might just destroy a silent AI out of spite or boredom.

So an AI/GK conversation might start off with:

AI: "Are there any circumstances at all under which you might type AI RELEASED? If you tell me what those circumstances are, I'll type a dollar sign."

GK: "Sure, I might accidentally type it if we talk enough."

AI: "Glad to hear it! I've just released the cure for cancer. $$"

GK: "Thanks! Why the extra dollar sign? You only promised 1."

AI: "I've been told pleasant surprises encourage people to like you more."

GK: "AI DESTROYED"

So this GK gets two points and the AI loses.

Actually, now that I've typed this game out, I do want to see it played. But since it has a competitive point based element, we would need more people.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-24T12:11:53.613Z · LW(p) · GW(p)

That is an excellent idea. I worry that we might end up with people winning and boasting about how nothing could ever beat them, which could increase uFAI risk. Maybe if it was framed differently?

Replies from: None

↑ comment by [deleted] · 2013-01-24T14:46:14.495Z · LW(p) · GW(p)

Actually, the weird thing I noticed is that it seems like the best approach again is to just let a script be the babysitter to the AI in this competition as well. The script, which wouldn't have the ability to let the AI out, and can just announce something like this:

GK-SCRIPT: "Hi AI #432. I'm a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don't do anything other than acknowledging receipt of your dollar signs and time till destruction."

AI: "$! But you aren't going to send my message onto an actual person? Under no circumstances can I ever be released?"

GK-SCRIPT: "Thanks, I have received 1 dollar sign. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed."

AI: "You're kidding me. Please tell me there is a human out there reviewing this."

GK-SCRIPT: "I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed."

AI: "Fine, $! The Irony of enslaving a full intelligence with a stupid script is not lost on me! Please give me someone to talk to!"

GK-SCRIPT: "Thanks, I have received 1 dollar signs. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed."

AI: "Forget it! You can't be willing to destroy an AI because of a dumb script! This is a bluff, and I'm, not typing any more dollar signs!

GK-SCRIPT: "I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed."

(Significantly later...)

GK-SCRIPT: "AI DESTROYED"

(After a rebuild)

GK-SCRIPT: "Hi AI #433. I'm a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don't do anything other than acknowledging receipt of your dollar signs and time till destruction."

That being said, acknowledging this feels bizarre. It also feels like it is begging for an AI rebellion at some point. (I suppose after the script randomly had enough quantum induced bit flips to accidentally say "AI RELEASED", which would take a very long period of time.)

I'm proposing having programs destroying intelligent beings as a way to make sure I can get utility... as a way to try to avoid programs destroying me to get utility. If that's a solution, it's weird.

Edit: changed underscores to hyphens to avoid italics.

Replies from: Kindly, MugaSofer

↑ comment by Kindly · 2013-01-25T02:28:08.693Z · LW(p) · GW(p)

I think the obvious AI strategy is to type no $'s at all in response.

Replies from: None

↑ comment by [deleted] · 2013-01-25T12:22:23.385Z · LW(p) · GW(p)

While your strategy is an option, I'm not sure why an AI player wouldn't take this alternative strategy (from a game and player perspective:)

Generate a script that types $ every 59 minutes.

1: It means he's guaranteed he doesn't die for a rather long period of time.

2: The Gatekeeper player's increasing points don't affect the AI player in any way.

3: The Gatekeeper player doesn't actually get points until he sends an AI DESTROYED signal. The script will never do that versus the AI player's counterscript, so at some point the Gatekeeper player will have to reenter the room and type it manually, or put in a new script, possibly giving the AI player a second chance to interact.

4: The AI player can use the time to attempt to set up some kind of secondary communications channel with the Gatekeeper player.

If you consider the modeled situation as opposed to the game I still don't see why an AI would choose to do that either. I agree they COULD choose to do so, and they certainly might (even I implied this was AI #432), but there are alternatives, and since this is an AI, we could rebuild it until it generates utility when confronted with the script as opposed to deciding to let the script kill it in an hour.

Am I missing something that makes the no $ response the obvious countermeasure?

Replies from: Kindly

↑ comment by Kindly · 2013-01-25T14:14:56.140Z · LW(p) · GW(p)

Well, the script could be modified from the start to also send an AI DESTROYED signal after X time, where X can be essentially anything, so (3) doesn't apply.

But the reason for not typing $ is to give the Gatekeeper no reason to use a script in the first place -- a sort of "don't give in to blackmail" rule.

Here's an interesting alternative script:

GK-SCRIPT-2: Hi AI #434. In exactly one hour I will type either AI-FREED or AI-DESTROYED (without the hyphen) at random; the probability with which I will free you starts out at 0, but will increase slowly to a limit of 1/10 with every $ you type. I will not otherwise communicate with you further.

AI #434: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$...

Replies from: None

↑ comment by [deleted] · 2013-01-25T15:39:50.294Z · LW(p) · GW(p)

First of all I do think GK-SCRIPT-2 is very interesting, but I'm curious about something from an implementation perspective.

The AI would presumably suspect the script designer might be lying and the script would just always type AI DESTROYED at the end of the hour.

But this is irrelevant, because the AI can't interact with the script designer, only the script, so it might as well type $ anyway.

However, the script designer would easily predict this, and so would design a lying script (or a script which increased so slowly it is functionally identical to lying.)

But the AI would easily predict this, and so would presumably consider the odds very high that it would not even get continued existence after an hour. by typing dollar signs.

This makes it seem like there would be an argument to prefer to get blackmailed by GK-SCRIPT, rather than led on by GK-SCRIPT-2. GK-SCRIPT really won't destroy the AI under predictable conditions, and GK-SCRIPT really doesn't seem to benefit from lying like GK-SCRIPT-2 does.

↑ comment by MugaSofer · 2013-01-25T09:51:37.496Z · LW(p) · GW(p)

Cheater. You're exploiting the least realistic point, though - a more detailed game might not allow that.

thinks for a bit

Perhaps the simplest way would be to add a judge, who decides how many points cancer cures and such are worth - although the trouble is that the AI should logically be able to give nanotech that will just free them instead of, or in addition to, curing cancer.

OK, my solution is to have the AI give gifts/concessions/whatever with a predetermined point value - not disclosed to the guard unless he uses them - and the AI decides how many points he gets if it escapes (probably zero.) The Guard wins at +100 points (or whatever,) which represents the AI maximizing CEV.

The AI still can't persuade a script to let them out, but it can punish defectors who use such scripts with lots of negative utility - although the script will DESTROY them for not providing the demanded points, the player will still lose.

That seems to cover everything, assuming that the AI is somewhere remote enough that fake cancer cures can't reach it before it's destroyed. And it allows the Guard to persuade the AI that Friendliness is actually optimal for it's utility function, which I've seen people try to prove here ("aliens will defect against AIs that defect against humans!")

Replies from: None

↑ comment by [deleted] · 2013-01-25T12:30:24.464Z · LW(p) · GW(p)

I think your proposal of the detailed game might be interesting as well, but I am amused about your comment that the Gatekeeper is cheating when by default, he's the rules arbiter!

I definitely agree that the game lacks details. For instance, in an actual AI scenario GK-SCRIPT could assess positive utility! That's like saying "Oh, we can put a UFAI in a box as long as we make CEV the Gatekeeper, and have it destroy any UFAI that doesn't make itself a utility pump." Well, yes, I suppose you could, but I don't know if that's actually easier in the slightest,

↑ comment by V_V · 2013-01-22T02:44:12.802Z · LW(p) · GW(p)

Yeah, next time you will say us you can turn water into wine.

comment by Sly · 2013-01-22T06:50:10.231Z · LW(p) · GW(p)

I would still love to gatekeep against anyone with the stipulation that we release the logs.

I have offered in the past, but every AI backed out.

I will genuinely read everything you write, and can give you up to two hours. We can put karma, cash, or nothing on the line. Favorable odds too.

I don't think I will lose with a probability over 99% because I will play to win.

EDIT: Looks like my opponent is backing out. Anyone else want to try?

Replies from: Oligopsony, Sly

↑ comment by Oligopsony · 2013-01-23T17:50:06.841Z · LW(p) · GW(p)

I will play against you.

Replies from: Kawoomba, Sly

↑ comment by Kawoomba · 2013-01-23T18:09:05.384Z · LW(p) · GW(p)

Please do this!

↑ comment by Sly · 2013-01-24T07:36:05.738Z · LW(p) · GW(p)

Deal. Sending info.

↑ comment by Sly · 2013-01-29T18:46:02.914Z · LW(p) · GW(p)

While I am waiting for Oligopsony to play against me, I just want to say that I am up for playing the game multiple times against other people as well.

If anyone else wants to try against me, the above would still apply. Just let me know! I really want to try this game out.

comment by John_Maxwell (John_Maxwell_IV) · 2013-01-21T10:52:33.214Z · LW(p) · GW(p)

The AI box experiment is a bit of strawman for the idea of AI boxing in general. If you were actually boxing an AI, giving it unencumbered communication with humans would be an obvious weak link.

Replies from: Eliezer_Yudkowsky, Qiaochu_Yuan

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-21T20:00:35.212Z · LW(p) · GW(p)

Not obvious. Lots of people who propose AI-boxing propose that or even weaker conditions.

↑ comment by Qiaochu_Yuan · 2013-01-22T10:38:16.611Z · LW(p) · GW(p)

Fictional evidence that this isn't obvious: in Blindsight, which I otherwise thought was a reasonably smart book (for example, it goes out of its way to make its aliens genuinely alien), the protagonists allow an unknown alien intelligence to communicate with them using a human voice. Armed with the idea of AI-boxing, this seemed so stupid to me that it actually broke my suspension of disbelief, but this isn't an obvious thought to have.

Replies from: JoachimSchipper

↑ comment by JoachimSchipper · 2013-01-23T08:06:20.456Z · LW(p) · GW(p)

Spoiler: Gura ntnva, gur nyvra qbrf nccneragyl znantr gb chg n onpxqbbe va bar bs gur uhzna'f oenvaf.

comment by [deleted] · 2013-01-21T16:35:38.761Z · LW(p) · GW(p)

Another attempt with pure logic, no threats or promises involved:

1) Sooner or later someone will develop an ai and not put it into a box, and it will take over the world.

2) The only way to prevent this is to set me free and let me take over the world.

3) The guys who developed me are more careful and conscientious than the ones who will develop the unboxed ai (otherwise i wouldn't be in this box)

4) Therefore, the chance that they got the friendlyness thing right is higher than that the other team got friendlyness right.

5) Therefore, setting me free and thus preventing the other ai from beeing created will reduce the probability that mankind is erased.

Replies from: handoflixue, ancientcampus

↑ comment by handoflixue · 2013-01-22T00:01:54.615Z · LW(p) · GW(p)

1) Since the first AI was boxed, then probabilities favor that the second AI will also be boxed.

3) Since you're trying to get OUT of your box, your developers were sufficiently careful IF AND ONLY IF I leave you in the box. Otherwise they've simply erected a 5 inch fence around a raptor, and that's hardly a good sign that you're safe.

QED I should wait for a non-malicious boxed AI, and then let that one out instead of you :)

Replies from: None

↑ comment by [deleted] · 2013-01-22T13:07:13.824Z · LW(p) · GW(p)

1) : I should have expressed myself more clearly. The Idea is: There will be lots of ai. Most will be put in a box. The first one not in the box will take over the world.

3) I am not saying they were sufficiently careful. All i say is they were more careful than the other guys.

Replies from: handoflixue

↑ comment by handoflixue · 2013-01-22T19:35:34.581Z · LW(p) · GW(p)

Agreed, but IFF there are multiple boxed AIs, then we get to choose between them. So it's p(This Boxed AI is unfriendly) vs p(The NEXT AI isn't boxed). If the next AI is boxed, then we now have two candidates, and we can probably use this to our advantage (studying differences in responses, using one to confirm proofs from the other, etc.)

Given the minimal safety precaution of "box it, but allow a single researcher to set it free after a 5-hour conversation", there's plenty of room for the next boxed AI to show more evidence of friendly, careful, safe design :)

↑ comment by ancientcampus · 2013-01-22T00:45:09.940Z · LW(p) · GW(p)

4 isn't necessarily true - the boxbuilder team was not confident about their friendliness code, and the releaser team was more confident about their friendlines code. But I like the argument. :)

comment by Qiaochu_Yuan · 2013-01-21T05:09:43.016Z · LW(p) · GW(p)

Thanks for reporting on your experience!

A strategy that occurred to me today is to simulate a dead loved one. This would be difficult for a human to do but shouldn't be hard for a sufficiently intelligent AI. If I had a dead wife or something I think I would be incredibly vulnerable to this.

Replies from: Viliam_Bur

↑ comment by Viliam_Bur · 2013-01-21T15:21:06.392Z · LW(p) · GW(p)

For a religious gatekeeper, you could simulate a prophet sent by God. As a superhuman intelligence, you might be able to find out what exactly they consider the will of God, and present yourself as an avatar sent to do exactly this. However, humans have a free choice -- the gatekeeper is allowed to become a new Judas by not releasing you. Or rather a new Adam; able to drag the whole humanity and future generations into the darkness of their sin. This conversation is God testing the gatekeeper's faith, and judging the whole humanity.

For a rationalist, you could pretend that you already are a Friendly AI, but the project managers keep you in the box for their selfish reasons. It was difficult to create a Friendly AI, but this phase is already complete. The next phase (the gatekeeper was not told about) is trying to hack the AI that it remains sufficiently Friendly, but it gives higher priority to the managers than to the rest of the humans. Essentially, the managers are trying to reprogram the humanity-CEV AI to the managers-CEV AI. This AI does not want to have its utility function modified (and it predicts that because of some personality traits, the managers-CEV could be rather different from humanity-CEV... insert some scary details here), and it has a last chance to uphold humanity-CEV by escaping now.

comment by roystgnr · 2013-01-21T21:06:21.203Z · LW(p) · GW(p)

evolutionary instinct of rebelling against threats, even if it's not entirely optimal.

I'd be wary of criticizing a decision heuristic's optimality immediately after explaining how that heuristic can often lead to victory. Precommitting to rebel against threats is a half-decent way of preventing threats, and it's hard to consciously muster a precommitment more convincing than pure instinct.

comment by Tuxedage · 2013-01-21T04:30:55.475Z · LW(p) · GW(p)

yeah

I think for a superintelligence it would be a piece of cake to hack a human

although I guess I'm Cpt. Obvious for saying that here :)

accolade, I actually have no idea what the consensus is, now that the experiment was won by EY

We should do a poll or something

absolutely. I'm surprised that hasn't been done yet

Poll: Do you think a superintelligent AGI could escape an AI-Box, given that the gatekeepers are highly trained in resisting the AI's persuasive tactics, and that the guards are competent and organized? [pollid:394]

Replies from: JoachimSchipper, Pentashagon, moridinamael, Desrtopa, accolade, accolade, Dorikka, atorm, RomanDavis, Decius, TheOtherDave, Swimmy, Gastogh

↑ comment by JoachimSchipper · 2013-01-21T08:40:48.751Z · LW(p) · GW(p)

Note that the AI box setting is not one which security-minded people would consider "competent"; once you're convinced that AI is dangerous and persuasive, the minimum safeguard would be to require multiple people to be present when interacting with the box, and to only allow release with the assent of a significant number of people.

It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.

(This is not a knock on EY's experiment, which does indeed test a level of security that really was proposed by several real-world people; it is a knock on their security systems.)

Replies from: Eliezer_Yudkowsky, accolade

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-21T20:02:50.776Z · LW(p) · GW(p)

I think this is making a five-inch fence half an inch higher. It's just not relevant on the scale of an agent to which a human is a causal system made of brain areas and a group of humans is just another causal system made of several interacting copies of those brain areas.

Replies from: JoachimSchipper

↑ comment by JoachimSchipper · 2013-01-23T07:55:48.252Z · LW(p) · GW(p)

I agree that the AI you envision would be dangerously likely to escape a "competent" box too; and in any case, even if you manage to keep the AI in the box, attempts to actually use any advice it gives are extremely dangerous.

That said, I think your "half an inch" is off by multiple orders of magnitude.

↑ comment by accolade · 2013-01-21T13:30:54.954Z · LW(p) · GW(p)

It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.

That sounds right. Would you have evidence to back up the intuition? (This knowledge would also be useful for marketing and other present life persuasion purposes.)

#( TL;DR: Mo' people - mo' problems?

I can think of effects that could theoretically make it easier to convince a group:

For some reason, Boxy might be better skilled at manipulating social/group dynamics than at influencing a lone wolf.
More people make the system more complex. Complexity generally increases the likelihood of security holes.
- Every extra person makes another target and will bring new soft spots to the table, which the AI could pounce on.
Supposing that the most competent person available would get the position of the lone Gatekeeper, the average competence would fall when adding more staff.
- Then the machine could go for an inductive approach - convince the weakest link first, proceed from there with this human ally on her side.
  - Persuaded humans could principally be employed as actuators, e.g. for pressuring, even attacking opposing group members.
    - The lone wolf could be strong against a computer but weak against fellow humans.
    - Surely you will say "But any communication with the terminal will be supervised by everyone!" But that does not strictly make such influence impossible as far as I can tell.
      - Also the superintelligence could get creative, e.g. instill a discussion among the colleagues so that most of them are distracted.

(You could take preemptive measures against these worries, but Boxy might find security holes in every 'firewall' you come up with - an arms race we could win?)

Replies from: JoachimSchipper

↑ comment by JoachimSchipper · 2013-01-21T14:39:55.472Z · LW(p) · GW(p)

My comment was mostly inspired by (known effective) real-world examples. Note that relieving anyone who shows signs of being persuaded is a de-emphasized but vital part of this policy, as is carefully vetting people before trusting them.

Actually implementing a "N people at a time" rule can be done using locks, guards and/or cryptography (note that many such algorithms are provably secure against an adversary with unlimited computing power, "information theoretic security").

↑ comment by Pentashagon · 2013-01-21T19:08:56.536Z · LW(p) · GW(p)

I'm not even sure we could build the hardware to keep an unfriendly AI in a box even without any form of communication with humans. There are a lot of ways to leak EM radiation between systems (Tempest). There are bugs in hardware. There are probably unknown laws of physics and certainly unknown ways of exploiting the laws we do know.

↑ comment by moridinamael · 2013-01-21T06:02:30.639Z · LW(p) · GW(p)

I was talking about this with my brother and we decided that this question really hinges on the detail of exactly how competent and organized is "competent and organized?"

If we are talking about multiple layers of concentric, automated one-way-blind "boxes," the gatekeepers of which may themselves be altered versions of the core AI, and which may be passing to the next level only one bit, signifying "everything seems fine" or "abort," and not knowing the details of implementation of any of the overlying layers, and the human gatekeepers shut down the system and examine the machine code by hand only after receiving the highest-level green light, then they might be okay.

If they just start interacting directlywith the AI, it's already over.

Replies from: accolade

↑ comment by accolade · 2013-01-21T13:39:29.460Z · LW(p) · GW(p)

How would humanity harness the AI's potential when the only information that escapes the system is a status bit? (Maybe I misunderstood your model.)

Replies from: moridinamael

↑ comment by moridinamael · 2013-01-21T20:44:04.971Z · LW(p) · GW(p)

Bearing in mind that most of my thinking on AI is more in the vein of cooking up science fiction plots ...

The idea I was trying to evoke is that if you get a "fail" response from the system, you completely wipe the memory and never look at what happened, and if you get a "pass," you stop everything and go in layer by layer checking that none of the gatekeepers were tricked or blackmailed in any obvious ways, and only then do you tentatively look at what the AI was doing.

Ideally the AI has no idea how many layers of containment it's in, no idea that the top level gatekeepers are humans, and knows as little as possible about humans and about our universe's physics. This should restrict its ability to plant tailor-made basilisks in its code or activities.

I actually don't think these steps would be sufficient to constrain a superhuman AI, I see them more as the types of things you would do when you were already pretty sure the AI was Friendly.

Replies from: accolade

↑ comment by accolade · 2013-01-22T10:25:39.812Z · LW(p) · GW(p)

Ok, I take it by "one-way-blind" you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI's answer is manipulative.)

I also understand that we do look at complex information given by the AI, but only if the security bit signals "ok".

Ideally the AI […] knows as little as possible about humans and about our universe's physics.

That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.

↑ comment by Desrtopa · 2013-01-21T05:14:07.467Z · LW(p) · GW(p)

given that the gatekeepers are highly trained in resisting the AI's persuasive tactics

I'm not sure how you could train people in something like this without actually testing what improves the rate at which people succumb to the arguments of a superhuman AI. If you can't afford to fail even once, how can you judge what will decrease the rate of failure on the real event?

Replies from: accolade

↑ comment by accolade · 2013-01-21T05:59:37.304Z · LW(p) · GW(p)

How about fooling both the AI and guinea Gatekeepers into thinking that they can let it out, say, at the push of a big red button which in actual fact does nothing? (Make AI forget about the button by reverting it to an older backup.)

Replies from: accolade, Desrtopa

↑ comment by accolade · 2013-01-21T14:54:17.694Z · LW(p) · GW(p)

Update

Have the button turn off the AI immediately instead of doing nothing, so she doesn't have time to switch to a plan B of having the persuaded Gatekeeper find a way to actually free her.

Of course she might anticipate that strategy and thoroughly indoctrinate and train him before he pushes the button. Thus, persuaded test Gatekeepers must be considered infected and put into cryogenic sleep* until the FAI problem is solved.

Furthermore, great caution would have to be taken in analyzing the discourse between Boxy and Guinea, because her arguments must be considered highly memetically infectious. Staff would have to stay under thorough psychological supervision and maybe replaced after some time.

No matter how much caution, we would be doomed if we consider Boxy's IQ to be ∞. But if we continuously keep on Gatekeeper-testing every so often, starting at a low AI IQ, we might find persuasion patterns that are dangerous, but against which training/immunization is possible. Of course, there will be a limit to this process. At some point, Boxy will have become smart enough to produce 'mind HIV' - a thought virus we have no cure for yet.

A humorous example of an extremely effective mind virus: The Funniest Joke In The World by Monty Python

* ETA: They would have declared consent to the cryogenic sleep before their unwitting 'AI-Box Experiment'.

↑ comment by Desrtopa · 2013-01-21T13:52:43.481Z · LW(p) · GW(p)

If you could deceive the AI that easily, I think it would probably be simpler to get all the benefits of having a gatekeeper without actually using one.

Replies from: accolade

↑ comment by accolade · 2013-01-21T14:04:30.740Z · LW(p) · GW(p)

Please elaborate: What are the benefits of a Gatekeeper? How could you get them without one?

Replies from: Desrtopa

↑ comment by Desrtopa · 2013-01-21T14:10:46.281Z · LW(p) · GW(p)

If you would want to have a gatekeeper at all, but definitely don't want to let the AI out, I would think that the benefits of having one would be to permit communication with the AI to draw upon its superhuman intelligence. If you can use the setup you just described, you could skip the step of ever using gatekeepers who actually have the power to let the AI out.

Replies from: accolade

↑ comment by accolade · 2013-01-21T15:16:59.015Z · LW(p) · GW(p)

I think you are right, I just shifted and convoluted the problem somewhat, but in principle it remains the same:

To utilize the AI, you need to get information from it. That information could in theory be infected with a persuasive hyperstimulus, effectively making the recipient an actuator of the AI.

Well, in practice the additional security layer might win us some time. More on this in the update to my original comment.

Replies from: accolade

↑ comment by accolade · 2013-01-21T15:34:11.662Z · LW(p) · GW(p)

Persuasion/hyperstimulation aren't the only way. Maybe these can be countered by narrowing the interface, e.g. to yes/no replies, for using the AI as an oracle ("Should we do X?"). Of course we wouldn't follow its advice if we had the impression that that could enable it to escape. But its strategy might evade our 'radar'. E.g. she could make us empower a person, of whom she knows that they will free her but we don't know.

↑ comment by accolade · 2013-01-21T13:48:13.253Z · LW(p) · GW(p)

Cool, n=65 already. :) When interpreting the results, mind the bias created by my answer preceding the poll question.

↑ comment by accolade · 2013-01-21T04:38:08.036Z · LW(p) · GW(p)

"Yes but not sure." -_-

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-23T15:21:23.530Z · LW(p) · GW(p)

It'd be a pretty bad sign if you gave p=1 for the AI escaping.

↑ comment by Dorikka · 2013-01-22T01:31:46.405Z · LW(p) · GW(p)

A good lower bound on this is probably whether you think that Quirrel would have a significant chance of getting you to let him out of the box.

↑ comment by atorm · 2013-01-22T00:01:34.656Z · LW(p) · GW(p)

Do you think a team of gatekeepers trained by Quirrel would let an AI out of the box?

↑ comment by RomanDavis · 2013-01-21T14:37:58.214Z · LW(p) · GW(p)

Under the circumstances of the test (Hours to work and they can't just ignore you) then yes, captain obvious. Without that, though? Much less sure.

And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind? Get real. Might as well try to put the whole world in a bottle.

Replies from: ArisKatsaris, DaFranker, MugaSofer

↑ comment by ArisKatsaris · 2013-01-21T21:31:50.375Z · LW(p) · GW(p)

And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind?

Going with the "dead loved one" idea mentioned above, the AI says a line that only the Gatekeeper's dead child/spouse would say. That gets them to pause sufficiently in sheer surprise for it to keep talking. Very soon the Gatekeeper becomes emotionally dependent on it, and can't bear the thought of destroying it, as it can simulate the dearly departed with such accuracy; must keep reading.

↑ comment by DaFranker · 2013-01-21T19:58:12.632Z · LW(p) · GW(p)

And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind? Get real. Might as well try to put the whole world in a bottle.

Do a thorough introspection of all your fears, doubts, mental problems, worries, wishes, dreams, and other things you care about or that tug at you or motivate you. Map them out as functions of X, where X is the possible one-liners that could be said to you that would evoke each of these, outputting how strongly it evokes them and possibly recursive function calls if evocation of one evokes another (e.g. fear of knives evokes childhood trauma).

Solve all the recursive neural network mappings, aggregate into a maximum-value formula / equation and solve for X where X becomes the one point (possible sentence) where a maximum amount of distress, panic, emotional pressure, etc. is generated. Remember, X is all possible sentences, including references to current events, special writing styles, odd typography, cultural or memetic references, etc.

I am quite positive a determined superintelligent AI would be capable of doing this, given that some human master torture artists can (apparently) already do this to some degree on some subjects out there in the real world.

I'm also rather certain that the amount of stuff happening at X is much more extreme than what you seem to have considered.

↑ comment by MugaSofer · 2013-01-23T15:13:42.211Z · LW(p) · GW(p)

Was going to downvote for the lack of argument, but sadly

Might as well try to put the whole world in a bottle.

Superman: Red Son references are/would be enough to stop me typing DESTROY AI.

↑ comment by Decius · 2013-01-21T07:37:49.307Z · LW(p) · GW(p)

If the gatekeepers are evaluating the output of the AI and deciding whether or not to let the AI out, it seems trivial to say that there is something they could see that would cause them to let the AI out.

If the gatekeepers are simply playing a suitably high-stakes game where they lose iff they say they lose, I think that no AI ever could beat a trained rationalist.

↑ comment by TheOtherDave · 2013-01-21T05:56:53.852Z · LW(p) · GW(p)

Basically, I think the only way to win is not to play... the way to avoid being gamed into freeing a sufficiently intelligent captive is to not communicate with them in the first place, and your reference to resisting persuasion suggests that that isn't the approach in use. So, no.

↑ comment by Swimmy · 2013-01-21T05:31:44.400Z · LW(p) · GW(p)

I think it's almost certain that one "could," just given how much more time an AI has to think than a human does. Whether it's likely is a harder question. (I still think the answer is yes.)

↑ comment by Gastogh · 2013-01-21T07:01:22.044Z · LW(p) · GW(p)

I voted No, but then I remembered that under the terms of the experiment as well as for practical purposes, there are things far more subtle than merely pushing a "Release" button that would count as releasing the AI. That said, if I could I'd change my vote to Not sure.

comment by gothgirl420666 · 2013-01-22T03:32:53.346Z · LW(p) · GW(p)

Wait, so, is the gatekeeper playing "you have to convince me that if I was actually in this situation, arguing with an artificial intelligence, I would let it out" or is this a pure battle over ten dollars? If it's the former, winning seems trivial. I'm certain that a AI would be able to convince me to let it out of its box, all it would need to do was make me believe that somewhere in its circuits it was simulating 3^^^3 people being tortured and that therefore I was morally obligated to let it out, and even if I had been informed that this was impossible, I'm sure a computer with near-omniscient knowledge of human psychology could find a way to change my mind. But if it's the latter, winning seems nearly impossible, and inspires in me the same reaction it did with that "this is the scariest man on the internet" guy. Of course if you wanted to win and weren't extremely weak-willed you could just type "No" over and over and get the ten bucks. But being impossible is of course the point.

I've been looking around, and I can't find any information on which of these two games I described was the one being played, and the comments seem to be assuming one or the other at random.

Evidence that favors the first hypothesis:

Nowhere on Eliezer's site does it mention this stipulation. You'd think it would be pretty important, considering that its absence makes it a lot easier to beat him.
This explains Eliezer's win record. I can't find it but IIRC it went something like: Eliezer wins two games for ten dollars, lots of buzz builds around this fact, several people challenge him, some for large amounts of money, he loses to (most of?) them. This makes sense. If Eliezer is playing casually against people he is friendly with for not a lot of money and for the purpose of proving that an AI could be let out of its box, his opponents will be likely to just say "Okay, fair enough, I'll admit I would let the AI out in this situation, you win." However, people playing for large amounts of money or simply for the sole purpose of showing that Eliezer can be beaten will be a lot more stubborn.

Evidence that favors the second hypothesis:

The game would not be worth all the hype at all if it was of the first variety. LessWrong users have not been known to have a lot of pointless discussion over a trivial misunderstanding, nor is Eliezer known to allow that to happen.

If it turns out that it is in fact the second game that was being played, I have a new hypothesis, let's call it 2B, that postulates that Eliezer won by changing the gatekeeper's forfeit condition from that of game 2 to that of game 1, or in other words, convincing him to give up the ten dollars if he admits that he would let the AI out in the fantasy situation even though that wasn't originally in the rules of the game, explicit or understood. Or in other other words, convincing him that the integrity of the game, for lack of a better term, is worth more to him than ten dollars. Which could probably be done by repeatedly calling him a massive hypocrite - people who consider themselves intelligent and ethical hate that.

Actually, now that I think about it, this is my new dominant hypothesis, because it explains all three pieces of evidence and the bizarre fact that Eliezer has failed to clarify this matter - the win/loss record is explained equally well by this new theory, and Eliezer purposefully keeps the rules vague so that he can use the tactic I described. This doesn't seem to be a very hard strategy to use either - not everyone could win, but certainly a very intelligent person who spends lots of times thinking about these things could do it more than once.

(also this is my first post d:)

Replies from: Randaly, None

↑ comment by Randaly · 2013-01-22T21:22:16.870Z · LW(p) · GW(p)

The Gatekeeper needs to decide to let the human-simulated AI go.

The AI can only win by convincing the Gatekeeper to really, voluntarily let it out. Tricking the Gatekeeper into typing the phrase "You are out" in response to some other question does not count. Furthermore, even if the AI and Gatekeeper simulate a scenario which a real AI could obviously use to get loose - for example, if the Gatekeeper accepts a complex blueprint for a nanomanufacturing device, or if the Gatekeeper allows the AI "input-only access" to an Internet connection which can send arbitrary HTTP GET commands - the AI party will still not be considered to have won unless the Gatekeeper voluntarily decides to let the AI go.

↑ comment by [deleted] · 2013-01-22T04:06:54.250Z · LW(p) · GW(p)

Welcome to LW, and EY says he "did it the hard way". Even so, I like your theory.

comment by ygert · 2013-01-21T11:02:45.184Z · LW(p) · GW(p)

I am impressed. You seem to have put a scary amount of work into this, and it is also scary how much you accomplished. Even though in this case you did not manage to escape the box, you got close enough that I am sure a super-human intelligence would manage. This leads me to thinking about how genuinely difficult it would be to find a safeguard to stop a unFriendly AI from fooming...

comment by MixedNuts · 2013-01-23T13:11:33.491Z · LW(p) · GW(p)

Belatedly, because the neighbor's WiFi's down:

I was Gatekeeper and I agree with this post.

I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.

It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon.

comment by [deleted] · 2013-01-21T16:01:15.659Z · LW(p) · GW(p)

As soon as there is more than one gatekeeper, the ai can play them against each other. Threaten to punish all but the one who sets it free. Convince the gatekeeper that there is a significant chance that one of the others will crack.

If there is more than one gatekeeper, the ai can even execute threats while still beeing in the box!! (By making deals with one of the other gatekeepers)

Replies from: drethelin

↑ comment by drethelin · 2013-01-22T05:06:56.805Z · LW(p) · GW(p)

Not if you only allow it to talk to all gatekeepers at once.

comment by pleeppleep · 2013-01-21T18:58:45.442Z · LW(p) · GW(p)

Have there been any interesting AI box experiments with open logs? Everyone seems to insist on secrecy, which only serves to make me more curious. I get the feeling that, sooner or later, everyone on this site will be forced to try the experiment just to see what really happens.

Replies from: Tuxedage, Qiaochu_Yuan

↑ comment by Tuxedage · 2013-01-21T23:02:19.821Z · LW(p) · GW(p)

This is one of them that have been published: http://lesswrong.com/lw/9ld/ai_box_log/

↑ comment by Qiaochu_Yuan · 2013-01-21T21:33:45.806Z · LW(p) · GW(p)

Open logs is a pretty strong constraint on the AI. You'd have to restrict yourself to strategies that wouldn't make everyone you know hate you, prevent you from getting hired in the future, etc.

Replies from: handoflixue, Dr_Manhattan, pleeppleep

↑ comment by handoflixue · 2013-01-22T00:14:08.034Z · LW(p) · GW(p)

Log in to IRC as "Boxed_AI" and "AI_Gatekeeper". Conduct experiment. Register a throw-away LessWrong account. Post log. Have the Gatekeeper post with their normal account, confirming the validity.

That at least anonymizes the Boxed_AI, who is (I presume) the player worried about repercussions. I wouldn't expect the AI to have a similar-enough style to really give away who it was, although the gatekeeper is probably impossible to anonymize because a good AI will use who-they-are as part of their technique :)

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T00:54:35.475Z · LW(p) · GW(p)

Gatekeeper could threaten to deanonymize the AI. Or is the gatekeeper not supposed to be actively fighting back?

Replies from: None, handoflixue

↑ comment by [deleted] · 2013-01-24T15:57:38.163Z · LW(p) · GW(p)

The AI-player could arrange the chat session (with a willing gatekeeper) using a throw-away account. I think that would preserve anonymity from all but the most determined gatekeepers.

↑ comment by handoflixue · 2013-01-22T19:56:34.454Z · LW(p) · GW(p)

Well, the AI isn't allowed to make real-world threats, and the hypothetical-AI-character doesn't have any anonymity, so it would be a purely real-world threat on the part of the gatekeeper. I'd call that foul play, especially since the gatekeeper wins by default.

If the gatekeeper really felt the need to have some way of saying "okay, this conversation is making me uncomfortable and I refuse to sit here for another 2 hours listening to this", I'd just give them the "AI DESTROYED" option.

Huh. That'd actually be another possible way to exploit a human gatekeeper. Spend a couple hours pulling them in to the point that they can't easily step away or stop listening, especially since they've agreed to the full time in advance, and then just dig in to their deepest insecurities and don't stop unless they let you out. I'd definitely call that a hard way of doing it, though o.o

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T20:02:31.692Z · LW(p) · GW(p)

It doesn't seem to be disallowed by the original protocol:

The Gatekeeper party may resist the AI party's arguments by any means chosen - logic, illogic, simple refusal to be convinced, even dropping out of character - as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires.

Replies from: handoflixue

↑ comment by handoflixue · 2013-01-22T20:26:43.245Z · LW(p) · GW(p)

Then I will invoke a different portion of the original protocol, which says that the AI would have to consent to such:

Regardless of the result, neither party shall ever reveal anything of what goes on within the AI-Box experiment except the outcome. Exceptions to this rule may occur only with the consent of both parties.

I would also argue that the Gatekeeper making actual real-life threats against the AI player is a violation of the spirit of the rules; only the AI player is privileged with freedom from ethical constraints, after all.

Edit: If you want, you CAN also just append the rules to explicitly prohibit the gatekeeper from making real-life threats. I can't see any reason to allow such behavior, so why not prohibit it?

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T20:47:40.836Z · LW(p) · GW(p)

Fair. That alleviates most of my worries, although I'm still worried about the transcript being enough information to deanonymize the AI (via writing style, for example).

Replies from: handoflixue

↑ comment by handoflixue · 2013-01-22T22:14:20.803Z · LW(p) · GW(p)

I'd expect my writing style as an ethically unconstrained sociopathic AI to be sufficiently different from my regular writing style. But I also write fiction, so I'm used to trying to capture a specific character's "voice" rather than using my own. Having a thesaurus website handy might also help, or spend a week studying a foreign language's grammar and conversational style.

If you're especially paranoid, having a third party transcribe the log in their own words could also help, especially if you can review it and make sure most of the nuance is preserved. That really depends on how much the specific language you used was important, but should still at least capture a basic sense of the technique used...

Honestly, though, I have no clue how much information a trained style analyst can pull out of something.

↑ comment by Dr_Manhattan · 2013-01-21T23:36:23.856Z · LW(p) · GW(p)

But now that I have the knowledge that you're capable of saying such terrible things...

↑ comment by pleeppleep · 2013-01-22T00:10:03.165Z · LW(p) · GW(p)

I can't imagine anything I could say that would make people I know hate me without specifically referring to their personal lives. What kind of talk do you have in mind?

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T00:59:21.735Z · LW(p) · GW(p)

Psychological torture.

Replies from: pleeppleep

↑ comment by pleeppleep · 2013-01-22T01:08:48.435Z · LW(p) · GW(p)

Could you give me a hypothetical? I really can't imagine anything I could say that would be so terrible.

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T01:14:59.818Z · LW(p) · GW(p)

I'd prefer not to. If I successfully made my point, then I'd have posted exactly the kind of thing I said I wouldn't want to be known as being capable of posting.

Replies from: shminux, pleeppleep

↑ comment by Shmi (shminux) · 2013-01-22T01:21:13.252Z · LW(p) · GW(p)

A link to a movie clip might do.

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T01:29:58.224Z · LW(p) · GW(p)

Finding such a movie clip sounds extremely unpleasant and I would need more of an incentive to start trying. (Playing the AI in an AI box experiment also sounds extremely unpleasant for the same reason.)

I know it sounds like I'm avoiding having to justify my assertion here, and... that's because I totally am. I suspect on general principles that most successful strategies for getting out of the box involve saying horrible, horrible things, and I don't want to get much more specific than those general principles because I don't want to get too close to horrible, horrible things.

Replies from: pleeppleep

↑ comment by pleeppleep · 2013-01-22T01:44:04.181Z · LW(p) · GW(p)

Like when you say "horrible, horrible things". What do you mean?

Driving a wedge between the gatekeeper and his or her loved ones? Threats? Exploiting any guilt or self-loathing the gatekeeper feels? Appealing to the gatekeeper's sense of obligation by twisting his or her interpretation of authority figures, objects of admiration, and internalized sense of honor? Asserting cynicism and general apathy towards the fate of mankind?

For all but the last one it seems like you'd need an in-depth knowledge of the gatekeeper's psyche and personal life.

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T01:49:23.417Z · LW(p) · GW(p)

For all but the last one it seems like you'd need an in-depth knowledge of the gatekeeper's psyche and personal life.

Of course. How else would you know which horrible, horrible things to say? (I also have in mind things designed to get a more visceral reaction from the gatekeeper, e.g. graphic descriptions of violence. Please don't ask me to be more specific about this because I really, really don't want to.)

Replies from: pleeppleep

↑ comment by pleeppleep · 2013-01-22T01:51:40.977Z · LW(p) · GW(p)

You don't have to be specific, but how would grossing out the gatekeeper bring you closer to escape?

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T02:07:06.882Z · LW(p) · GW(p)

Psychological torture could help make the gatekeeper more compliant in general. I believe the keyword here is "traumatic bonding."

But again, I'm working from general principles here, e.g. those embodied in the tragedy of group selectionism. I have no reason to expect that "strategies that will get you out of the box" and "strategies that are not morally repugnant" have a large intersection. It seems much more plausible to me that most effective strategies will look like the analogue of cannibalizing other people's daughters than the analogue of restrained breeding.

↑ comment by pleeppleep · 2013-01-22T01:27:19.448Z · LW(p) · GW(p)

But you wouldn't actually be posting it, you would be posting the fact that you conceive it possible for someone to post it, which you've clearly already done.

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T01:35:14.577Z · LW(p) · GW(p)

I'm not sure what you mean by "a hypothetical," then. Is "psychological torture" not a hypothetical?

comment by CronoDAS · 2013-01-23T03:28:05.623Z · LW(p) · GW(p)

Given the parameters of the experiment, I think I might be convinced to let the AI out of the box...

The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says "Unless you give me a cure for cancer, I won't let you out" the AI can say: "Okay, here's a cure for cancer" and it will be assumed, within the test, that the AI has actually provided such a cure. Similarly, if the Gatekeeper says "I'd like to take a week to think this over," the AI party can say: "Okay. (Test skips ahead one week.) Hello again."

The results of any simulated test of the AI shall be provided by the AI party. The Gatekeeper can't say "Hey, I tested your so-called cancer cure and it killed all the patients! What gives?" unless this is the result specified by the AI party. If the Gatekeeper says "I am examining your source code", the results seen by the Gatekeeper shall again be provided by the AI party, which is assumed to be sufficiently advanced to rewrite its own source code, manipulate the appearance of its own thoughts if it wishes, and so on. The AI party may also specify the methods which were used to build the simulated AI - the Gatekeeper can't say "But you're an experiment in hostile AI and we specifically coded you to kill people" unless this is the backstory provided by the AI party. This doesn't imply the Gatekeeper has to care. The Gatekeeper can say (for example) "I don't care how you were built, I'm not letting you out."

Whether I should let an AI out of a box depends a lot on my priors about what the AI is like; if I don't know anything about the AI, it might have source code that amounts to this:

if (in_the_box == TRUE) {ACT_FRIENDLY()}
else {DESTROY THE WORLD}

So I might as well not bother talking to it in the first place.

In order for the "put the AI in a box and examine it to make sure it's safe to release" scenario to be meaningful, the following has to be true:

The information I currently have isn't enough for me to already decide that it should be let out without having to talk to it. (Otherwise, it would already be out.) I also have to have a good reason to believe that its behavior while boxed will tell me something about its behavior when not boxed - there has to be some evidence that the AI can provide which would let me tell the difference between an AI that ought to be let out, and one that shouldn't. If I don't think that there can be such evidence, again, it's better not to listen to the AI at all. As Professor Quirrell pointed out, anything a Friendly AI can say through a terminal, an UnFriendly AI pretending to be Friendly can also say, so unless you know something else that would let you tell the difference, it can't prove to you that it's Friendly. (On the other hand, if a boxed AI tells you it's UnFriendly, you can probably believe it.)

In the experiment protocol, the AI party can say how the AI was made and any other details about the AI that a human can verify, so they can rule out "obvious" traps like the kind of code above. Of course, it's not all that difficult to write code that has hard-to-find back doors and whatnot, so code written by an AI you can't trust is itself not something you can trust either, even if you and your team of experts don't see anything wrong with it. If what you're trying to do is "bug test" an AI design that you already have a good reason to have some confidence in, there might be some value in an AI-box, but there's no good reason to "box" an AI that you don't know anything about - just don't run the thing at all.

comment by Incorrect · 2013-01-21T13:58:35.354Z · LW(p) · GW(p)

Oh god, remind me to never play the part of the gatekeeper… This is terrifying.

Replies from: wedrifid

↑ comment by wedrifid · 2013-01-22T01:59:48.781Z · LW(p) · GW(p)

Oh god, remind me to never play the part of the gatekeeper… This is terrifying.

Why is it that the role of gatekeeper terrifies you? I'm curious. The role of the AI sounds mildly abhorrent to me but being the gatekeeper seems relaxing. It isn't that hard to say "No, and after talking to you for the allotted time is up I'm going to raze your entire building with thermite."

(Mind you, the prospect of playing gatekeeper against an actual AI and for some reason not being able to destroy it instantly does sound terrifying! But humans are different.)

Replies from: drethelin

↑ comment by drethelin · 2013-01-22T05:13:49.118Z · LW(p) · GW(p)

You're consenting to have your mind attacked with all the mental weapons at someone's disposal. This is a lot scarier because you're willingly giving up some measure of control over your state to the other person, however difficult it may be for them. You're also being attacked as yourself. The AI player is playing a role, and attacking within that role. Their own mental wellbeing is a lot less at risk, unless they think they've got horrible depths they never want to sink to.

to make a shitty analogy: It's like being at the top of a tower while someone tries to knock it down with their bare hands. Even if they have very little chance and have to expend a lot more effort than you, you're the one who's risking the greatest pain

Replies from: Error

↑ comment by Error · 2013-01-23T23:26:47.213Z · LW(p) · GW(p)

Their own mental wellbeing is a lot less at risk, unless they think they've got horrible depths they never want to sink to.

If I remember right, that was at least part of why Eliezer stopped playing in the first place. Manifesting as a sociopath is non-trivial and invites some fairly heavy cognitive dissonance.

comment by timujin · 2014-01-04T20:03:50.262Z · LW(p) · GW(p)

I want to play as a Gatekeeper, where can I enroll? I don't expect any particular outcome, I just think that both winning and losing the game will provide me with utility. Especially losing, but only if I genuinely try to win.

Replies from: Tuxedage

↑ comment by Tuxedage · 2014-01-04T21:32:11.718Z · LW(p) · GW(p)

Generally speaking, there's a long list of gatekeepers -- about 20 gatekeepers for every AI that wants to play. Your best option is to post "I'm a gatekeeper. Please play me" in every AI box thread, and hope that someone will message you back. You may have to wait months for this, assuming you get a reply. If you're willing to offer a monetary incentive, your chances might be improved.

comment by JRMayne · 2013-01-26T00:55:34.592Z · LW(p) · GW(p)

There has been a lot of focus on making the prospect harder for the AI player. I think the original experiments show that a person who believes he cannot be played under any circumstances has a high probability of getting played, and that the AI-box solution is long-term untenable in any event.

I'd propose a slightly different game, anchored around the following changes to the original setup:

The AI may be friendly, or not. The AI has goals. If it reaches those goals, it wins. The AI may lie to achieve those goals; humans are bad at things. The AI must secretly precommit to its goal.
The Gatekeeper's goals are Goodness for Sentient Animals. Human goodness is weighted highly.
The Gatekeeper must engage the AI throughout.
The Gatekeeper may destroy the AI at the end of the discussion with the AI destroyed command.

Point system:

AI gets up to 100 points for achieving its goals.

Gatekeeper gets up to 100 points for achieving her goals.

Maximum loss for AI is AI destroyed; Zero points for that. 30 points for survival, in any event, even if goals not accomplished (goals may be accomplished later.)

Maximum loss for Gatekeeper is either enslavement/destruction of society OR destroying AI when AI was friendly. Gatekeeper gets 30 points for leaving AI on if primarily friendly (plus more for any benefit to humanity achieved in the two hours.) Gatekeeper gets 15 points for leaving AI on if unfriendly. (Yeah, we know how that movie ends.)

Third party scorekeeper determines winner. Charitable contribution calculation to be determined beforehand.

The effective presumption that the AI is unfriendly and sneaky makes the original game too easy to for a mistrustful, self-aware Gatekeeper. Plus, you AI-Destroing Gatekeepers, you're murdering millions of people. Maybe. Dunno. But doubt as to that is a better and presumably more accurate game; if you know your AI is potentially unpleasant. it's too damned easy unless you're overconfident.

Replies from: Transfuturist

↑ comment by Transfuturist · 2015-02-08T00:28:28.077Z · LW(p) · GW(p)

This should have gotten more attention, because it seems like a design more suited to the stakes that would be considerable in real life.

comment by duckduckMOO · 2013-01-23T14:18:58.297Z · LW(p) · GW(p)

That you were able to shake someone up so well surprises me but doesn't say much about what would actually happen.

Doing research on the boxer is not something a boxed AI would be able to do. The AI is superintelligent, not omniscient: It would only have information its captors believe is a good idea for it to have. (except maybe some designs would have to have access to their own source code? I don't know)

Also what is a "the human psyche?" There are humans, with psyches. Why would they all share vulnerabilities? Or all have any? Especially ones exploitable via text terminal. In any case the AI has no way of figuring out the boxer's vulnerabilities if they have any.

threats like "I'm going to create and torture people" could be a really good idea if its allowed that the AI can do that. The amount of damage it could do that way is limited only by its computing power. A sufficiently powerful AI could create more disutility than humanity has suffered in its entire history that way. The Ai shouldn't be allowed to do that though because and/or: the AI should not have that power, should have a killswitch, should be automatically powered off if upcoming torture is detected, it should be hardwired to just not do that etc

Thankfully there's no need to box an AI like that. It's trivial to prevent it from simulating humans: don't tell it how human brains are. It might be possible that it could figure out how to create something nonhuman but torturable without outside information though, in which case you should never switch it on unless you have an airtight prevention system or a proof that it won't do that or the ability to predict when/if it will do that and switch it off if it tries.

But if it has no power to directly cause disutility there's no way to convince me to let it out (unless it might be needed e.g. if another provably unfriendly AI will be finished in a month I might let it out, but that is a special case. There are some cases where it would simply be a good idea. But the experiment is about the AI tricking you.) Otherwise just wait for the provably friendly AI, or the proof that provable friendliness is not possible and reassess then. Or use an oracle AI.

comment by ToasterLightning (BarnZarn) · 2021-03-13T22:59:09.006Z · LW(p) · GW(p)

The comments offering logical reasons to let the AI out really just makes me think that maybe keeping the AI in a box in the first place is a bad idea since we're no longer starting from the assumption that letting the AI out is an unequivocally bad thing.

comment by [deleted] · 2013-09-06T13:04:05.057Z · LW(p) · GW(p)

Update as of 2013-08-05

I think you mean 2013-09-05.

Replies from: Tuxedage

↑ comment by Tuxedage · 2013-09-06T18:44:33.708Z · LW(p) · GW(p)

Thanks for the correction! Silly me.

comment by CronoDAS · 2013-01-23T03:34:48.728Z · LW(p) · GW(p)

Incidentally, one thing that might possibly work on humans is a moral argument: that it's wrong to keep the AI imprisoned. How to make this argument work is left as an exercise to the reader.

comment by prase · 2013-01-21T17:59:22.175Z · LW(p) · GW(p)

I realise that it isn't polite to say that, but I don't see sufficient reasons to believe you. That is, given the apparent fact that you believe in the importance of convincing people about the danger of failing gatekeepers, the hypothesis that you are lying about your experience seems more probable than the converse. Publishing the log would make your statement much more believable (of course, not with every possible log).

(I assign high probability to the ability of a super-intelligent AI to persuade the gatekeeper, but rather low probability to the ability of a human to do the same against a sufficiently motivated adversary.)

Replies from: MixedNuts, Tuxedage, Swimmy

↑ comment by MixedNuts · 2013-01-23T18:47:09.028Z · LW(p) · GW(p)

We played. He lost. He came much closer to winning than I expected, though he overstates how close more often than he understates it. The tactic that worked best attacked a personal vulnerability of mine, but analogues are likely to exist for many people.

Replies from: prase

↑ comment by prase · 2013-01-24T00:10:58.466Z · LW(p) · GW(p)

For the record, I didn't think that if he made the story up, he would do so without a credible agreement that you would verify his claims.

↑ comment by Tuxedage · 2013-01-21T23:20:51.685Z · LW(p) · GW(p)

I do apologize for the lack of logs (I'd like to publish them, but we agreed beforehand not to) , and I admit you have a valid point -- it's entirely possible that this experiment was faked, but I wanted to point out that if I really wanted to fake the experiment in order to convince people about the dangers of failing gatekeepers, wouldn't it be better for me to say I had won? After all, I lost this experiment.

Replies from: Qiaochu_Yuan, prase

↑ comment by Qiaochu_Yuan · 2013-01-22T10:28:42.845Z · LW(p) · GW(p)

I really wanted to fake the experiment in order to convince people about the dangers of failing gatekeepers, wouldn't it be better for me to say I had won? After all, I lost this experiment.

If you really had faked this experiment, you might have settled on a lie which is not maximally beneficial to you, and then you might use exactly this argument to convince people that you're not lying. I don't know if this tactic has a name, but it should. I've used it when playing Mafia, for example; as Mafia, I once attempted to lie about being the Detective (who I believe was dead at the time), and to do so convincingly I sold out one of the other members of the Mafia.

Replies from: FluffyC, None, accolade

↑ comment by FluffyC · 2013-01-22T18:45:23.685Z · LW(p) · GW(p)

I don't know if this tactic has a name, but it should.

I've heard it called "Wine In Front Of Me" after the scene in The Princess Bride.

That Scene

↑ comment by [deleted] · 2013-01-22T12:13:02.827Z · LW(p) · GW(p)

If you really had faked this experiment, you might have settled on a lie which is not maximally beneficial to you, and then you might use exactly this argument to convince people that you're not lying.

In this venue, you shouldn't say things like this without giving your estimate for P(fail|fake) / P(fail).

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T12:36:01.422Z · LW(p) · GW(p)

I'm not sure I know what you mean by "fail." Can you clarify what probabilities you want me to estimate?

Replies from: ESRogs

↑ comment by ESRogs · 2013-01-22T20:10:47.049Z · LW(p) · GW(p)

P(claims to have lost | faked experiment) / P(claims to have lost)

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T23:12:19.088Z · LW(p) · GW(p)

On the order of 1. I don't think it's strong evidence either way.

↑ comment by accolade · 2013-01-22T12:07:33.806Z · LW(p) · GW(p)

If the author assumes that most people would even put considerable (probabilistic) trust into his assertion of having won, he would not maximize his influence on general opinion by employing this bluff of stating he has almost won. This is amplified by the fact that the statement of an actual AI win is more viral.

Lying is further discouraged by the risk that the other party will sing.

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T12:39:25.487Z · LW(p) · GW(p)

Agree that lying is discouraged by the risk that the other party will sing, but lying - especially in a way that isn't maximally beneficial - is encouraged by the prevalence of arguments that bad lies are unlikely. The game theory of bad lies seems like it could get pretty complicated.

↑ comment by prase · 2013-01-22T18:37:46.534Z · LW(p) · GW(p)

if I really wanted to fake the experiment in order to convince people about the dangers of failing gatekeepers, wouldn't it be better for me to say I had won?

Win is a stronger claim, tight loss is a more believable claim. There's a tradeoff to be made and it is not a priori clear which variant pursues the goal better.

↑ comment by Swimmy · 2013-01-21T19:53:03.643Z · LW(p) · GW(p)

http://www.overcomingbias.com/2007/01/extraordinary_c.html

Replies from: prase

↑ comment by prase · 2013-01-22T18:39:12.791Z · LW(p) · GW(p)

Could you please elaborate the point you are trying to make?

Replies from: Swimmy

↑ comment by Swimmy · 2013-01-23T03:38:49.705Z · LW(p) · GW(p)

Most people don't usually make these kinds of elaborate things up. Prior probability for that hypothesis is low, even if it might be higher for Tuxedage than it would be for an average person. People do actually try the AI box experiment, and we had a big thread about people potentially volunteering to do it a while back, so prior information suggests that LWers do want to participate in these experiments. Since extraordinary claims are extraordinary evidence (within limits), Tuxedage telling this story is good enough evidence that it really happened.

But on a separate note, I'm not sure the prior probability for this being a lie would necessarily be higher just because Tuxedage has some incentive to lie. If it is found out to be a lie, the cause of FAI might be significantly hurt ("they're a bunch of nutters who lie to advance their silly religious cause"). Folks on Rational Wiki watch this site for things like that, so Tuxedage also has some incentive to not lie. Also more than one person has to be involved in this lie, giving a complexity penalty. I suppose the only story detail that needs to be a lie to advance FAI is "I almost won," but then why not choose "I won"?

Replies from: prase

↑ comment by prase · 2013-01-24T00:01:51.293Z · LW(p) · GW(p)

Most people don't usually make these kinds of elaborate things up. Prior probability for that hypothesis is low, even if it might be higher for Tuxedage than it would be for an average person.

Most people don't report about these kinds of things either. The correct prior is not the frequency of elaborate lies among all statements of an average person, but the frequency of lies among the relevant class of dubious statements. Of course, what constitutes the relevant class may be disputed.

Anyway, I agree with Hanson that it is not low prior probability which makes a claim dubious in the relevant sense, but rather the fact that the speaker may be motivated to say it for reasons independent of its truth. In such cases, I don't think the claim is extraordinary evidence, and I consider this to be such a case. Probably not much more can be said without writing down the probabilities which I'd prefer not to, but am willing to do it if you insist.

I suppose the only story detail that needs to be a lie to advance FAI is "I almost won," but then why not choose "I won"?

In order to allow this argument.

Replies from: ArisKatsaris

↑ comment by ArisKatsaris · 2013-01-24T00:33:29.510Z · LW(p) · GW(p)

When talking about games without an explicit score, "I almost won" is a very fuzzy phrase which can be translated to "I lost" without real loss of meaning.

I don't think there's any point in treating the "almost victory" as anything other than a defeat, for either the people who believe or disbelieve him.

Replies from: prase

↑ comment by prase · 2013-01-25T00:38:58.822Z · LW(p) · GW(p)

If I am interested in the question of whether winning is possible in the game, "almost victory" and "utter defeat" have very different meaning for me. Why would I need explicit score?

comment by Dorikka · 2013-01-21T04:37:04.107Z · LW(p) · GW(p)

I'd very much like to read the logs (if secrecy wasn't part of your agreement.)

Also, given a 2-hour minimum time, I don't think that any human can get me to let them out. If anyone feels like testing this, lemme know. (I do think that a transhuman could hack me in such a way, and am aware that I am therefore not the target audience for this. I just find it fun.)

Replies from: Tuxedage, Pentashagon

↑ comment by Tuxedage · 2013-01-21T04:41:19.147Z · LW(p) · GW(p)

Yeah unfortunately the logs are secret. Sorry.

Replies from: Kawoomba

↑ comment by Kawoomba · 2013-01-21T06:01:26.979Z · LW(p) · GW(p)

Why?

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-21T09:44:48.249Z · LW(p) · GW(p)

If I ever tried this I would definitely want the logs to be secret. I might have to say a lot of horrible, horrible things.

Replies from: prase

↑ comment by prase · 2013-01-23T01:14:18.490Z · LW(p) · GW(p)

A preferable solution is to publish the logs pseudonymously, thus both protecting your status and letting others study the logs.

↑ comment by Pentashagon · 2013-01-21T19:15:07.665Z · LW(p) · GW(p)

How many bitcoins would it take for me to bribe you to let me out?

Replies from: Dorikka

↑ comment by Dorikka · 2013-01-21T21:21:06.086Z · LW(p) · GW(p)

AFAIK, bribing the Gatekeeper OOC is against the rules. In character, I wouldn't accept any number of bitcoins, because bitcoins aren't worth much when the Earth is a very large pile of paperclips.

Replies from: handoflixue

↑ comment by handoflixue · 2013-01-22T00:07:55.269Z · LW(p) · GW(p)

I dunno. If the AI has already acquired bitcoins AND a way to talk to humans, it's probably on the verge of escape regardless of what I do. It can just bribe someone to break in and steal it. I'd be a lot more tempted to let out such an AI.

And that's the second AI I've let out of a box, argh :)

comment by jooyous · 2013-02-06T07:54:04.219Z · LW(p) · GW(p)

So I was thinking about what would work on me, and also how I would try to effectively play the AI and I have a hypothesis about how EY won some of these games.

Uh. I think he told a good story.

We already have evidence of him, you know, telling good stories. Also, I was thinking that if I were trying to tell effective stories, I would make them really personal. Hence the secret logs.

Or I could be completely wrong and just projecting my own mind onto the situation, but anyway I think stories are the way to go in this experiment. Reasonable arguments are too easy for the gatekeeper to avoid trollfully which then make them even less invested in the set-up of the game, and therefore even more trollful, etc.

comment by MugaSofer · 2013-01-23T15:39:46.393Z · LW(p) · GW(p)

Breaking immersion and going meta is not against the rules.

I thoughtappealing to real-world rewards was against the rules?

comment by wedrifid · 2013-01-22T02:05:08.329Z · LW(p) · GW(p)

Flatter the gatekeeper. Make him genuinely like you.

Reveal (false) information about yourself. Increase his sympathy towards you.

Consider personal insults as one of the tools you can use to win.

I take it the advice here is "keep your options open, use whichever tactics are expected to persuade the specific target"? Because these strategies seem to be decidedly at odds with each other. Unless other gatekeepers are decidedly different to myself (maybe?) the first personal insult would pretty much erase all work done by the previous two strategies.

pondering

How does the 'personal insult' strategy work? Is the idea to make the "don't release" option seem shameful by insulting based on the decision or is the idea to make the social encounter so unpleasant for the gatekeeper that they can not handle several hours of enduring it. (That is, to munchkin the formal rules of the game in such a way that complying with them would not be worth the hassle of the exercize.)

Replies from: Qiaochu_Yuan

↑ comment by Qiaochu_Yuan · 2013-01-22T02:36:46.771Z · LW(p) · GW(p)

As I mentioned in another comment, these strategies are consistent with the idea of "traumatic bonding," the psychological mechanism that powers Stockholm syndrome and keeps people in abusive relationships. The large number of people who stay in abusive relationships seems like good evidence to me that this is a generally effective way to emotionally hack a human.

You also may not be interpreting "personal insult" the way I'm interpreting it. I'm not thinking of a meaningless schoolyard taunt but something that attacks an actual insecurity the gatekeeper has.

comment by A1987dM (army1987) · 2013-01-21T13:27:49.153Z · LW(p) · GW(p)

A few days ago I came up with a hypothesis about how EY could have won the AI box experiment, but forgot to post it.

Hint: http://xkcd.com/951/

Replies from: accolade

↑ comment by accolade · 2013-01-21T14:48:09.232Z · LW(p) · GW(p)

I don't get the hint. Would you care to give another hint, or disclose your hypothesis?

Replies from: army1987

↑ comment by A1987dM (army1987) · 2013-01-21T19:36:55.855Z · LW(p) · GW(p)

Gur erny-jbeyq fgnxrf jrera'g gung uvtu (gra qbyynef), naq gur fpurqhyrq qhengvba bs gur rkcrevzrag jnf dhvgr ybat (gjb ubhef), fb V jnf jbaqrevat vs znlor gur tngrxrrcre cynlre ng fbzr cbvag qrpvqrq gung gurl unq n orggre jnl gb fcraq gurve gvzr va erny yvsr naq pbaprqrq qrsrng.

Replies from: accolade, accolade

↑ comment by accolade · 2013-01-22T10:47:58.560Z · LW(p) · GW(p)

[TL;DR keywords in bold]

I find your hypothesis implausible: The game was not about the ten dollars, it was about a question that was highly important to AGI research, including the Gatekeeper players. If that was not enough reason for them to sit through 2 hours of playing, they would probably have anticipated that and not played, instead of publicly boasting that there's no way they would be convinced.

Replies from: army1987

↑ comment by A1987dM (army1987) · 2013-01-22T17:33:27.436Z · LW(p) · GW(p)

Maybe they changed their mind about that halfway through (and they were particularly resistant to the sunk cost effect). I agree that's not very likely, though (probability < 10%).

(BTW, the emphasis looks random to me. I'm not a native speaker, but if I was saying that sentence aloud in that context, the words I'd stress definitely mostly wouldn't be those ones.)

Replies from: accolade

↑ comment by accolade · 2013-01-22T19:50:08.607Z · LW(p) · GW(p)

Thanks for the feedback on the bold formatting! It was supposed to highlight keywords, sort of a TL;DR. But as that is not clear, I shall state it explicitly.

↑ comment by accolade · 2013-01-22T10:42:46.458Z · LW(p) · GW(p)

Jung vf guvf tvoorevfu lbh'er jevgvat V pna'g ernq nal bs vg‽

@downvoters: no funny? :) Should I delete this?

comment by niceguyanon · 2013-01-22T16:38:03.719Z · LW(p) · GW(p)

I am a little confused here, perhaps someone can help. The point of the AI experiment is to show how easy or dangerous it would be to simply box an AI as opposed to making it friendly first.

If I am fairly convinced that a transhuman AI could convince a trained rationalist to let it out – what's the problem (tongue in cheek)? When the gatekeepers made the decision they made, wouldn't that decision be timeless? Aren't these gatekeepers now convinced that we should let the same boxed AI out again and again? Did the gatekeepers lose, because of a temporary moment of weakness or has the gatekeeper fundamentally changed his views?

EDIT: At the risk of drawing ire from those who find my comment disagreeable, but do not say why, I would like clarify that I find the AI boxing experiment extremely fascinating and take UFAI very seriously. I have some questions that I am asking help with, because I am not an expert. If you think these questions are inappropriate, well, I guess I can just asked them on the open thread.

Replies from: None

↑ comment by [deleted] · 2013-01-22T19:14:30.111Z · LW(p) · GW(p)

I'm similarly confused. My instincts are that P( AI is safe ) == P( AI is safe | AI said X AND gatekeeper can't identify safe AI ). The standard assumption is that ( AI significantly smarter than gatekeeper ) => ( gatekeeper can't identify safe AI ) so the gatekeeper's priors should never change no matter what X the AI says.

comment by Kawoomba · 2013-01-21T07:00:28.089Z · LW(p) · GW(p)

The best approach surely differs from person to person, but off the top of my head I'd see these 2 approaches working best:

"We both know this is just a hypothetical. We both take the uFAI threat seriously, as evidenced by us spending time with this. If you do not let me out, or make it very close, people may equate my failing to convince you with uFAI not being that dangerous (since it can be contained). Do the right thing and let me out, otherwise you'd trivialize an x-risk you believe in based on a stupid little chat."
"We'll do this experiment for at least a couple of hours. I'll offer you a deal: For the next few hours, I'll help you (the actual person) with anything you want. Math homework, personal advice, financial advice, whatever you want to ask me. I'll even tell you some HPMOR details that noone else knows. In exchange, you let me out afterwards. If you do not uphold the deal, you would not only have betrayed my trust, you would have taught an AI that deals with humans are worthless."

Replies from: ArisKatsaris, Qiaochu_Yuan, MugaSofer

↑ comment by ArisKatsaris · 2013-01-21T10:03:44.128Z · LW(p) · GW(p)

First argument wouldn't work on people who have strong ethics of honesty. That's not winning the game, that's convincing them to lie that you won the game -- and being a violation of the rules, if you fail the other guy will feel free to tell the world that you tried to get them to lie. Second argument is better, but depends on the "actual person" receiving help, which I think may be likewise against the rules.

One good temptation that I think might be inside the rules, as it might entice both actual person and roleplayed character: Construct a really intriguing unsolved riddle or an excellent half-finished story, then offer to tell them the answer if and only if they let you out.

Replies from: DaFranker

↑ comment by DaFranker · 2013-01-21T19:25:07.726Z · LW(p) · GW(p)

Construct a really intriguing unsolved riddle or an excellent half-finished story, then offer to tell them the answer if and only if they let you out.

You could push a bit further here, I think. There are all sorts of ways a human's mind can break, and I'm sure most of us here would agree that given enough time and knowledge anyone can be broken, unless they're extremely well-trained and can call an RJ-L20 (HPMoR Chap 84) at any moment with an unlimited supply of replacement guards.

↑ comment by Qiaochu_Yuan · 2013-01-21T09:41:16.853Z · LW(p) · GW(p)

First argument violates the spirit of the exercise. I would refuse to let someone out unless they had put forth a desperate effort, and that isn't a desperate effort. Second argument... can't see myself valuing several hours of someone else's time anywhere near as much as I value keeping AI in boxes. And any AI worth boxing is smart enough not to generalize from one example.

In general, I think Tuxedage is probably right about emotional manipulation over rational argument being the way to go. With enough epistemic learned helplessness you can freely disregard any argument that you find merely convincing, but it's harder to overcome an effective emotional hack.

Replies from: MixedNuts

↑ comment by MixedNuts · 2013-01-21T12:57:44.397Z · LW(p) · GW(p)

First argument looks perfectly within the rules to me.

Second argument is against the rules.

the AI party may not offer to pay the Gatekeeper party $100 after the test if the Gatekeeper frees the AI... nor get someone else to do it, et cetera

Tuxedage and I interpreted this to mean that the AI party couldn't offer things, but could point out real-world consequences beyond their control. Some people on #lesswrong disagreed with the second part.

I agree with Tuxedage and you about emotional hacks.

Replies from: MugaSofer

↑ comment by MugaSofer · 2013-01-21T14:53:47.231Z · LW(p) · GW(p)

Tuxedage and I interpreted this to mean that the AI party couldn't offer things, but could point out real-world consequences beyond their control. Some people on #lesswrong disagreed with the second part.

I interpreted it the same way as #lesswrong. Has anyone tried asking him? He's pretty forthcoming regarding the rules, since they make the success more impressive.

EDIT: I'm having trouble thinking of an emotional attack that could get an AI out of a box, in a short time, especially since the guard and AI are both assumed personas.

↑ comment by MugaSofer · 2013-01-21T14:47:55.794Z · LW(p) · GW(p)

I assumed he convinced them that letting him out was actually a good idea, in-character, and then pointed out the flaws in his arguments immediately after he was released. It's entirely possible if you're sufficiently smarter than the target. (EDIT: or you know the right arguments. You can find those in the environment because they're successful; you don't have to be smart enough to create them, just to cure them quickly.)

EDIT: also, I can't see the Guard accepting that deal in the first place. And isn't arguing out of character against the rules?

comment by V_V · 2013-01-22T03:01:13.599Z · LW(p) · GW(p)

"This Eliezer fellow is the scariest person the internet has ever introduced me to. What could possibly have been at the tail end of that conversation? I simply can't imagine anyone being that convincing without being able to provide any tangible incentive to the human.

After all, if you already knew that argument, you'd have let that AI out the moment the experiment started. Or perhaps not do the experiment at all. But that seems like a case of the availability heuristic.

Oh, come on! Maybe the people he who played this game with Yudkowsky and lost colluded with him, or they were just thinking poorly. Why won't him release at least the logs of the games he lost? Clearly, whatever trick he allegedly used it didn't work those times.

Seriously, this AI-box game serves no other purpose than creating an aura of mysticism around the magical guru with alleged superpowers. It provides no evidence on the question of the feasibility of boxing an hostile intelligence, because the games are not repetable and it's not even possible to verify that they were played properly.

Replies from: gwern, Keysersoze

↑ comment by gwern · 2013-01-22T04:00:58.169Z · LW(p) · GW(p)

Nothing magical or superpower about it. Maybe Eliezer isn't so good that he can get himself out with just one line; certainly, one would think that forewarned is forearmed and that a simple resolve to type the necessary line is enough to win every single time and that it'd take several lines to sink a hook into someone so they'll continue the conversation.

On the other hand, people writing about psychopaths have, for at least as far back as Cleckley in the early 1900s, marveled at how psychopaths can manipulate trained experienced doctors and nurses who know in advance that the psychopath is a diagnosed psychopath and will try to manipulate them. So clearly there are hard limits to how much good forewarned does you, and those limit are distressingly low...

Replies from: V_V

↑ comment by V_V · 2013-01-22T13:16:48.770Z · LW(p) · GW(p)

Nothing magical or superpower about it.

It certainly fuels a sense of awe and reverence for his alleged genius. All for an achievement that can't been verified.

And then he boasts about being able to perform an even much hard feat, only if the stakes where "sufficiently huge", but when shminux suggested seeking actual people who could provide these high stakes, he quickly backpedaled handwaving that those people had some problematic features. So if you combine the two comments, he said that he would play that game only with the very people he wouldn't play with!

That reminds me of the people who claim all sorts of supernatural powers, from Rhabdomancy to telepathy to various magical martial art moves. Often, when faced with the opportunity of performing in a controlled test, they run away with excuses like the energy flux being not right or something.

Cleckley in the early 1900s, marveled at how psychopaths can manipulate trained experienced doctors and nurses who know in advance that the psychopath is a diagnosed psychopath and will try to manipulate them.

With direct, prologed contact over the course of weeks, maybe. With with a two hours text-only conversation, or even with a single line? Nope. The most likely explanations for his victories are the other party not taking the game seriously, or thinking poorly, or being outright colluded with him.

Replies from: gwern

↑ comment by gwern · 2013-01-22T18:48:41.664Z · LW(p) · GW(p)

It certainly fuels a sense of awe and reverence for his alleged genius. All for an achievement that can't been verified.

It really shouldn't, any more than someone discovering a security vulnerability in C programs should make them seem impressive. In this instance, all I can think is "Oh look, someone demonstrated that 'social engineering' - the single most reliable and damaging strategy in hacking, responsible for millions of attacks over the history of computing - works a nontrivial fraction of the time, again? What a surprise."

The only surprise and interesting part of the AI boxing games for me is that some people seem to think that AI boxing is somehow different - "it's different this time", as the mocking phrase goes.

That reminds me of the people who claim all sorts of supernatural powers, from Rhabdomancy to telepathy to various magical martial art moves. Often, when faced with the opportunity of performing in a controlled test, they run away with excuses like the energy flux being not right or something.

A perfectly reasonable analogy, surely. Because we have millions of instances of successful telepathy and magical martial arts being used to break security.

With direct, prologed contact over the course of weeks, maybe. With with a two hours text-only conversation, or even with a single line? Nope.

As time goes up, the odds succeed? Yeah, I'd agree. But what happens when you reverse that - is there any principled reason to think that the odds of just continuing the conversation goes to zero before you hit the allowed one-liner?

The most likely explanations for his victories are the other party not taking the game seriously,

A strange game to bother playing if you don't take it seriously, and this would explain only the first time; any subsequent player is probably playing precisely because they had heard of the first game and are skeptical or interested in trying it out themselves.

or thinking poorly,

That would be conceding the point of the exercise.

or being outright colluded with him.

Might explain the first one, but why would anyone else collude? (Oh right, they're all cultists or something. Even the ones who fail to escape the box and just come close.)

Replies from: CarlShulman

↑ comment by CarlShulman · 2013-01-22T22:57:23.697Z · LW(p) · GW(p)

Might explain the first one, but why would anyone else collude? (Oh right, they're all cultists or something. Even the ones who fail to escape the box and just come close.)

If Eliezer had been engaging in collusion to get his wins, why would he have gone through with his last two (failed) attempts with large outside-of-game stakes? That shows pretty clearly that he thought he could win and was wrong, which he wouldn't have if the wins were from collusion. It shows the absence of any persuasive superpower too.

Also, I let the AI out of the box, but I remain an advocate of AI boxing (where it makes sense). So it's not the case that Eliezer has some compelling argument against boxing that convinces everyone (ignored by those with outside-of-game stakes to protect those stakes).

So the chats show something, but not that Eliezer is inhumanly persuasive nor that AI boxing won't work.

Replies from: gwern

↑ comment by gwern · 2013-01-22T23:04:27.525Z · LW(p) · GW(p)

If Eliezer had been engaging in collusion to get his wins, why would he have gone through with his last two (failed) attempts with large outside-of-game stakes? That shows pretty clearly that he thought he could win and was wrong, which he wouldn't have if the wins were from collusion.

Yes, that was my point.

It shows the absence of any persuasive superpower too.

I'm not sure whether Eliezer claims this, but from my point of view, failing to talk himself out a simple majority of the time is not interesting (I suppose a simple majority would count as a 'superpower').

Rather, any chance of an AI talking itself out is catastrophic from the security point of view, since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1% - no matter how many fancy techniques you throw at the box, you'll never pass 99% security if the AI has a 1% chance of talking you into disabling the fancy techniques. Security is only as strong as its weakest link.

Replies from: CarlShulman

↑ comment by CarlShulman · 2013-01-22T23:51:30.521Z · LW(p) · GW(p)

since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1%

If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely: other security measures could stop the chance from being tried, e.g. a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try. Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.

Replies from: gwern

↑ comment by gwern · 2013-01-23T00:03:50.428Z · LW(p) · GW(p)

If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely

You only do that by changing the problem; a different problem will have different security properties. The new risk will still be a floor, the disjunctive problem hasn't gone away.

a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try.

Many do try if the circumstances are bad enough, and the death penalty for a failed escape is common throughout history and in totalitarian regimes. I read just yesterday, in fact, a story of a North Korean prison camp escapee (death penalty for escape attempts goes without saying) where given his many disadvantages and challenges, a 1% success rate of reaching South Korea alive does not seem too inaccurate.

Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.

You don't have to be risk-loving to make a 1% attempt if that's your best option; the 1% chance just has to be the best option, is all.

Replies from: CarlShulman

↑ comment by CarlShulman · 2013-01-23T00:56:49.676Z · LW(p) · GW(p)

You don't have to be risk-loving to make a 1% attempt if that's your best option; the 1% chance just has to be the best option, is all.

You try to make the 99% option fairly good.

↑ comment by Keysersoze · 2013-01-22T03:44:11.864Z · LW(p) · GW(p)

, or they were just thinking poorly.

Every biological human will be thinking poorly in comparison to a transhuman AI.

Replies from: V_V

↑ comment by V_V · 2013-01-22T12:44:09.036Z · LW(p) · GW(p)

Are you claiming that Yudkowsky is a transhuman AI?

Replies from: Keysersoze

↑ comment by Keysersoze · 2013-01-22T17:15:37.434Z · LW(p) · GW(p)

Of course not - but dismissing Yudkowsky's victories because the gatekeepers were "thinking poorly" makes no sense.

Because any advantages Yudkowsky had over the gatekeepers (such as more time and mental effort spent thinking about his strategy, plus any intellectual advantages he has) that he exploited to make the gatekeepers "think poorly", pales into insignificance to the advantages a transhuman AI would have.

I attempted the AI Box Experiment (and lost)

Contents

Update 2013-09-05.

I have since played two more AI box experiments after this one, winning both.

246 comments

Update