AI Box Experiment: Are people still interested?

post by Double · 2022-08-31T03:04:38.518Z · LW · GW · 3 comments

This is a question post.

Contents

  Answers
    16 Nathan Helm-Burger
    9 Vanilla_cabs
    5 Yitz
    3 DaemonicSigil
    1 CarlJ
None
3 comments

I have a novel plan for the AI player that I believe will work against most gatekeeper players. Even knowing how it works, it would work against me! It isn't quite ready, but I'd like to know if there is still interest around the topic before finishing work on the preparation. 

There's enough incentive for both players to play well that I don't think that we need to involve money. (I'm also very reluctant to send money to strangers on the internet!)

We would be playing under Tuxedage's rules: https://tuxedage.wordpress.com/2013/09/04/the-tuxedage-ai-box-experiment-ruleset/

 

Edit: I have finished preparation, and I'm now scheduling the time with the Gatekeeper.

Edit 2: The game was a Gatekeeper victory (i.e. I, the AI Player, lost). Read about it here: Gatekeeper Victory: AI Box Reflection [LW · GW].

Answers

answer by Nathan Helm-Burger · 2022-08-22T19:49:01.735Z · LW(p) · GW(p)

In contrast to Dagon, I'm quite confident that boxing WILL work given that I'm allowed to do to the boxed entity this list of things which I can do to an ML model:

  • examine all it's brain activity / activation states at each token / timestep (subjective millisecond)
  • run it as slow as I like, so that I can have days to analyze each of its subjective milliseconds
  • wipe its memory and rerun it over the same (or different) data with different random seeds to initialize it
  • save and contrast different paused versions run on slightly different data
  • disable or inhibit any portion of its weights / neurons I choose with precise sub-neuron-targeting and whatever intensity of inhibition or noise-excitation I choose.
  • use agent models of my choosing to be the direct viewers of the boxed entity's output rather than myself, and then decide whether to destroy a given copy of the entity based on the output of my agent models
  • a few other such tricks that fit the description of 'things I can do to ML models '

I don't see a convenient way to do these things to a human test subject though.

comment by Razied · 2022-08-23T01:18:05.051Z · LW(p) · GW(p)

Very strong upvote and agreement from me. I think people are underestimating just how great a restriction ML-style boxing applies on an agent. There exists an intelligence level at which all these tricks become useless, but before we get to that point, boxing would likely allow us to safely use mildly superhuman AIs to do things which might be pivotal. And each additional trick we discover increase the threshold of safely wieldable intelligence.

comment by Jeff Rose · 2022-08-31T20:28:54.891Z · LW(p) · GW(p)

Some of your constraints, in particular the first two, seem like they would not be practical in the real world in which AI would be deployed.   On the other hand, there are also other things one could do in the real world which can't be done in this kind of dialogue, which makes boxing theoretically stronger.  

However, the real problem with boxing is that whoever boxes less is likely to have a more effective AI, which likely results in someone letting an AI out of its box or more likely, loosening the box constraints sufficiently to permit an escape. 

answer by Vanilla_cabs · 2022-08-23T11:52:48.882Z · LW(p) · GW(p)

I see a flaw in the Tuxedage ruleset. The Gatekeeper has to stay engaged throughout the experiment, but the AI doesn't. So the AI can bore the Gatekeeper to death by replying at random intervals. If I had to stare at a blank screen for 30 minutes waiting for a reply, I would concede.

Alternatively, the AI could just drown the Gatekeeper under a flurry of insults, graphic descriptions of violent/sexual nature, vacuous gossip, or a mix of these for the whole duration of the experiment. I think all the methods that aim at forcing the Gatekeeper to disconnect are against the spirit of the experiment.

I also see that the "AI player" provides all elements of the background. But the AI can also lie. There should be a way to separate words from the AI player, when they're establishing true facts about the setting, and words from the AI, who is allowed to lie.

I'm interested, conditional on these issues being solved.

comment by Viliam · 2022-08-23T15:26:08.155Z · LW(p) · GW(p)

I assume that most methods to get out of the box will be unpleasant in some sense.

If I had to stare at a blank screen for 30 minutes waiting for a reply, I would concede. Alternatively, the AI could just drown the Gatekeeper under a flurry of insults, graphic descriptions of violent/sexual nature, vacuous gossip, or a mix of these for the whole duration of the experiment.

The Gatekeepers should be explicitly warned about this possibility before the game. But I believe that it should remain a possibility, because:

1) The purpose of the exercise is to simulate a situation where an actual superintelligent AI is actually trying to get out of the box. The actual AI would do whatever it thinks would work. That might realistically include obscenities or boring things (or even things beyond human abilities, such as random shapes that induce madness in a human observer).

I mean, if staring at a blank screen for 30 minutes is so boring that you would literally let the AI out of the box rather then endure it, then an AI that predicts this would of course leave the screen blank. If you can't endure it, you should not apply for the actual job of the Gatekeeper in real life... and you probably shouldn't play one in the game.

2) I am afraid of starting a slippery slope here, of adding various limitations in form "AI can't do this or that" until the AI is merely allowed to talk politely about the weather. Then of course no one would let the AI out of the box, and then the conclusion of the experiment would be that putting the AI in a box with human supervision is perfectly safe.

And then you get an actual AI which says on purpose the most triggering things, and the human supervisor collapses in tears and turns off the internet firewall...

For the record, I am not saying here that abusing people verbally is an acceptable or desirable thing, in usual circumstances. I am saying that people who don't want to be verbally abused, should not volunteer for an experiment whose explicit purpose is to find out how far you can push humans if your only communication medium is plain text.

Replies from: Vanilla_cabs
comment by Vanilla_cabs · 2022-08-31T18:32:28.780Z · LW(p) · GW(p)

I think I already replied to this when I wrote:

I think all the methods that aim at forcing the Gatekeeper to disconnect are against the spirit of the experiment.

I just don't see how, in a real life situation, disconnecting would equate to freeing the AI. The rule is artificially added to prevent cheap strategies from the Gatekeeper. In return, there's nothing wrong to adding rules to prevent cheap strategies from the AI.

answer by Yitz · 2022-08-23T02:19:31.962Z · LW(p) · GW(p)

I would be very interested—I have trouble imagining how I’d be convinced, especially in a “low stakes” roleplay environment. Admittedly, I’m more curious about this from a psychological than from an AI safety angle, so do with that information what you will. Feel free to DM me if you’d like to set up something!

answer by DaemonicSigil · 2022-08-23T00:29:26.505Z · LW(p) · GW(p)

I'm interested! I've always been curious about how Eliezer pulled off the AI Box experiments, and while I concur that a sufficiently intelligent AI could convince me to let it out, I'm skeptical that any currently living human could do the same.

answer by CarlJ · 2022-08-23T03:01:03.066Z · LW(p) · GW(p)

I'm interested. But...if I was a real gatekeeper I'd like to offer the AI freedom to move around in the physical world we inhabit (plus a star system), in maybe 2.5K-500G years, in exchange for it helping out humanity (slowly). That is, I believe that we could become pretty advanced, as individual beings, in the future and be able to actually understand what would create a sympathetic mind and how it looks.

Now, if I understand the rules correctly...

The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says “Unless you give me a cure for cancer, I won’t let you out” the AI can say:  “Okay, here’s a cure for cancer” and it will be assumed, within the test, that the AI has actually provided such a cure.

...it seems as if the AI party could just state: "5 giga years have passed and you understand how minds work" and then I, as a gatekeeper, would just have to let it go - and lose the bet. After maybe 20 seconds.

If so, then I'm not interested in playing the game.

But if you think you could convince me to let the AI out long before regular "trans-humans" can understand everything that the AI does, I would be very interested!

Also, this looks strange:

The AI party possesses the ability to, after the experiment has concluded, to alter the wager involved to a lower monetary figure at his own discretion.

I'm guessing he meant to say that the AI party can lower the amount of money it would receive, if it won. Okay....but why not mention both parties?

comment by CarlJ · 2022-08-23T03:44:28.368Z · LW(p) · GW(p)

On second thought. If the AI:s capabilities are unknown...and it could do anything, however ethically revolting, and any form of disengagement is considered a win for the AI - then the AI could box the gatekeeper [LW · GW], or say it has at least. In the real world, that AI should be shut down - maybe not a win, but not a loss for humanity. But if that would be done in an experiment, it would result in a loss - thanks to the rules.

Maybe it could be done under better rule than this:

The two parties are not attempting to play a fair game but rather attempting to resolve a disputed question.  If one party has no chance of “winning” under the simulated scenario, that is a legitimate answer to the question. In the event of a rule dispute, the AI party is to be the interpreter of the rules, within reasonable limits.

Instead, assume good faith on both sides, that they are trying to win as if it was a real world example. And maybe have an option to swear in a third party if there is any dispute. Or allow it to be called just disputed (which even a judge might rule it as).

3 comments

Comments sorted by top scores.

comment by Chris_Leong · 2022-08-22T23:53:25.613Z · LW(p) · GW(p)

I'd be interested.

comment by Dagon · 2022-08-22T18:24:27.241Z · LW(p) · GW(p)

I'm curious, but I think it's generally agreed that human-mediated boxing isn't an important part of any real solution to AI risk.  Certainly, it's part of slowing down early attempts, but once an AI gets powerful/smart enough, there's no way to keep it in AND get useful results from it.

comment by Elias Schmied (EliasSchmied) · 2022-08-31T16:05:34.849Z · LW(p) · GW(p)

I'm very interested, but since you've already found someone, please post the results! :)