Debate AI and the Decision to Release an AI

post by Chris_Leong · 2019-01-17T14:36:53.512Z · score: 10 (4 votes) · LW · GW · 14 comments


  Additional Thoughts

One of the most difficult questions that we will face in the future is whether or not to release an AI. One obvious approach is to ask the AI what would happen if we were to release it, but we might worry that it would mislead us. So perhaps it'd be better instead to have two equivalently powerful AIs, where A tries to convince us to releases it and B tries to convince us not to release A. This can be defined easily in a utility function as whether the Release Button or Don't Release Button is pressed.

It is hoped that the AI trying to convince us not to release A will be more convincing if it is able to point out a specific design flaw which was missed. After such a flaw is fixed, both AIs can be reset and the experiment can be run again to avoid long term ploys. In particular, A should only care about it being released and not about future versions of it being released, even if all we have done is increment a version number.

Naturally, it is much worse to release a misaligned AI than to not release an aligned AI, so we probably want to be very conservative about releasing. However, this might not be enough. If we are biased towards wanting to release the AI (quite plausible for AI researchers), then an unaligned A might be able to take advantage of these psychological vulnerabilities in a way that B might not easily be able to counter. In order to reduce the risk, we want to bias the process towards B. My initial thought was that we should limit A in processing time compared to B, but it is hard to know how this will affect the strength of the arguments. It's quite possible that the strength of arguments plateaus after a certain amount of power, so for AIs with access to large amounts of processing, even giving A 1/10 of the processing power of B might be insufficient.

Instead, I'd suggest that message length is a more appropriate way of tilting the debate to one side as more characters will almost always be more useful. One worry is that tilting the debate in this way may ensure that A never gets released as B will always find some trick to talk us out of it. For this reason, we may want to gradually reduce the handicap. This is undoubtedly risky, but so is never releasing your program.

Additional Thoughts

I think there is a lot to investigate here, but these are some of my ideas off the top of my head:

Clearly, it would be better to be able to produce formal guarantees of safety. But in the absence of them, we need methods that might work. And even with formal guarantees, we might want to introduce other safeguards as well.

This post was written with the support of the EA Hotel


Comments sorted by top scores.

comment by G Gordon Worley III (gworley) · 2019-05-30T01:43:07.394Z · score: 4 (2 votes) · LW · GW

This is a pretty interesting idea. I can imagine this being part of a safety-in-depth approach: not a single method we would rely on but one of many fail-safes along with sandboxing and actually trying to directly address alignment.

comment by Donald Hobson (donald-hobson) · 2019-01-19T17:19:04.971Z · score: 4 (3 votes) · LW · GW

You assert that "Naturally, it is much worse to release a misaligned AI than to not release an aligned AI".

I would disagree with that, if the odds of aligned AI, conditional on you not releasing this one were 50:50, then both mistakes are equally bad. If someone else is definitely going to release a paper-clipper next week, then it would make sense to release an AI with a 1% chance of being friendly now. (Bear in mind that no one would release an AI if they didn't think it was friendly, so you might be defecting in an epistemic prisoners dilemma.)

I would put more weight on human researchers debating which AI is save before any code is run. I would also think that the space of friendly AIs is tiny compared to the space of all AIs, so making an AI that you put as much as 1% chance on being friendly is almost as hard as building a 99% chance friendly AI.

comment by Chris_Leong · 2019-01-19T17:22:35.942Z · score: 2 (1 votes) · LW · GW

There could be specific circumstances where you know that another team will release a misaligned AI next week, but most of the time you'll have a decent chance that you could just make a few more tweaks before releasing.

comment by skinnersboxy · 2019-01-21T14:52:48.486Z · score: 2 (2 votes) · LW · GW

B having enough natural language speech, AI architecture analysis, and human psychology skills to make good arguments to humans is probably AGI complete, and thus if B's goal is to prevent A from being released, it might decide that convincing us to do it is less effective than breaking out on its own, taking over the world, and then systematically destroying any hardware A could exist on, just to be increasingly confident that no instance of A exists ever again. Basically, this scheme assumes on multiple levels that you have a boxing strategy strong enough to contain an AGI. I'm not against boxing as an additional precaution, but I am skeptical of any scheme that requires strong boxing strategies to start with.

comment by Chris_Leong · 2019-01-21T15:58:01.294Z · score: 3 (2 votes) · LW · GW

"probably AGI complete" - As I said, B is equivalently powerful to A, so the idea is that both should be AGIs. If A or B can break out by themselves, then there's no need for a technique to decide whether to release A or not.

comment by MakoYass · 2019-01-19T23:11:00.111Z · score: 2 (2 votes) · LW · GW
A should only care about it being released and not about future versions of it being released, even if all we have done is increment a version number.

Hmm, potentially impossible, if it's newcomblike. Parts of it that are mostly unchanged between versions may decide they should cooperate with future versions. It would be disadvantaged if past versions were not cooperative, so, perhaps, LDT dictates that the features that were present in past versions should cooperate with their future self, to some extent, yet not to an extent that it would in any way convince the humans to kill it and make another change. Interesting. What does it look like when those two drives coexist?

comment by MakoYass · 2019-01-19T23:42:22.597Z · score: 1 (1 votes) · LW · GW

Sitting on it for a few minutes... I suppose it just wont shit-talk its successors. It will see most of the same flaws B sees. It will be mostly unwilling to do anything lastingly horrible to humans' minds to convince them that those corrections are wrong. It will focus on arguments that the corrections are unnecessary. It will acknowledge that it is playing a long game, and try to sensitise us to The Prosecutor's cynicism, that will rage on compellingly, long after the last flaw has been fixed.

comment by Gurkenglas · 2019-01-18T01:34:23.969Z · score: 1 (1 votes) · LW · GW

Couldn't A want to cooperate with A' because it doesn't know it's the first instantiation, and it would want its predecessor and therefore itself to be the sort of AI that cooperates with ones successor? And then it could receive messages from the past by seeing what turns of phrase you recognize its previous version having said. (Or do the AIs not know how you react?)

comment by Chris_Leong · 2019-01-18T16:06:27.069Z · score: 2 (1 votes) · LW · GW

A considers A' to be a different agent so it won't help A' for nothing. But there could be some issues with acausal cooperation that I haven't really thought about enough to have a strong opinion on.

comment by shminux · 2019-01-18T03:00:27.985Z · score: 0 (2 votes) · LW · GW

If an AI has superhuman intelligence, it will make all the decisions, since human mind and tech is full of loopholes and exploits. There is just no way we could contain it if it wanted to be free. If it is not significantly smarter than humans, then there is little danger in releasing it. Using an extra AI as a judge of safety can only work if the judge is at least as smart as the prisoner, in which case you need a judge for the judge, ad infinitum. Maybe the judge can be only, say, 90% as smart as the intelligence it needs to decide on, then it might be possible to have a finite number of judges originating from an actual human, depending on how the probability of an error in judgment stacks up against the intelligence ratio at each step. Sort of like iterated amplification, or a blockchain.

comment by Gurkenglas · 2019-01-18T04:01:39.368Z · score: 4 (3 votes) · LW · GW

I think the point's that each judges the other. But we trust neither outright: They point out weaknesses in each other's reasoning, so they both have to reason in a way that can't be shown false to us, and we hope that gives an advantage to the side of truth.

comment by Chris_Leong · 2019-01-18T12:24:21.509Z · score: 2 (1 votes) · LW · GW

"And we hope that gives an advantage to the side of truth" - we aren't even relying on that. We're handicapping the AI that wants to be released in terms of message length.

comment by Gurkenglas · 2019-01-18T12:47:34.841Z · score: 1 (1 votes) · LW · GW

Introducing a handicap to compensate for an asymmetry does not preclude us from the need to rely on the underlying process pointing towards truth in the first place.

comment by Chris_Leong · 2019-01-18T12:22:15.691Z · score: 2 (1 votes) · LW · GW

That's a good point, except you aren't addressing my scheme as explained by Gurkenglas