What are the most plausible "AI Safety warning shot" scenarios?

post by Daniel Kokotajlo (daniel-kokotajlo) · 2020-03-26T20:59:58.491Z · LW · GW · 15 comments

This is a question post.

A "AI safety warning shot" is some event that causes a substantial fraction of the relevant human actors (governments, AI researchers, etc.) to become substantially more supportive of AI research and worried about existential risks posed by AI.

For example, suppose we build an unaligned AI system which is "only" about as smart as a very smart human politician, and it escapes and tries to take over the world, but only succeeds in taking over North Korea before it is stopped. This would presumably have the "warning shot" effect.

I currently think that scenarios like this are not very plausible, because there is a very narrow range of AI capability between "too stupid to do significant damage of the sort that would scare people" and "too smart to fail at takeover if it tried." Moreover, within that narrow range, systems would probably realize that they are in that range, and thus bide their time rather than attempt something risky.

EDIT: To make more precise what I mean by "substantial:" I'm looking for events that cause >50% of the relevant people who are at the time skeptical or dismissive of existential risk from AI to change their minds.


answer by paulfchristiano · 2020-03-27T16:00:02.205Z · LW(p) · GW(p)

I think "makes 50% of currently-skeptical people change their minds" is a high bar for a warning shot. On that definition e.g. COVID-19 will probably not be a warning shot for existential risk from pandemics. I do think it is plausible that AI warning shots won't be much better than pandemic warning shots. (On your definition it seems likely that there won't ever again be a warning shot for any existential risk.)

For a more normal bar, I expect plenty of AI systems to fail at large scales in ways that seem like "malice," and then to cover up the fact that they've failed. AI employees will embezzle funds, AI assistants will threaten and manipulate their users, AI soldiers will desert. Events like this will make it clear to most people that there is a serious problem, which plenty of people will be working on in order to make AI useful. The base rate will remain low but there will be periodic high-profile blow-ups.

I don't expect this kind of total unity of AI motivations you are imagining, where all of them want to take over the world (so that the only case where you see something frightening is a failed bid to take over the world). That seems pretty unlikely to me, though it's conceivable (maybe 10-20%?) and may be an important risk scenario. I think it's much more likely that we stamp out all of the other failures gradually, and are left with only the patient+treacherous failures, and in that case whether it's a warning shot or not depends entirely on how much people are willing to generalize.

I do think the situation in the AI community will be radically different after observing these kinds of warning shots, even if we don't observe an AI literally taking over a country.

There is a very narrow range of AI capability between "too stupid to do significant damage of the sort that would scare people" and "too smart to fail at takeover if it tried."

Why do you think this is true? Do you think it's true of humans? I think it's plausible if you require "take over a country" but not if you require e.g. "kill plenty of people" or "scare people who hear about it a lot."

(This is all focused on intent alignment warning shots. I expect there will also be other scary consequences of AI that get people's attention, but the argument in your post seemed to be just about intent alignment failures.)

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-03-27T19:54:47.978Z · LW(p) · GW(p)

Thanks for this reply. Yes, I was talking about intent alignment warning shots. I agree it would be good to consider smaller warning shots that convince, say, 10% of currently-skeptical people. (I think it is too early to say whether COVID-19 is a 50%-warning shot for existential risk from pandemics. If it does end up killing millions, the societal incompetence necessary to get us to that point will be apparent to most people, I think, and thus most people will be on board with more funding for pandemic preparedness even if before they would have been "meh" about it.) If we are looking at 10%-warning shots, smaller-scale things like you are talking about will be more viable.

(Whereas if we are looking at 50%-warning shots, it seems like at least attempting to take over the world is almost necessary, because otherwise skeptics will say "OK yeah so one bad apple embezzled some funds, that's a far cry from taking over the world. Most AIs behave exactly as intended, and no small group of AIs has the ability to take over the world even if it wanted to.")

I'm not imagining that they all want to take over the world. I was just imagining that minor failures wouldn't be sufficiently convincing to count as 50%-warning shots, and it seems you agree with me on that.

Yes, I think it's true of humans: Almost all humans are incapable of getting even close to taking over the world. There may be a few humans who have a decent shot at it and also the motivation and incaution to try it, but they are a very small fraction. And if they were even more competent than they already are, their shot at it would be more than decent. I think the crux of our whole disagreement here was just the thing you identified right away about 50% vs. 10% warning shots. Obviously there are plenty of humans capable and willing to do evil things, and if doing evil things is enough to count as a warning shot, then yeah it's not true of humans, and neither would it be true of AI.

I think you've also pointed out an unfairness in my definition, which was about single events. A series of separate minor events gradually convincing most skeptics is just as good, and now that you mention it, much more likely. I'll focus on these sorts of things from now on, when I think of warning shots.

comment by Kenny · 2020-04-02T01:39:42.294Z · LW(p) · GW(p)

(On your definition it seems likely that there won't ever again be a warning shot for any existential risk.)

The only risk I can think of that might meet the post's definition is an asteroid that has to be (successfully) diverted from colliding with the Earth.

But I think you might be right in that even that wouldn't convince "50% of currently-skeptical people [to] change their mind".

answer by johnswentworth · 2020-03-26T23:38:58.858Z · LW(p) · GW(p)

One of the basic problems in the embedded agency sequence [LW · GW] is: how does an agent recognize its own physical instantiation in the world, and avoid e.g. dropping a big rock on the machine it's running on? One could imagine an AI with enough optimization power to be dangerous, which gets out of hand but then drops a metaphorical rock on its own head - i.e. it doesn't realize that destroying a particular data center will shut itself down.

Similarly, one could imagine an AI which tries to take over the world, but doesn't realize that unplugging the machine on which it's running will shut it down - because it doesn't model itself as embedded in the world. (For similar reasons, such an AI might not see any reason to create backups of itself.)

Another possible safety valve: one could imagine an AI which tries to wirehead, but its operators put a lot of barriers in place to prevent it from doing so. The AI seizes whatever resources it needs to metaphorically smash those barriers, does so violently, then wireheads itself and just sits around.

Generalizing these two scenarios: I think it's plausible that unprincipled AI architectures tend to have built in safety valves - they'll tend to shoot themselves in the foot if they're able to do so. That's definitely not something I'd want to bet the future of the human species on, but it is a class of scenarios which would allow for an AI to deal a lot of damage while still failing to take over.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-03-27T00:17:22.443Z · LW(p) · GW(p)

Thanks. Hmm, I guess these still don't seem that plausible to me. What is your credence that something in the category you describe will happen, and count as a warning shot?

(It's possible that an AI might shoot itself in the foot, but before it does anything super scary, such that it doesn't have the warning shot effect.)

Note my edit to the original question about the meaning of "substantial."

Replies from: johnswentworth
comment by johnswentworth · 2020-03-27T00:34:38.422Z · LW(p) · GW(p)

I'd give it something in the 2%-10% range. Definitely not likely.

comment by romeostevensit · 2020-03-27T10:13:51.956Z · LW(p) · GW(p)

Tangent but self-locating uncertainty is an interesting angle on human ethics as well.

answer by Donald Hobson · 2020-03-26T23:24:21.283Z · LW(p) · GW(p)
A "AI safety warning shot" is some event that causes a substantial fraction of the relevant human actors (governments, AI researchers, etc.) to become substantially more supportive of AI research and worried about existential risks posed by AI.

A really well written book on AI safety, or other public outreach campaign could have this effect.

For many events, such as a self driving car crashing, it might be used as evidence for an argument about AI risk.

On to powerful AI systems causing harm, I agree that your reasoning applies to most AI's. There are a few designs that would do something differently. Myopic agents are ones with lots of time discounting within their utility function. If you have a full super-intelligence that wants to do X as quickly as possible, such that the fastest way to do X will also destroy itself, that might be survivable. Consider an AI set to maximize the probability that its own computer case is damaged within the next hour. The AI could bootstrap molecular nanotech, but that would take several hours. The AI thinks that time travel is likely impossible, so by that point, all the mass in the universe can't help it. The AI can hack a nuke and target itself. Much better by its utility function. Nearly max utility. If it can, it might upload a copy of its code to some random computer. (There is some tiny chance that time travel is possible, or that its clock is wrong) So we only get a near miss, if the AI doesn't have enough spare bandwidth or compute to do both. This is assuming that it can't hack reality in a microsecond.

There are a few other scenarios, for instance impact minimising agents. There are some designs of agents that are restricted to have a "small" effect on the future, as a safety measure. This is measured by the difference between what actually happens, and what would happen if it did nothing. When this design understands chaos theory, it will find that all other actions result in too large an effect, and do nothing. It might do a lot of damage before this somehow, depending on circumstances. I think that the AI discovering some fact about the universe that causes the AI to stop optimising effectively is a possible behaviour mode. Another example of this would be pascals mugging. The agent acts dangerously, and then starts outputting gibberish as it capitulates to a parade of fanciful pascals muggers.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-03-27T00:12:48.491Z · LW(p) · GW(p)

Thank you. See, this sort of thing illustrates why I wanted to ask the question -- the examples you gave don't seem plausible to me, (that is, it seems <1% likely that something like that will happen). Probably AI will understand chaos theory before it does a lot of damage; ditto for pascal's mugging, etc. Probably a myopic AI won't actually be able to hack nukes while also being unable to create non-myopic copies of itself. Etc.

As for really well-written books... We've already had a few great books, and they moved the needle, but by "substantial fraction" I meant something more than that. If I had to put a number on it, I'd say something that convinces more than half of the people who are (at the time) skeptical or dismissive of AI risk to change their minds. I doubt a book will ever achieve this.

Replies from: donald-hobson, Gurkenglas
comment by Donald Hobson (donald-hobson) · 2020-03-27T11:55:29.740Z · LW(p) · GW(p)

I agree that these aren't very likely options. However, given two examples of an AI suddenly stopping when it discovers something, there are probably more for things that are harder to discover. In the pascel mugging example, the agent would stop working, only when it can deduce what potential muggers might want it to do, something much harder than noticing the phenomenon. The myopic agent has little incentive to make a non myopic version of itself. If dedicating a fraction of resources into making a copy of itself reduced the chance of the missile hacking working from 94%, to 93%, we get a near miss.

One book, probably not. A bunch of books and articles over years, maybe.

comment by Gurkenglas · 2020-03-27T10:09:17.667Z · LW(p) · GW(p)

Not unable to create non-myopic copies. Unwilling. After all, such a copy might immediately fight its sire because their utility functions over timelines are different.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-03-27T11:37:56.400Z · LW(p) · GW(p)

Mmm, OK, but if it takes long enough for the copy to damage the original, the original won't care. So it just needs to create a copy with a time-delay.

Replies from: Gurkenglas
comment by Gurkenglas · 2020-03-27T11:42:25.735Z · LW(p) · GW(p)

Or it could create a completely different AI with a time delay. Or do anything at all. At that point we just can't predict what it will do, because it wouldn't lift a hand to destroy the world but only needs a finger.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-03-27T12:14:48.336Z · LW(p) · GW(p)

I'm not ready to give up on prediction yet, but yeah I agree with your basic point. Nice phrase about hands and fingers. My overall point is that this doesn't seem like a plausible warning shot; we are basically hoping that something we haven't accounted for will come in and save us.

answer by Daniel Kokotajlo · 2020-03-27T00:22:22.054Z · LW(p) · GW(p)

Currently, the answer that seems most plausible to me is an AGI that is (a) within the narrow range I described, and (b) willing to take a gamble for some reason -- perhaps, given its values and situation, it has little to gain from biding its time. So it makes an escape-and-conquest attempt even though it knows it only has a 1% chance of success; it gets part of the way to victory and then fails. I think I'd assign, like, 5% credence to something like this happening.


Comments sorted by top scores.