The Leeroy Jenkins principle: How faulty AI could guarantee "warning shots"

titotal

The Leeroy Jenkins principle: How faulty AI could guarantee "warning shots"

post by titotal (lombertini) · 2024-01-14T15:03:21.087Z · LW · GW · 6 comments

This is a link post for https://titotal.substack.com/p/the-leeroy-jenkins-principle-how

6 comments

6 comments

Comments sorted by top scores.

comment by WillPetillo · 2024-01-15T03:44:15.873Z · LW(p) · GW(p)

One more objection to the model: AI labs apply just enough safety measures to prevent dumb rogue AIs. Fearing a public backlash to low-level catastrophes, AI companies test their models, checking for safety vulnerabilities, rogue behaviors, and potential for misuse. The easiest to catch problems, however, are also the least dangerous, so only the most cautious, intelligent, and dangerous rogue AI's pass the security checks. Further, this correlation continues indefinitely, so all additional safety work contributes towards filtering the population of malevolent AIs towards the most dangerous. AI companies are not interested in adhering to the standard of theoretical, "provably safe" models, as they are trying to get away with the bare minimum, so the filter never catches everything. While "warning shots" appear all the time in experimental settings, these findings are suppressed or downplayed in public statements and the media, and the public only sees the highly sanitized result of the filtration process. Eventually, the security systems fail, but by this point AI has been developed past the threshold needed to become catastrophically dangerous.

comment by Charlie Steiner · 2024-01-17T10:23:55.415Z · LW(p) · GW(p)

It's all quantitative. Why wasn't Bing chat threatening a reporter for saying bad things about it the warning shot that got civilization to wake up and do the right thing? Well, it sort of was. It's a clear sign of problems, and it got some people to move in the right direction, it just didn't suddenly change everyone's mind - there are still plenty of people rushing ahead. As the warning shots get warning-shottier, more people will change their minds, but whether they'll do so fast enough is an empirical question that I'm somewhat pessimistic about.

comment by RogerDearnaley (roger-d-1) · 2024-01-14T22:14:03.609Z · LW(p) · GW(p)

As I've suggested before, one of the less drastic forms that a "pivotal act" could take (if we got to the point where one was needed: currently most governments appear to be taking AI risk fairly seriously) is a competent well-documented demonstration of "here's how an ASI could take over the world/defeat humanity if it wanted to" (preferably a demonstration that doesn't actually kill anyone). What you discuss is the other half of that: "an AGI that clearly wanted to take over the world/defeat humanity, but wasn't in fact up to pulling it off correctly".

I also, sadly, agree that we as a society might not pay much attention until hundreds of people or more die from one of these. Or it might be that the level of public concern is already high enough that we would.

comment by quasi_quasar (ealex) · 2024-01-15T13:22:19.820Z · LW(p) · GW(p)

A couple of notes from me as, though I appreciate the effort you've put into it, especially the simulations, I overall disagree with the reasoning you've presented so I thought I'd offer a few counter-points.

Whilst I don't disagree that "idiotic AGI" is conceptually possible I think the main disagreement we have is that you believe that AGI will sample from a sufficiently large pool, similar to that of high IQ people in the world today, so that we will be guaranteed at least a few "idiotic AGI" to tip us off. I think this assumption rests centrally on a world where either AGI is developed relatively simultaneously by a large number of different actors OR it is widely shared once it is developed so that many such different AGIs might co-exist in a relatively short timespan. I have serious doubts that that is indeed the timeline we are heading towards.

It is perfectly possible that when AGI is eventually developed it remains a singular (or single-digit count) guarded secret for a significant amount of time for example. If the AGI that happens to be developed turns out to not be an "idiotic AGI", then we have an immediate problem. Even if the AGI that is developed does turn out to be an "idiotic AGI" and it displays serious errors in testing, it's entirely possible these will be "mitigated", again in secret, and thus a far more capable and less prone to "idiocy" AGI will be eventually released into the world, one that is equally far more capable of carrying out an existential attack OR of simply putting this off until it is an ASI and is far more capable.

I'd note also that you state quite clearly towards the beginning of the post that you are "not making any claims about the prevalence of foolish beliefs among the highly intelligent" and yet in other places you state that "there are many humans, even highly intelligent ones, that act dumb or foolish" and that "foolish high-IQ people exist and are not vanishingly rare". Either you are claiming simply the existence of the phenomenon or you are claiming you can demonstrate prevalence. I don't feel like you've successfully demonstrated the latter, having offered only some fairly insubstantial evidence, so I will assume that the former is the statement you actually want to make. Prevalence is however quite essential to the argument you are making, I think - it does matter whether it's 3 or 30 out of 100 high-IQ people that are "foolish".

There is also a discussion to be had in relation to equating a "high-IQ" human to an AGI. The definition of AGI is still highly problematic so I think we're on pretty shaky ground assuming what an AGI will and won't be anyway and that in itself may be a weakness in your argument.

I think however that if we are to follow your line of reasoning of "foolish humans", a lot of the errors that humans in general (high-IQ or not) make are due to a combination of emotion and cognitive biases.

AGI will (presumably) not make any errors due to emotion and it is highly debatable what cognitive biases (if any) an AGI will have. We are certainly introducing a bunch with RLHF as you yourself mentioned, though whether that technique will (still) be used when AGI is achieved is another tenuous assumption. Whilst you argue that hallucinations might be themselves such an example of a "cognitive bias" that may give away the "idiot AGI's" plan, it's worth noting that the elimination of hallucinations is a direct goal of current AI improvement and whilst perhaps we cannot expect the complete elimination of hallucinations, as long as they are reduced to extremely small odds of occurring, expecting hallucinations to be source of a "giveaway" alarm from an AGI is highly optimistic if not dare I say unrealistic.

comment by Chris_Leong · 2024-01-14T15:44:01.822Z · LW(p) · GW(p)

Good post. One suggestion: you might want to further emphasise that this is likely a race condition where an AI would have to move fast lest it lose control to an AI produced after it.

comment by davekasten · 2024-06-29T23:06:44.114Z · LW(p) · GW(p)

I am glad you wrote this, as I have been spending some time wondering about this possibility space.

One more option: an AI can have a utility function where it seeks to max its time alive, and have enough cognition to think it is likely to die regardless when humans decide it is dangerous. Even if they think they cannot win, they might seek to cause chaos that increases their total time to live.

The Leeroy Jenkins principle: How faulty AI could guarantee "warning shots"

Contents

6 comments