[Link post] Michael Nielsen's "Notes on Existential Risk from Artificial Superintelligence"

post by Joel Becker (joel-becker) · 2023-09-19T13:31:02.298Z · LW · GW · 12 comments

This is a link post for https://michaelnotebook.com/xrisk/index.html

12 comments

Comments sorted by top scores.

comment by ryan_b · 2023-09-19T17:24:53.468Z · LW(p) · GW(p)

"So, what's your probability of doom?" I think the concept is badly misleading.

Thank heavens, I am not alone

Replies from: rhollerith_dot_com
comment by RHollerith (rhollerith_dot_com) · 2023-09-20T03:56:40.296Z · LW(p) · GW(p)

I'm curious why you think that.

My probability that there will be any human alive 100 years from now is .07. If MIRI were magically given effective control over all AI research starting now or if all AI research were magically stopped somehow, my probability would change to .98.

Is what I just wrote also badly misleading? Surely your objection is not to ambiguity in the definition of "alive", but I'm failing to imagine what your objection might be unless we think very differently about how probability works.

(I'm interested in answers from anyone.)

Replies from: ryan_b
comment by ryan_b · 2023-09-21T18:01:55.660Z · LW(p) · GW(p)

What you wrote is not misleading at all. For example, I was able to glean that you are thinking over a timeline of a century, and generally agree with MIRI's model of the AI problem. My objection to p(doom) is that none of those crucial details were included in the numbers themselves. In practice, the only interpretation I am seeing people actually use it for is as a signpost for whether someone is doomer or e/acc.

More specifically, p(doom) fails to communicate anything useful because there is so much uncertainty in the models. The differences in the probabilities people assign has a lot less to do with different assessments of the same thing than they do with assessing wildly different things.

Consider for a moment an alternative system, where we don't talk about probability at all. Instead, we can take a couple of important dimensions of the alignment problem: I propose that these two dimensions be timelines, which is to say whether you expect a dangerous AGI to arrive in a short time or in a long time, and tractability of alignment, which is whether you expect aligning an AGI to be hard or to be easy. If all we asked people was which side of the origin they were on in both dimensions, we could symbolize this as + (meaning we have a lot of time, or alignment is an easy problem) and - (meaning we don't have a lot of time, or alignment is a hard problem). This gives us four available answers:

doom++

doom+-

doom-+

doom--

I claim this gives us much more information than p(doom)=.97 because a scalar number tells you nothing about why. The positive and negative on dimensions give us a two-dimensional why: doom-- means the alignment problem is hard and we have very little time in which to solve it. It neatly breaks down into a four quadrant graph for visually representing fundamental areas of agreement/disagreement.

In short, the probability calculations have no useful meaning unless they are being run on at least similar models; what we should be doing instead is finding ways to expose our personal models clearly and quickly. I think this will lead to better conversations, even in such limited domains as twitter.

comment by Noosphere89 (sharmake-farah) · 2023-09-19T15:37:30.202Z · LW(p) · GW(p)

With all that said: practical alignment work is extremely accelerationist. If ChatGPT had behaved like Tay, AI would still be getting minor mentions on page 19 of The New York Times. These alignment techniques play a role in AI somewhat like the systems used to control when a nuclear bomb goes off. If such bombs just went off at random, no-one would build nuclear bombs, and there would be no nuclear threat to humanity. Practical alignment work makes today's AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist. There's an extremely thoughtful essay by Paul Christiano, one of the pioneers of both RLHF and AI safety, where he addresses the question of whether he regrets working on RLHF, given the acceleration it has caused. I admire the self-reflection and integrity of the essay, but ultimately I think, like many of the commenters on the essay, that he's only partially facing up to the fact that his work will considerably hasten ASI, including extremely dangerous systems.

Over the past decade I've met many AI safety people who speak as though "AI capabilities" and "AI safety/alignment" work is a dichotomy. They talk in terms of wanting to "move" capabilities researchers into alignment. But most concrete alignment work is capabilities work. It's a false dichotomy, and another example of how a conceptual error can lead a field astray. Fortunately, many safety people now understand this, but I still sometimes see the false dichotomy misleading people, sometimes even causing systematic effects through bad funding decisions.

"Does this mean you oppose such practical work on alignment?" No! Not exactly. Rather, I'm pointing out an alignment dilemma: do you participate in practical, concrete alignment work, on the grounds that it's only by doing such work that humanity has a chance to build safe systems? Or do you avoid participating in such work, viewing it as accelerating an almost certainly bad outcome, for a very small (or non-existent) improvement in chances the outcome will be good? Note that this dilemma isn't the same as the by-now common assertion that alignment work is intrinsically accelerationist. Rather, it's making a different-albeit-related point, which is that if you take ASI xrisk seriously, then alignment work is a damned-if-you-do-damned-if-you-don't proposition.

I think this is sort of a flipside to the following point: Alignment work is incentivized as a side effect of capabilities, and there is reason to believe that alignment and capabilities can live together without either of them being destroyed. The best example really comes down to the jailbreak example, where the jailbreaker has aligned it to them, and controls the AI. The AI doesn't jailbreak itself and is unaligned, instead the alignment/control is transferred to the jailbreaker. We truly do live in a regime where alignment is pretty easy, at least for LLMs. And that's good news compared to AI pessimist views.

The tweet is below:

https://twitter.com/QuintinPope5/status/1702554175526084767

This also is important, in the sense that alignment progress will naturally raise misuse risk, and solutions to the control problem look very different from solutions to the misuse problems of AI, and one implication is that it's far less bad to accelerate if misuse is the main concern and can actually look very positive.

This is a point Simeon raised in this link, where he states a tradeoff between misuse and misalignment concerns here:

https://www.lesswrong.com/posts/oadiC5jmptAbJi6mS/the-cruel-trade-off-between-ai-misuse-and-ai-x-risk-concerns [LW · GW]

So this means that it is very plausible that as the control problem/misalignment is solved, misuse risk can be increased, which is a different tradeoff than what is pictured here.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-09-19T14:02:24.079Z · LW(p) · GW(p)

There isn't a way to comment there, so I'm commenting here:

--The author, like many others, misunderstands what people mean when they talk about capabilities vs. alignment. They do not mean that everything is either one or the other, and that nothing is both.
--Just because humanity can affect something doesn't mean it is bad to have a credence in it. P(doom) is a very important concept I think which I am glad people are going around discussing. It's analogous to asking "Does it look like we are going to win the war?" and "Is that new coronavirus we hear about in the news going to overwhelm the hospital system in this country?" In both cases, whether or not it happens depends hugely on choices humanity makes.
--It looks like this is a thoughtful and insightful piece with lots of good bits which will stick in my mind. E.g. the discussion of how RLHF accelerates ASI timelines, the discussion of situations in which we have accurate predictive theories already, and the persuasion paradox about how people worried about AI x-risk can't be too specific or else they make it (slightly) more likely.
 

Replies from: ryan_b, quinn-dougherty
comment by ryan_b · 2023-09-26T21:21:16.682Z · LW(p) · GW(p)

The author, like many others, misunderstands what people mean when they talk about capabilities vs. alignment. They do not mean that everything is either one or the other, and that nothing is both

I had a similar reaction, which made me want to go looking for the source of disagreement. Do you have a post or thread that comes to mind which makes this distinction well? Most of what I am able to find just sort of gestures at some tradeoffs, which seems like a situation where we would expect the kind of misunderstanding you describe.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-09-26T23:44:22.896Z · LW(p) · GW(p)

Perhaps this? Request: stop advancing AI capabilities - LessWrong 2.0 viewer (greaterwrong.com) [LW · GW]

Replies from: ryan_b, daniel-kokotajlo
comment by ryan_b · 2023-09-27T12:07:51.938Z · LW(p) · GW(p)

Yep, it’s nicely packaged right here:

To all doing that (directly and purposefully for its own sake, rather than as a mournful negative externality to alignment research): I request you stop.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-09-26T23:44:39.078Z · LW(p) · GW(p)

Ehh, it's not long enough, doesn't explain things as well as it could.

comment by Quinn (quinn-dougherty) · 2023-09-19T20:52:43.809Z · LW(p) · GW(p)

Winning or losing a war kinda binary.

Will a pandemic get to my country is a matter of degree, since in principle you can have a pandemic that killed 90% of counterfactual economic activity in one country break containment but only destroy 10% in your country.

"Alignment" or "transition to TAI" of any kind is way further from "coinflip" than either of these, so if you think doomcoin is salvageable or want to defend its virtues you need way different reference classes.

Think about the ways in which winning or losing a war isn't binary-- lots of ways for implementation details of an agreement to effect your life as a citizen of one of the countries. AI is like this but even further-- all the different kinds of outcomes, how central or unilateral are important moments, which values end up being imposed on the future and at what resolution, etc. People who think "we have a doomcoin toss coming up, now argue about the p(heads)" are not gonna think about this stuff!

To me, "p(doom)" is a memetic PITA as bad as "the real unaligned AI was the corporations/calitalism", so I'm excited that you're defending it! Usually people tell me "yeah you're right it's not a defensible frame haha"

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-09-19T21:24:52.833Z · LW(p) · GW(p)

Interesting, thanks. Yeah, I currently think the range of possible outcomes in warfare seems to be more smeared out, across a variety of different results, than the range of possible outcomes for humanity with respect to AGI. The bulk of the probability mass in the AGI case, IMO, is concentrated in "Total victory of unaligned, not-near-miss AGIs" and then there are smaller chunks concentrated in "Total victory of unaligned, near-miss AGIs" (near-miss means what they care about is similar enough to what we care about that it is either noticeably better, or noticeably worse, than human extinction.) and of course "human victory," which can itself be subdivided depending on the details of how that goes.

Whereas with warfare, there's almost a continuous range of outcomes ranging from "total annihilation and/or enslavement of our people" to "total victory" with pretty much everything in between a live possibility, and indeed some sort of negotiated settlement more likely than not.

I do agree that there are a variety of different outcomes with AGI, but I think if people think seriously about the spread of outcomes (instead of being daunted and deciding not to think about it because it's so speculative) they'll conclude that they fall into the buckets I described.

Separately, I think that even if it was less binary than warfare, it would still be good to talk about p(doom). I think it's pretty helpful for orienting people & also I think a lot of harm comes from people having insufficiently high p(doom). Like, a lot of people are basically feeling/thinking "yeah it looks like things could go wrong but probably things will be fine probably we'll figure it out, so I'm going to keep working on capabilities at the AGI lab and/or keep building status and prestige and influence and not rock the boat too much because who knows what the future might bring but anyhow we don't want to do anything drastic that would get us ridiculed and excluded now." If they are actually correct that there's, say, a 5% chance of AI doom, coming from worlds in which things are harder than we expect and an unfortunate chain of events occurs and people make a bunch of mistakes or bad people seize power, maybe something in this vicinity is justified. But if instead we are in a situation where doom is the default and we need a bunch of unlikely things to happen and/or a bunch of people to wake up and work very hard and very smart and coordinate well, in order to NOT suffer unaligned AI takeover...

comment by TAG · 2023-09-21T14:18:42.996Z · LW(p) · GW(p)

The most direct way to make a strong argument for xrisk is to convincingly describe a detailed concrete pathway to extinction. The more concretely you describe the steps, the better the case for xrisk. But of course, any “progress” in improving such an argument actually creates xrisk.

I think that's a false dichotomy: a good argument is nowhere as detailed as a basic blueprint.