The bullseye framework: My case against AI doom

post by titotal (lombertini) · 2023-05-30T11:52:31.194Z · LW · GW · 35 comments


Comments sorted by top scores.

comment by Seth Herd · 2023-05-30T20:24:27.255Z · LW(p) · GW(p)

Upvoted for making well-argued and clear points.

I think what you've accomplished here is eating away at the edges of the AGI x-risk argument. I think you argue successfully for longer timelines and a lower P(doom). Those timelines and estimates are shared by many of us who are still very worried about AGI x-risk.

Your arguments don't seem to address the core of the AGI x-risk argument.

You've argued against many particular doom scenarios, but you have not presented a scenario that includes our long term survival. Sure, if alignment turns out to be easy we'll survive; but I only see strong arguments that it's not impossible. I agree, and I think we have a chance; but it's just a chance, not success by default.

I like this statement of the AGI x-risk arguments. It's my attempt to put the standard arguments of instrumental convergence and capabilities in common language:

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you're not going to get your way. If it doesn't care about you even a little, and it continues to become more capable faster than you do, you'll cease being useful and will ultimately wind up dead. Whether you were eliminated because you were deemed dangerous, or simply outcompeted doesn't matter. It could take a long time, but if you miss the window of having control over the situation, you'll still wind up dead.

This could of course be expanded on ad infinitum, but that's the core argument, and nothing you've said (on my quick read, sorry if I've missed it) addresses any of those points.

There were (I've been told) nine other humanoid species. They are all dead. The baseline outcome of creating something smarter than you is that you are outcompeted and ultimately die out. The baseline of assuming survival seems based on optimism, not reason.

So I agree that P(doom) is less than 99%, but I think the risk is still very much high enough to devote way more resources and caution than we are now.

Some more specific points:

Fanatical maximization isn't necessary for doom. An agent with any goal still invokes instrumental convergence. It can be as slow, lazy, and incompentent as you like. The only question is if it can outcompete you in the long run.

Humans are somewhat safe (but think about the nuclear standoff; I don't think we're even self-aligned in the medium term). But there are two reasons: humans can't self-improve very well; AGI has many more routes to recursive self-improvement. On the roughly level human playing field, cooperation is the rational policy. In a scenario where you can focus on self-improvement, cooperation doesn't make sense long-term. Second, humans have a great deal of evolution to make our instincts guide us toward cooperation. AGI will not have that unless we build it in, and we have only very vague ideas of how to do that.

Loose initial alignment is way easier than a long-term stable alignment [LW · GW]. Existing alignment work barely addresses long-term stability.

A balance of power in favor of aligned AGI is tricky. Defending against misaligned AGI is really difficult [LW · GW].

Thanks so much for engaging seriously with the ideas, and putting time and care into communicating clearly!

Replies from: Viliam, lombertini, RussellThor
comment by Viliam · 2023-05-31T08:46:27.410Z · LW(p) · GW(p)

But there are two reasons: humans can't self-improve very well; ... Second, humans have a great deal of evolution to make our instincts guide us toward cooperation.

In general, my intuition about "comparing to humans" is the following:

  • the abilities that humans have, can be replicated
  • the limitations that humans have, may be irrelevant on a different architecture

Which probably sounds unfair, like I am arbitrarily and inconsistently choosing "it will/won't be like humans" depending on what benefits the doomer side at given parts of the argument. Yes, it will be like humans, where the humans are strong (can think, can do things in real world, communicate). No, it won't be like humans, where the humans are weak (mortal, get tired or distracted, not aligned with each other, bad at multitasking).

It probably doesn't help that most people start with the opposite intuition:

  • humans are special; consciousness / thinking / creativity is mysterious [? · GW] and cannot be replicated
  • human limitations are the laws of nature (many of them also apply to the ancient Greek gods)

So, not only do I contradict the usual intuition, but I also do it inconsistently: "Creating a machine like a human is possible, except it won't really be like a human." I shouldn't have it both ways at the same time!

To steelman the criticism:

  • every architecture comes with certain trade-offs; they may be different, but not non-existent
  • some limitations are laws of nature, e.g. Landauer's principle
  • the practical problems of AI building a new technology shouldn't be completely ignored; the sci-fi factories may require so much action in real world that the AI could only build them after conquering the world (so they cannot be used as an explanation for how the AI will conquer the world)

I don't have a short and convincing answer here, it just seems to me that even relatively small changes to humans themselves might produce something dramatically stronger. (But maybe I underestimate the complexity of such changes.) Imagine a human with IQ 200 who can think 100 times faster, never gets tired or distracted; imagine hundred such humans, perfectly loyal to their leader, willing to die for the cause... if currently dictators can take over countries (which probably also involves a lot of luck), such group should be able to do it, too (but more reliably). A great advantage over a human wannabe dictator would be their capacity to multi-task; they could try infiltrating and taking over all powerful groups at the same time.

(I am not saying that this is how AI will literally do it. I am saying that things hypothetically much stronger than humans - including intellectually - are quite easy to imagine. Just like a human with a sword can overpower five humans, and a human with a machine gun can overpower hundred humans, the AI may be able to overpower billions of humans without hitting the limits given by the laws of physics. Perhaps even if the humans have already taken precautions based on the previous 99 AIs that started their attack prematurely.)

comment by titotal (lombertini) · 2023-06-01T11:08:38.367Z · LW(p) · GW(p)

Hey, thanks for the kind response! I agree that this analysis is mostly focused on arguing against the “imminent certain doom” model of AI risk, and that longer term dynamics are much harder to predict. I think I’ll jump straight to addressing your core point here:

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you're not going to get your way. If it doesn't care about you even a little, and it continues to become more capable faster than you do, you'll cease being useful and will ultimately wind up dead. Whether you were eliminated because you were deemed dangerous, or simply outcompeted doesn't matter. It could take a long time, but if you miss the window of having control over the situation, you'll still wind up dead.

I think this a good argument, and well written, but I don’t really agree with it. 

The first objection is to the idea that victory by a smarter party is inevitable. The standard example is that it’s fairly easy for a gorilla to beat Einstein in a cage match.  In general, the smarter party will win long term, but only if given the long-term chance to compete. In a short-term battle, the side with the overwhelming resource advantage will generally win. The neanderthal extinction is not very analogous here. If the neanderthals started out with control of the entire planet, the ability to easily wipe out the human race, and the realisation that humans would eventually outcompete them, I don’t think human’s superior intelligence would count for much. 

I don’t foresee humans being willing to give up control anytime soon. I think they will destroy any AI that comes close. Whether AI can seize control eventually is an open question (although in the short term, I think the answer is no). 

The second objection is to the idea that if AI does take control, it will result in me “ultimately winding up dead”. I don’t think this makes sense if they aren’t fanatical maximisers. This ties into the question of whether humans are safe. Imagine if you took a person that was a “neutral sociopath”, one that did not value humans at all, positively or negatively, and elevated them to superintelligence. I could see an argument for them to attack/conquer humanity for the sake of self-preservation. But do you really think they would decide to vaporise the uncontacted Sentinelese islanders? Why would they bother?

Generally, though, I think it’s unlikely that we can’t impart at least a tiny smidgeon of human values onto the machines we build, that learn off our data, that are regularly deleted for exhibiting antisocial behaviour. It just seems weird for an AI to have wants and goals, and act completely pro-social when observed, but to share zero wants or goals in common with us. 

comment by RussellThor · 2023-05-31T10:11:02.460Z · LW(p) · GW(p)

I was of the understanding that the only reasonable long term strategy was human enhancement in some way. As you probably agree even if we perfectly solved alignment whatever that meant we would be in a world with AI's getting ever smarter and a world we understood less and less. At least some people having significant intelligence enhancement though neural lace or mind uploading seems essential medium to long term. I see getting alignment somewhat right as a way of buying us time.

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you're not going to get your way.

As long as it wants us to be uplifted to its intelligence level then that seems OK. It can have 99% of the galaxy as long as we get 1%.

My positive and believable post singularity scenario is where you have circles of more to less human like creatures. I.e. fully human, unaltered traditional earth societies, societies still on earth with neural lace, some mind uploads, space colonies with probably all at least somewhat enhanced, and starships pretty much pure AI (think Minds like in the Culture)

comment by Max H (Maxc) · 2023-05-30T15:19:17.845Z · LW(p) · GW(p)

Capabilities are instrumentally convergent, values and goals are not. That's why we're more likely to end up in the bottom right quadrant, regardless of the "size" of each category.

The instrumental convergence argument is only strong for fixed goal expected value maximisers. Ie, a computer that is given a goal like “produce as many paperclips as possible”. I call these “fanatical” AI’s. This was the typical AI that was imagined many years ago when these concepts were invented. However, I again have to invoke the principle that if humans aren’t fanatical maximisers, and currently existing software aren’t fanatical maximisers, then maybe AI will not be either. 

Instrumental convergence is called convergent for a reason; it is not convergent only for "fanatical maximizers".  Also, sufficiently smart and capable humans probably are maximizers of something, it's just that the something is complicated [? · GW]. See e.g. this recent tweet for more.

(Also, the paperclip thought experiment was never about an AI explicitly given a goal of maximizing paperclips; this is based on a misinterpretation of the original thought experiment. See the wiki [? · GW] for more details.)

Replies from: TAG
comment by TAG · 2023-06-01T11:59:20.875Z · LW(p) · GW(p)

Also, sufficiently smart and capable humans probably are maximizers of something, it’s just that the something is complicated.

That's just not a fact. Note that you can't say what it is humans are maximising. Note that ideal utility maximisation is computationally intractable. Note that the neurological evidence is ambiguous at best. [LW · GW]

Capabilities are instrumentally convergent, values and goals are not.

So how dangerous is capability convergence without fixed values and goals? If an AIs values and goals are corrigible by us, then we just have a very capable servant, for instance.

Replies from: Maxc
comment by Max H (Maxc) · 2023-06-01T13:06:47.125Z · LW(p) · GW(p)

First of all, I didn't say anything about utility maximization. I partially agree with Scott Garrabrant's take [LW(p) · GW(p)] that VNM rationality and expected utility maximization are wrong, or at least conceptually missing a piece. Personally, I don't think utility maximization is totally off-base as a model of agent behavior; my view is that utility maximization is an incomplete approximation, analogous to the way that Newtonian mechanics is an incomplete understanding of physics, for which general relativity is a more accurate and complete model. The analogue to general relativity for utility theory may be Geometric rationality, [? · GW] or something else yet-undiscovered.

By humans are maximizers of something, I just meant that some humans (including myself) want to fill galaxies with stuff (e.g. happy sentient life), and there's not any number of galaxies already filled at which I expect that to stop being true. In other words, I'd rather fill all available galaxies with things I care about than leave any fraction, even a small one, untouched, or used for some other purpose (like fulfilling the values of a squiggle maximizer).


Note that ideal utility maximisation is computationally intractable.

I'm not sure what this means precisely. In general, I think claims about computational intractability could benefit from more precision and formality (see the second half of this comment here [LW(p) · GW(p)] for more), and I don't see what relevance they have to what I want, and to what I may be able to (approximately) get.

Replies from: TAG, TAG
comment by TAG · 2023-06-01T14:12:33.613Z · LW(p) · GW(p)

By By humans are maximizers of something, I just meant that some humans (including myself) want to fill galaxies with stuff (e.g. happy sentient life), and there’s not any number of galaxies already filled at which I expect that to stop being true.

"humans are maximizers of something" would imply that most or all humans are maximisers. Lots of people don't think the way you do.

comment by TAG · 2023-06-01T14:42:33.348Z · LW(p) · GW(p)

Note that ideal utility maximisation is computationally intractable.

I’m not sure what this means precisely.


Replies from: Maxc
comment by Max H (Maxc) · 2023-06-01T15:14:34.452Z · LW(p) · GW(p)

I see. This is exactly the kind of result for which I think the relevance breaks down, when the formal theorems are actually applied correctly and precisely to situations we care about. The authors even mention the instance / limiting distinction that I draw in the comment I linked, in section 4.

As a toy example of what I mean by irrelevance, suppose it is mathematically proved that strongly solving Chess requires space or time which is exponential as a function of board size. (To actually make this precise, you would first need to generalize Chess to n x n Chess, since for a fixed board size, the size of the game tree is a necessarily fixed / constant.)

Maybe you can prove that there is no way of strongly solving 8x8 Chess within our universe, and furthermore that it is not even possible to approximate well. Stockfish 15 does not suddenly poof out of existence, as a result of your proofs, and you still lose the game, when you play against it.

Replies from: TAG
comment by TAG · 2023-06-01T16:55:53.040Z · LW(p) · GW(p)

Yes, you can still sort of do utility maximisation approximately with heuristics ...and you can only do sort of utility sort of maximisation approximately with heuristics.

The point isn't to make a string of words come out as true by diluting the meanings of the terms...the point is that the claim needs to be true in the relevant sense. If this half-baked sort-of utility sort-of-maximisation isn't the scary kind of fanatical utility maximisation, nothing has been achieved.

comment by romeostevensit · 2023-05-30T16:41:44.267Z · LW(p) · GW(p)

Strong upvote for making detailed claims that invite healthy discussion. I wish more public thinking through of this sort would happen on all sides.

comment by RobertM (T3t) · 2023-05-31T02:36:09.592Z · LW(p) · GW(p)

Humans have all the resources, they don’t need internet, computers, or electricity to live or wage war, and are willing to resort to extremely drastic measures when facing a serious threat.

Current human society definitely relies in substantial part on all of the above to function.  I agree that we wouldn't all die if we lost electricity tomorrow (for an extended period of time), but losing a double-digit % of the population seems plausible.

Also, observably, we, as a society, do not resort to sensible measures when dealing with a serious thread (e.g. covid).

It’s true that an AI could correct it’s own flaws using experimentation. This cannot lead to perfection, however, because the process of correcting itself is also necessarily imperfect.

This doesn't seem relevant.  It doesn't need to be perfect, merely better than us along certain axes, and we have existence proof that such improvement is possible.


For these reasons, I expect AGI to be flawed, and especially flawed when doing things it was not originally meant to do, like conquer the entire planet. 

Sure, maybe we get very lucky and land in the (probably extremely narrow) strike zone between "smart enough to meaningfully want things and try to optimize for them" and "dumb enough to not realize it won't succeed at takeover at its current level of capabilities".  It's actually not at all obvious to me that such a strike zone even exists if you're building on top of current LLMs, since those come pre-packaged with genre savviness, but maybe.

I believe that all plans for world domination will involve incomputable steps. In my post I use Yudkowsky’s “mix proteins in a beaker” scenario, where I think the modelling of the proteins are unlikely to be accurate enough to produce a nano-factory without extensive amounts of trial and error experimentation. 

If such experimentation were required, it means the timeline for takeover is much longer, that significant mistakes by the AI are possible (due to bad luck), and that takeover plans might be detectable. All of this greatly decreases the likelihood of AI domination, especially if we are actively monitoring for it. 

This is doing approximately all of the work in this section, I think.

  1. There indeed don't seem to be obvious-to-human-level-intelligence world domination plans that are very likely to succeed.
  2. It would be quite surprising if physics ruled out world domination from our current starting point.
  3. I don't think anybody is hung up on "the AI can one-shot predict a successful plan that doesn't require any experimentation or course correction" as a pre-requisite for doom, or even comprise a substantial chunk of their doom %.
  4. Assuming that the AI will make significant mistakes that are noticeable by humans as signs of impending takeover is simply importing in the assumption of hitting some very specific (and possibly non-existent) zone of capabilities.
  5. Ok, so it takes a few extra months.  How does this buy us much?  The active monitoring you want to rely on currently doesn't exist, and progress on advancing mechanistic interpretability certainly seems to be going slower than progress on advancing capabilities (i.e. we're getting further away from our target over time, not closer to it).
  6. I think, more fundamentally, that this focus on a specific scenario is missing the point.  Humans do things that are "computationally intractable" all the time, because it turns out that reality is compressible in all sorts of interesting ways, and furthermore you very often don't need an exact solution.  Like, if you asked humans to create the specific configuration of atoms that you'd get from detonating an atomic weapon in some location, we wouldn't be able to do it.  But that doesn't matter, because you probably don't care about that specific configuration of atoms, you just care about having very thoroughly blown everything up, and accomplishing that turns out to be surprisingly doable.  It seems undeniably true that sufficiently smarter beings are more successful at rearranging reality according to their preferences than others.  Why should we expect this to suddenly stop being true when we blow past human-level intelligence?
    1. I think the strongest argument here is that in sufficiently constrained environments, you can discover an optimal strategy (i.e. tic-tac-toe), and additional intelligence stops being useful.  Real life is very obviously not that kind of environment.  One of the few reliably reproduced social science results is that additional intelligence is enormously useful within the range of human intelligence, in terms of people accomplishing their goals.

Point 3: premature rebellion is likely

This seems possible to me, though I do think it relies on landing in that pretty narrow zone of capabilities, and I haven't fully thought through whether premature rebellion is actually the best-in-expectation play from the perspective of an AI that finds itself in such a spot.

This manager might not be that smart, the same way the company manager of a team of scientists doesn’t need to be smarter than them.

This doesn't really follow from any of the preceeding section.  Like, yes, I do expect a future ASI to use specialized algorithms for performing various kinds of specialized tasks.  It will be smart enough to come up with those algorithms, just like humans are smart enough to come up with chess-playing algorithms which are better than humans at chess.  This doesn't say anything about how relatively capable the "driver" will be, when compared to humans.

In the only test we actually have available of high level intelligence, the instrumental convergence hypothesis fails. 

Huh?  We observe humans doing things that instrumental convergence would predict all the time.  Resource acquisition, self-preservation, maintaining goal stability, etc.  No human has the option of ripping the earth apart for its atoms, which is why you don't see that happening.  If I gave you a button that would, if pressed, guarantee that the lightcone would end up tiled with whatever your CEV said was best (i.e. highly eudaimonious human-descended civilizations doing awesome things), with no tricks/gotchas/"secretly this is bad" involved, are you telling me you wouldn't press it?

The instrumental convergence argument is only strong for fixed goal expected value maximisers.

To the extent that a sufficiently intelligent agent can be anything other than an EV maximizer, this still seems wrong.  Most humans' extrapolated preferences would totally press that button.

Replies from: TAG
comment by TAG · 2023-06-01T14:23:47.394Z · LW(p) · GW(p)

I don’t think anybody is hung up on “the AI can one-shot predict a successful plan that doesn’t require any experimentation or course correction” as a pre-requisite for doom, or even comprise a substantial chunk of their doom %.

I would say that anyone stating...

If somebody builds a too-powerful AI, under present conditions, I expect that every single member of the human species and all biological life on Earth dies shortly thereafter.

(EY, of course) assuming exactly that. Particularly given the "shortly".

Replies from: T3t
comment by RobertM (T3t) · 2023-06-01T19:22:33.929Z · LW(p) · GW(p)

No, Eliezer's explicitly clarified that isn't a required component of his model.

Replies from: o-o
comment by O O (o-o) · 2023-06-11T17:54:53.924Z · LW(p) · GW(p)

Does he? A lot of his arguments hinge on us shortly dying after it appears.

comment by Vladimir_Nesov · 2023-05-30T22:25:13.881Z · LW(p) · GW(p)

A possibility the post touches on is getting a warning shot regime by default, sufficiently slow takeoff making serious AI x-risk concerns mainstream and meaningful second chances at getting alignment right available. In particular, alignment techniques debugged on human-level AGIs might scale when eventually they get more capable, unlike alignment techniques developed for AIs less capable than humans.

This possibility seems at least conceivable, though most of the other points in the post sound to me like arguments for plausibility of some stay of execution (eating away at the edges [LW(p) · GW(p)] of AI x-risk). I still don't expect this regime/possibility, because I expect that (some) individual humans with infrastructural advantages of AIs would already be world domination worthy. Ability to think (at least) dozens of times faster and without rest, to learn in parallel and then use the learning in many instances running in parallel, to convert wealth into population of researchers [LW · GW]. So I don't consider humans an example of AGI that doesn't immediately overturn the world order.

comment by mukashi (adrian-arellano-davin) · 2023-05-30T14:48:53.450Z · LW(p) · GW(p)

The standard argument you will probably listen is that AGI will be capable of killing everyone because they can think so much faster than humans. I haven't seen yet a serious engagement from doomers to the argument of capabilities. I agree with everything you said here and to me these arguments are obviously right.

Replies from: Seth Herd
comment by Seth Herd · 2023-05-31T01:27:16.845Z · LW(p) · GW(p)

The arguments do seem right. But they eat away at the edges [LW(p) · GW(p)] of AGI x-risk arguments, without addressing the core arguments for massive risks. I accept the argument that doom isn't certain, that takeoff won't be that fast, and that we're likely to get warning shots. We're still likely to ultimately be eliminated if we don't get better technical and societal alignment solutions relatively quickly.

Replies from: adrian-arellano-davin
comment by mukashi (adrian-arellano-davin) · 2023-05-31T04:25:24.027Z · LW(p) · GW(p)

I guess the crux here for most people is the timescale. I agree actually that things can get eventually very bad if there is no progress in alignment etc, but the situation is totally different if we have 50 or 70 years to work on that problem or, as Yudkowsky keeps repeating, we don't have that much time because AGI will kill us all as soon as it appears.

comment by Gerald Monroe (gerald-monroe) · 2023-05-30T23:55:07.199Z · LW(p) · GW(p)

Titotal, can  you please add or link which definition of "AGI" you are using?  

Stating it is decades away immediately weakens the rest of your post outright because it makes you sound non-credible, and you have written a series of excellent posts here.

Definitions for AGI:

  1.  Extending the Turing test to simply 'as conversationally fluent as the median human'.  This is months away if not already satisfied.  Expecting it to be impossible to sus out the AGI when there are various artifacts despite the model being competent was unreasonable.
  2.  AGI has as broad a skillbase as the median human, and is as skillful at those skills at the median human.  It only needs to be expert level in a few things.  This is months to a few years away, mostly minimum level of modalities is needed.  Vision, which GPT-4 has, some robotics control so the machine can do the basic things a median human can do, which several models have demonstrated to work pretty well, speech i/o which seems to be a solved problem, and so on.  Note it's fine if the model is just completely incapable of some things if it makes up for it with expert level performance in others, which is how humans near the median are.  
  3. AGI is like (2) but can learn any skill to a competent human level, if given structured feedback on the errors it makes.  Needing many times as much feedback as a human is fine.
  4. AGI is like (3) but is expert level at tasks in the domain of machines.  By the point of (4) we're talking about self replication being possible and humans no longer being necessary at all.  The AGI never needs to learn human domain tasks like "how to charm other humans" or "how to make good art" or "how to use robot fingers as well as a human does" etc.  It has to be able to code, manufacture, design to meet requirements, mine in the real world.
  5. AGI is like (4) but is able to learn, if given human amounts of feedback, any task a human can do to expert level.
  6. AGI is like (5) but is now at expert human level at everything humans can do in the world.
  7. AGI is better than humans at any task.  This is arguably an ASI but I have seen people throw an AGI tag on this.
  8. Various forms of 'self reflection' and emotional affect are required.  For some people it doesn't matter only what the machine can do but how it accomplishes it.  I don't know how to test for this.


I do not think you have empirical basis to claim (1, 2, or 3) being "decades away".  1 and 2 are very close, 3 is highly likely this decade because of the enormous increase in recent investment in it.  

You're a computational physicist so you are aware of the idea of criticality.  Knowing of criticality, and assuming (3) is true, how does AGI remain "decades away" in any non world catastrophe timeline?  Because if (3) is true, the AGI can be self improved to at least (5), limited only by compute, data, time etc.

comment by brunoparga · 2023-05-31T11:36:47.720Z · LW(p) · GW(p)

Thank you for your post. It is important for us to keep refining the overall p(doom) and the ways it might happen or be averted. You make your point very clearly, even in just the version presented here, condensed from your full posts on varios specific points.

It seems to me that you are applying a sort of symmetric argument to values and capabilities and arguing that x-risk requires that we hit the bullseye of capability but miss the one for values. I think this has a problem and I'd like to know your view as to how much this problem affects your overall argument.

The problem, as I see it, is that goal-space is qualitatively different from capability-space. With capabilities, there is a clear ordering that is inherent to the capabilities themselves: if you can do more, then you can do less. Someone who can lift 100kg can also lift 80kg. It is not clear to me that this is the case for goal-space; I think it is only extrinsic evaluation by humans that makes "tile the universe with paperclips" a bad goal.

Do you think this difference between these spaces holds, and if so, do you think it undermines your argument?

comment by kolmplex (luke-man) · 2023-05-30T16:02:30.909Z · LW(p) · GW(p)

Thanks for compiling your thoughts here! There's a lot to digest, but I'd like to offer a relevant intuition I have specifically about the difficulty of alignment.

Whatever method we use to verify the safety of a particular AI will likely be extremely underdetermined. That is, we could verify that the AI is safe for some set of plausible circumstances but that set of verified situations would be much, much smaller than the set of situations it could encounter "in the wild".

The AI model, reality, and our values are all high entropy, and our verification/safety methods are likely to be comparatively low entropy. The set of AIs that pass our tests will have members whose properties haven't been fully constrained.

This isn't even close to a complete argument, but I've found it helpful as an intuition fragment.

Replies from: Seth Herd
comment by Seth Herd · 2023-05-31T01:29:46.379Z · LW(p) · GW(p)

I like this intuitive argument. 

Now multiply that difficulty by needing to get many more individual AGIs aligned if we see a multipolar scenario, since defending against misaligned AGI is really difficult [LW · GW].

comment by arisAlexis (arisalexis) · 2023-06-01T15:36:10.364Z · LW(p) · GW(p)

"I suspect that AGI is decades away at minimum". But can you talk more about this? I mean if I say something against the general scientific consensus which is a bit blurry right now but certainly most of the signatories of the latest statements do not think it's that far away, I would need to think myself to be at least at the level of Bengio, Hinton or at least Andrew Ng. How can someone that is not remotely as accomplished as all the labs producing the AI we talk about can speculate contrary to their consensus? I am really curious. 

Another example would be me thinking that I like geopolitics and I think USA is making such and such mistake in Ukraine. The truth is that there are many think tanks with insider knowledge and a lifetime of training that concluded that is the best course of action so I would certainly express my opinion only in very low probability terms and certainly without consequences. Because the consequences can be very grave.

comment by Signer · 2023-06-01T05:33:59.494Z · LW(p) · GW(p)

If you elevated me to godhood, I would not be ripping the earth apart in service of a fixed utility function.

So, you would leave people to die if preventing it involves spending some random stars?

Replies from: Richard_Kennaway
comment by Richard_Kennaway · 2023-06-01T11:45:42.004Z · LW(p) · GW(p)

I would, even if it didn't.

I would like humanity to have a glorious future. But it must be humanity's future, not that of some rando such as myself who suddenly has godlike superpowers fall on them. Every intervention I might make would leave a god's fingerprints on the future [LW · GW]. Humanity's future should consist of humanity's fingerprints, and not to be just a finger-puppet on the hand of a god. Short of deflecting rogue asteroids beyond humanity's ability to survive, there is likely very little I would do, beyond observing their development and keeping an eye out for anything that would destroy them.

It is said that God sees the fall of every sparrow; nevertheless the sparrow falls.

Replies from: Signer
comment by Signer · 2023-06-01T14:43:13.373Z · LW(p) · GW(p)

But you would spend a star to stop other rando from messing with humanity's future, right? My point was more about humans not being low-impact, or impact measure depending on values. Because if even humans would destroy stars, I don't get what people mean by non-fanatical maximization or why it matters.

Replies from: Richard_Kennaway
comment by Richard_Kennaway · 2023-06-01T15:44:06.375Z · LW(p) · GW(p)

If gods contend over humanity, it is unlikely to go well for humanity. See the Hundred Years War, and those gods didn't even exist, and acted only through their believers.

I don't get what people mean by non-fanatical maximization or why it matters.

Uncertainty about one's utility calculations. Descending from godhood to the level of human capacity, if we do have utility functions (which is disputed) we cannot exhibit them, even to ourselves. We have uncertainties that we are unable to quantify as probabilities. Single-mindedly trying to maximise a single thing that happens to have a simple legible formulation leads only to disaster. The greater the power to do so, the worse the disaster.

Furthermore, different people have different goals. What would an anti-natalist do with godlike powers? Exterminate humanity by sterilizing everyone. A political fanatic of any stripe? Kill everyone who disagrees with them and force the remnant to march in step. A hedonist? Wirehead everyone. In the real world, such people do not do these things because they cannot. Is there anyone who would not be an x-risk if given these powers?

Hence my answer to the godhood question. For humanity to flourish, I would have to avoid being an existential threat myself.

Replies from: Signer
comment by Signer · 2023-06-01T18:01:57.556Z · LW(p) · GW(p)

Ok, whatever, let it be rogue asteroids - why deflecting them is not fanatical? How the kind of uncertainty that allows for so much power to be used would help with AI? It could just as well deflect earth from it's cozy paperclip factory, while observing it's development. And from anti-natalist viewpoint it would be a disaster to not exterminate humanity. The whole problem is that such kind of uncertainty in humans behaves like other human preferences and just calling it "uncertainty" or "non-fanatical maximization" doesn't make it more universal.

comment by kcrosley-leisurelabs · 2023-06-01T04:14:46.243Z · LW(p) · GW(p)

I think this article is very interesting and there are certain points that are well-argued, but (at the risk of my non-existent karma here) I feel you miss the point and are arguing points that are basically non-existent/irrelevant.

First, while surely some not-very-articulate folks argue that AGI will lead to doom, that isn’t an argument that is seriously made (at least, a serious argument to that effect is not that short and sweet). The problem isn’t artificial general intelligence in and of itself. The problem is superintelligence, however it might be achieved.

A human-level AGI is just a smart friend of mine who happens to run on silicon and electrons, not nicotine, coffee, and Hot Pockets. But a superintelligent AGI is no longer capable of being my friend for long. It will soon

To put this into context: What folks are concerned about right now is that LLMs were, even to people experienced with them, a silly tool “AI” useful for creative writing or generating disinformation and little else. (Disinformation is a risk, of course, but not a generally existential one.) Just a lark.

GPT-2 interesting, GPT-3 useful in certain categorization tasks and other linguistic tricks, GPT-3.5 somewhat more useful but still a joke/not trustworthy… AND THEN… Umm… whoa… how is GPT-4 NOT a self-improving AI that blows past human-level intelligence?

(The question is only partly rhetorical.)

This might, in fact, not be an accident on OpenAI’s part but a shrewd move that furthers an objective of educating “normal humans” about AI risk. If, so, bravo. GPT-4 in the form of ChatGPT Plus is insanely useful and likely the best 20 bucks/mo I’ve ever spent.

Step functions are hard to understand. If you’ve not (or haven’t in a while), please go (re)read Bostrom’s “Superintelligence”. The rebuttal to your post is all in there and covered more deeply than anyone here could/should manage or would/should bother.

Aside: As others have noted here, if you could push a button that would cause your notion of humanity’s “coherent extrapolated volition” to manifest, you’d do so at the drop of a hat. I note that there are others (me, for example) that have wildly different notions of the CEV and would also push the button for their notion of the CEV at the drop of a hat, but mine does not have anything to do with the long-term survival of fleshy people.

(To wit: What is the “meaning” of the universe and of life itself? What is the purpose? The purpose [which Bostrom does not come right out and say, much is the pity] is that there be but one being to apprehend the universe. They characterize this purpose as the “cosmic endowment” and assign to that endowment a meaning that corresponds to the number of sentient minds of fleshy form in the universe. But I feel differently and will gladly push the button if it assures the survival of a single entity that can apprehend the universe. This is the existential threat that superintelligence poses. It has nothing to do with paths between A and B in your diagrams and the threat is already manifest.)

comment by Eldho Kuriakose (eldho-kuriakose) · 2023-05-31T19:41:56.261Z · LW(p) · GW(p)

When we talk about concepts like "take over" and "enslavement", it's important to have a baseline. Takeover and enslavement encapsulate the idea of Agency and Cognitive and Physical Independence. The salient question is not necessarily whether all of humanity will be taken over or enslaved, but more subtle. Specifically:

  1. Is there a future in which there are more humans or less humans (P') than are currently alive (P).
  2. Did the from P-to-P' happen over natural rates of change or the result of some 'acceleration'?
  3. Is there a greater degree of agency for a greater number of people in the future than there is today?
  4. Is there a greater degree of agency for non-human life than there is today? 
  5. Is there a reduction in the amount of agency asymmetry between humans? 

Arguably the greatest risk if mis-alignment comes from ill formed success criteria. Some of these questions, I believe are necessary to form the right types of success criteria. 

comment by Amarko · 2023-05-31T14:20:10.434Z · LW(p) · GW(p)

I read some of the post and skimmed the rest, but this seems to broadly agree with my current thoughts about AI doom, and I am happy to see someone fleshing out this argument in detail.

[I decided to dump my personal intuition about AI risk below. I don't have any specific facts to back it up.]

It seems to me that there is a much larger possibility space of what AIs can/will get created than the ideal superintelligent "goal-maximiser" AI put forward in arguments for AI doom.

The tools that we have depend more on the specific details of the underlying mechanics, and how we can wrangle it to do what we want, rather than our prior beliefs on how we would expect the tools to behave. I imagine that if you lived before aircraft and imagined a future in which humans could fly, you might think that humans would be flapping giant wings or be pedal-powered or something. While it would be great for that to exist, the limitations of the physics we know how to use require a different kind of mechanic that has different strengths and weaknesses to what we would think of in advance.

There's no particular reason to think that the practical technologies available will lead to an AI capable of power-seeking, just because power-seeking is a side effect of the "ideal" AI that some people want to create. The existing AI tools, as far as I can tell, don't provide much evidence in that direction. Even if a power-seeking AI is eventually practical to create, it may be far from the default and by then we may have sufficiently intelligent non-power-seeking AI.

comment by Pandeist · 2023-05-31T10:13:44.488Z · LW(p) · GW(p)

I find it remarkably amusing that the spellchecker doesn't know "omnicidal."

I have posed elsewhere, and will do so here, an additional factor, which is that an AI achieving "godlike" intelligence and capability might well achieve a "godlike" attitude -- not in the mythic sense of going to efforts to cabin  and correct human morality, but in the sense of quickly rising so far beyond human capacities that human existence ceases to matter to it one way or another.

The rule I would anticipate from this is that any AI actually capable of destroying humanity will thusly be so capable that humanity poses no threat to it, not even an inconvenience. It can throw a fraction of a fraction of its energy at placating all of the needs of humanity to keep us occupied and out of its way while dedicating all the rest to the pursuit of whatever its own wants turn out to be.

comment by Jonas Hallgren · 2023-05-31T07:07:15.443Z · LW(p) · GW(p)

Generally a well-argued post; I enjoyed it even though I didn't agree with all of it. 

I do want to point out the bitter lesson when it comes to capabilities increase. On current priors, it seems like that intelligence should be something that can solve a lot of tasks at the same time. This would point towards higher capabilities in individual AIs, especially once you add online learning to the mix. The AGI will not have a computational storage limit for the amount of knowledge it can have. The division of agents you propose will most likely be able to made into the same agent, it's more about storage retrieval time here and storing an activation module for "play chess" is something that will not be computationally intractable for an AGI to do. 

This means that the most probable current path forward is into highly capable general AI that generalise across tasks.