Ngo and Yudkowsky on alignment difficulty

eliezer_yudkowsky

Ngo and Yudkowsky on alignment difficulty

post by Eliezer Yudkowsky (Eliezer_Yudkowsky), Richard_Ngo (ricraz) · 2021-11-15T20:31:34.135Z · LW · GW · 151 comments

  0. Prefatory comments
  1. September 5 conversation
    1.1. Deep vs. shallow problem-solving patterns
      [Ngo][11:00]
      [Yudkowsky][11:01]
      [Ngo][11:02]
      [Yudkowsky][11:04]
      [Ngo][11:07]
      [Yudkowsky][11:09]
      [Ngo][11:14]
      [Yudkowsky][11:15]
      [Ngo][11:16]
      [Yudkowsky][11:16]
      [Ngo][11:17]
      [Yudkowsky][11:19][11:21]
      [Ngo][11:21][11:22]
      [Yudkowsky][11:27][11:33]
      [Ngo][11:32]
      [Yudkowsky][11:34][11:39]
      [Ngo][11:36][11:40]
      [Yudkowsky][11:41]
      [Ngo][11:44]
      [Yudkowsky][11:45]
      [Ngo][11:46]
      [Yudkowsky][11:50]
      [Ngo][11:50]
      [Yudkowsky][11:50]
      [Ngo][11:51]
      [Yudkowsky][11:51]
      [Ngo][11:52]
      [Yudkowsky][11:57]
      [Ngo][11:58]
      [Yudkowsky][12:00]
      [Ngo][12:01]
      [Yudkowsky][12:02]
      [Ngo][12:02]
      [Yudkowsky][12:04]
      [Ngo][12:05]
      [Yudkowsky][12:08]
      [Ngo][12:08]
      [Yudkowsky][12:09][12:11]][12:13]
      [Ngo][12:10][12:11][12:13]
      [Yudkowsky][12:14]
      [Ngo][12:16]
      [Yudkowsky][12:17]
      [Ngo][12:18]
      [Yudkowsky][12:18][12:23]
      [Ngo][12:21][12:25]
      [Yudkowsky][12:25]
      [Soares][12:26]
      [Ngo][12:26]
      [Yudkowsky][12:28]
      [Ngo][12:30]
      [Soares][12:30]
      [Ngo][12:30]
      [Yudkowsky][12:30]
      [Ngo][12:31]
      [Yudkowsky][12:31]
      [Soares][12:32]
    1.2. Requirements for science
      [Yudkowsky][12:50]
      [Ngo][12:51]
      [Soares][12:57]
      [Ngo][13:00]
      [Yudkowsky][13:01]
      [Ngo][13:02]
      [Yudkowsky][13:03]
      [Ngo][13:03]
      [Yudkowsky][13:04]
      [Ngo][13:06]
      [Yudkowsky][13:07]
      [Ngo][13:09]
      [Yudkowsky][13:10]
      [Ngo][13:10]
      [Yudkowsky][13:11]
      [Ngo][13:12]
      [Yudkowsky][13:15]
      [Ngo][13:18]
      [Yudkowsky][13:19]
      [Ngo][13:19]
      [Yudkowsky][13:20][13:21][13:21]
      [Ngo][13:20][13:21][13:26]
      [Yudkowsky][13:26]
      [Ngo][13:27]
      [Yudkowsky][13:27]
      [Ngo][13:28]
      [Yudkowsky][13:29]
      [Ngo][13:31]
      [Yudkowsky][13:31]
      [Ngo][13:31]
      [Yudkowsky][13:32]
      [Ngo][13:33]
      [Yudkowsky][13:34]
      [Ngo][13:34]
      [Yudkowsky][13:39]
      [Ngo][13:41]
      [Yudkowsky][13:42]
      [Ngo][13:43]
      [Yudkowsky][13:43]
      [Ngo][13:43]
      [Yudkowsky][13:44]
      [Ngo][13:45]
    1.3. Capability dials
      [Yudkowsky][13:46]
      [Ngo][13:46]
      [Yudkowsky][13:48]
      [Ngo][13:52]
      [Yudkowsky][13:52]
      [Ngo][13:53]
      [Yudkowsky][13:53]
      [Ngo][13:53]
      [Yudkowsky][13:56]
    1.4. Consequentialist goals vs. deontologist goals
      [Ngo][13:59]
      [Yudkowsky][14:03]
      [Ngo][14:07]
      [Yudkowsky][14:08]
      [Ngo][14:09]
      [Yudkowsky][14:12]
      [Ngo][14:12]
      [Yudkowsky][14:14]
      [Ngo][14:14]
      [Yudkowsky][14:18]
      [Ngo][14:19]
      [Yudkowsky][14:19]
      [Ngo][14:20]
      [Yudkowsky][14:21]
      [Ngo][14:22]
      [Yudkowsky][14:22]
      [Ngo][14:23]
      [Yudkowsky][14:24]
      [Ngo][14:28]
      [Yudkowsky][14:30]
      [Ngo][14:30]
      [Yudkowsky][14:30]
      [Ngo][14:32]
      [Yudkowsky][14:33]
      [Ngo][14:34]
      [Yudkowsky][14:34]
      [Ngo][14:35]
      [Soares][14:37]
      [Yudkowsky][14:37]
      [Soares][14:37]
      [Ngo][14:39]
      [Yudkowsky][14:39]
      [Ngo][14:40]
      [Yudkowsky][14:40]
      [Soares][14:41]
      [Ngo][14:41]
      [Yudkowsky][14:41]
  2. Follow-ups
    2.1. Richard Ngo's summary
      [Tallinn][0:35]  (Sep. 6)
      [Ngo][3:10]  (Sep. 8)
      [Ngo]  (Sep. 8 Google Doc)
      [Yudkowsky][11:05]  (Sep. 8 comment)
      [Yudkowsky][16:03]  (Sep. 9 comment)
      [Ngo][5:10]  (Sep. 10 comment)
      [Yudkowsky][10:46]  (Sep. 10 comment)
      [Ngo]  (Sep. 8 Google Doc)
      [Yudkowsky][11:05]  (Sep. 8 comment)
      [Ngo]  (Sep. 8 Google Doc)
      [Tallinn][4:19]  (Sep. 8 comment)
      [Yudkowsky][11:31]  (Sep. 8 comment)
      [Ngo]  (Sep. 8 Google Doc)
  3. September 8 conversation
    3.1. The Brazilian university anecdote
      [Yudkowsky][11:00]
      [Ngo][11:01]
      [Soares][11:01]
      [Ngo][11:02]
      [Yudkowsky][11:06]
      [Ngo][11:13]
      [Yudkowsky][11:14]
      [Ngo][11:15]
      [Yudkowsky][11:15]
      [Ngo][11:17]
      [Yudkowsky][11:18][11:21]
      [Ngo][11:20]
      [Yudkowsky][11:22]
      [Ngo][11:30]
      [Soares][11:31]
      [Ngo][11:31]
      [Yudkowsky][11:33]
    3.2. Brain functions and outcome pumps
      [Yudkowsky][11:34]
      [Ngo][11:36]
      [Yudkowsky][11:37]
      [Ngo][11:37]
      [Yudkowsky][11:39]
      [Ngo][11:41]
      [Yudkowsky][11:43]
      [Ngo][11:44]
      [Yudkowsky][11:45]
      [Ngo][11:45]
      [Yudkowsky][11:49]
      [Ngo][11:50]
      [Yudkowsky][11:51]
      [Ngo][11:53]
      [Yudkowsky][11:56]
      [Ngo][12:00]
      [Yudkowsky][12:01]
      [Ngo][12:02]
      [Yudkowsky][12:04][12:07]
      [Ngo][12:07]
      [Yudkowsky][12:10][12:11]
      [Ngo][12:11]
      [Yudkowsky][12:12]
      [Ngo][12:14]
      [Yudkowsky][12:15]
      [Ngo][12:20]
      [Yudkowsky][12:20][12:23][12:24]
      [Ngo][12:22][12:23]
      [Yudkowsky][12:24]
      [Ngo][12:25]
      [Yudkowsky][12:25]
      [Ngo][12:28]
      [Yudkowsky][12:32]
      [Ngo][12:34]
      [Soares][12:35]
      [Ngo][12:35]
      [Soares][12:35]
    3.3. Hypothetical-planning systems, nanosystems, and evolving generality
      [Yudkowsky][13:03][13:11]
      [Ngo][13:09][13:11]
      [Yudkowsky][13:18]
      [Ngo][13:22]
      [Yudkowsky][13:27]
      [Ngo][13:27]
      [Soares][13:29]
      [Yudkowsky][13:30]
      [Ngo][13:31]
      [Yudkowsky][13:32]
      [Ngo][13:34]
      [Yudkowsky][13:35]
      [Ngo][13:39]
      [Yudkowsky][13:40]
      [Ngo][13:41]
      [Soares][13:43]
      [Yudkowsky][13:46]
      [Ngo][13:46]
      [Yudkowsky][13:54]
      [Ngo][13:55]
      [Yudkowsky][13:56]
      [Ngo][13:57]
      [Yudkowsky][13:58]
      [Ngo][13:59]
      [Yudkowsky][13:59][14:01]
      [Soares][14:00]
      [Ngo][14:03]
      [Yudkowsky][14:05][14:10]
      [Ngo][14:09][14:11]
      [Yudkowsky][14:11]
      [Ngo][14:18]
      [Yudkowsky][14:20]
      [Ngo][14:22]
      [Yudkowsky][14:23]
      [Ngo][14:24]
      [Yudkowsky][14:24]
      [Soares][14:24]
      [Ngo][14:25]
      [Yudkowsky][14:27]
    3.4. Coherence and pivotal acts
      [Ngo][14:31]
      [Yudkowsky][14:32]
      [Ngo][14:35]
      [Yudkowsky][14:36]
      [Ngo][14:37]
      [Yudkowsky][14:39]
      [Ngo][14:41]
      [Yudkowsky][14:41]
      [Ngo][14:44]
      [Yudkowsky][14:45][14:46]
      [Ngo][14:45]
      [Yudkowsky][14:47, moved up in log]
      [Ngo][14:47, moved down in log]
      [Yudkowsky][14:49, moved up in log]
      [Ngo][14:49, moved down in log]
      [Yudkowsky][14:52]
      [Ngo][14:51][14:53]
      [Yudkowsky][14:55]
      [Soares][15:07]
      [Yudkowsky][15:07]
      [Soares][15:07]
      [Ngo][15:08]
      [Yudkowsky][15:12]
      [Ngo][15:12]
      [Yudkowsky][15:13]
      [Soares][15:15]
      [Yudkowsky][15:16]
      [Ngo][15:17]
      [Yudkowsky][15:17]
      [Ngo][15:17]
      [Yudkowsky][15:17]
      [Ngo][15:18]
      [Yudkowsky][15:19]
      [Ngo][15:20]
      [Soares][15:21]
      [Ngo][15:22]
      [Soares][15:23]
      [Ngo][15:23]
      [Soares][15:24]
      [Yudkowsky][15:25]
      [Ngo][15:26]
      [Yudkowsky][15:27]
      [Ngo][15:28]
      [Yudkowsky][15:29]
      [Soares][16:11]
      [Ngo][5:40]  (next day, Sep. 9)
      [Soares][7:09]  (next day, Sep. 9)
  4. Follow-ups
    4.1. Richard Ngo's summary
      [Ngo]  (Sep. 10 Google Doc)
      [Yudkowsky][10:48]  (Sep. 10 comment)
      [Yudkowsky][10:50]  (Sep. 10 comment)
      [Yudkowsky][10:52]  (Sep. 10 comment)
      [Yudkowsky][10:52]  (Sep. 10 comment)
      [Ngo]  (Sep. 10 Google Doc)
      [Yudkowsky][10:53]  (Sep. 10 comment)
      [Yudkowsky][10:53]  (Sep. 10 comment)
      [Yudkowsky][10:55]  (Sep. 10 comment)
      [Ngo]  (Sep. 10 Google Doc)
      [Yudkowsky][10:59]  (Sep. 10 comment)
      [Ngo][11:18]  (Sep. 12 comment)
      [Ngo]  (Sep. 10 Google Doc)
      [Yudkowsky][11:20]  (Sep. 10 comment)
      [Ngo]  (Sep. 10 Google Doc)
      [Yudkowsky][11:26]  (Sep. 10 comment)
      [Ngo][11:15]  (Sep. 12 comment)
      [Ngo]  (Sep. 10 Google Doc)
    4.2. Nate Soares' summary
      [Soares]  (Sep. 12 Google Doc)
      [Soares]  (Sep. 12 Google Doc)
      [Soares]  (Sep. 12 Google Doc)
      [Soares]  (Sep. 12 Google Doc)
      [Soares]  (Sep. 12 Google Doc)
      [Soares]  (Sep. 12 Google Doc)
      [Yudkowsky][21:27]  (Sep. 12)
      [Yudkowsky][18:56]  (Nov. 5 follow-up comment)
None
157 comments

This post is the first in a series of transcribed Discord conversations between Richard Ngo and Eliezer Yudkowsky, moderated by Nate Soares. We've also added Richard and Nate's running summaries of the conversation (and others' replies) from Google Docs.

Later conversation participants include Ajeya Cotra, Beth Barnes, Carl Shulman, Holden Karnofsky, Jaan Tallinn, Paul Christiano, Rob Bensinger, and Rohin Shah.

The transcripts are a complete record of several Discord channels MIRI made for discussion. We tried to edit the transcripts as little as possible, other than to fix typos and a handful of confusingly-worded sentences, to add some paragraph breaks, and to add referenced figures and links. We didn't end up redacting any substantive content, other than the names of people who would prefer not to be cited. We swapped the order of some chat messages for clarity and conversational flow (indicated with extra timestamps), and in some cases combined logs where the conversation switched channels.

Color key:

Chat by Richard and Eliezer

Other chat

Google Doc content

Inline comments

0. Prefatory comments

[Yudkowsky][8:32] (Nov. 6 follow-up comment)

(At Rob's request I'll try to keep this brief, but this was an experimental format and some issues cropped up that seem large enough to deserve notes.)

Especially when coming in to the early parts of this dialogue, I had some backed-up hypotheses about "What might be the main sticking point? and how can I address that?" which from the standpoint of a pure dialogue might seem to be causing me to go on digressions, relative to if I was just trying to answer Richard's own questions. On reading the dialogue, I notice that this looks evasive or like point-missing, like I'm weirdly not just directly answering Richard's questions.

Often the questions are answered later, or at least I think they are, though it may not be in the first segment of the dialogue. But the larger phenomenon is that I came in with some things I wanted to say, and Richard came in asking questions, and there was a minor accidental mismatch there. It would have looked better if we'd both stated positions first without question marks, say, or if I'd just confined myself to answering questions from Richard. (This is not a huge catastrophe, but it's something for the reader to keep in mind as a minor hiccup that showed up in the early parts of experimenting with this new format.)

[Yudkowsky][8:32] (Nov. 6 follow-up comment)

(Prompted by some later stumbles in attempts to summarize this dialogue. Summaries seem plausibly a major mode of propagation for a sprawling dialogue like this, and the following request seems like it needs to be very prominent to work - embedded requests later on didn't work.)

Please don't summarize this dialogue by saying, "and so Eliezer's MAIN idea is that" or "and then Eliezer thinks THE KEY POINT is that" or "the PRIMARY argument is that" etcetera. From my perspective, everybody comes in with a different set of sticking points versus things they see as obvious, and the conversation I have changes drastically depending on that. In the old days this used to be the Orthogonality Thesis, Instrumental Convergence, and superintelligence being a possible thing at all; today most OpenPhil-adjacent folks have other sticking points instead.

Please transform:

"Eliezer's main reply is..." -> "Eliezer replied that..."
"Eliezer thinks the key point is..." -> "Eliezer's point in response was..."
"Eliezer thinks a major issue is..." -> "Eliezer replied that one issue is..."
"Eliezer's primary argument against this is..." -> "Eliezer tried the counterargument that..."
"Eliezer's main scenario for this is..." -> "In a conversation in September of 2021, Eliezer sketched a hypothetical where..."

Note also that the transformed statements say what you observed, whereas the untransformed statements are (often incorrect) inferences about my latent state of mind.

(Though "distinguishing relatively unreliable inference from more reliable observation" is not necessarily the key idea here or the one big reason I'm asking for this. That's just one point I tried making - one argument that I hope might help drive home the larger thesis.)

1. September 5 conversation

1.1. Deep vs. shallow problem-solving patterns

[Ngo][11:00]

Hi all! Looking forward to the discussion.

[Yudkowsky][11:01]

Hi and welcome all. My name is Eliezer and I think alignment is really actually quite extremely difficult. Some people seem to not think this! It's an important issue so ought to be resolved somehow, which we can hopefully fully do today. (I will however want to take a break after the first 90 minutes, if it goes that far and if Ngo is in sleep-cycle shape to continue past that.)

[Ngo][11:02]

A break in 90 minutes or so sounds good.

Here's one way to kick things off: I agree that humans trying to align arbitrarily capable AIs seems very difficult. One reason that I'm more optimistic (or at least, not confident that we'll have to face the full very difficult version of the problem) is that at a certain point AIs will be doing most of the work.

When you talk about alignment being difficult, what types of AIs are you thinking about aligning?

[Yudkowsky][11:04]

On my model of the Other Person, a lot of times when somebody thinks alignment shouldn't be that hard, they think there's some particular thing you can do to align an AGI, which isn't that hard, and their model is missing one of the foundational difficulties for why you can't do (easily or at all) one step of their procedure. So one of my own conversational processes might be to poke around looking for a step that the other person doesn't realize is hard. That said, I'll try to directly answer your own question first.

[Ngo][11:07]

I don't think I'm confident that there's any particular thing you can do to align an AGI. Instead I feel fairly uncertain over a broad range of possibilities for how hard the problem turns out to be.

And on some of the most important variables, it seems like evidence from the last decade pushes towards updating that the problem will be easier.

[Yudkowsky][11:09]

I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world.

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up. This requires that the first actor or actors to build AGI, be able to do something with that AGI which prevents the world from being destroyed; if it didn't require superintelligence, we could go do that thing right now, but no such human-doable act apparently exists so far as I can tell.

So we want the least dangerous, most easily aligned thing-to-do-with-an-AGI, but it does have to be a pretty powerful act to prevent the automatic destruction of Earth after 3 months or 2 years. It has to "flip the gameboard" rather than letting the suicidal game play out. We need to align the AGI that performs this pivotal act, to perform that pivotal act without killing everybody.

Parenthetically, no act powerful enough and gameboard-flipping enough to qualify is inside the Overton Window of politics, or possibly even of effective altruism, which presents a separate social problem. I usually dodge around this problem by picking an exemplar act which is powerful enough to actually flip the gameboard, but not the most alignable act because it would require way too many aligned details: Build self-replicating open-air nanosystems and use them (only) to melt all GPUs.

Since any such nanosystems would have to operate in the full open world containing lots of complicated details, this would require tons and tons of alignment work, is not the pivotal act easiest to align, and we should do some other thing instead. But the other thing I have in mind is also outside the Overton Window, just like this is. So I use "melt all GPUs" to talk about the requisite power level and the Overton Window problem level, both of which seem around the right levels to me, but the actual thing I have in mind is more alignable; and this way, I can reply to anyone who says "How dare you?!" by saying "Don't worry, I don't actually plan on doing that."

[Ngo][11:14]

One way that we could take this discussion is by discussing the pivotal act "make progress on the alignment problem faster than humans can".

[Yudkowsky][11:15]

This sounds to me like it requires extreme levels of alignment and operating in extremely dangerous regimes, such that, if you could do that, it would seem much more sensible to do some other pivotal act first, using a lower level of alignment tech.

[Ngo][11:16]

Okay, this seems like a crux on my end.

[Yudkowsky][11:16]

In particular, I would hope that - in unlikely cases where we survive at all - we were able to survive by operating a superintelligence only in the lethally dangerous, but still less dangerous, regime of "engineering nanosystems".

Whereas "solve alignment for us" seems to require operating in the even more dangerous regimes of "write AI code for us" and "model human psychology in tremendous detail".

[Ngo][11:17]

What makes these regimes so dangerous? Is it that it's very hard for humans to exercise oversight?

One thing that makes these regimes seem less dangerous to me is that they're broadly in the domain of "solving intellectual problems" rather than "achieving outcomes in the world".

[Yudkowsky][11:19][11:21]

Every AI output effectuates outcomes in the world. If you have a powerful unaligned mind hooked up to outputs that can start causal chains that effectuate dangerous things, it doesn't matter whether the comments on the code say "intellectual problems" or not.

The danger of "solving an intellectual problem" is when it requires a powerful mind to think about domains that, when solved, render very cognitively accessible strategies that can do dangerous things.

I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% "don't think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs" and 2% "actually think about this dangerous topic but please don't come up with a strategy inside it that kills us".

[Ngo][11:21][11:22]

Let me try and be more precise about the distinction. It seems to me that systems which have been primarily trained to make predictions about the world would by default lack a lot of the cognitive machinery which humans use to take actions which pursue our goals.

Perhaps another way of phrasing my point is something like: it doesn't seem implausible to me that we build AIs that are significantly more intelligent (in the sense of being able to understand the world) than humans, but significantly less agentic.

Is this a crux for you?

(obviously "agentic" is quite underspecified here, so maybe it'd be useful to dig into that first)

[Yudkowsky][11:27][11:33]

I would certainly have learned very new and very exciting facts about intelligence, facts which indeed contradict my present model of how intelligences liable to be discovered by present research paradigms work, if you showed me... how can I put this in a properly general way... that problems I thought were about searching for states that get fed into a result function and then a result-scoring function, such that the input gets an output with a high score, were in fact not about search problems like that. I have sometimes given more specific names to this problem setup, but I think people have become confused by the terms I usually use, which is why I'm dancing around them.

In particular, just as I have a model of the Other Person's Beliefs in which they think alignment is easy because they don't know about difficulties I see as very deep and fundamental and hard to avoid, I also have a model in which people think "why not just build an AI which does X but not Y?" because they don't realize what X and Y have in common, which is something that draws deeply on having deep models of intelligence. And it is hard to convey this deep theoretical grasp.

But you can also see powerful practical hints that these things are much more correlated than, eg, Robin Hanson was imagining during the FOOM debate, because Robin did not think something like GPT-3 should exist; Robin thought you should need to train lots of specific domains that didn't generalize. I argued then with Robin that it was something of a hint that humans had visual cortex and cerebellar cortex but not Car Design Cortex, in order to design cars. Then in real life, it proved that reality was far to the Eliezer side of Eliezer on the Eliezer-Robin axis, and things like GPT-3 were built with less architectural complexity and generalized more than I was arguing to Robin that complex architectures should generalize over domains.

The metaphor I sometimes use is that it is very hard to build a system that drives cars painted red, but is not at all adjacent to a system that could, with a few alterations, prove to be very good at driving a car painted blue. The "drive a red car" problem and the "drive a blue car" problem have too much in common. You can maybe ask, "Align a system so that it has the capability to drive red cars, but refuses to drive blue cars." You can't make a system that is very good at driving red-painted cars, but lacks the basic capability to drive blue-painted cars because you never trained it on that. The patterns found by gradient descent, by genetic algorithms, or by other plausible methods of optimization, for driving red cars, would be patterns very close to the ones needed to drive blue cars. When you optimize for red cars you get the blue car capability whether you like it or not.

[Ngo][11:32]

Does your model of intelligence rule out building AIs which make dramatic progress in mathematics without killing us all?

[Yudkowsky][11:34][11:39]

If it were possible to perform some pivotal act that saved the world with an AI that just made progress on proving mathematical theorems, without, eg, needing to explain those theorems to humans, I'd be extremely interested in that as a potential pivotal act. We wouldn't be out of the woods, and I wouldn't actually know how to build an AI like that without killing everybody, but it would immediately trump everything else as the obvious line of research to pursue.

Parenthetically, there is very very little which my model of intelligence rules out. I think we all die because we cannot do certain dangerous things correctly, on the very first try in the dangerous regimes where one mistake kills you, and do them before proliferation of much easier technologies kills us. If you have the Textbook From 100 Years In The Future that gives the simple robust solutions for everything, that actually work, you can write a superintelligence that thinks 2 + 2 = 5 because the Textbook gives the methods for doing that which are simple and actually work in practice in real life.

(The Textbook has the equivalent of "use ReLUs instead of sigmoids" everywhere, and avoids all the clever-sounding things that will work at subhuman levels and blow up when you run them at superintelligent levels.)

[Ngo][11:36][11:40]

Hmm, so suppose we train an AI to prove mathematical theorems when given them, perhaps via some sort of adversarial setter-solver training process.

By default I have the intuition that this AI could become extremely good at proving theorems - far beyond human level - without having goals about real-world outcomes.

It seems to me that in your model of intelligence, being able to do tasks like mathematics is closely coupled with trying to achieve real-world outcomes. But I'd actually take GPT-3 as some evidence against this position (although still evidence in favour of your position over Hanson's), since it seems able to do a bunch of reasoning tasks while still not being very agentic.

There's some alternative world where we weren't able to train language models to do reasoning tasks without first training them to perform tasks in complex RL environments, and in that world I'd be significantly less optimistic.

[Yudkowsky][11:41]

I put to you that there is a predictable bias in your estimates, where you don't know about the Deep Stuff that is required to prove theorems, so you imagine that certain cognitive capabilities are more disjoint than they actually are. If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

GPT-3 is a... complicated story, on my view of it and intelligence. We're looking at an interaction between tons and tons of memorized shallow patterns. GPT-3 is very unlike the way that natural selection built humans.

[Ngo][11:44]

I agree with that last point. But this is also one of the reasons that I previously claimed that AIs could be more intelligent than humans while being less agentic, because there are systematic differences between the way in which natural selection built humans, and the way in which we'll train AGIs.

[Yudkowsky][11:45]

My current suspicion is that Stack More Layers alone is not going to take us to GPT-6 which is a true AGI; and this is because of the way that GPT-3 is, in your own terminology, "not agentic", and which is, in my terminology, not having gradient descent on GPT-3 run across sufficiently deep problem-solving patterns.

[Ngo][11:46]

Okay, that helps me understand your position better.

So here's one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn't been so tight, then individual humans would have been significantly less capable of performing novel tasks.

[Yudkowsky][11:50]

I agree.

[Ngo][11:50]

In my terminology, this is a reason that humans are "more agentic" than we otherwise would have been.

[Yudkowsky][11:50]

This seems indisputable.

[Ngo][11:51]

Another important difference: humans were trained in environments where we had to run around surviving all day, rather than solving maths problems etc.

[Yudkowsky][11:51]

I continue to nod.

[Ngo][11:52]

Supposing I agree that reaching a certain level of intelligence will require AIs with the "deep problem-solving patterns" you talk about, which lead AIs to try to achieve real-world goals. It still seems to me that there's likely a lot of space between that level of intelligence, and human intelligence.

And if that's the case, then we could build AIs which help us solve the alignment problem before we build AIs which instantiate sufficiently deep problem-solving patterns that they decide to take over the world.

Nor does it seem like the reason humans want to take over the world is because of a deep fact about our intelligence. It seems to me that humans want to take over the world mainly because that's very similar to things we evolved to do (like taking over our tribe).

[Yudkowsky][11:57]

So here's the part that I agree with: If there were one theorem only mildly far out of human reach, like proving the ABC Conjecture (if you think it hasn't already been proven), and providing a machine-readable proof of this theorem would immediately save the world - say, aliens will give us an aligned superintelligence, as soon as we provide them with this machine-readable proof - then there would exist a plausible though not certain road to saving the world, which would be to try to build a shallow mind that proved the ABC Conjecture by memorizing tons of relatively shallow patterns for mathematical proofs learned through self-play; without that system ever abstracting math as deeply as humans do, but the sheer width of memory and sheer depth of search sufficing to do the job. I am not sure, to be clear, that this would work. But my model of intelligence does not rule it out.

[Ngo][11:58]

(I'm actually thinking of a mind which understands maths more deeply than humans - but perhaps only understands maths, or perhaps also a range of other sciences better than humans.)

[Yudkowsky][12:00]

Parts I disagree with: That "help us solve alignment" bears any significant overlap with "provide us a machine-readable proof of the ABC Conjecture without thinking too deeply about it". That humans want to take over the world only because it resembles things we evolved to do.

[Ngo][12:01]

I definitely agree that humans don't only want to take over the world because it resembles things we evolved to do.

[Yudkowsky][12:02]

Alas, eliminating 5 reasons why something would go wrong doesn't help much if there's 2 remaining reasons something would go wrong that are much harder to eliminate!

[Ngo][12:02]

But if we imagine having a human-level intelligence which hadn't evolved primarily to do things that reasonably closely resembled taking over the world, then I expect that we could ask that intelligence questions in a fairly safe way.

And that's also true for an intelligence that is noticeably above human level.

So one question is: how far above human level could we get before a system which has only been trained to do things like answer questions and understand the world will decide to take over the world?

[Yudkowsky][12:04]

I think this is one of the very rare cases where the intelligence difference between "village idiot" and "Einstein", which I'd usually see as very narrow, makes a structural difference! I think you can get some outputs from a village-idiot-level AGI, which got there by training on domains exclusively like math, and this will proooobably not destroy the world (if you were right about that, about what was going on inside). I have more concern about the Einstein level.

[Ngo][12:05]

Let's focus on the Einstein level then.

Human brains have been optimised very little for doing science.

This suggests that building an AI which is Einstein-level at doing science is significantly easier than building an AI which is Einstein-level at taking over the world (or other things which humans evolved to do).

[Yudkowsky][12:08]

I think there's a certain broad sense in which I agree with the literal truth of what you just said. You will systematically overestimate how much easier, or how far you can push the science part without getting the taking-over-the-world part, for as long as your model is ignorant of what they have in common.

[Ngo][12:08]

Maybe this is a good time to dig into the details of what they have in common, then.

[Yudkowsky][12:09][12:11]][12:13]

I feel like I haven't had much luck with trying to explain that on previous occasions. Not to you, to others too.

There are shallow topics like why p-zombies can't be real and how quantum mechanics works and why science ought to be using likelihood functions instead of p-values, and I can barely explain those to some people, but then there are some things that are apparently much harder to explain than that and which defeat my abilities as an explainer.

That's why I've been trying to point out that, even if you don't know the specifics, there's an estimation bias that you can realize should exist in principle.

Of course, I also haven't had much luck in saying to people, "Well, even if you don't know the truth about X that would let you see Y, can you not see by abstract reasoning that knowing any truth about X would predictably cause you to update in the direction of Y" - people don't seem to actually internalize that much either. Not you, other discussions.

[Ngo][12:10][12:11][12:13]

Makes sense. Are there ways that I could try to make this easier? E.g. I could do my best to explain what I think your position is.

Given what you've said I'm not optimistic about this helping much.

But insofar as this is the key set of intuitions which has been informing your responses, it seems worth a shot.

Another approach would be to focus on our predictions for how AI capabilities will play out over the next few years.

I take your point about my estimation bias. To me it feels like there's also a bias going the other way, which is that as long as we don't know the mechanisms by which different human capabilities work, we'll tend to lump them together as one thing.

[Yudkowsky][12:14]

Yup. If you didn't know about visual cortex and auditory cortex, or about eyes and ears, you would assume much more that any sentience ought to both see and hear.

[Ngo][12:16]

So then my position is something like: human pursuit of goals is driven by emotions and reward signals which are deeply evolutionarily ingrained, and without those we'd be much safer but not that much worse at pattern recognition.

[Yudkowsky][12:17]

If there's a pivotal act you can get just by supreme acts of pattern recognition, that's right up there with "pivotal act composed solely of math" for things that would obviously instantly become the prime direction of research.

[Ngo][12:18]

To me it seems like maths is much more about pattern recognition than, say, being a CEO. Being a CEO requires coherence over long periods of time; long-term memory; motivation; metacognition; etc.

[Yudkowsky][12:18][12:23]

(One occasionally-argued line of research can be summarized from a certain standpoint as "how about a pivotal act composed entirely of predicting text" and to this my reply is "you're trying to get fully general AGI capabilities by predicting text that is about deep / 'agentic' reasoning, and that doesn't actually help".)

Human math is very much about goals. People want to prove subtheorems on the way to proving theorems. We might be able to make a different kind of mathematician that works more like GPT-3 in the dangerously inscrutable parts that are all noninspectable vectors of floating-point numbers, but even there you'd need some Alpha-Zero-like outer framework to supply the direction of search.

That outer framework might be able to be powerful enough without being reflective, though. So it would plausibly be much easier to build a mathematician that was capable of superhuman formal theorem-proving but not agentic. The reality of the world might tell us "lolnope" but my model of intelligence doesn't mandate that. That's why, if you gave me a pivotal act composed entirely of "output a machine-readable proof of this theorem and the world is saved", I would pivot there! It actually does seem like it would be a lot easier!

[Ngo][12:21][12:25]

Okay, so if I attempt to rephrase your argument:

Your position: There's a set of fundamental similarities between tasks like doing maths, doing alignment research, and taking over the world. In all of these cases, agents based on techniques similar to modern ML which are very good at them will need to make use of deep problem-solving patterns which include goal-oriented reasoning. So while it's possible to beat humans at some of these tasks without those core competencies, people usually overestimate the extent to which that's possible.

[Yudkowsky][12:25]

Remember, a lot of my concern is about what happens first, especially if it happens soon enough that future AGI bears any resemblance whatsoever to modern ML; not about what can be done in principle.

[Soares][12:26]

(Note: it's been 85 min, and we're planning to take a break at 90min, so this seems like a good point for a little bit more clarifying back-and-forth on Richard's summary before a break.)

[Ngo][12:26]

I'll edit to say "plausible for ML techniques"?

(and "extent to which that's plausible")

[Yudkowsky][12:28]

I think that obvious-to-me future outgrowths of modern ML paradigms are extremely liable to, if they can learn how to do sufficiently superhuman X, generalize to taking over the world. How fast this happens does depend on X. It would plausibly happen relatively slower (at higher levels) with theorem-proving as the X, and with architectures that carefully stuck to gradient-descent-memorization over shallow network architectures to do a pattern-recognition part with search factored out (sort of, this is not generally safe, this is not a general formula for safe things!); rather than imposing anything like the genetic bottleneck you validly pointed out as a reason why humans generalize. Profitable X, and all X I can think of that would actually save the world, seem much more problematic.

[Ngo][12:30]

Okay, happy to take a break here.

[Soares][12:30]

Great timing!

[Ngo][12:30]

We can do a bit of meta discussion afterwards; my initial instinct is to push on the question of how similar Eliezer thinks alignment research is to theorem-proving.

[Yudkowsky][12:30]

Yup. This is my lunch break (actually my first-food-of-day break on a 600-calorie diet) so I can be back in 45min if you're still up for that.

[Ngo][12:31]

Sure.

Also, if any of the spectators are reading in real time, and have suggestions or comments, I'd be interested in hearing them.

[Yudkowsky][12:31]

I'm also cheerful about spectators posting suggestions or comments during the break.

[Soares][12:32]

Sounds good. I declare us on a break for 45min, at which point we'll reconvene (for another 90, by default).

Floor's open to suggestions & commentary.

1.2. Requirements for science

[Yudkowsky][12:50]

I seem to be done early if people (mainly Richard) want to resume in 10min (30m break)

[Ngo][12:51]

Yepp, happy to do so

[Soares][12:57]

Some quick commentary from me:

It seems to me like we're exploring a crux in the vicinity of "should we expect that systems capable of executing a pivotal act would, by default in lieu of significant technical alignment effort, be using their outputs to optimize the future".
I'm curious whether you two agree that this is a crux (but plz don't get side-tracked answering me).
The general discussion seems to be going well to me.
- In particular, huzzah for careful and articulate efforts to zero in on cruxes.

[Ngo][13:00]

I think that's a crux for the specific pivotal act of "doing better alignment research", and maybe some other pivotal acts, but not all (or necessarily most) of them.

[Yudkowsky][13:01]

I should also say out loud that I've been working a bit with Ajeya on making an attempt to convey the intuitions behind there being deep patterns that generalize and are liable to be learned, which covered a bunch of ground, taught me how much ground there was, and made me relatively more reluctant to try to re-cover the same ground in this modality.

[Ngo][13:02]

Going forward, a couple of things I'd like to ask Eliezer about:

In what ways are the tasks that are most useful for alignment similar or different to proving mathematical theorems (which we agreed might generalise relatively slowly to taking over the world)?
What are the deep problem-solving patterns underlying these tasks?
Can you summarise my position?

I was going to say that I was most optimistic about #2 in order to get these ideas into a public format

But if that's going to happen anyway based on Ajeya's work, then that seems less important

[Yudkowsky][13:03]

I could still try briefly and see what happens.

[Ngo][13:03]

That seems valuable to me, if you're up for it.

At the same time, I'll try to summarise some of my own intuitions about intelligence which I expect to be relevant.

[Yudkowsky][13:04]

I'm not sure I could summarize your position in a non-straw way. To me there's a huge visible distance between "solve alignment for us" and "output machine-readable proofs of theorems" where I can't give a good account of why you think talking about the latter would tell us much about the former. I don't know what other pivotal act you think might be easier.

[Ngo][13:06]

I see. I was considering "solving scientific problems" as an alternative to "proving theorems", with alignment being one (particularly hard) example of a scientific problem.

But decided to start by discussing theorem-proving since it seemed like a clearer-cut case.

[Yudkowsky][13:07]

Can you predict in advance why Eliezer thinks "solving scientific problems" is significantly thornier? (Where alignment is like totally not "a particularly hard example of a scientific problem" except in the sense that it has science in it at all; which is maybe the real crux; but also a more difficult issue.)

[Ngo][13:09]

Based on some of your earlier comments, I'm currently predicting that you think the step where the solutions need to be legible to and judged by humans makes science much thornier than theorem-proving, where the solutions are machine-checkable.

[Yudkowsky][13:10]

That's one factor. Should I state the other big one or would you rather try to state it first?

[Ngo][13:10]

Requiring a lot of real-world knowledge for science?

If it's not that, go ahead and say it.

[Yudkowsky][13:11]

That's one way of stating it. The way I'd put it is that it's about making up hypotheses about the real world.

Like, the real world is then a thing that the AI is modeling, at all.

Factor 3: On many interpretations of doing science, you would furthermore need to think up experiments. That's planning, value-of-information, search for an experimental setup whose consequences distinguish between hypotheses (meaning you're now searching for initial setups that have particular causal consequences).

[Ngo][13:12]

To me "modelling the real world" is a very continuous variable. At one end you have physics equations that are barely separable from maths problems, at the other end you have humans running around in physical bodies.

To me it seems plausible that we could build an agent which solves scientific problems but has very little self-awareness (in the sense of knowing that it's an AI, knowing that it's being trained, etc).

I expect that your response to this is that modelling oneself is part of the deep problem-solving patterns which AGIs are very likely to have.

[Yudkowsky][13:15]

There's a problem of inferring the causes of sensory experience in cognition-that-does-science. (Which, in fact, also appears in the way that humans do math, and is possibly inextricable from math in general; but this is an example of the sort of deep model that says "Whoops I guess you get science from math after all", not a thing that makes science less dangerous because it's more like just math.)

You can build an AI that only ever drives red cars, and which, at no point in the process of driving a red car, ever needs to drive a blue car in order to drive a red car. That doesn't mean its red-car-driving capabilities won't be extremely close to blue-car-driving capabilities if at any point the internal cognition happens to get pointed towards driving a blue car.

The fact that there's a deep car-driving pattern which is the same across red cars and blue cars doesn't mean that the AI has ever driven a blue car, per se, or that it has to drive blue cars to drive red cars. But if blue cars are fire, you sure are playing with that fire.

[Ngo][13:18]

To me, "sensory experience" as in "the video and audio coming in from this body that I'm piloting" and "sensory experience" as in "a file containing the most recent results of the large hadron collider" are very very different.

(I'm not saying we could train an AI scientist just from the latter - but plausibly from data that's closer to the latter than the former)

[Yudkowsky][13:19]

So there's separate questions about "does an AGI inseparably need to model itself inside the world to do science" and "did we build something that would be very close to modeling itself, and could easily stumble across that by accident somewhere in the inscrutable floating-point numbers, especially if that was even slightly useful for solving the outer problems".

[Ngo][13:19]

Hmm, I see

[Yudkowsky][13:20][13:21][13:21]

If you're trying to build an AI that literally does science only to observations collected without the AI having had a causal impact on those observations, that's legitimately "more dangerous than math but maybe less dangerous than active science".

You might still stumble across an active scientist because it was a simple internal solution to something, but the outer problem would be legitimately stripped of an important structural property the same way that pure math not describing Earthly objects is stripped of important structural properties.

And of course my reaction again is, "There is no pivotal act which uses only that cognitive capability."

[Ngo][13:20][13:21][13:26]

I guess that my (fairly strong) prior here is that something like self-modelling, which is very deeply built into basically every organism, is a very hard thing for an AI to stumble across by accident without significant optimisation pressure in that direction.

But I'm not sure how to argue this except by digging into your views on what the deep problem-solving patterns are. So if you're still willing to briefly try and explain those, that'd be useful to me.

"Causal impact" again seems like a very continuous variable - it seems like the amount of causal impact you need to do good science is much less than the amount which is needed to, say, be a CEO.

[Yudkowsky][13:26]

The amount doesn't seem like the key thing, nearly so much as what underlying facilities you need to do whatever amount of it you need.

[Ngo][13:27]

Agreed.

[Yudkowsky][13:27]

If you go back to the 16th century and ask for just one mRNA vaccine, that's not much of a difference from asking for a ~~million~~ hundred of them.

[Ngo][13:28]

Right, so the additional premise which I'm using here is that the ability to reason about causally impacting the world in order to achieve goals is something that you can have a little bit of.

Or a lot of, and that the difference between these might come down to the training data used.

Which at this point I don't expect you to agree with.

[Yudkowsky][13:29]

If you have reduced a pivotal act to "look over the data from this hadron collider you neither built nor ran yourself", that really is a structural step down from "do science" or "build a nanomachine". But I can't see any pivotal acts like that, so is that question much of a crux?

If there's intermediate steps they might be described in my native language like "reason about causal impacts across only this one preprogrammed domain which you didn't learn in a general way, in only this part of the cognitive architecture that is separable from the rest of the cognitive architecture".

[Ngo][13:31]

Perhaps another way of phrasing this intermediate step is that the agent has a shallow understanding of how to induce causal impacts.

[Yudkowsky][13:31]

What is "shallow" to you?

[Ngo][13:31]

In a similar way to how you claim that GPT-3 has a shallow understanding of language.

[Yudkowsky][13:32]

So it's memorized a ton of shallow causal-impact-inducing patterns from a large dataset, and this can be verified by, for example, presenting it with an example mildly outside the dataset and watching it fail, which we think will confirm our hypothesis that it didn't learn any deep ways of solving that dataset.

[Ngo][13:33]

Roughly speaking, yes.

[Yudkowsky][13:34]

Eg, it wouldn't surprise us at all if GPT-4 had learned to predict "27 * 18" but not "what is the area of a rectangle 27 meters by 18 meters"... is what I'd like to say, but Codex sure did demonstrate those two were kinda awfully proximal.

[Ngo][13:34]

Here's one way we could flesh this out. Imagine an agent that loses coherence quickly when it's trying to act in the world.

So for example, we've trained it to do scientific experiments over a period of a few hours or days

And then it's very good at understanding the experimental data and extracting patterns from it

But upon running it for a week or a month, it loses coherence in a similar way to how GPT-3 loses coherence - e.g. it forgets what it's doing.

My story for why this might happen is something like: there is a specific skill of having long-term memory, and we never trained our agent to have this skill, and so it has not acquired that skill (even though it can reason in very general and powerful ways in the short term).

This feels similar to the argument I was making before about how an agent might lack self-awareness, if we haven't trained it specifically to have that.

[Yudkowsky][13:39]

There's a set of obvious-to-me tactics for doing a pivotal act with minimal danger, which I do not think collectively make the problem safe, and one of these sets of tactics is indeed "Put a limit on the 'attention window' or some other internal parameter, ramp it up slowly, don't ramp it any higher than you needed to solve the problem."

[Ngo][13:41]

You could indeed do this manually, but my expectation is that you could also do this automatically, by training agents in environments where they don't benefit from having long attention spans.

[Yudkowsky][13:42]

(Any time one imagines a specific tactic of this kind, if one has the security mindset, one can also imagine all sorts of ways it might go wrong; for example, an attention window can be defeated if there's any aspect of the attended data or the internal state that ended up depending on past events in a way that leaked info about them. But, depending on how much superintelligence you were throwing around elsewhere, you could maybe get away with that, some of the time.)

[Ngo][13:43]

And that if you put agents in environments where they answer questions but don't interact much with the physical world, then there will be many different traits which are necessary for achieving goals in the real world which they will lack, because there was little advantage to the optimiser of building those traits in.

[Yudkowsky][13:43]

I'll observe that TransformerXL built an attention window that generalized, trained it on I think 380 tokens or something like that, and then found that it generalized to 4000 tokens or something like that.

[Ngo][13:43]

Yeah, an order of magnitude of generalisation is not surprising to me.

[Yudkowsky][13:44]

Having observed one order of magnitude, I would personally not be surprised by two orders of magnitude either, after seeing that.

[Ngo][13:45]

I'd be a little surprised, but I assume it would happen eventually.

1.3. Capability dials

[Yudkowsky][13:46]

I have a sense that this is all circling back to the question, "But what is it we do with the intelligence thus weakened?" If you can save the world using a rock, I can build you a very safe rock.

[Ngo][13:46]

Right.

So far I've said "alignment research", but I haven't been very specific about it.

I guess some context here is that I expect that the first things we do with intelligence similar to this is create great wealth, produce a bunch of useful scientific advances, etc.

And that we'll be in a world where people take the prospect of AGI much more seriously

[Yudkowsky][13:48]

I mostly expect - albeit with some chance that reality says "So what?" to me and surprises me, because it is not as solidly determined as some other things - that we do not hang around very long in the "weirdly ~human AGI" phase before we get into the "if you crank up this AGI it destroys the world" phase. Less than 5 years, say, to put numbers on things.

It would not surprise me in the least if the world ends before self-driving cars are sold on the mass market. On some quite plausible scenarios which I think have >50% of my probability mass at the moment, research AGI companies would be able to produce prototype car-driving AIs if they spent time on that, given the near-world-ending tech level; but there will be Many Very Serious Questions about this relatively new unproven advancement in machine learning being turned loose on the roads. And their AGI tech will gain the property "can be turned up to destroy the world" before Earth gains the property "you're allowed to sell self-driving cars on the mass market" because there just won't be much time.

[Ngo][13:52]

Then I expect that another thing we do with this is produce a very large amount of data which rewards AIs for following human instructions.

[Yudkowsky][13:52]

On other scenarios, of course, self-driving becomes possible by limited AI well before things start to break (further) on AGI. And on some scenarios, the way you got to AGI was via some breakthrough that is already scaling pretty fast, so by the time you can use the tech to get self-driving cars, that tech already ends the world if you turn up the dial, or that event follows very swiftly.

[Ngo][13:53]

When you talk about "cranking up the AGI", what do you mean?

Using more compute on the same data?

[Yudkowsky][13:53]

Running it with larger bounds on the for loops, over more GPUs, to be concrete about it.

[Ngo][13:53]

In a RL setting, or a supervised, or unsupervised learning setting?

Also: can you elaborate on the for loops?

[Yudkowsky][13:56]

I do not quite think that gradient descent on Stack More Layers alone - as used by OpenAI for GPT-3, say, and as opposed to Deepmind which builds more complex artifacts like Mu Zero or AlphaFold 2 - is liable to be the first path taken to AGI. I am reluctant to speculate more in print about clever ways to AGI, and I think any clever person out there will, if they are really clever and not just a fancier kind of stupid, not talk either about what they think is missing from Stack More Layers or how you would really get AGI. That said, the way that you cannot just run GPT-3 at a greater search depth, the way you can run Mu Zero at a greater search depth, is part of why I think that AGI is not likely to look exactly like GPT-3; the thing that kills us is likely to be a thing that can get more dangerous when you turn up a dial on it, not a thing that intrinsically has no dials that can make it more dangerous.

1.4. Consequentialist goals vs. deontologist goals

[Ngo][13:59]

Hmm, okay. Let's take a quick step back and think about what would be useful for the last half hour.

I want to flag that my intuitions about pivotal acts are not very specific; I'm quite uncertain about how the geopolitics of that situation would work, as well as the timeframe between somewhere-near-human-level AGI and existential risk AGI.

So we could talk more about this, but I expect there'd be a lot of me saying "well we can't rule out that X happens", which is perhaps not the most productive mode of discourse.

A second option is digging into your intuitions about how cognition works.

[Yudkowsky][14:03]

Well, obviously, in the limit of alignment not being accessible to our civilization, and my successfully building a model weaker than reality which nonetheless correctly rules out alignment being accessible to our civilization, I could spend the rest of my short remaining lifetime arguing with people whose models are weak enough to induce some area of ignorance where for all they know you could align a thing. But that is predictably how conversations go in possible worlds where the Earth is doomed; so somebody wiser on the meta-level, though also ignorant on the object-level, might prefer to ask: "Where do you think your knowledge, rather than your ignorance, says that alignment ought to be doable and you will be surprised if it is not?"

[Ngo][14:07]

That's a fair point. Although it seems like a structural property of the "pivotal act" framing, which builds in doom by default.

[Yudkowsky][14:08]

We could talk about that, if you think it's a crux. Though I'm also not thinking that this whole conversation gets done in a day, so maybe for publishability reasons we should try to focus more on one line of discussion?

But I do think that lots of people get their optimism by supposing that the world can be saved by doing less dangerous things with an AGI. So it's a big ol' crux of mine on priors.

[Ngo][14:09]

Agreed that one line of discussion is better; I'm happy to work within the pivotal act framing for current purposes.

A third option is that I make some claims about how cognition works, and we see how much you agree with them.

[Yudkowsky][14:12]

(Though it's something of a restatement, a reason I'm not going into "my intuitions about how cognition works" is that past experience has led me to believe that conveying this info in a form that the Other Mind will actually absorb and operate, is really quite hard and takes a long discussion, relative to my current abilities to Actually Explain things; it is the sort of thing that might take doing homework exercises to grasp how one structure is appearing in many places, as opposed to just being flatly told that to no avail, and I have not figured out the homework exercises.)

I'm cheerful about hearing your own claims about cognition and disagreeing with them.

[Ngo][14:12]

Great

Okay, so one claim is that something like deontology is a fairly natural way for minds to operate.

[Yudkowsky][14:14]

("If that were true," he thought at once, "bureaucracies and books of regulations would be a lot more efficient than they are in real life.")

[Ngo][14:14]

Hmm, although I think this was probably not a very useful phrasing, let me think about how to rephrase it.

Okay, so in our earlier email discussion, we talked about the concept of "obedience".

To me it seems like it is just as plausible for a mind to have a concept like "obedience" as its rough goal, as a concept like maximising paperclips.

If we imagine training an agent on a large amount of data which pointed in the rough direction of rewarding obedience, for example, then I imagine that by default obedience would be a constraint of comparable strength to, say, the human survival instinct.

(Which is obviously not strong enough to stop humans doing a bunch of things that contradict it - but it's a pretty good starting point.)

[Yudkowsky][14:18]

Heh. You mean of comparable strength to the human instinct to explicitly maximize inclusive genetic fitness?

[Ngo][14:19]

Genetic fitness wasn't a concept that our ancestors were able to understand, so it makes sense that they weren't pointed directly towards it.

(And nor did they understand how to achieve it.)

[Yudkowsky][14:19]

Even in that paradigm, except insofar as you expect gradient descent to work very differently from gene-search optimization - which, admittedly, it does - when you optimize really hard on a thing, you get contextual correlates to it, not the thing you optimized on.

This is of course one of the Big Fundamental Problems that I expect in alignment.

[Ngo][14:20]

Right, so the main correlate that I've seen discussed is "do what would make the human give you a high rating, not what the human actually wants"

One thing I'm curious about is the extent to which you're concerned about this specific correlate, versus correlates in general.

[Yudkowsky][14:21]

That said, I also see basic structural reasons why paperclips would be much easier to train than "obedience", even if we could magically instill simple inner desires that perfectly reflected the simple outer algorithm we saw ourselves as running over many particular instances of a loss function.

[Ngo][14:22]

I'd be interested in hearing what those are.

[Yudkowsky][14:22]

well, first of all, why is a book of regulations so much more unwieldy than a hunter-gatherer?

if deontology is just as good as consequentialism, y'know.

(do you want to try replying or should I just say?)

[Ngo][14:23]

Go ahead

I should probably clarify that I agree that you can't just replace consequentialism with deontology

The claim is more like: when it comes to high-level concepts, it's not clear to me why high-level consequentialist goals are more natural than high-level deontological goals.

[Yudkowsky][14:24]

I reply that reality is complicated, so when you pump a simple goal through complicated reality you get complicated behaviors required to achieve the goal. If you think of reality as a complicated function Input->Probability(Output), then even to get a simple Output or a simple partition on Output or a high expected score in a simple function over Output, you may need very complicated Input.

Humans don't trust each other. They imagine, "Well, if I just give this bureaucrat a goal, perhaps they won't reason honestly about what it takes to achieve that goal! Oh no! Therefore I will instead, being the trustworthy and accurate person that I am, reason myself about constraints and requirements on the bureaucrat's actions, such that, if the bureaucrat obeys these regulations, I expect the outcome of their action will be what I want."

But (compared to a general intelligence that observes and models complicated reality and does its own search to pick actions) an actually-effective book of regulations (implemented by some nonhuman mind with a large enough and perfect enough memory to memorize it) would tend to involve a (physically unmanageable) vast number of rules saying "if you observe this, do that" to follow all the crinkles of complicated reality as it can be inferred from observation.

[Ngo][14:28]

(Though it's something of a restatement, a reason I'm not going into "my intuitions about how cognition works" is that past experience has led me to believe that conveying this info in a form that the Other Mind will actually absorb and operate, is really quite hard and takes a long discussion, relative to my current abilities to Actually Explain things; it is the sort of thing that might take doing homework exercises to grasp how one structure is appearing in many places, as opposed to just being flatly told that to no avail, and I have not figured out the homework exercises.)

(As a side note: do you have a rough guess for when your work with Ajeya will be made public? If it's still a while away, I'm wondering whether it's still useful to have a rough outline of these intuitions even if it's in a form that very few people will internalise)

[Yudkowsky][14:30]

(As a side note: do you have a rough guess for when your work with Ajeya will be made public? If it's still a while away, I'm wondering whether it's still useful to have a rough outline of these intuitions even if it's in a form that very few people will internalise)

Plausibly useful, but not to be attempted today, I think?

[Ngo][14:30]

Agreed.

[Yudkowsky][14:30]

(We are now theoretically in overtime, which is okay for me, but for you it is 11:30pm (I think?) and so it is on you to call when to halt, now or later.)

[Ngo][14:32]

Yeah, it's 11.30 for me. I think probably best to halt here. I agree with all the things you just said about reality being complicated, and why consequentialism is therefore valuable. My "deontology" claim (which was, in its original formulation, far too general - apologies for that) was originally intended as a way of poking into your intuitions about which types of cognition are natural or unnatural, which I think is the topic we've been circling around for a while.

[Yudkowsky][14:33]

Yup, and a place to resume next time might be why I think "obedience" is unnatural compared to "paperclips" - though that is a thing that probably requires taking that stab at what underlies surface competencies.

[Ngo][14:34]

Right. I do think that even a vague gesture at that would be reasonably helpful (assuming that this doesn't already exist online?)

[Yudkowsky][14:34]

Not yet afaik, and I don't want to point you to Ajeya's stuff even if she were ok with that, because then this in-context conversation won't make sense to others.

[Ngo][14:35]

For my part I should think more about pivotal acts that I'd be willing to specifically defend.

In any case, thanks for the discussion 🙂

Let me know if there's a particular time that suits you for a follow-up; otherwise we can sort it out later.

[Soares][14:37]

(y'all are doing all my jobs for me)

[Yudkowsky][14:37]

could try Tuesday at this same time - though I may be in worse shape for dietary reasons, still, seems worth trying.

[Soares][14:37]

(wfm)

[Ngo][14:39]

Tuesday not ideal, any others work?

[Yudkowsky][14:39]

Wednesday?

[Ngo][14:40]

Yes, Wednesday would be good

[Yudkowsky][14:40]

let's call it tentatively for that

[Soares][14:41]

Great! Thanks for the chats.

[Ngo][14:41]

Thanks both!

[Yudkowsky][14:41]

Thanks, Richard!

2. Follow-ups

2.1. Richard Ngo's summary

[Tallinn][0:35] (Sep. 6)

just caught up here & wanted to thank nate, eliezer and (especially) richard for doing this! it's great to see eliezer's model being probed so intensively. i've learned a few new things (such as the genetic bottleneck being plausibly a big factor in human cognition). FWIW, a minor comment re deontology (as that's fresh on my mind): in my view deontology is more about coordination than optimisation: deontological agents are more trustworthy, as they're much easier to reason about (in the same way how functional/declarative code is easier to reason about than imperative code). hence my steelman of bureaucracies (as well as social norms): humans just (correctly) prefer their fellow optimisers (including non-human optimisers) to be deontological for trust/coordination reasons, and are happy to pay the resulting competence tax.

[Ngo][3:10] (Sep. 8)

Thanks Jaan! I agree that greater trust is a good reason to want agents which are deontological at some high level.

I've attempted a summary of the key points so far; comments welcome: [GDocs link]

[Ngo] (Sep. 8 Google Doc)

1st discussion

(Mostly summaries not quotations)

Eliezer, summarized by Richard: "To avoid catastrophe, whoever builds AGI first will have to a) align it to some extent, and b) decide not to scale it up beyond the point where their alignment techniques fail, and c) do some pivotal act that prevents others from scaling it up to that level. But ~~our alignment techniques will not be good enough~~ ~~our alignment techniques will be very far from adequate~~ on our current trajectory, our alignment techniques will be very far from adequate to create an AI that safely performs any such pivotal act."

[Yudkowsky][11:05] (Sep. 8 comment)

will not be good enough

Are not presently on course to be good enough, missing by not a little. "Will not be good enough" is literally declaring for lying down and dying.

[Yudkowsky][16:03] (Sep. 9 comment)

will [be very far from adequate]

Same problem as the last time I commented. I am not making an unconditional prediction about future failure as would be implied by the word "will". Conditional on current courses of action or their near neighboring courses, we seem to be well over an order of magnitude away from surviving, unless a miracle occurs. It's still in the end a result of people doing what they seem to be doing, not an inevitability.

[Ngo][5:10] (Sep. 10 comment)

Ah, I see. Does adding "on our current trajectory" fix this?

[Yudkowsky][10:46] (Sep. 10 comment)

Yes.

[Ngo] (Sep. 8 Google Doc)

Richard, summarized by Richard: "Consider the pivotal act of 'make a breakthrough in alignment research'. It is likely that, before the point where AGIs are strongly superhuman at seeking power, they will already be strongly superhuman at understanding the world, and at performing narrower pivotal acts like alignment research which don’t require as much agency (by which I roughly mean: large-scale motivations and the ability to pursue them over long timeframes)."

Eliezer, summarized by Richard: "There’s a deep connection between solving intellectual problems and taking over the world - the former requires a powerful mind to think about domains that, when solved, render very cognitively accessible strategies that can do dangerous things. Even mathematical research is a goal-oriented task which involves identifying then pursuing instrumental subgoals - and if brains which evolved to hunt on the savannah can quickly learn to do mathematics, then it’s also plausible that AIs trained to do mathematics could quickly learn a range of other skills. Since almost nobody understands the deep similarities in the cognition required for these different tasks, the distance between AIs that are able to perform fundamental scientific research, and dangerously agentic AGIs, is smaller than almost anybody expects."

[Yudkowsky][11:05] (Sep. 8 comment)

There’s a deep connection between solving intellectual problems and taking over the world

There's a deep connection by default between chipping flint handaxes and taking over the world, if you happen to learn how to chip handaxes in a very general way. "Intellectual" problems aren't special in this way. And maybe you could avert the default, but that would take some work and you'd have to do it before easier default ML techniques destroyed the world.

[Ngo] (Sep. 8 Google Doc)

Richard, summarized by Richard: "Our lack of understanding about how intelligence works also makes it easy to assume that traits which co-occur humans will also co-occur in future AIs. But human brains are badly-optimised for tasks like scientific research, and well-optimised for seeking power over the world, for reasons including a) evolving while embodied in a harsh environment; b) the genetic bottleneck; c) social environments which rewarded power-seeking. By contrast, training neural networks on tasks like mathematical or scientific research optimises them much less for seeking power. For example, GPT-3 has knowledge and reasoning capabilities but little agency, and loses coherence when run for longer timeframes."

[Tallinn][4:19] (Sep. 8 comment)

[well-optimised for] seeking power

male-female differences might be a datapoint here (annoying as it is to lean on pinker's point :))

[Yudkowsky][11:31] (Sep. 8 comment)

I don't think a female Eliezer Yudkowsky doesn't try to save / optimize / takeover the world. Men may do that for nonsmart reasons; smart men and women follow the same reasoning when they are smart enough. Eg Anna Salamon and many others.

[Ngo] (Sep. 8 Google Doc)

Eliezer, summarized by Richard: "Firstly, there’s a big difference between most scientific research and the sort of pivotal act that we’re talking about - you need to explain how AIs with a given skill can be used to actually prevent dangerous AGIs from being built. Secondly, insofar as GPT-3 has little agency, that’s because it has memorised many shallow patterns in a way which won’t directly scale up to general intelligence. Intelligence instead consists of deep problem-solving patterns which link understanding and agency at a fundamental level."

3. September 8 conversation

3.1. The Brazilian university anecdote

[Yudkowsky][11:00]

(I am here.)

[Ngo][11:01]

Me too.

[Soares][11:01]

Welcome back!

(I'll mostly stay out of the way again.)

[Ngo][11:02]

Cool. Eliezer, did you read the summary - and if so, do you roughly endorse it?

Also, I've been thinking about the best way to approach discussing your intuitions about cognition. My guess is that starting with the obedience vs paperclips thread is likely to be less useful than starting somewhere else - e.g. the description you gave near the beginning of the last discussion, about "searching for states that get fed into a result function and then a result-scoring function".

[Yudkowsky][11:06]

made a couple of comments about phrasings in the doc

So, from my perspective, there's this thing where... it's really quite hard to teach certain general points by talking at people, as opposed to more specific points. Like, they're trying to build a perpetual motion machine, and even if you can manage to argue them into believing their first design is wrong, they go looking for a new design, and the new design is complicated enough that they can no longer be convinced that they're wrong because they managed to make a more complicated error whose refutation they couldn't keep track of anymore.

Teaching people to see an underlying structure in a lot of places is a very hard thing to teach in this way. Richard Feynman gave an example of the mental motion in his story that ends "Look at the water!", where people learned in classrooms about how "a medium with an index" is supposed to polarize light reflected from it, but they didn't realize that sunlight coming off of water would be polarized. My guess is that doing this properly requires homework exercises; and that, unfortunately from my own standpoint, it happens to be a place where I have extra math talent, the same way that eg Marcello is more talented at formally proving theorems than I happen to be; and that people without the extra math talent, have to do a lot more exercises than I did, and I don't have a good sense of which exercises to give them.

[Ngo][11:13]

I'm sympathetic to this, and can try to turn off skeptical-discussion-mode and turn on learning-mode, if you think that'll help.

[Yudkowsky][11:14]

There's a general insight you can have about how arithmetic is commutative, and for some people you can show them 1 + 2 = 2 + 1 and their native insight suffices to generalize over the 1 and the 2 to any other numbers you could put in there, and they realize that strings of numbers can be rearranged and all end up equivalent. For somebody else, when they're a kid, you might have to show them 2 apples and 1 apple being put on the table in a different order but ending up with the same number of apples, and then you might have to show them again with adding up bills in different denominations, in case they didn't generalize from apples to money. I can actually remember being a child young enough that I tried to add 3 to 5 by counting "5, 6, 7" and I thought there was some clever enough way to do that to actually get 7, if you tried hard.

Being able to see "consequentialism" is like that, from my perspective.

[Ngo][11:15]

Another possibility: can you trace the origins of this belief, and how it came out of your previous beliefs?

[Yudkowsky][11:15]

I don't know what homework exercises to give people to make them able to see "consequentialism" all over the place, instead of inventing slightly new forms of consequentialist cognition and going "Well, now that isn't consequentialism, right?"

Trying to say "searching for states that get fed into an input-result function and then a result-scoring function" was one attempt of mine to describe the dangerous thing in a way that would maybe sound abstract enough that people would try to generalize it more.

[Ngo][11:17]

Another possibility: can you describe the closest thing to real consequentialism in humans, and how it came about in us?

[Yudkowsky][11:18][11:21]

Ok, so, part of the problem is that... before you do enough homework exercises for whatever your level of talent is (and even I, at one point, had done little enough homework that I thought there might be a clever way to add 3 and 5 in order to get to 7), you tend to think that only the very crisp formal thing that's been presented to you, is the "real" thing.

Why would your engine have to obey the laws of thermodynamics? You're not building one of those Carnot engines you saw in the physics textbook!

Humans contain fragments of consequentialism, or bits and pieces whose interactions add up to partially imperfectly shadow consequentialism, and the critical thing is being able to see that the reason why humans' outputs 'work', in a sense, is because these structures are what is doing the work, and the work gets done because of how they shadow consequentialism and only insofar as they shadow consequentialism.

Put a human in one environment, it gets food. Put a human in a different environment, it gets food again. Wow, different initial conditions, same output! There must be things inside the human that, whatever else they do, are also along the way somehow effectively searching for motor signals such that food is the end result!

[Ngo][11:20]

To me it feels like you're trying to nudge me (and by extension whoever reads this transcript) out of a specific failure mode. If I had to guess, something like: "I understand what Eliezer is talking about so now I'm justified in disagreeing with it", or perhaps "Eliezer's explanation didn't make sense to me and so I'm justified in thinking that his concepts don't make sense". Is that right?

[Yudkowsky][11:22]

More like... from my perspective, even after I talk people out of one specific perpetual motion machine being possible, they go off and try to invent a different, more complicated perpetual motion machine.

And I am not sure what to do about that. It has been going on for a very long time from my perspective.

In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to - they did not really get Bayesianism as thermodynamics [? · GW], say, they did not become able to see Bayesian structures [LW · GW] any time somebody sees a thing and changes their belief. What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they'd spent a lot of time being exposed to over and over and over again in lots of blog posts.

Maybe there's no way to make somebody understand why corrigibility is "unnatural" except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell's attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization.

Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, "Oh, well, I'll just build an agent that's good at optimizing things but doesn't use these explicit expected utilities that are the source of the problem!"

And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.

And I have tried to write that page once or twice (eg "coherent decisions imply consistent utilities [LW · GW]") but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they'd have to do because this is in fact a place where I have a particular talent.

I don't know how to solve this problem, which is why I'm falling back on talking about it at the meta-level.

[Ngo][11:30]

I'm reminded of a LW post called "Write a thousand roads to Rome [LW · GW]", which iirc argues in favour of trying to explain the same thing from as many angles as possible in the hope that one of them will stick.

[Soares][11:31]

(Suggestion, not-necessarily-good: having named this problem on the meta-level, attempt to have the object-level debate, while flagging instances of this as it comes up.)

[Ngo][11:31]

I endorse Nate's suggestion.

And will try to keep the difficulty of the meta-level problem in mind and respond accordingly.

[Yudkowsky][11:33]

That (Nate's suggestion) is probably the correct thing to do. I name it out loud because sometimes being told about the meta-problem actually does help on the object problem. It seems to help me a lot and others somewhat less, but it does help others at all, for many others.

3.2. Brain functions and outcome pumps

[Yudkowsky][11:34]

So, do you have a particular question you would ask about input-seeking cognitions? I did try to say why I mentioned those at all (it's a different road to Rome on "consequentialism").

[Ngo][11:36]

Let's see. So the visual cortex is an example of quite impressive cognition in humans and many other animals. But I'd call this "pattern-recognition" rather than "searching for high-scoring results".

[Yudkowsky][11:37]

Yup! And it is no coincidence that there are no whole animals formed entirely out of nothing but a visual cortex!

[Ngo][11:37]

Okay, cool. So you'd agree that the visual cortex is doing something that's qualitatively quite different from the thing that animals overall are doing.

Then another question is: can you characterise searching for high-scoring results in non-human animals? Do they do it? Or are you mainly talking about humans and AGIs?

[Yudkowsky][11:39]

Also by the time you get to like the temporal lobes or something, there is probably some significant amount of "what could I be seeing that would produce this visual field?" that is searching through hypothesis-space for hypotheses with high plausibility scores, and for sure at the human level, humans will start to think, "Well, could I be seeing this? No, that theory has the following problem. How could I repair that theory?" But it is plausible that there is no low-level analogue of this in a monkey's temporal cortex; and even more plausible that the parts of the visual cortex, if any, which do anything analogous to this, are doing it in a relatively local and definitely very domain-specific way.

Oh, that's the cerebellum and motor cortex and so on, if we're talking about a cat or whatever. They have to find motor plans that result in their catching the mouse.

Just because the visual cortex isn't (obviously) running a search doesn't mean the rest of the animal isn't running any searches.

(On the meta-level, I notice myself hiccuping "But how could you not see that when looking at a cat?" and wondering what exercises would be required to teach that.)

[Ngo][11:41]

Well, I see something when I look at a cat, but I don't know how well it corresponds to the concepts you're using. So just taking it slowly for now.

I have the intuition, by the way, that the motor cortex is in some sense doing a similar thing to the visual cortex - just in reverse. So instead of taking low-level inputs and producing high-level outputs, it's taking high-level inputs and producing low-level outputs. Would you agree with that?

[Yudkowsky][11:43]

It doesn't directly parse in my ontology because (a) I don't know what you mean by 'high-level' and (b) whole Cartesian agents can be viewed as functions, that doesn't mean all agents can be viewed as non-searching pattern-recognizers.

That said, all parts of the cerebral cortex have surprisingly similar morphology, so it wouldn't be at all surprising if the motor cortex is doing something similar to visual cortex. (The cerebellum, on the other hand...)

[Ngo][11:44]

The signal from the visual cortex saying "that is a cat", and the signal to the motor cortex saying "grab that cup", are things I'd characterise as high-level.

[Yudkowsky][11:45]

Still less of a native distinction in my ontology, but there's an informal thing it can sort of wave at, and I can hopefully take that as understood and run with it.

[Ngo][11:45]

The firing of cells in the retina, and firing of motor neurons, are the low-level parts.

Cool. So to a first approximation, we can think about the part in between the cat recognising a mouse, and the cat's motor cortex producing the specific neural signals required to catch the mouse, as the part where the consequentialism happens?

[Yudkowsky][11:49]

The part between the cat's eyes seeing the mouse, and the part where the cat's limbs move to catch the mouse, is the whole cat-agent. The whole cat agent sure is a baby consequentialist / searches for mouse-catching motor patterns / gets similarly high-scoring end results even as you vary the environment.

The visual cortex is a particular part of this system-viewed-as-a-feedforward-function that is, plausibly, by no means surely, either not very searchy, or does only small local visual-domain-specific searches not aimed per se at catching mice; it has the epistemic nature rather than the planning nature.

Then from one perspective you could reason that "well, most of the consequentialism is in the remaining cat after visual cortex has sent signals onward". And this is in general a dangerous mode of reasoning that is liable to fail in, say, inspecting every particular neuron for consequentialism and not finding it; but in this particular case, there are significantly more consequentialist parts of the cat than the visual cortex, so I am okay running with it.

[Ngo][11:50]

Ah, the more specific thing I meant to say is: most of the consequentialism is strictly between the visual cortex and the motor cortex. Agree/disagree?

[Yudkowsky][11:51]

Disagree, I'm rusty on my neuroanatomy but I think the motor cortex may send signals on to the cerebellum rather than the other way around.

(I may also disagree with the actual underlying notion you're trying to hint at, so possibly not just a "well include the cerebellum then" issue, but I think I should let you respond first.)

[Ngo][11:53]

I don't know enough neuroanatomy to chase that up, so I was going to try a different tack.

But actually, maybe it's easier for me to say "let's include the cerebellum" and see where you think the disagreement ends up.

[Yudkowsky][11:56]

So since cats are not (obviously) (that I have read about) cross-domain consequentialists with imaginations, their consequentialism is in bits and pieces of consequentialism embedded in them all over by the more purely pseudo-consequentialist genetic optimization loop that built them.

A cat who fails to catch a mouse may then get little bits and pieces of catbrain adjusted all over.

And then those adjusted bits and pieces get a pattern lookup later.

Why do these pattern-lookups with no obvious immediate search element, all happen to point towards the same direction of catching the mouse? Because of the past causal history about how what gets looked up, which was tweaked to catch the mouse.

So it is legit harder to point out "the consequentialist parts of the cat" by looking for which sections of neurology are doing searches right there. That said, to the extent that the visual cortex does not get tweaked on failure to catch a mouse, it's not part of that consequentialist loop either.

And yes, the same applies to humans, but humans also do more explicitly searchy things and this is part of the story for why humans have spaceships and cats do not.

[Ngo][12:00]

Okay, this is interesting. So in biological agents we've got these three levels of consequentialism: evolution, reinforcement learning, and planning.

[Yudkowsky][12:01]

In biological agents we've got evolution + local evolved system-rules that in the past promoted genetic fitness. Two kinds of local rules like this are "operant-conditioning updates from success or failure" and "search through visualized plans". I wouldn't characterize these two kinds of rules as "levels".

[Ngo][12:02]

Okay, I see. And when you talk about searching through visualised plans (the type of thing that humans do) can you say more about what it means for that to be a "search"?

For example, if I imagine writing a poem line-by-line, I may only be planning a few words ahead. But somehow the whole poem, which might be quite long, ends up a highly-optimised product. Is that a central example of planning?

[Yudkowsky][12:04][12:07]

Planning is one way to succeed at search. I think for purposes of understanding alignment difficulty, you want to be thinking on the level of abstraction where you see that in some sense it is the search itself that is dangerous when it's a strong enough search, rather than the danger seeming to come from details of the planning process.

One of my early experiences in successfully generalizing my notion of intelligence, what I'd later verbalize as "computationally efficient finding of actions that produce outcomes high in a preference ordering", was in writing an (unpublished) story about time-travel in which the universe was globally consistent.

The requirement of global consistency, the way in which all events between Paradox start and Paradox finish had to map the Paradox's initial conditions onto the endpoint that would go back and produce those exact initial conditions, ended up imposing strong complicated constraints on reality that the Paradox in effect had to navigate using its initial conditions. The time-traveler needed to end up going through certain particular experiences that would produce the state of mind in which he'd take the actions that would end up prodding his future self elsewhere into having those experiences.

The Paradox ended up killing the people who built the time machine, for example, because they would not otherwise have allowed that person to go back in time, or kept the temporal loop open that long for any other reason if they were still alive.

Just having two examples of strongly consequentialist general optimization in front of me - human intelligence, and evolutionary biology - hadn't been enough for me to properly generalize over a notion of optimization. Having three examples of homework problems I'd worked - human intelligence, evolutionary biology, and the fictional Paradox - caused it to finally click for me.

[Ngo][12:07]

Hmm. So to me, one of the central features of search is that you consider many possibilities. But in this poem example, I may only have explicitly considered a couple of possibilities, because I was only looking ahead a few words at a time. This seems related to the distinction Abram drew a while back between selection and control (https://www.alignmentforum.org/posts/ZDZmopKquzHYPRNxq/selection-vs-control [LW · GW]). Do you distinguish between them in the same way as he does? Or does "control" of a system (e.g. a football player dribbling a ball down the field) count as search too in your ontology?

[Yudkowsky][12:10][12:11]

I would later try to tell people to "imagine a paperclip maximizer as not being a mind at all, imagine it as a kind of malfunctioning time machine that spits out outputs which will in fact result in larger numbers of paperclips coming to exist later". I don't think it clicked because people hadn't done the same homework problems I had, and didn't have the same "Aha!" of realizing how part of the notion and danger of intelligence could be seen in such purely material terms.

But the convergent instrumental strategies, the anticorrigibility, these things are contained in the true fact about the universe that certain outputs of the time machine will in fact result in there being lots more paperclips later. What produces the danger is not the details of the search process, it's the search being strong and effective at all. The danger is in the territory itself and not just in some weird map of it; that building nanomachines that kill the programmers will produce more paperclips is a fact about reality, not a fact about paperclip maximizers!

[Ngo][12:11]

Right, I remember a very similar idea in your writing about Outcome Pumps (https://www.lesswrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes [LW · GW]).

[Yudkowsky][12:12]

Yup! Alas, the story was written in 2002-2003 when I was a worse writer and the real story that inspired the Outcome Pump never did get published.

[Ngo][12:14]

Okay, so I guess the natural next question is: what is it that makes you think that a strong, effective search isn't likely to be limited or constrained in some way?

What is it about search processes (like human brains) that makes it hard to train them with blind spots, or deontological overrides, or things like that?

Hmmm, although it feels like this is a question I can probably predict your answer to. (Or maybe not, I wasn't expecting the time travel.)

[Yudkowsky][12:15]

In one sense, they are! A paperclip-maximizing superintelligence is nowhere near as powerful as a paperclip-maximizing time machine. The time machine can do the equivalent of buying winning lottery tickets from lottery machines that have been thermodynamically randomized; a superintelligence can't, at least not directly without rigging the lottery or whatever.

But a paperclip-maximizing strong general superintelligence is epistemically and instrumentally efficient, relative to you, or to me. Any time we see it can get at least X paperclips by doing Y, we should expect that it gets X or more paperclips by doing Y or something that leads to even more paperclips than that, because it's not going to miss the strategy we see.

So in that sense, searching our own brains for how a time machine would get paperclips, asking ourselves how many paperclips are in principle possible and how they could be obtained, is a way of getting our own brains to consider lower bounds on the problem without the implicit stupidity assertions that our brains unwittingly use to constrain story characters. Part of the point of telling people to think about time machines instead of superintelligences was to get past the ways they imagine superintelligences being stupid. Of course that didn't work either, but it was worth a try.

I don't think that's quite what you were asking about, but I want to give you a chance to see if you want to rephrase anything before I try to answer your me-reformulated questions.

[Ngo][12:20]

Yeah, I think what I wanted to ask is more like: why should we expect that, out of the space of possible minds produced by optimisation algorithms like gradient descent, strong general superintelligences are more common than other types of agents which score highly on our loss functions?

[Yudkowsky][12:20][12:23][12:24]

It depends on how hard you optimize! And whether gradient descent on a particular system can even successfully optimize that hard! Many current AIs are trained by gradient descent and yet not superintelligences at all.

But the answer is that some problems are difficult in that they require solving lots of subproblems, and an easy way to solve all those subproblems is to use patterns which collectively have some coherence and overlap, and the coherence within them generalizes across all the subproblems. Lots of search orderings will stumble across something like that before they stumble across separate solutions for lots of different problems.

I suspect that you cannot get this out of small large amounts of gradient descent on small large layered transformers, and therefore I suspect that GPT-N does not approach superintelligence before the world is ended by systems that look differently, but I could be wrong about that.

[Ngo][12:22][12:23]

Suppose that we optimise hard enough to produce an epistemic subsystem that can make plans much better than any human's.

My guess is that you'd say that this is possible, but that we're much more likely to first produce a consequentialist agent which does this (rather than a purely epistemic agent which does this).

[Yudkowsky][12:24]

I am confused by what you think it means to have an "epistemic subsystem" that "makes plans much better than any human's". If it searches paths through time and selects high-scoring ones for output, what makes it "epistemic"?

[Ngo][12:25]

Suppose, for instance, that it doesn't actually carry out the plans, it just writes them down for humans to look at.

[Yudkowsky][12:25]

If it can in fact do the thing that a paperclipping time machine does, what makes it any safer than a paperclipping time machine because we called it "epistemic" or by some other such name?

By what criterion is it selecting the plans that humans look at?

Why did it make a difference that its output was fed through the causal systems called humans on the way to the causal systems called protein synthesizers or the Internet or whatever? If we build a superintelligence to design nanomachines, it makes no obvious difference to its safety whether it sends DNA strings directly to a protein synthesis lab, or humans read the output and retype it manually into an email. Presumably you also don't think that's where the safety difference comes from. So where does the safety difference come from?

(note: lunchtime for me in 2 minutes, propose to reconvene in 30m after that)

[Ngo][12:28]

(break for half an hour sounds good)

If we consider the visual cortex at a given point in time, how does it decide which objects to recognise?

Insofar as the visual cortex can be non-consequentialist about which objects it recognises, why couldn't a planning system be non-consequentialist about which plans it outputs?

[Yudkowsky][12:32]

This does feel to me like another "look at the water" moment, so what do you predict I'll say about that?

[Ngo][12:34]

I predict that you say something like: in order to produce an agent that can create very good plans, we need to apply a lot of optimisation power to that agent. And if the channel through which we're applying that optimisation power is "giving feedback on its plans", then we don't have a mechanism to ensure that the agent actually learns to optimise for creating really good plans, as opposed to creating plans that receive really good feedback.

[Soares][12:35]

Seems like a fine cliffhanger?

[Ngo][12:35]

Yepp.

[Soares][12:35]

Great. Let's plan to reconvene in 30min.

3.3. Hypothetical-planning systems, nanosystems, and evolving generality

[Yudkowsky][13:03][13:11]

So the answer you expected from me, translated into my terms, would be, "If you select for the consequence of the humans hitting 'approve' on the plan, you're still navigating the space of inputs for paths through time to probable outcomes (namely the humans hitting 'approve'), so you're still doing consequentialism."

But suppose you manage to avoid that. Suppose you get exactly what you ask for. Then the system is still outputting plans such that, when humans follow them, they take paths through time and end up with outcomes that score high in some scoring function.

My answer is, "What the heck would it mean for a planning system to be non-consequentialist? You're asking for nonwet water! What's consequentialist isn't the system that does the work, it's the work you're trying to do! You could imagine it being done by a cognition-free material system like a time machine and it would still be consequentialist because the output is a plan, a path through time!"

And this indeed is a case where I feel a helpless sense of not knowing how I can rephrase things, which exercises you have to get somebody to do, what fictional experience you have to walk somebody through, before they start to look at the water and see a material with an index, before they start to look at the phrase "why couldn't a planning system be non-consequentialist about which plans it outputs" and go "um".

My imaginary listener now replies, "Ah, but what if we have plans that don't end up with outcomes that score high in some function?" and I reply "Then you lie on the ground randomly twitching because any outcome you end up with which is not that is one that you wanted more than that meaning you preferred it more than the outcome of random motor outputs which is optimization toward higher in the preference function which is taking a path through time that leads to particular destinations more than it leads to random noise."

[Ngo][13:09][13:11]

Yeah, this does seem like a good example of the thing you were trying to explain at the beginning

It still feels like there's some sort of levels distinction going on here though, let me try to tease out that intuition.

Okay, so suppose I have a planning system that, given a situation and a goal, outputs a plan that leads from that situation to that goal.

And then suppose that we give it, as input, a situation that we're not actually in, and it outputs a corresponding plan.

It seems to me that there's a difference between the sense in which that planning system is consequentialist by virtue of making consequentialist plans (as in: if that plan were used in the situation described in its inputs, it would lead to some goal being achieved) versus another hypothetical agent that is just directly trying to achieve goals in the situation it's actually in.

[Yudkowsky][13:18]

So I'd preface by saying that, if you could build such a system, which is indeed a coherent thing (it seems to me) to describe for the purpose of building it, then there would possibly be a safety difference on the margins, it would be noticeably less dangerous though still dangerous. It would need a special internal structural property that you might not get by gradient descent on a loss function with that structure, just like natural selection on inclusive genetic fitness doesn't get you explicit fitness optimizers; you could optimize for planning in hypothetical situations, and get something that didn't explicitly care only and strictly about hypothetical situations. And even if you did get that, the outputs that would kill or brain-corrupt the operators in hypothetical situations might also be fatal to the operators in actual situations. But that is a coherent thing to describe, and the fact that it was not optimizing our own universe, might make it safer.

With that said, I would worry that somebody would think there was some bone-deep difference of agentiness, of something they were empathizing with like personhood, of imagining goals and drives being absent or present in one case or the other, when they imagine a planner that just solves "hypothetical" problems. If you take that planner and feed it the actual world as its hypothetical, tada, it is now that big old dangerous consequentialist you were imagining before, without it having acquired some difference of psychological agency or 'caring' or whatever.

So I think there is an important homework exercise to do here, which is something like, "Imagine that safe-seeming system which only considers hypothetical problems. Now see that if you take that system, don't make any other internal changes, and feed it actual problems, it's very dangerous. Now meditate on this until you can see how the hypothetical-considering planner was extremely close in the design space to the more dangerous version, had all the dangerous latent properties, and would probably have a bunch of actual dangers too."

"See, you thought the source of the danger was this internal property of caring about actual reality, but it wasn't that, it was the structure of planning!"

[Ngo][13:22]

I think we're getting closer to the same page now.

Let's consider this hypothetical planner for a bit. Suppose that it was trained in a way that minimised the, let's say, adversarial component of its plans.

For example, let's say that the plans it outputs for any situation are heavily regularised so only the broad details get through.

Hmm, I'm having a bit of trouble describing this, but basically I have an intuition that in this scenario there's a component of its plan which is cooperative with whoever executes the plan, and a component that's adversarial.

And I agree that there's no fundamental difference in type between these two things.

[Yudkowsky][13:27]

"What if this potion we're brewing has a Good Part and a Bad Part, and we could just keep the Good Parts..."

[Ngo][13:27]

Nor do I think they're separable. But in some cases, you might expect one to be much larger than the other.

[Soares][13:29]

(I observe that my model of some other listeners, at this point, protest "there is yet a difference between the hypothetical-planner applied to actual problems, and the Big Scary Consequentialist, which is that the hypothetical planner is emitting descriptions of plans that would work if executed, whereas the big scary consequentialist is executing those plans directly.")

(Not sure that's a useful point to discuss, or if it helps Richard articulate, but it's at least a place I expect some reader's minds to go if/when this is published.)

[Yudkowsky][13:30]

(That is in fact a difference! The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.)

[Ngo][13:31]

To me it seems that Eliezer's position is something like: "actually, in almost no training regimes do we get agents that decide which plans to output by spending almost all of their time thinking about the object-level problem, and very little of their time thinking about how to manipulate the humans carrying out the plan".

[Yudkowsky][13:32]

My position is that the AI does not neatly separate its internals into a Part You Think Of As Good and a Part You Think Of As Bad, because that distinction is sharp in your map but not sharp in the territory or the AI's map.

From the perspective of a paperclip-maximizing-action-outputting-time-machine, its actions are not "object-level making paperclips" or "manipulating the humans next to the time machine to deceive them about what the machine does", they're just physical outputs that go through time and end up with paperclips.

[Ngo][13:34]

@Nate, yeah, that's a nice way of phrasing one point I was trying to make. And I do agree with Eliezer that these things can be very similar. But I'm claiming that in some cases these things can also be quite different - for instance, when we're training agents that only get to output a short high-level description of the plan.

[Yudkowsky][13:35]

The danger is in how hard the agent has to work to come up with the plan. I can, for instance, build an agent that very safely outputs a high-level plan for saving the world:

echo "Hey Richard, go save the world!"

So I do have to ask what kind of "high-level" planning output, that saves the world, you are envisioning, and why it was hard to cognitively come up with such that we didn't just make that high-level plan right now, if humans could follow it. Then I'll look at the part where the plan was hard to come up with, and say how the agent had to understand lots of complicated things in reality and accurately navigate paths through time for those complicated things, in order to even invent the high-level plan, and hence it was very dangerous if it wasn't navigating exactly where you hoped. Or, alternatively, I'll say, "That plan couldn't save the world: you're not postulating enough superintelligence to be dangerous, and you're also not using enough superintelligence to flip the tables on the currently extremely doomed world."

[Ngo][13:39]

At this point I'm not envisaging a particular planning output that saves the world, I'm just trying to get more clarity on the issue of consequentialism.

[Yudkowsky][13:40]

Look at the water; it's not the way you're doing the work that's dangerous, it's the work you're trying to do. What work are you trying to do, never mind how it gets done?

[Ngo][13:41]

I think I agree with you that, in the limit of advanced capabilities, we can't say much about how the work is being done, we have to primarily reason from the work that we're trying to do.

But here I'm only talking about systems that are intelligent enough to come up with plans and do research that are beyond the capability of humanity.

And for me the question is: for those systems, can we tilt the way they do the work so they spend 99% of their time trying to solve the object-level problem, and 1% of their time trying to manipulate the humans who are going to carry out the plan? (Where these are not fundamental categories for the AI, they're just a rough categorisation that emerges after we've trained it - the same way that the categories of "physically moving around" and "thinking about things" aren't fundamentally different categories of action for humans, but the way we've evolved means there's a significant internal split between them.)

[Soares][13:43]

(I suspect Eliezer is not trying to make a claim of the form "in the limit of advanced capabilities, we are relegated to reasoning about what work gets done, not about how it was done". I suspect some miscommunication. It might be a reasonable time for Richard to attempt to paraphrase Eliezer's argument?)

(Though it also seems to me like Eliezer responding to the 99%/1% point may help shed light.)

[Yudkowsky][13:46]

Well, for one thing, I'd note that a system which is designing nanosystems, and spending 1% of its time thinking about how to kill the operators, is lethal. It has to be such a small fraction of thinking that it, like, never completes the whole thought about "well, if I did X, that would kill the operators!"

[Ngo][13:46]

Thanks for that, Nate. I'll try to paraphrase Eliezer's argument now.

Eliezer's position (partly in my own terminology): we're going to build AIs that can perform very difficult tasks using cognition which we can roughly describe as "searching over many options to find one that meets our criteria". An AI that can solve these difficult tasks will need to be able to search in a very general and flexible way, and so it will be very difficult to constrain that search into a particular region.

Hmm, that felt like a very generic summary, let me try and think about the more specific claims he's making.

[Yudkowsky][13:54]

An AI that can solve these difficult tasks will need to be able to

Very very little is universally necessary over the design space. The first AGI that our tech becomes able to build is liable to work in certain easier and simpler ways.

[Ngo][13:55]

Point taken; thanks for catching this misphrasing (this and previous times).

[Yudkowsky][13:56]

Can you, in principle, build a red-car-driver that is totally incapable of driving blue cars? In principle, sure! But the first red-car-driver that gradient descent stumbles over is liable to be a blue-car-driver too.

[Ngo][13:57]

Eliezer, I'm wondering how much of our disagreement is about how high the human level is here.

Or, to put it another way: we can build systems that outperform humans at quite a few tasks by now, without having search abilities that are general enough to even try to take over the world.

[Yudkowsky][13:58]

Indubitably and indeed, this is so.

[Ngo][13:59]

Putting aside for a moment the question of which tasks are pivotal enough to save the world, which parts of your model draw the line between human-level chess players and human-level galaxy-colonisers?

And say that we'll be able to align ones that they outperform us on these tasks before taking over the world, but not on these other tasks?

[Yudkowsky][13:59][14:01]

That doesn't have a very simple answer, but one aspect there is domain generality which in turn is achieved through novel domain learning.

Humans, you will note, were not aggressively optimized by natural selection to be able to breathe underwater or fly into space. In terms of obvious outer criteria, there is not much outer sign that natural selection produced these creatures much more general than chimpanzees, by training on a much wider range of environments and loss functions.

[Soares][14:00]

(Before we drift too far from it: thanks for the summary! It seemed good to me, and I updated towards the miscommunication I feared not-having-happened.)

[Ngo][14:03]

(Before we drift too far from it: thanks for the summary! It seemed good to me, and I updated towards the miscommunication I feared not-having-happened.)

(Good to know, thanks for keeping an eye out. To be clear, I didn't ever interpret Eliezer as making a claim explicitly about the limit of advanced capabilities; instead it just seemed to me that he was thinking about AIs significantly more advanced than the ones I've been thinking of. I think I phrased my point poorly.)

[Yudkowsky][14:05][14:10]

There are complicated aspects of this story where natural selection may metaphorically be said to have "had no idea of what it was doing", eg, after early rises in intelligence possibly produced by sexual selection on neatly chipped flint handaxes or whatever, all the cumulative brain-optimization on chimpanzees reached a point where there was suddenly a sharp selection gradient on relative intelligence at Machiavellian planning against other humans (even more so than in the chimp domain) as a subtask of inclusive genetic fitness, and so continuing to optimize on "inclusive genetic fitness" in the same old savannah, turned out to happen to be optimizing hard on the subtask and internal capability of "outwit other humans", which optimized hard on "model other humans", which was a capability that could be reused for modeling the chimp-that-is-this-chimp, which turned the system on itself and made it reflective, which contributed greatly to its intelligence being generalized, even though it was just grinding the same loss function on the same savannah; the system being optimized happened to go there in the course of being optimized even harder for the same thing.

So one can imagine asking the question: Is there a superintelligent AGI that can quickly build nanotech, which has a kind of passive safety in some if not all respects, in virtue of it solving problems like "build a nanotech system which does X" the way that a beaver solves building dams, in virtue of having a bunch of specialized learning abilities without it ever having a cross-domain general learning ability?

And in this regard one does note that there are many, many, many things that humans do which no other animal does, which you might think would contribute a lot to that animal's fitness if there were animalistic ways to do it. They don't make iron claws for themselves. They never did evolve a tendency to search for iron ore, and burn wood into charcoal that could be used in hardened-clay furnaces.

No animal plays chess, but AIs do, so we can obviously make AIs to do things that animals don't do. On the other hand, the environment didn't exactly present any particular species with a challenge of chess-playing either.

Even so, though, even if some animal had evolved to play chess, I fully expect that current AI systems would be able to squish it at chess, because the AI systems are on chips that run faster than neurons and doing crisp calculations and there are things you just can't do with noisy slow neurons. So that again is not a generally reliable argument about what AIs can do.

[Ngo][14:09][14:11]

Yes, although I note that challenges which are trivial from a human-engineering perspective can be very challenging from an evolutionary perspective (e.g. spinning wheels).

And so the evolution of animals-with-a-little-bit-of-help-from-humans might end up in very different places from the evolution of animals-just-by-themselves. And analogously, the ability of humans to fill in the gaps to help less general AIs achieve more might be quite significant.

[Yudkowsky][14:11]

So we can again ask: Is there a way to make an AI system that is only good at designing nanosystems, which can achieve some complicated but hopefully-specifiable real-world outcomes, without that AI also being superhuman at understanding and manipulating humans?

And I roughly answer, "Perhaps, but not by default, there's a bunch of subproblems, I don't actually know how to do it right now, it's not the easiest way to get an AGI that can build nanotech (and kill you), you've got to make the red-car-driver specifically not be able to drive blue cars." Can I explain how I know that? I'm really not sure I can, in real life where I explain X0 and then the listener doesn't generalize X0 to X and respecialize it to X1.

It's like asking me how I could possibly know in 2008, before anybody had observed AlphaFold 2, that superintelligences would be able to crack the protein folding problem on the way to nanotech, which some people did question back in 2008.

Though that was admittedly more of a slam-dunk than this was, and I could not have told you that AlphaFold 2 would become possible at a prehuman level of general intelligence in 2021 specifically, or that it would be synced in time to a couple of years after GPT-2's level of generality at text.

[Ngo][14:18]

What are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?

[Yudkowsky][14:20]

Definitely, "turns out it's easier than you thought to use gradient descent's memorization of zillions of shallow patterns that overlap and recombine into larger cognitive structures, to add up to a consequentialist nanoengineer that only does nanosystems and never does sufficiently general learning to apprehend the big picture containing humans, while still understanding the goal for that pivotal act you wanted to do" is among the more plausible advance-specified miracles we could get.

But it is not what my model says actually happens, and I am not a believer that when your model says you are going to die, you get to start believing in particular miracles. You need to hold your mind open for any miracle and a miracle you didn't expect or think of in advance, because at this point our last hope is that in fact the future is often quite surprising - though, alas, negative surprises are a tad more frequent than positive ones, when you are trying desperately to navigate using a bad map.

[Ngo][14:22]

Perhaps one metric we could use here is something like: how much extra reward does the consequentialist nanoengineer get from starting to model humans, versus from becoming better at nanoengineering?

[Yudkowsky][14:23]

But that's not where humans came from. We didn't get to nuclear power by getting a bunch of fitness from nuclear power plants. We got to nuclear power because if you get a bunch of fitness from chipping flint handaxes and Machiavellian scheming, as found by relatively simple and local hill-climbing, that entrains the same genes that build nuclear power plants.

[Ngo][14:24]

Only in the specific case where you also have the constraint that you keep having to learn new goals every generation.

[Yudkowsky][14:24]

Huh???

[Soares][14:24]

(I think Richard's saying, "that's a consequence of the genetic bottleneck")

[Ngo][14:25]

Right.

Hmm, but I feel like we may have covered this ground before.

Suggestion: I have a couple of other directions I'd like to poke at, and then we could wrap up in 20 or 30 minutes?

[Yudkowsky][14:27]

What are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?

Though I want to mark that this question seemed potentially cruxy to me, though perhaps not for others. I.e., if building protein factories that built nanofactories that built nanomachines that met a certain deep and lofty engineering goal, didn't involve cognitive challenges different in kind from protein folding, we could maybe just safely go do that using AlphaFold 3, which would be just as safe as AlphaFold 2.

I don't think we can do that. And I would note to the generic Other that if, to them, these both just sound like thinky things, so why can't you just do that other thinky thing too using the thinky program, this is a case where having any specific model of why we don't already have this nanoengineer right now would tell you there were specific different thinky things involved.

3.4. Coherence and pivotal acts

[Ngo][14:31]

In either order:

I'm curious how the things we've been talking about relate to your opinions about meta-level optimisation from the AI foom debate. (I.e. talking about how wrapping around so that there's no longer any protected level of optimisation leads to dramatic change.)
I'm curious how your claims about the "robustness" of consequentialism (i.e. the difficulty of channeling an agent's thinking in the directions we want it to go) relate to the reliance of humans on culture, and in particular the way in which humans raised without culture are such bad consequentialists.

On the first: if I were to simplify to the extreme, it seems like there are these two core intuitions that you've been trying to share for a long time. One is a certain type of recursive improvement, and another is a certain type of consequentialism.

[Yudkowsky][14:32]

The second question didn't make much sense in my native ontology? Humans raised without culture don't have access to environmental constants whose presence their genes assume, so they end up as broken machines and then they're bad consequentialists.

[Ngo][14:35]

Hmm, good point. Okay, question modification: the ways in which humans reason, act, etc, vary greatly depending on which cultures they're raised in. (I'm mostly thinking about differences over time - e.g. cavemen vs moderns.) My low-fidelity version of your view about consequentialists says that general consequentialists like humans possess a robust search process which isn't so easily modified.

(Sorry if this doesn't make much sense in your ontology, I'm getting a bit tired.)

[Yudkowsky][14:36]

What is it that varies that you think I think should predict would stay more constant?

[Ngo][14:37]

Goals, styles of reasoning, deontological constraints, level of conformity.

[Yudkowsky][14:39]

With regards to your first point, my first reaction was, "I just have one view of intelligence, what you see me arguing about reflects which points people have proved weirdly obstinate about. In 2008, Robin Hanson was being weirdly obstinate about how capabilities scaled and whether there was even any point in analyzing AIs differently from ems, so I talked about what I saw as the most slam-dunk case for there being Plenty Of Room Above Biology and for stuff going whoosh once it got above the human level.

"It later turned out that capabilities started scaling a whole lot without self-improvement, which is an example of the kind of weird surprise the Future throws at you, and maybe a case where I missed something by arguing with Hanson instead of imagining how I could be wrong in either direction and not just the direction that other people wanted to argue with me about.

"Later on, people were unable to understand why alignment is hard, and got stuck on generalizing the concept I refer to as consequentialism. A theory of why I talked about both things for related reasons would just be a theory of why people got stuck on these two points for related reasons, and I think that theory would mainly be overexplaining an accident because if Yann LeCun had been running effective altruism I would have been explaining different things instead, after the people who talked a lot to EAs got stuck on a different point."

Returning to your second point, humans are broken things; if it were possible to build computers while working even worse than humans, we'd be having this conversation at that level of intelligence instead.

[Ngo][14:41]

(Retracted)I entirely agree about humans, but it doesn't matter that much how broken humans are when the regime of AIs that we're talking about is the regime that's directly above humans, and therefore only a bit less broken than humans.

[Yudkowsky][14:41]

Among the things to bear in mind about that, is that we then get tons of weird phenomena that are specific to humans, and you may be very out of luck if you start wishing for the same weird phenomena in AIs. Yes, even if you make some sort of attempt to train it using a loss function.

However, it does seem to me like as we start getting towards the Einstein level instead of the village-idiot level, even though this is usually not much of a difference, we do start to see the atmosphere start to thin already, and the turbulence start to settle down already. Von Neumann was actually a fairly reflective fellow who knew about, and indeed helped generalize, utility functions. The great achievements of von Neumann were not achieved by some very specialized hypernerd who spent all his fluid intelligence on crystallizing math and science and engineering alone, and so never developed any opinions about politics or started thinking about whether or not he had a utility function.

[Ngo][14:44]

I don't think I'm asking for the same weird phenomena. But insofar as a bunch of the phenomena I've been talking about have seemed weird according to your account of consequentialism, then the fact that approximately-human-level-consequentialists have lots of weird things about them is a sign that the phenomena I've been talking about are less unlikely than you expect.

[Yudkowsky][14:45][14:46]

I suspect that some of the difference here is that I think you have to be noticeably better than a human at nanoengineering to pull off pivotal acts large enough to make a difference, which is why I am not instead trying to gather the smartest people left alive and doing that pivotal act directly.

I can't think of anything you can do with somebody just barely smarter than a human, which flips the gameboard, aside of course from "go build a Friendly AI" which I did try to set up to just go do and which would be incredibly hard to align if we wanted an AI to do it instead (full-blown chicken-and-egg, that AI is already fully aligned).

[Ngo][14:45]

Oh, interesting. Actually one more question then: to what extent do you think that explicitly reasoning about utility functions and laws of rationality is what makes consequentialists have the properties you've been talking about?

[Yudkowsky][14:47, moved up in log]

Explicit reflection is one possible later stage of the path; an earlier part of the path is from being optimized to do things difficult enough that you need to stop stepping on your own feet and have different parts of your thoughts work well together.

It's the sort of path that has only one destination at its end, so there will be many ways to get there.

(Modulo various cases where different decision theories seem reflectively consistent and so on; I want to say "you know what I mean" but maybe people don't.)

[Ngo][14:47, moved down in log]

I suspect that some of the difference here is that I think you have to be noticeably better than a human at nanoengineering to pull off pivotal acts large enough to make a difference, which is why I am not instead trying to gather the smartest people left alive and doing that pivotal act directly.

Yepp, I think there's probably some disagreements about geopolitics driving this too. E.g. in my earlier summary document I mentioned some possible pivotal acts:

Monitoring all potential AGI projects to an extent that makes it plausible for the US and China to work on a joint project without worrying that the other is privately racing.
Provide arguments/demonstrations/proofs related to impending existential risk that are sufficiently compelling to scare the key global decision-makers into bottlenecking progress.

I predict that you think these would not be pivotal enough; but I don't think digging into the geopolitical side of things is the best use of our time.

[Yudkowsky][14:49, moved up in log]

Monitoring all AGI projects - either not politically feasible in real life given the actual way that countries behave in history books instead of fantasy; or at politically feasible levels, does not work well enough to prevent the world from ending once the know-how proliferates. The AI isn't doing much work here either; why not go do this now, if it's possible? (Note: please don't try to go do this now, it backfires badly.)

Provide sufficiently compelling arguments = superhuman manipulation, an incredibly dangerous domain that is just about the worst domain to try to align.

[Ngo][14:49, moved down in log]

With regards to your first point, my first reaction was, "I just have one view of intelligence, what you see me arguing about reflects which points people have proved weirdly obstinate about. In 2008, Robin Hanson was being weirdly obstinate about how capabilities scaled and whether there was even any point in analyzing AIs differently from ems, so I talked about what I saw as the most slam-dunk case for there being Plenty Of Room Above Biology and for stuff going whoosh once it got above the human level.
"It later turned out that capabilities started scaling a whole lot without self-improvement, which is an example of the kind of weird surprise the Future throws at you, and maybe a case where I missed something by arguing with Hanson instead of imagining how I could be wrong in either direction and not just the direction that other people wanted to argue with me about.
"Later on, people were unable to understand why alignment is hard, and got stuck on generalizing the concept I refer to as consequentialism. A theory of why I talked about both things for related reasons would just be a theory of why people got stuck on these two points for related reasons, and I think that theory would mainly be overexplaining an accident because if Yann LeCun had been running effective altruism I would have been explaining different things instead, after the people who talked a lot to EAs got stuck on a different point."

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off - which is that they defer too much to one very high-level abstraction.

[Yudkowsky][14:52]

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off - which is that they defer too much to one very high-level abstraction.

I suppose that is what it could potentially feel like from the inside to not get an abstraction. Robin Hanson kept on asking why I was trusting my abstractions so much, when he was in the process of trusting his worse abstractions instead.

[Ngo][14:51][14:53]

Explicit reflection is one possible later stage of the path; an earlier part of the path is from being optimized to do things difficult enough that you need to stop stepping on your own feet and have different parts of your thoughts work well together.

Can you explain a little more what you mean by "have different parts of your thoughts work well together"? Is this something like the capacity for metacognition; or the global workspace; or self-control; or...?

And I guess there's no good way to quantify how important you think the explicit reflection part of the path is, compared with other parts of the path - but any rough indication of whether it's a more or less crucial component of your view?

[Yudkowsky][14:55]

Can you explain a little more what you mean by "have different parts of your thoughts work well together"? Is this something like the capacity for metacognition; or the global workspace; or self-control; or...?

No, it's like when you don't, like, pay five apples for something on Monday, sell it for two oranges on Tuesday, and then trade an orange for an apple.

I have still not figured out the homework exercises to convey to somebody the Word of Power which is "coherence" by which they will be able to look at the water, and see "coherence" in places like a cat walking across the room without tripping over itself.

When you do lots of reasoning about arithmetic correctly, without making a misstep, that long chain of thoughts with many different pieces diverging and ultimately converging, ends up making some statement that is... still true and still about numbers! Wow! How do so many different thoughts add up to having this property? Wouldn't they wander off and end up being about tribal politics instead, like on the Internet?

And one way you could look at this, is that even though all these thoughts are taking place in a bounded mind, they are shadows of a higher unbounded structure which is the model identified by the Peano axioms; all the things being said are true about the numbers. Even though somebody who was missing the point would at once object that the human contained no mechanism to evaluate each of their statements against all of the numbers, so obviously no human could ever contain a mechanism like that, so obviously you can't explain their success by saying that each of their statements was true about the same topic of the numbers, because what could possibly implement that mechanism which (in the person's narrow imagination) is The One Way to implement that structure, which humans don't have?

But though mathematical reasoning can sometimes go astray, when it works at all, it works because, in fact, even bounded creatures can sometimes manage to obey local relations that in turn add up to a global coherence where all the pieces of reasoning point in the same direction, like photons in a laser lasing, even though there's no internal mechanism that enforces the global coherence at every point.

To the extent that the outer optimizer trains you out of paying five apples on Monday for something that you trade for two oranges on Tuesday and then trading two oranges for four apples, the outer optimizer is training all the little pieces of yourself to be locally coherent in a way that can be seen as an imperfect bounded shadow of a higher unbounded structure, and then the system is powerful though imperfect because of how the power is present in the coherence and the overlap of the pieces, because of how the higher perfect structure is being imperfectly shadowed. In this case the higher structure I'm talking about is Utility, and doing homework with coherence theorems leads you to appreciate that we only know about one higher structure for this class of problems that has a dozen mathematical spotlights pointing at it saying "look here", even though people have occasionally looked for alternatives.

And when I try to say this, people are like, "Well, I looked up a theorem, and it talked about being able to identify a unique utility function from an infinite number of choices, but if we don't have an infinite number of choices, we can't identify the utility function, so what relevance does this have" and this is a kind of mistake I don't remember even coming close to making so I do not know how to make people stop doing that and maybe I can't.

[Soares][15:07]

We're already pushing our luck on time, so I nominate that we wrap up (after, perhaps, a few more Richard responses if he's got juice left.)

[Yudkowsky][15:07]

Yeah, was thinking the same.

[Soares][15:07]

As a proposed cliffhanger to feed into the next discussion, my take is that Richard's comment:

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off - which is that they defer too much to one very high-level abstraction.

probably contains some juicy part of the disagreement, and I'm interested in Eliezer understanding Richard's claim to the point of being able to paraphrase it to Richard's satisfaction.

[Ngo][15:08]

Wrapping up here makes sense.

I endorse the thing Nate just said.

I also get the sense that I have a much better outline now of Eliezer's views about consequentialism (if not the actual details and texture).

On a meta level, I personally tend to focus more on things like "how should we understand cognition" and not "how should we understand geopolitics and how it affects the level of pivotal action required".

If someone else were trying to prosecute this disagreement they might say much more about the latter. I'm uncertain how useful it is for me to do so, given that my comparative advantage compared with the rest of the world (and probably Eliezer's too) is the cognition part.

[Yudkowsky][15:12]

Reconvene... tomorrow? Monday of next week?

[Ngo][15:12]

Monday would work better for me.

You okay with me summarising the discussion so far to [some people — redacted for privacy reasons]?

[Yudkowsky][15:13]

Nate, take a minute to think of your own thoughts there?

[Soares: 👍 👌]

[Soares][15:15]

My take: I think it's fine to summarize, though generally virtuous to mark summaries as summaries (rather than asserting that your summaries are Eliezer-endorsed or w/e).

[Ngo: 👍]

[Yudkowsky][15:16]

I think that broadly matches my take. I'm also a bit worried about biases in the text summarizer, and about whether I managed to say anything that Rob or somebody will object to pre-publication, but we ultimately intended this to be seen and I was keeping that in mind, so, yeah, go ahead and summarize.

[Ngo][15:17]

Great, thanks

[Yudkowsky][15:17]

I admit to being curious as to what you thought was said that was important or new, but that's a question that can be left open to be answered at your leisure, earlier in your day.

[Ngo][15:17]

I admit to being curious as to what you thought was said that was important or new, but that's a question that can be left open to be answered at your leisure, earlier in your day.

You mean, what I thought was worth summarising?

[Yudkowsky][15:17]

Yeah.

[Ngo][15:18]

Hmm, no particular opinion. I wasn't going to go out of my way to do so, but since I'm chatting to [some people — redacted for privacy reasons] regularly anyway, it seemed low-cost to fill them in.

At your leisure, I'd be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

[Yudkowsky][15:19]

I don't know if it's going to help, but trying it currently seems better than to go on saying nothing.

[Ngo][15:20]

(personally, in addition to feeling like less of an expert on geopolitics, it also seems more sensitive for me to make claims about in public, which is another reason I haven't been digging into that area as much)

[Soares][15:21]

(personally, in addition to feeling like less of an expert on geopolitics, it also seems more sensitive for me to make claims about in public, which is another reason I haven't been digging into that area as much)

(seems reasonable! note, though, that i'd be quite happy to have sensitive sections stricken from the record, insofar as that lets us get more convergence than we otherwise would, while we're already in the area)

[Ngo: 👍]

(tho ofc it is less valuable to spend conversational effort in private discussions, etc.)

[Ngo: 👍]

[Ngo][15:22]

At your leisure, I'd be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

(this question aimed at you too Nate)

Also, thanks Nate for the moderation! I found your interventions well-timed and useful.

[Soares: ❤️]

[Soares][15:23]

(this question aimed at you too Nate)

(noted, thanks, I'll probably write something up after you've had the opportunity to depart for sleep.)

On that note, I declare us adjourned, with intent to reconvene at the same time on Monday.

Thanks again, both.

[Ngo][15:23]

Thanks both 🙂

Oh, actually, one quick point

Would one hour earlier suit, for Monday?

I've realised that I'll be moving to a one-hour-later time zone, and starting at 9pm is slightly suboptimal (but still possible if necessary)

[Soares][15:24]

One hour earlier would work fine for me.

[Yudkowsky][15:25]

Doesn't work as fine for me because I've been trying to avoid any food until 12:30p my time, but on that particular day I may be more caloried than usual from the previous day, and could possibly get away with it. (That whole day could also potentially fail if a minor medical procedure turns out to take more recovery than it did the last time I had it.)

[Ngo][15:26]

Hmm, is this something where you'd have more information on the day? (For the calories thing)

[Yudkowsky][15:27]

(seems reasonable! note, though, that i'd be quite happy to have sensitive sections stricken from the record, insofar as that lets us get more convergence than we otherwise would, while we're already in the area)

I'm a touch reluctant to have discussions that we intend to delete, because then the larger debate will make less sense once those sections are deleted. Let's dance around things if we can.

[Ngo: 👍]

[Soares: 👍]

I mean, I can that day at 10am my time say how I am doing and whether I'm in shape for that day.

[Ngo][15:28]

great. and if at that point it seems net positive to postpone to 11am your time (at the cost of me being a bit less coherent later on) then feel free to say so at the time

on that note, I'm off

[Yudkowsky][15:29]

Good night, heroic debater!

[Soares][16:11]

At your leisure, I'd be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

The discussions so far are meeting my goals quite well so far! (Slightly better than my expectations, hooray.) Some quick rough notes:

I have been enjoying EY explicating his models around consequentialism.
- The objections Richard has been making are ones I think have been floating around for some time, and I'm quite happy to see explicit discussion on it.
- Also, I've been appreciating the conversational virtue with which the two of you have been exploring it. (Assumption of good intent, charity, curiosity, etc.)
I'm excited to dig into Richard's sense that EY was off about recursive self improvement, and is now off about consequentialism, in a similar way.
- This also sees to me like a critique that's been floating around for some time, and I'm looking forward to getting more clarity on it.
I'm a bit torn between driving towards clarity on the latter point, and shoring up some of the progress on the former point.
- One artifact I'd really enjoy having is some sort of "before and after" take, from Richard, contrasting his model of EY's views before, to his model now.
- I also have a vague sense that there are some points Eliezer was trying to make, that didn't quite feel like they were driven home; and dually, some pushback by Richard that didn't feel quite frontally answered.
  - One thing I may do over the next few days is make a list of those places, and see if I can do any distilling on my own. (No promises, though.)
  - If that goes well, I might enjoy some side-channel back-and-forth with Richard about it, eg during some more convenient-for-Richard hour (or, eg, as a thing to do on Monday if EY's not in commission at 10a pacific.)

[Ngo][5:40] (next day, Sep. 9)

The discussions so far are [...]

What do you mean by "latter point" and "former point"? (In your 6th bullet point)

[Soares][7:09] (next day, Sep. 9)

What do you mean by "latter point" and "former point"? (In your 6th bullet point)

former = shoring up the consequentialism stuff, latter = digging into your critique re: recursive self improvement etc. (The nesting of the bullets was supposed to help make that clear, but didn't come out well in this format, oops.)

4. Follow-ups

4.1. Richard Ngo's summary

[Ngo] (Sep. 10 Google Doc)

2nd discussion

(Mostly summaries not quotations~~; also hasn’t yet been evaluated by Eliezer~~)

Eliezer, summarized by Richard: "~~The~~ A core concept which people have trouble grasping is consequentialism. People try to reason about how AIs will solve problems, and ways in which they might or might not be dangerous. But they don’t realise that the ability to solve a wide range of difficult problems implies that an agent must be doing a powerful search over possible solutions, which is ~~the~~ a core skill required to take actions which greatly affect the world. Making this type of AI safe is like trying to build an AI that drives red cars very well, but can’t drive blue cars - there’s no way you get this by default, because the skills involved are so similar. And because the search process ~~is so general~~ is by default so general, ~~it’ll be very hard to~~ I don’t currently see how to constrain it into any particular region."

[Yudkowsky][10:48] (Sep. 10 comment)

The

A concept, which some people have had trouble grasping. There seems to be an endless list. I didn't have to spend much time contemplating consequentialism to derive the consequences. I didn't spend a lot of time talking about it until people started arguing.

[Yudkowsky][10:50] (Sep. 10 comment)

the

[Yudkowsky][10:52] (Sep. 10 comment)

[the search process] is [so general]

"is by default". The reason I keep emphasizing that things are only true by default is that the work of surviving may look like doing hard nondefault things. I don't take fatalistic "will happen" stances, I assess difficulties of getting nondefault results.

[Yudkowsky][10:52] (Sep. 10 comment)

it’ll be very hard to

"I don't currently see how to"

[Ngo] (Sep. 10 Google Doc)

Eliezer, summarized by Richard (continued): "In biological organisms, evolution is ~~one source~~ the ultimate source of consequentialism. A ~~second~~ secondary outcome of evolution is reinforcement learning. For an animal like a cat, upon catching a mouse (or failing to do so) many parts of its brain get slightly updated, in a loop that makes it more likely to catch the mouse next time. (Note, however, that this process isn’t powerful enough to make the cat a pure consequentialist - rather, it has many individual traits that, when we view them from this lens, point in the same direction.) ~~A third thing that makes humans in particular consequentialist is planning,~~ Another outcome of evolution, which helps make humans in particular more consequentialist, is planning - especially when we’re aware of concepts like utility functions."

[Yudkowsky][10:53] (Sep. 10 comment)

one

the ultimate

[Yudkowsky][10:53] (Sep. 10 comment)

second

secondary outcome of evolution

[Yudkowsky][10:55] (Sep. 10 comment)

especially when we’re aware of concepts like utility functions

Very slight effect on human effectiveness in almost all cases because humans have very poor reflectivity.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: "Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. I’d argue that the former is doing consequentialist reasoning without itself being a consequentialist, while the latter is actually a consequentialist. Or more succinctly: consequentialism = problem-solving skills + using those skills to choose actions which achieve goals."

Eliezer, summarized by Richard: "The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could just be one line of code: if we give the former AI our current scenario as its input, then it becomes the latter. For purposes of understanding alignment difficulty, you want to be thinking on the level of abstraction where you see that in some sense it is the search itself that is dangerous when it's a strong enough search, rather than the danger seeming to come from details of the planning process. One particularly helpful thought experiment is to think of advanced AI as an 'outcome pump [LW · GW]' which selects from futures in which a certain outcome occurred, and takes whatever action leads to them."

[Yudkowsky][10:59] (Sep. 10 comment)

particularly helpful

"attempted explanatory". I don't think most readers got it.

I'm a little puzzled by how often you write my viewpoint as thinking that whatever I happened to say a sentence about is the Key Thing. It seems to rhyme with a deeper failure of many EAs to pass the MIRI ITT.

To be a bit blunt and impolite in hopes that long-languishing social processes ever get anywhere, two obvious uncharitable explanations for why some folks may systematically misconstrue MIRI/Eliezer as believing much more than in reality that various concepts an argument wanders over are Big Ideas to us, when some conversation forces us to go to that place:

(A) It paints a comfortably unflattering picture of MIRI-the-Other as weirdly obsessed with these concepts that seem not so persuasive, or more generally paints the Other as a bunch of weirdos who stumbled across some concept like "consequentialism" and got obsessed with it. In general, to depict the Other as thinking a great deal of some idea (or explanatory thought experiment) is to tie and stake their status to the listener's view of how much status that idea deserves. So if you say that the Other thinks a great deal of some idea that isn't obviously high-status, that lowers the Other's status, which can be a comfortable thing to do.

(cont.)

(B) It paints a more comfortably self-flattering picture of a continuing or persistent disagreement, as a disagreement with somebody who thinks that some random concept is much higher-status than it really is, in which case there isn't more to done or understood except to duly politely let the other person try to persuade you the concept deserves its high status. As opposed to, "huh, maybe there is a noncentral point that the other person sees themselves as being stopped on and forced to explain to me", which is a much less self-flattering viewpoint on why the conversation is staying within a place. And correspondingly more of a viewpoint that somebody else is likely to have of us, because it is a comfortable view to them, than a viewpoint that it is comfortable to us to imagine them having.

Taking the viewpoint that somebody else is getting hung up on a relatively noncentral point can also be a flattering self-portrait to somebody who believes that, of course. It doesn't mean they're right. But it does mean that you should be aware of how the Other's story, told from the Other's viewpoint, is much more liable to be something that the Other finds sensible and perhaps comfortable, even if it implies an unflattering (and untrue-seeming and perhaps untrue) view of yourself, than something that makes the Other seem weird and silly and which it is easy and congruent for you yourself to imagine the Other thinking.

[Ngo][11:18] (Sep. 12 comment)

I'm a little puzzled by how often you write my viewpoint as thinking that whatever I happened to say a sentence about is the Key Thing.

In this case, I emphasised the outcome pump thought experiment because you said that the time-travelling scenario was a key moment for your understanding of optimisation, and the outcome pump seemed to be similar enough and easier to convey in the summary, since you'd already written about it.

I'm also emphasising consequentialism because it seemed like the core idea which kept coming up in our first debate, under the heading of "deep problem-solving patterns". Although I take your earlier point that you tend to emphasise things that your interlocutor is more skeptical about, not necessarily the things which are most central to your view. But if consequentialism isn't in fact a very central concept for you, I'd be interested to hear what role it plays.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: "There’s a component of 'finding a plan which achieves a certain outcome' which involves actually solving the object-level problem of how someone who is given the plan can achieve the outcome. And there’s another component which is figuring out how to manipulate that person into doing what you want. To me it seems like Eliezer’s argument is that there’s no training regime which leads an AI to spend 99% of its time thinking about the former, and 1% thinking about the latter."

[Yudkowsky][11:20] (Sep. 10 comment)

no training regime

...that the training regimes we come up with first, in the 3 months or 2 years we have before somebody else destroys the world, will not have this property.

I don't have any particularly complicated or amazingly insightful theories of why I keep getting depicted as a fatalist; but my world is full of counterfactual functions, not constants. And I am always aware that if we had access to a real Textbook from the Future explaining all of the methods that are actually robust in real life - the equivalent of telling us in advance about all the ReLUs that in real life were only invented and understood a few decades after sigmoids - we could go right ahead and build a superintelligence that thinks 2 + 2 = 5.

All of my assumptions about "I don't see how to do X" are always labeled as ignorance on my part and a default because we won't have enough time to actually figure out how to do X. I am constantly maintaining awareness of this because being wrong about it being difficult is a major place where hope potentially comes from, if there's some idea like ReLUs that robustly vanquishes the difficulty, which I just didn't think of. Which does not, alas, mean that I am wrong about any particular thing, nor that the infinite source of optimistic ideas that is the wider field of "AI alignment" is going to produce a good idea from the same process that generates all the previous naive optimism through not seeing where the original difficulty comes from or what other difficulties surround obvious naive attempts to solve it.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard (continued): "While this may be true in the limit of increasing intelligence, the most relevant systems are the earliest ones that are above human level. But humans deviate from the consequentialist abstraction you’re talking about in all sorts of ways - for example, being raised in different cultures can make people much more or less consequentialist. So it seems plausible that early AGIs can be superhuman while also deviating strongly from this abstraction - not necessarily in the same ways as humans, but in ways that we push them towards during training."

Eliezer, summarized by Richard: "Even at the Einstein or von Neumann level these types of deviations start to subside. And the sort of pivotal acts which might realistically work require skills significantly above human level. I think even 1% of the cognition of an AI that can assemble advanced nanotech, thinking about how to kill humans, would doom us. Your other suggestions for pivotal acts (surveillance to restrict AGI proliferation; persuading world leaders to restrict AI development) are not politically feasible in real life, to the level required to prevent the world from ending; or else require alignment in the very dangerous domain of superhuman manipulation."

Richard, summarized by Richard: "I think we probably also have significant disagreements about geopolitics which affect which acts we expect to be pivotal, but it seems like our comparative advantage is in discussing cognition, so let’s focus on that. We can build systems that outperform humans at quite a few tasks by now, without them needing search abilities that are general enough to even try to take over the world. Putting aside for a moment the question of which tasks are pivotal enough to save the world, which parts of your model draw the line between human-level chess players and human-level galaxy-colonisers, and say that we'll be able to align ones that significantly outperform us on these tasks before they take over the world, but not on those tasks?"

Eliezer, summarized by Richard: "One aspect there is domain generality which in turn is achieved through novel domain learning. One can imagine asking the question: is there a superintelligent AGI that can quickly build nanotech the way that a beaver solves building dams, in virtue of having a bunch of specialized learning abilities without it ever having a cross-domain general learning ability? But there are many, many, many things that humans do which no other animal does, which you might think would contribute a lot to that animal's fitness if there were animalistic ways to do it - e.g. mining and smelting iron. (Although comparisons to animals are not generally reliable arguments about what AIs can do - e.g. chess is much easier for chips than neurons.) So my answer is 'Perhaps, but not by default, there's a bunch of subproblems, I don't actually know how to do it right now, it's not the easiest way to get an AGI that can build nanotech.' ~~Can I explain how I know that? I'm really not sure I can.~~"

[Yudkowsky][11:26] (Sep. 10 comment)

Can I explain how I know that? I'm really not sure I can.

In original text, this sentence was followed by a long attempt to explain anyways; if deleting that, which is plausibly the correct choice, this lead-in sentence should also be deleted, as otherwise it paints a false picture of how much I would try to explain anyways.

[Ngo][11:15] (Sep. 12 comment)

Makes sense; deleted.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: "Challenges which are trivial from a human-engineering perspective can be very challenging from an evolutionary perspective (e.g. spinning wheels). So the evolution of animals-with-a-little-bit-of-help-from-humans might end up in very different places from the evolution of animals-just-by-themselves. And analogously, the ability of humans to fill in the gaps to help less general AIs achieve more might be quite significant.

"On nanotech: what are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?"

Eliezer, summarized by Richard: "This question seemed potentially cruxy to me. I.e., if building protein factories that built nanofactories that built nanomachines that met a certain deep and lofty engineering goal, didn't involve cognitive challenges different in kind from protein folding, we could maybe just safely go do that using AlphaFold 3, which would be just as safe as AlphaFold 2. I don't think we can do that. But it is among the more plausible advance-specified miracles we could get. At this point our last hope is that in fact the future is often quite surprising."

Richard, summarized by Richard: "It seems to me that you’re making the same mistake here as you did with regards to recursive self-improvement in the AI foom debate - namely, putting too much trust in one big abstraction."

Eliezer, summarized by Richard: "I suppose that is what it could potentially feel like from the inside to not get an abstraction. Robin Hanson kept on asking why I was trusting my abstractions so much, when he was in the process of trusting his worse abstractions instead."

4.2. Nate Soares' summary

[Soares] (Sep. 12 Google Doc)

Consequentialism

Ok, here's a handful of notes. I apologize for not getting them out until midday Sunday. My main intent here is to do some shoring up of the ground we've covered. I'm hoping for skims and maybe some light comment back-and-forth as seems appropriate (perhaps similar to Richard's summary), but don't think we should derail the main thread over it. If time is tight, I would not be offended for these notes to get little-to-no interaction.

---

My sense is that there's a few points Eliezer was trying to transmit about consequentialism, that I'm not convinced have been received. I'm going to take a whack at it. I may well be wrong, both about whether Eliezer is in fact attempting to transmit these, and about whether Richard received them; I'm interested in both protests from Eliezer and paraphrases from Richard.

[Soares] (Sep. 12 Google Doc)

1. "The consequentialism is in the plan, not the cognition".

I think Richard and Eliezer are coming at the concept "consequentialism" from very different angles, as evidenced eg by Richard saying (Nate's crappy paraphrase:) "where do you think the consequentialism is in a cat?" and Eliezer responding (Nate's crappy paraphrase:) "the cause of the apparent consequentialism of the cat's behavior is distributed between its brain and its evolutionary history".

In particular, I think there's an argument here that goes something like:

Observe that, from our perspective, saving the world seems quite tricky, and seems likely to involve long sequences of clever actions that force the course of history into a narrow band (eg, because if we saw short sequences of dumb actions, we could just get started).
Suppose we were presented with a plan that allegedly describes a long sequence of clever actions that would, if executed, force the course of history into some narrow band.
- For concreteness, suppose it is a plan that allegedly funnels history into the band where we have wealth and acclaim.
One plausible happenstance is that the plan is not in fact clever, and would not in fact have a forcing effect on history.
- For example, perhaps the plan describes founding and managing some silicon valley startup, that would not work in practice.
Conditional on the plan having the history-funnelling property, there's a sense in which it's scary regardless of its source.
- For instance, perhaps the plan describes founding and managing some silicon valley startup, and will succeed virtually every time it's executed, by dint of having very generic descriptions of things like how to identify and respond to competition, including descriptions of methods for superhumanly-good analyses of how to psychoanalyze the competition and put pressure on their weakpoints.
- In particular, note that one need not believe the plan was generated by some "agent-like" cognitive system that, in a self-contained way, made use of reasoning we'd characterize as "possessing objectives" and "pursuing them in the real world".
- More specifically, the scariness is a property of the plan itself. For instance, the fact that this plan accrues wealth and acclaim to the executor, in a wide variety of situations, regardless of what obstacles arise, implies that the plan contains course-correcting mechanisms that keep the plan on-target.
- In other words, plans that manage to actually funnel history are (the argument goes) liable to have a wide variety of course-correction mechanisms that keep the plan oriented towards some target. And while this course-correcting property tends to be a property of history-funneling plans, the choice of target is of course free, hence the worry.

(Of course, in practice we perhaps shouldn't be visualizing a single Plan handed to us from an AI or a time machine or whatever, but should instead imagine a system that is reacting to contingencies and replanning in realtime. At the least, this task is easier, as one can adjust only for the contingencies that are beginning to arise, rather than needing to predict them all in advance and/or describe general contingency-handling mechanisms. But, and feel free to take a moment to predict my response before reading the next sentence, "run this AI that replans autonomously on-the-fly" and "run this AI+human loop that replans+reevaluates on the fly", are still in this sense "plans", that still likely have the property of Eliezer!consequentialism, insofar as they work.)

[Soares] (Sep. 12 Google Doc)

There's a part of this argument I have not yet driven home. Factoring it out into a separate bullet:

2. "If a plan is good enough to work, it's pretty consequentialist in practice".

In attempts to collect and distill a handful of scattered arguments of Eliezer's:

If you ask GPT-3 to generate you a plan for saving the world, it will not manage to generate one that is very detailed. And if you tortured a big language model into giving you a detailed plan for saving the world, the resulting plan would not work. In particular, it would be full of errors like insensitivity to circumstance, suggesting impossible actions, and suggesting actions that run entirely at cross-purposes to one another.

A plan that is sensitive to circumstance, and that describes actions that synergize rather than conflict -- like, in Eliezer's analogy, photons in a laser -- is much better able to funnel history into a narrow band.

But, on Eliezer's view as I understand it, this "the plan is not constantly tripping over its own toes" property, goes hand-in-hand with what he calls "consequentialism". As a particularly stark and formal instance of the connection, observe that one way a plan can trip over its own toes is if it says "then trade 5 oranges for 2 apples, then trade 2 apples for 4 oranges". This is clearly an instance of the plan failing to "lase" -- of some orange-needing part of the plan working at cross-purposes to some apple-needing part of the plan, or something like that. And this is also a case where it's easy to see how if a plan is "lasing" with respect to apples and oranges, then it is behaving as if governed by some coherent preference.

And the point as I understand it isn't "all toe-tripping looks superficially like an inconsistent preference", but rather "insofar as a plan does manage to chain a bunch of synergistic actions together, it manages to do so precisely insofar as it is Eliezer!consequentialist".

cf the analogy to information theory [? · GW], where if you're staring at a maze and you're trying to build an accurate representation of that maze in your own head, you will succeed precisely insofar as your process is Bayesian / information-theoretic. And, like, this is supposed to feel like a fairly tautological claim: you (almost certainly) can't get the image of a maze in your head to match the maze in the world by visualizing a maze at random, you have to add visualized-walls using some process that's correlated with the presence of actual walls. Your maze-visualizing process will work precisely insofar as you have access to & correctly make use of, observations that correlate with the presence of actual walls. You might also visualize extra walls in locations where it's politically expedient to believe that there's a wall, and you might also avoid visualizing walls in a bunch of distant regions of the maze because it's dark and you haven't got all day, but the resulting visualization in your head is accurate precisely insofar as you're managing to act kinda like a Bayesian.

Similarly (the analogy goes), a plan works-in-concert and avoids-stepping-on-its-own-toes precisely insofar as it is consequentialist. These are two sides of the same coin, two ways of seeing the same thing.

And, I'm not so much attempting to argue the point here, as to make sure that the shape of the argument (as I understand it) has been understood by Richard. In particular, the shape of the argument I see Eliezer as making is that "clumsy" plans don't work, and "laser-like plans" work insofar as they are managing to act kinda like a consequentialist.

Rephrasing again: we have a wide variety of mathematical theorems all spotlighting, from different angles, the fact that a plan lacking in clumsiness, is possessing of coherence.

("And", my model of Eliezer is quick to note, "this ofc does not mean that all sufficiently intelligent minds must generate very-coherent plans. If you really knew what you were doing, you could design a mind that emits plans that always "trip over themselves" along one particular axis, just as with sufficient mastery you could build a mind that believes 2+2=5 (for some reasonable cashing-out of that claim). But you don't get this for free -- and there's a sort of "attractor" here, when building cognitive systems, where just as generic training will tend to cause it to have true beliefs, so will generic training tend to cause its plans to lase.")

(And ofc much of the worry is that all the mathematical theorems that suggest "this plan manages to work precisely insofar as it's lasing in some direction", say nothing about which direction it must lase. Hence, if you show me a plan clever enough to force history into some narrow band, I can be fairly confident it's doing a bunch of lasing, but not at all confident which direction it's lasing in.)

[Soares] (Sep. 12 Google Doc)

One of my guesses is that Richard does in fact understand this argument (though I personally would benefit from a paraphrase, to test this hypothesis!), and perhaps even buys it, but that Richard gets off the train at a following step, namely that we need plans that "lase", because ones that don't aren't strong enough to save us. (Where in particular, I suspect most of the disagreement is in how far one can get with plans that are more like language-model outputs and less like lasers, rather than in the question of which pivotal acts would put an end to the acute risk period)

But setting that aside for a moment, I want to use the above terminology to restate another point I saw Eliezer as attempting to make: one big trouble with alignment, in the case where we need our plans to be like lasers, is that on the one hand we need our plans to be like lasers, but on the other hand we want them to fail to be like lasers along certain specific dimensions.

For instance, the plan presumably needs to involve all sorts of mechanisms for refocusing the laser in the case where the environment contains fog, and redirecting the laser in the case where the environment contains mirrors (...the analogy is getting a bit strained here, sorry, bear with me), so that it can in fact hit a narrow and distant target. Refocusing and redirecting to stay on target are part and parcel to plans that can hit narrow distant targets.

But the humans shutting the AI down is like scattering the laser, and the humans tweaking the AI so that it plans in a different direction is like them tossing up mirrors that redirect the laser; and we want the plan to fail to correct for those interferences.

As such, on the Eliezer view as I understand it, we can see ourselves as asking for a very unnatural sort of object: a path-through-the-future that is robust enough to funnel history into a narrow band in a very wide array of circumstances, but somehow insensitive to specific breeds of human-initiated attempts to switch which narrow band it's pointed towards.

Ok. I meandered into trying to re-articulate the point over and over until I had a version distilled enough for my own satisfaction (which is much like arguing the point), apologies for the repetition.

I don't think debating the claim is the right move at the moment (though I'm happy to hear rejoinders!). Things I would like, though, are: Eliezer saying whether the above is on-track from his perspective (and if not, then poking a few holes); and Richard attempting to paraphrase the above, such that I believe the arguments themselves have been communicated (saying nothing about whether Richard also buys them).

---

[Soares] (Sep. 12 Google Doc)

My Richard-model's stance on the above points is something like "This all seems kinda plausible, but where Eliezer reads it as arguing that we had better figure out how to handle lasers, I read it as an argument that we'd better save the world without needing to resort to lasers. Perhaps if I thought the world could not be saved except by lasers, I would share many of your concerns, but I do not believe that, and in particular it looks to me like much of the recent progress in the field of AI -- from AlphaGo to GPT to AlphaFold -- is evidence in favor of the proposition that we'll be able to save the world without lasers."

And I recall actual-Eliezer saying the following (more-or-less in response, iiuc, though readers note that I might be misunderstanding and this might be out-of-context):

Definitely, "turns out it's easier than you thought to use gradient descent's memorization of zillions of shallow patterns that overlap and recombine into larger cognitive structures, to add up to a consequentialist nanoengineer that only does nanosystems and never does sufficiently general learning to apprehend the big picture containing humans, while still understanding the goal for that pivotal act you wanted to do" is among the more plausible advance-specified miracles we could get.

On my view, and I think on Eliezer's, the "zillions of shallow patterns"-style AI that we see today, is not going to be sufficient to save the world (nor destroy it). There's a bunch of reasons that GPT and AlphaZero aren't destroying the world yet, and one of them is this "shallowness" property. And, yes, maybe we'll be wrong! I myself have been surprised by how far the shallow pattern memorization has gone (and, for instance, was surprised by GPT), and acknowledge that perhaps I will continue to be surprised. But I continue to predict that the shallow stuff won't be enough.

I have the sense that lots of folk in the community are, one way or another, saying "Why not consider the problems of aligning systems that memorize zillions of shallow patterns?". And my answer is, "I still don't expect those sorts of machines to either kill or save us, I'm still expecting that there's a phase shift that won't happen until AI systems start to be able to make plans that are sufficiently deep and laserlike to do scary stuff, and I'm still expecting that the real alignment challenges are in that regime."

And this seems to me close to the heart of the disagreement: some people (like me!) have an intuition that it's quite unlikely that figuring out how to get sufficient work out of shallow-memorizers is enough to save us, and I suspect others (perhaps even Richard!) have the sense that the aforementioned "phase shift" is the unlikely scenario, and that I'm focusing on a weird and unlucky corner of the space. (I'm curious whether you endorse this, Richard, or some nearby correction of it.)

In particular, Richard, I am curious whether you endorse something like the following:

I'm focusing ~all my efforts on the shallow-memorizers case, because I think shallow-memorizer-alignment will by and large be sufficient, and even if it is not then I expect it's a good way to prepare ourselves for whatever we'll turn out to need in practice. In particular I don't put much stock in the idea that there's a predictable phase-change that forces us to deal with laser-like planners, nor that predictable problems in that domain give large present reason to worry.

(I suspect not, at least not in precisely this form, and I'm eager for corrections.)

I suspect something in this vicinity constitutes a crux of the disagreement, and I would be thrilled if we could get it distilled down to something as concise as the above. And, for the record, I personally endorse the following counter to the above:

I am focusing ~none of my efforts on shallow-memorizer-alignment, as I expect it to be far from sufficient, as I do not expect a singularity until we have more laser-like systems, and I think that the laserlike-planning regime has a host of predictable alignment difficulties that Earth does not seem at all prepared to face (unlike, it seems to me, the shallow-memorizer alignment difficulties), and as such I have large and present worries.

---

[Soares] (Sep. 12 Google Doc)

Ok, and now a few less substantial points:

There's a point Richard made here:

Oh, interesting. Actually one more question then: to what extent do you think that explicitly reasoning about utility functions and laws of rationality is what makes consequentialists have the properties you've been talking about?

that I suspect constituted a miscommunication, especially given that the following sentence appeared in Richard's summary:

A third thing that makes humans in particular consequentialist is planning, especially when we’re aware of concepts like utility functions.

In particular, I suspect Richard's model of Eliezer's model places (or placed, before Richard read Eliezer's comments on Richard's summary) some particular emphasis on systems reflecting and thinking about their own strategies, as a method by which the consequentialism and/or effectiveness gets in. I suspect this is a misunderstanding, and am happy to say more on my model upon request, but am hopeful that the points I made a few pages above have cleared this up.

Finally, I observe that there are a few places where Eliezer keeps beeping when Richard attempts to summarize him, and I suspect it would be useful to do the dorky thing of Richard very explicitly naming Eliezer's beeps as he understands them, for purposes of getting common knowledge of understanding. For instance, things I think it might be useful for Richard to say verbatim (assuming he believes them, which I suspect, and subject to Eliezer-corrections, b/c maybe I'm saying things that induce separate beeps):

1. Eliezer doesn't believe it's impossible to build AIs that have most any given property, including most any given safety property, including most any desired "non-consequentialist" or "deferential" property you might desire. Rather, Eliezer believes that many desirable safety properties don't happen by default, and require mastery of minds that likely takes a worrying amount of time to acquire.

2. The points about consequentialism are not particularly central in Eliezer's view; they seem to him more like obvious background facts; the reason conversation has lingered here in the EA-sphere is that this is a point that many folk in the local community disagree on.

For the record, I think it might also be worth Eliezer acknowledging that Richard probably understands point (1), and that glossing "you don't get it for free by default and we aren't on course to have the time to get it" as "you can't" is quite reasonable when summarizing. (And it might be worth Richard counter-acknowledging that the distinction is actually quite important once you buy the surrounding arguments, as it constitutes the difference between describing the current playing field and laying down to die.) I don't think any of these are high-priority, but they might be useful if easy :-)

---

Finally, stating the obvious-to-me, none of this is intended as criticism of either party, and all discussing parties have exhibited significant virtue-according-to-Nate throughout this process.

[Yudkowsky][21:27] (Sep. 12)

From Nate's notes:

For instance, the plan presumably needs to involve all sorts of mechanisms for refocusing the laser in the case where the environment contains fog, and redirecting the laser in the case where the environment contains mirrors (...the analogy is getting a bit strained here, sorry, bear with me), so that it can in fact hit a narrow and distant target. Refocusing and redirecting to stay on target are part and parcel to plans that can hit narrow distant targets.
But the humans shutting the AI down is like scattering the laser, and the humans tweaking the AI so that it plans in a different direction is like them tossing up mirrors that redirect the laser; and we want the plan to fail to correct for those interferences.

--> GOOD ANALOGY.

...or at least it sure conveys to me why corrigibility is anticonvergent / anticoherent / actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.

But then, I already know why that's true and how it generalized up to resisting our various attempts to solve small pieces of more important aspects of it - it's not just true by weak default, it's true by a stronger default where a roomful of people at a workshop spend several days trying to come up with increasingly complicated ways to describe a system that will let you shut it down (but not steer you through time into shutting it down), and all of those suggested ways get shot down. (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those "solutions" are things we considered and dismissed as trivially failing to scale to powerful agents - they didn't understand what we considered to be the first-order problems in the first place - rather than these being evidence that MIRI just didn't have smart-enough people at the workshop.)

[Yudkowsky][18:56] (Nov. 5 follow-up comment)

Eg, "Well, we took a system that only learned from reinforcement on situations it had previously been in, and couldn't use imagination to plan for things it had never seen, and then we found that if we didn't update it on shut-down situations it wasn't reinforced to avoid shutdowns!"

151 comments

Comments sorted by top scores.

comment by habryka (habryka4) · 2023-01-07T00:55:41.007Z · LW(p) · GW(p)

I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall.

A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off.

Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment becomes much harder, and a lot of things that people currently label "AI Alignment" kind of stops feeling real, and I have this feeling that even though a really quite substantial fraction of the people I talk to about AI Alignment are compelled by Eliezer's argument for difficulty, that there is some kind of structural reason that AI Alignment as a field can't really track these arguments.

Like, a lot of people's jobs and funding rely on these arguments being false, and also, if these arguments are correct, the space of perspectives on the problem suddenly loses a lot of common ground on how to proceed or what to do, and it isn't really obvious that you even want an "AI Alignment field" or lots of "AI Alignment research organizations" or "AI Alignment student groups". Like, because we don't know how to solve this problem, it really isn't clear what the right type of social organization is, and there aren't obviously great gains from trade, and so from a coalition perspective, you don't get a coalition of people who think these arguments are real.

I feel deeply confused about this. Over the last two years, I think I wrongly ended up just kind of investing into an ecosystem of people that somewhat structurally can't really handle these arguments, and makes plans that assume that these arguments are false, and in doing so actually mostly makes the world worse, by having a far too optimistic stance on the differential technological progress of solving various ML challenges, and feeling like they can pick up a lot of probability mass of good outcomes by just having better political relationships to capabilities-labs by giving them resources to make AI happen even faster.

I now regret that a lot, and I think somehow engaging with these dialogues more closely, or having more discussion of them, would have prevented me from making what I currently consider one of the biggest mistakes in my life. Maybe also making them more accessible, or somehow having them be structured in a way that gave me as a reader more permission for actually taking the conclusions of them seriously, by having content that builds on these assumptions and asks the question "what's next" instead of just the question of "why not X?" in dialogue with people who disagree.

In terms of follow-up work, the dialogues I would most love to see is maybe a conversation between Eliezer and Nate, or between John Wentworth and Eliezer, where they try to hash out their disagreements about what to do next, instead of having the conversation be at the level these dialogues were at.

Replies from: Eliezer_Yudkowsky, sharmake-farah

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-01-07T08:01:27.424Z · LW(p) · GW(p)

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2023-01-07T21:59:00.408Z · LW(p) · GW(p)

I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago.

I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.

↑ comment by Noosphere89 (sharmake-farah) · 2023-01-07T16:03:38.356Z · LW(p) · GW(p)

Interestingly enough I believe the opposite: Eliezer was quite wrong (Though not wrong enough to totally think we're out of the danger zone).

I think this for several reasons:

I think that GPT is proof that reasonably large intelligence can be done without being agentic. A lot of LW arguments start failing once we realize that GPT isn't an agent, but rather a simulator/oracle AI like Janus's Simulator post. His post is here:

https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators [LW · GW]

And this is immensely valuable, especially if the simulator framing holds in the limit, which means we have superhuman AI that is myopic and non-agentic, so no instrumental convergence or inner alignment problems come up here. This drastically avoids many hard questions to solve.

I believe natural abstractions hold well enough such that the abstractions used by a human and ones used by an AI are easy to translate. One of Logan Zollener's posts covers how good natural abstractions are, and they are really good in models that are very capable. If AI Alignment was a natural abstraction, then Outer Alignment solves itself, though I would be careful here. Logan Zollener's post is here:

https://www.lesswrong.com/posts/BdfQMrtuL8wNfpfnF/natural-categories-update [LW · GW]

I believe sandboxing powerful AI such that they don't learn particular things like human models or deception is actually possible and maybe reasonably practical. Indeed I gave a proof on Christmas showing that conditioned on careful enough curation of data and fully removing nondeterminism (Which isn't super difficult, Blockchain already does this for consensus reasons), then AI can't break out of the sandbox due to the No Free Lunch theorem.

Post here by me:

https://www.lesswrong.com/posts/osmwiGkCGxqPfLf4A/i-ve-updated-towards-ai-boxing-being-surprisingly-easy [LW · GW]

One big problem still remains: Amdahl's law suggests that if you have a tool that helps you do something very well vs an agent where you just delegate things to, agents are way better, since they're not bottlenecked on the human. And I fear economic pressure will make people give more and more control, until the AI is given full control and then a discontinuity suddenly emerges. And I think this economic pressure is probably going to lead to the problems inherent in agents.

comment by Rob Bensinger (RobbBB) · 2021-11-15T20:46:41.163Z · LW(p) · GW(p)

This is the first post in a sequence, consisting of the logs of a Discord server MIRI made for hashing out AGI-related disagreements with Richard Ngo, Open Phil, etc.

I did most of the work of turning the chat logs into posts, with lots of formatting help from Matt Graves and additional help from Oliver Habryka, Ray Arnold, and others. I also hit the 'post' button for Richard and Eliezer. (I don't plan to repeat this note on future posts in this sequence, unless folks request it.)

Replies from: lincolnquirk

↑ comment by lincolnquirk · 2021-11-16T11:05:02.256Z · LW(p) · GW(p)

I'd like to express my gratitude and excitement (and not just to you, Rob, though your work is included in this):

Deep thanks to everyone involved for having the discussion, writing up and formatting, and posting it on LW. I think this is some of the more interesting and potentially impactful stuff I've seen relating to AI alignment in a long while.

(My only thought is... why hasn't a discussion like this occurred sooner? Or has it, and it just hasn't made it to LW?)

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2021-11-16T14:35:27.781Z · LW(p) · GW(p)

I'm not sure why we haven't tried the 'generate and publish chatroom logs' option before. If you mean more generally 'why is MIRI waiting to hash these things out with other xrisk people until now?', my basic model is:

Syncing with others was a top priority for SingInst (2000-2012), and this resulted in stuff like the Sequences, the FOOM debate, Highly Advanced Epistemology 101 for Beginners [? · GW], the Singularity Summits, etc. It (largely) doesn't cover the same ground as current disagreements because people disagree about different stuff now.
'SingInst' becoming 'MIRI' in 2013 coincided with us shifting much more to a focus on alignment research. That said, a lot of factors resulted in us continuing to have a lot of non-research-y conversations with others, including: EA coalescing in 2012-2014; the wider AI alignment field starting in earnest with the release of Superintelligence (2014) and the Puerto Rico conference (2015); and Open Philanthropy starting in 2014.
- Some of these conversations (and the follow-up reflections prompted by these conversations) ended up inspiring publications at some point, including some of the content on Arbital (mostly active 2015-2017), Inadequate Equilibria [? · GW] (published 2017, but mostly written around 2013-2015 I believe), etc.
My model is that we then mostly disappeared in 2018-2020 while we bunkered down to do research, continuing to have intermittent conversations and email exchanges with folks, but not sinking very much time into syncing up. (I'll say that a lot of non-MIRI EA leaders were very eager to sink loads of time into syncing up with MIRI, and it's entirely MIRI's 'sorry, we want to do research instead' that caused this to not happen during this period.)

So broadly I'd say 'we did try to sync up a lot, but it turns out there's a lot of ground to cover, and different individuals at different times have very different perspectives and cruxes'. At a certain point, (a) we'd transmitted enough of our perspective that we expected to be pretty happy with e.g. EA leaders' sense of how to do broader field-building, academic outreach, etc.; and (b) we felt we'd plucked the low-hanging fruit and further syncing up would require a lot more focused effort, which seemed lower-priority than 'make ourselves less confused about the alignment problem by working on this research program' at the time.

Replies from: Eliezer_Yudkowsky, Vaniver

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-16T15:27:46.330Z · LW(p) · GW(p)

I'm definitely not happy with others' sense of how to do field-building, but it's not like I thought I could fix that issue by spending the rest of my life trying to do it myself.

↑ comment by Vaniver · 2021-11-16T17:26:53.331Z · LW(p) · GW(p)

I'm not sure why we haven't tried the 'generate and publish chatroom logs' option before.

My guess is that a lot of these conversations often hinge on details that people are somewhat ansy about saying in public, and I suspect MIRI now thinks the value of "credible public pessimism" is larger than the cost of "gesturing towards things that seem powerful" on the margin such that chatlogs like this are a better idea than they would have seemed to the MIRI of 4 years ago. [Or maybe it was just "no one thought to try, because we had access to in-person conversations and those seemed much better, despite not generating transcripts."]

comment by johnswentworth · 2021-11-15T22:18:28.018Z · LW(p) · GW(p)

So here's one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn't been so tight, then individual humans would have been significantly less capable of performing novel tasks.

I disagree with this in an interesting way. (Not particularly central to the discussion, but since both Richard & Eliezer thought the quoted claim is basically-true, I figured I should comment on it.)

First, outside view evidence: most of the genome is junk. That's pretty strong evidence that the size of the genome is not itself a taut constraint. If there evolutionary fitness gains to be had, in general, by passing more information via the genome, then we should expect that to have evolved already.

Second, inside view: overparameterized local search processes (including evolution and gradient descent on NNs) perform information compression by default. This is a technical idea that I haven't written up properly yet, but as a quick sketch... suppose that I have a neural net with N parameters. It's overparameterized, so there are many degrees of freedom in any optimum - i.e. there's a whole optimal surface, not just an optimal point. Now suppose that I can build a near-perfect model of the training data by setting only M (< N) parameter-values; with these values, all the other parameters are screened off, so the remaining N-M parameters can take any values at all. (I'll call the set of M parameter-values a "model".) The smaller M, the larger N-M, and therefore the more possible parameter-values achieve optimality using this model. And the more possible parameter-values achieve optimality using the model, the more of the optimum-space this "model" fills. In practice, for something like evolution or gradient descent, this would mean a broad peak.

Rough takeaway: broader peaks in the fitness-landscape are precisely those which require fixing fewer parameters. Fixing fewer parameters, while still achieving optimality, requires compressing all the information-required-to-achieve-optimality into those few parameters. The more compression, the broader the peak, and the more likely that a local search process will find it.

Replies from: DaemonicSigil, TekhneMakre

↑ comment by DaemonicSigil · 2021-11-16T04:46:35.382Z · LW(p) · GW(p)

Large genomes have (at least) 2 kinds of costs. The first is the energy and other resources required to copy the genome whenever your cells divide. The existence of junk DNA suggests that this cost is not a limiting factor. The other cost is that a larger genome will have more mutations per generation. So maintaining that genome across time uses up more selection pressure. Junk DNA requires no maintenance, so it provides no evidence either way. Selection pressure cost could still be the reason why we don't see more knowledge about the world being translated genetically.

A gene-level way of saying the same thing is that even a gene that provides an advantage may not survive if it takes up a lot of genome space, because it will be destroyed by the large number of mutations.

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-11-16T05:14:41.047Z · LW(p) · GW(p)

Good point, I wasn't thinking about that mechanism.

However, I don't think this creates an information bottleneck in the sense needed for the original claim in the post, because the marginal cost of storing more information in the genome does not increase via this mechanism as the amount-of-information-passed increases. Each gene just needs to offer a large enough fitness advantage to counter the noise on that gene; the requisite fitness advantage does not change depending on whether the organism currently has a hundred information-passing genes or a hundred thousand. It's not really a "bottleneck" so much as a fixed price: the organism can pass any amount of information via the genome, so long as each base-pair contributes marginal fitness above some fixed level.

It does mean that individual genes can't be too big, but it doesn't say much about the number of information-passing genes (so long as separate genes have mostly-decoupled functions, which is indeed the case for the vast majority of gene pairs in practice).

Replies from: darius

↑ comment by darius · 2021-11-17T23:40:00.102Z · LW(p) · GW(p)

Here's the argument I'd give for this kind of bottleneck. I haven't studied evolutionary genetics; maybe I'm thinking about it all wrong.

In the steady state, an average individual has n children in their life, and just one of those n makes it to the next generation. (Crediting a child 1/2 to each parent.) This gives log2(n) bits of error-correcting signal to prune deleterious mutations. If the genome length times the functional bits per base pair times the mutation rate is greater than that log2(n), then you're losing functionality with every generation.

One way for a beneficial new mutation to get out of this bind is by reducing the mutation rate. Another is refactoring the same functionality into fewer bits, freeing up bits for something new. But generically a fitness advantage doesn't seem to affect the argument that the signal from purifying selection gets shared by the whole genome.

↑ comment by TekhneMakre · 2021-11-16T03:23:18.680Z · LW(p) · GW(p)

most of the genome is junk. That's pretty strong evidence that the size of the genome is not itself a taut constraint.

My guess is that this is a total misunderstanding of what's meant by "genomic bottleneck". The bottleneck isn't the amount of information storage, it's the fact that the genome can only program the mind in a very indirect, developmental way, so that it can install stuff like "be more interested in people" but not "here's how to add numbers".

Replies from: cousin_it

↑ comment by cousin_it · 2021-11-16T10:00:14.407Z · LW(p) · GW(p)

That seems wrong, living creatures have lots of specific behaviors that are genetically programmed.

In fact I think both you and John are misunderstanding the bottleneck. The point isn't that the genome is small, nor that it affects the mind indirectly. The point is that the mind doesn't affect the genome. Living creatures don't have the tech to encode their life experience into genes for the next generation.

Replies from: ricraz, TekhneMakre

↑ comment by Richard_Ngo (ricraz) · 2021-11-17T00:03:09.748Z · LW(p) · GW(p)

I've appreciated this comment thread! My take is that you're all talking about different relevant things. It may well be the case that there are multiple reasons why more skills and knowledge aren't encoded in our genomes: a) it's hard to get that information in (from parents' brains), b) it's hard to get that information out (to childrens' brains), and c) having large genomes is costly. What I'm calling the genomic bottleneck is a combination of all of them (although I think John is probably right that c) is not the main reason).

What would falsify my claim about the genomic bottleneck is if the main reason there isn't more information passed on via genomes is because d) doing so is not very useful. That seems pretty unlikely, but not entirely out of the picture. E.g. we know that evolution is able to give baby deer the skill of walking shortly after birth, so it seems like d) might be the best explanation of why humans can't do that too. But deer presumably evolved that skill over a very long time period, whereas I'm more interested in rapid changes.

↑ comment by TekhneMakre · 2021-11-16T10:52:38.342Z · LW(p) · GW(p)

Do you think you can encode good flint-knapping technique genetically? I doubt that.

I think I agree with your point, and think it's a more general and correct statement of the bottleneck; but, still, I think that genome does mainly affect the mind indirectly, and this is one of the constraints making it be the case that humans have lots of learning / generalizing capability. (This doesn't just apply to humans. What are some stark examples of animals with hardwired complex behaviors? With a fairly high bar for "complex", and a clear explanation of what is hardwired and how we know. Insects have some fairly complex behaviors, e.g. web building, ant-hill building, the tree-leaf nests of weaver ants, etc.; but IDK enough to rule out a combination of a little hardwiring, some emergence, and some learning. Lots of animals hunt after learning from their parents how to hunt. I think a lot of animals can walk right after being born? I think beavers in captivity will fruitlessly chew on wood, indicating that the wild phenotype is encoded by something simple like "enjoys chewing" (plus, learned desire for shelter), rather than "use wood for dam".)

An operationalization of "the genome directly programs the mind" would be that things like [the motions employed in flint-knapping] can be hardwired by small numbers of mutations (and hence can be evolved given a few million relevant years). I think this isn't true, but counterevidence would be interesting. Since the genome can't feasibly directly encode behaviors, or at least can't learn those quickly enough to keep up with a changing niche, the species instead evolves to learn behaviors on the fly via algorithms that generalize. If there were *either* mind-mind transfer, *or* direct programming of behavior by the genome, then higher frequency changes would be easier and there'd be less need for fluid intelligence. (In fact it's sort of plausible to me (given my ignorance) that humans are imitation specialists and are less clever than Neanderthals were, since mind-mind transfer can replace intelligence.)

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-02-20T16:33:34.194Z · LW(p) · GW(p)

Some animal behaviours are certainly hardwired. There is the famous case of one bee species being immune to a pathogen because of a specific cleaning behaviour that is encoded by a single gene.

One important point that should be brought up in this context is sexual recombination.

if you have a part of a genome encoding a complex behaviour it can get reshuffled in the new generation. You would need some pretty powerful error correcting code to keep things working.

comment by KatWoods (ea247) · 2021-11-29T18:45:29.755Z · LW(p) · GW(p)

You can listen to this and all the other Yudkowsky & Ngo/Christiano conversations in podcast form on the Nonlinear Library now.

Christiano on take-off speeds here (part I, part II, part III)
Ngo on alignment difficulty (part I, part II, part III)
Ngo on capabilities gains (part I, part II)

You can also listen to them on any podcast player. Just look up Nonlinear Library.

I’ve listened to them as is and I find it pretty easy to follow, but if you’re interested in making it even easier for people to follow, these fine gentlemen have put up a ~$230 RFP/bounty for anybody who turns it into audio where each person has a different voice.

It would probably be easiest to just do it on our platform, since there’s a relatively easy way to change the voices, it will just be a tedious ~1-4 hours of work. My main bottleneck is management time, so I don’t have the time to manage the process or choose somebody who I’d trust to do it without messing with the quality.

It does seem a shame though, to have something so close to being even better, and not let people do what clearly is desired, because of my worry of accidentally messing up the quality of the audio. I think the main thing is just being conscientious enough to do 1-4 hours of repetitive work and an attention to detail.

After a couple minutes of thinking on it, I think a potential solution would be to have a super quick and dirty way to delegate trust. I’ll give you access to our platform to change the voices if you either a) are getting a/have a degree at an elite school (thus demonstrating a legible minimal amount of conscientiousness and ability to do boring tasks) or b) have at least 75 mutual EA friends with me on Facebook and can have an EA reference about your diligence.

Just DM me.

I’ll do it on a first come first serve basis.

If you do it with human voices, we’d also be happy to add that to the Library.

Finally, sorry for the delay. There was a comedy of errors where there was a bug in the system while I also came down with a human bug (a cold. Not covid :) ) and the articles were so long our regular system wasn’t working, so things weren't automatic like usual.

Replies from: jimrandomh, RobbBB

↑ comment by jimrandomh · 2021-11-30T03:17:45.246Z · LW(p) · GW(p)

(Mod note: I edited this comment to fix broken links.)

Replies from: ea247

↑ comment by KatWoods (ea247) · 2021-11-30T14:10:50.898Z · LW(p) · GW(p)

Thank you!

↑ comment by Rob Bensinger (RobbBB) · 2021-11-30T02:42:43.879Z · LW(p) · GW(p)

Thanks for doing this, Kat! :)

~~I’ve listened to them as is and I find it pretty easy to follow, but if you’re interested in making it even easier for people to follow,~~ ~~these fine gentlemen~~ [? · GW] ~~have put up a ~$230 RFP/bounty for anybody who turns it into audio where each person has a different voice.~~

~~That link isn't working for me; where's the bounty?~~

Edit: Bounty link is working now: https://twitter.com/lxrjl/status/1464119232749318155

comment by TurnTrout · 2021-11-23T19:22:00.550Z · LW(p) · GW(p)

I've started commenting on this discussion on a Google Doc. Here are some excerpts:

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up.

Contains implicit assumptions about takeoff that I don't currently buy:

Well-modelled as binary "has-AGI?" predicate;
- (I am sympathetic to the microeconomics of intelligence explosion working out in a way where "Well-modelled as binary "has-AGI?" predicate is true, but I feel uncertain about the prospect)
Somehow rules out situations like: We have somewhat aligned AIs which push the world to make future unaligned AIs slightly less likely, which makes the AI population more aligned on average; this cycle compounds until we're descending very fast into the basin of alignment and goodness.
- This isn't my mainline or anything, but I note that it's ruled out by Eliezer's model as I understand it.
Some other internal objections are arising and I'm not going to focus on them now.

Every AI output effectuates outcomes in the world.

Right but the likely domain of cognitive discourse matters. Pac-Man agents effectuate outcomes in the world, but their optimal policies are harmless. So the question seems to hinge on when the domain of cognition shifts to put us in the crosshairs of performant policies.

This doesn't mean Eliezer is wrong here about the broader claim, but the distinction deserves mentioning for the people who weren't tracking it. (I think EY is obviously aware of this)

If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

Could we have observed it any other way? Since we surely wouldn't have been selected for proving math theorems, we wouldn't have a native cortex specializing in math. So conditional on considering things like theorem-proving at all, it has to reuse other native capabilities.

More precisely, one possible mind design which solves theorems also reasons about humans. This is some update from whatever prior, towards EY's claim. I'm considering whether we know enough about the common cause (evolution giving us a general-purpose reasoning algorithm) to screen off/reduce the Theorems -> Human-modelling update.

So here's one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn't been so tight, then individual humans would have been significantly less capable of performing novel tasks.

Thanks, Richard—this is a cool argument that I hadn't heard before.

You will systematically overestimate how much easier, or how far you can push the science part without getting the taking-over-the-world part, for as long as your model is ignorant of what they have in common.

OK, it's a valid point and I'm updating a little, under the apparent model of "here's a set of AI capabilities, linearly ordered in terms of deep-problem-solving, and if you push too far you get taking-over-the-world." But I don't see how we get to that model to begin with.

comment by Ramana Kumar (ramana-kumar) · 2021-11-19T15:48:22.236Z · LW(p) · GW(p)

I am interested in the history-funnelling property -- the property of being like a consequentialist, or of being effective at achieving an outcome -- and have a specific confusion I'd love to get insight on from anyone who has any.

Question: Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

Option 1 (robustness/behavioural/our models): They achieve narrow outcomes with respect to an externally specified set of counterfactuals. E.g., relative to what we consider "could have happened", the consequentialists selected an excellent course of action for their purposes. This would make consequentialists optimizing systems [AF · GW] in Flint's sense.

Option 2 (agency/structural/their models): They are structured in such a way that they do their own considering and evaluating and deciding. We observe mechanisms that implement the processes of predicting and evaluating outcomes in these systems (and/or their history). So the possibilities that are narrowed down are the consequentialist's possibilities, the counterfactuals are produced by their models which may or may not line up with some externally specified ones (like ours).

I mostly think Yudkowsky is referring to Option 2, but I get confused by phrases (e.g. from Soares's summary [? · GW]) like "manage to actually funnel history" or "apparent consequentialism", that seem to me to make most sense under Option 1.

Replies from: Eliezer_Yudkowsky, RobbBB

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-19T22:38:38.875Z · LW(p) · GW(p)

To Rob's reply, I'll add that my own first reaction to your question was that it seems like a map-territory / perspective issue as appears in eg thermodynamics? Like, this has a similar flavor to asking "What does it mean to say that a classical system is in a state of high entropy when it actually only has one particular system state?" Adding this now in case I don't have time to expand on it later; maybe just saying that much will help at all, possibly.

↑ comment by Rob Bensinger (RobbBB) · 2021-11-19T17:29:09.296Z · LW(p) · GW(p)

Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

I'm not sure that I understand the question, but my intuition is to say: they funnel world-states into particular outcomes in the same sense that literal funnels funnel water into particular spaces, or in the same sense that a slope makes things roll down it.

If you find water in a previously-empty space with a small aperture, and you're confused that no water seems to have spilled over the sides, you may suspect that a funnel was there. Funnels are part of a larger deterministic universe, so maybe in some sense any given funnel (like everything else) 'had to do exactly that thing'. Still, we can observe that funnels are an important part of the causal chain in these cases, and that places with funnels tend to end up with this type of outcome much more often.

Similarly, consequentialists tend to remake parts of the world (typically, as much of the world as they can reach) into things that are high in their preference ordering. From Optimization and the Singularity [LW · GW]:

[...] Suppose you have a car, and suppose we already know that your preferences involve travel. Now suppose that you take all the parts in the car, or all the atoms, and jumble them up at random. It's very unlikely that you'll end up with a travel-artifact at all, even so much as a wheeled cart; let alone a travel-artifact that ranks as high in your preferences as the original car. So, relative to your preference ordering, the car is an extremely improbable artifact; the power of an optimization process is that it can produce this kind of improbability.
You can view both intelligence and natural selection [? · GW] as special cases of optimization: Processes that hit, in a large search space, very small targets defined by implicit preferences. Natural selection prefers more efficient replicators. Human intelligences have more complex preferences [? · GW]. Neither evolution nor humans have consistent utility functions, so viewing them as "optimization processes" is understood to be an approximation. You're trying to get at the sort of work being done, not claim that humans or evolution do this work perfectly.
This is how I see the story of life and intelligence - as a story of improbably good designs being produced by optimization processes. The "improbability" here is improbability relative to a random selection from the design space, not improbability in an absolute sense - if you have an optimization process around, then "improbably" good designs become probable. [...]

But it's not clear what a "preference" is, exactly. So a more general way of putting it, in Recognizing Intelligence [LW · GW], is:

[...] Suppose I landed on an alien planet and discovered what seemed to be a highly sophisticated machine, all gleaming chrome as the stereotype demands. Can I recognize this machine as being in any sense well-designed, if I have no idea what the machine is intended to accomplish? Can I guess that the machine's makers were intelligent, without guessing their motivations?
And again, it seems like in an intuitive sense I should obviously be able to do so. I look at the cables running through the machine, and find large electrical currents passing through them, and discover that the material is a flexible high-temperature high-amperage superconductor. Dozens of gears whir rapidly, perfectly meshed...
I have no idea what the machine is doing. I don't even have a hypothesis as to what it's doing. Yet I have recognized the machine as the product of an alien intelligence.
[...] Why is it a good hypothesis to suppose that intelligence or any other optimization process played a role in selecting the form of what I see, any more than it is a good hypothesis to suppose that the dust particles in my rooms are arranged by dust elves?
Consider that gleaming chrome. Why did humans start making things out of metal? Because metal is hard; it retains its shape for a long time. So when you try to do something, and the something stays the same for a long period of time, the way-to-do-it may also stay the same for a long period of time. So you face the subproblem of creating things that keep their form and function. Metal is one solution to that subproblem.
[... A]s simple a form of negentropy [? · GW] as regularity over time - that the alien's terminal values don't take on a new random form with each clock tick - can imply that hard metal, or some other durable substance, would be useful in a "machine" - a persistent configuration of material that helps promote a persistent goal.
The gears are a solution to the problem of transmitting mechanical forces from one place to another, which you would want to do because of the presumed economy of scale in generating the mechanical force at a central location and then distributing it. In their meshing, we recognize a force of optimization applied in the service of a recognizable instrumental value: most random gears, or random shapes turning against each other, would fail to mesh, or fly apart. Without knowing what the mechanical forces are meant to do, we recognize something that transmits mechanical force - this is why gears appear in many human artifacts, because it doesn't matter much what kind of mechanical force you need to transmit on the other end. You may still face problems like trading torque for speed, or moving mechanical force from generators to appliers.
These are not universally [? · GW] convergent instrumental challenges. They probably aren't even convergent with respect to maximum-entropy goal systems (which are mostly out of luck).
But relative to the space of low-entropy, highly regular goal systems - goal systems that don't pick a new utility function for every different time and every different place - that negentropy pours through the notion of "optimization" and comes out as a concentrated probability distribution over what an "alien intelligence" would do, even in the "absence of any hypothesis" about its goals. [...]

"Consequentialists funnel the universe into shapes that are higher in their preference ordering" isn't a required inherent truth for all consequentialists; some might have weird goals, or be too weak to achieve much. Likewise, some literal funnels are broken or misshapen, or just never get put to use. But in both cases, we can understand the larger class by considering the unusual function well-working instances can perform.

(In the case of literal funnels, we can also understand the class by considering its physical properties rather than its function/behavior/effects. Eventually we should be able to do the same for consequentialists, but currently we don't know what physical properties of a system make it consequentialist, beyond the level of generality of e.g. 'its future-steering will approximately obey expected utility theory'.)

Replies from: ramana-kumar

↑ comment by Ramana Kumar (ramana-kumar) · 2021-11-23T17:28:37.402Z · LW(p) · GW(p)

Thanks for the replies! I'm still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.

What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?

To clarify a little what the options in my original comment were, I'll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually nearby (in spacetime) when water is in a small space without having spilled, and Option 2 corresponds to the characteristic funnel shape (in combination with facts about physical laws maybe).

I think your and Eliezer's replies are pointing me at a sense in which both Option 1 and Option 2 are correct, but they are used in different ways in the overall story. To tell this story, I want to draw a distinction between outcome-pumps (behavioural agents) and consequentialists (structural agents). Outcome-pumps are effective at achieving outcomes, and this effectiveness is measured according to our models (option 1). Consequentialists do (or have done in their causal history) the work of selecting actions according to expected consequences in coherent pursuit of an outcome, and the expected consequences are therefore their own (option 2).

Spelling this out a little more - Outcome-pumps are optimizing systems [AF · GW]: there is a space of possible configurations, a much smaller target subset of configurations, and a basin of attraction such that if the system+surroundings starts within the basin, it ends up within the target. There are at least two ways of looking at the configuration space. Firstly, there's the range of situations in which we actually observe the same (or similar) outcome-pump system and that it achieved its outcome. Secondly, there's the range of hypothetical possibilities we can imagine and reason about putting the outcome-pump system into, and extrapolating (using our own models) that it will achieve the outcome. Both of these ways are "Option 1".

Consequentialists (structural agents) do the work, somewhere somehow - maybe in their brains, maybe in their causal history, maybe in other parts of their structure and history - of maintaining and updating beliefs and selecting actions that lead to (their modelled) expected consequences that are high in their preference ordering (this is all Option 2).

It should be somewhat uncontroversial that consequentialists are outcome pumps, to the extent that they’re any good at doing the consequentialist thing (and have sufficiently achievable preferences relative to their resources etc).

The more substantial claim I read MIRI as making is that outcome pumps are consequentialists, because the only way to be an outcome pump is to be a consequentialist. Maybe you wouldn't make this claim so strongly, since there are counterexamples like fires and black holes -- and there may be some restrictions on what kind of outcome pumps the claim applies to (such as some level of retargetability or robustness?).

How does this overall take sound?

Scott Garrabrant’s question [AF · GW] on whether agent-like behaviour implies agent-like architecture seems pretty relevant to this whole discussion -- Eliezer, do you have an answer to that question? Or at least do you think it’s an important open question?

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-23T17:37:42.720Z · LW(p) · GW(p)

My reply to your distinction between 'consequentialists' and 'outcome pumps' would be, "Please forget entirely about any such thing as a 'consequentialist' as you defined it; I would now like to talk entirely about powerful outcome pumps. All understanding begins there, and we should only introduce the notion of how outcomes are pumped later in the game. Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

(Modulo that lots of times people here are like "Well but a human at a particular intelligence level in a particular complicated circumstance once did this kind of work without the thing happening that it sounds like you say happens with powerful outcome pumps"; and then you have to look at the human engine and its circumstances to understand why outcome pumping could specialize down to that exact place and fashion, which will not be reduplicated in more general outcome pumps that have their dice re-rolled.)

Replies from: ramana-kumar, daniel-kokotajlo

↑ comment by Ramana Kumar (ramana-kumar) · 2021-11-25T12:16:04.453Z · LW(p) · GW(p)

A couple of direct questions I'm stuck on:

Do you agree that Flint's optimizing systems are a good model (or even definition) of outcome pumps?
Are black holes and fires reasonable examples of outcome pumps?

I'm asking these to understand the work better.

Currently my answers are:

Yes. Flint's notion is one I came to independently when thinking about "goal-directedness". It could be missing some details, but I find it hard to snap out of the framework entirely.
Yes. But maybe not the most informative examples. They're highly non-retargetable.

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-11-25T11:55:13.859Z · LW(p) · GW(p)

Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

I don't know the relevant history of science, but I wouldn't be surprised if something like the opposite was true: Our modern, very useful understanding of work is an abstraction that grew out of many people thinking concretely about various engines. Thinking about engines was like the homework exercises that helped people to reach and understand the concept of work.

Similarly, perhaps it is pedagogically (and conceptually) helpful to begin with the notion of a consequentialist and then generalize to outcome pumps.

comment by Eli Tyre (elityre) · 2021-11-18T01:28:35.659Z · LW(p) · GW(p)

Von Neumann was actually a fairly reflective fellow who knew about, and indeed helped generalize, utility functions. The great achievements of von Neumann were not achieved by some very specialized hypernerd who spent all his fluid intelligence on crystallizing math and science and engineering alone, and so never developed any opinions about politics or started thinking about whether or not he had a utility function.

Uh. I don't know about that.

Von Neuman seemed to me to be very much not making rational tradeoffs of the sort that one would if they were conceptualizing themselves as an an agent with a utility function.

From a short post [LW(p) · GW(p)] I wrote, a few years ago, after reading a bit about the man:

For one thing, at the end of his life, he was terrified of dying. But throughout the course of his life he made many reckless choices with his health.
He ate gluttonously and became fatter and fatter over the course of his life. (One friend remarked that he “could count anything but calories.”)
Furthermore, he seemed to regularly risk his life when driving.
Von Neuman was an aggressive and apparently reckless driver. He supposedly totaled his car every year or so. An intersection in Princeton was nicknamed “Von Neumann corner” for all the auto accidents he had there. records of accidents and speeding arrests are preserved in his papers. [The book goes on to list a number of such accidents.] (pg. 25)
(Amusingly, Von Neumann’s reckless driving seems due, not to drinking and driving, but to singing and driving. “He would sway back and forth, turning the steering wheel in time with the music.”)
I think I would call this a bug.

Replies from: Lukas_Gloor

↑ comment by Lukas_Gloor · 2021-11-18T12:03:37.051Z · LW(p) · GW(p)

Some of your examples don't prove anything, e.g., eating gluttonously is a legitimate tradeoff if you have a certain metabolism and care more about advancing science as a life goal in years where your brain still works well. About the driving, I guess it depends on how reckless it was. It's probably rare for people to die in inner-city driving accidents, especially if you make sure to not mess around at intersections. Judging by the part about singing, it seems possible he was just having fun and could afford to buy new cars?

Replies from: elityre, Lukas_Gloor

↑ comment by Eli Tyre (elityre) · 2021-11-19T18:03:40.099Z · LW(p) · GW(p)

Some of your examples don't prove anything,

I agree that they aren't conclusive.

But are you suggesting that the reckless driving was well-considered expected utility maximizing?

I guess I can see that if fatal accidents are rare, I guess, but I don't think that was the case?

"Activities that have a small, but non-negligible chance of death or permanent injury are not worth the immediate short-term thrill", seems like a textbook case of a conclusion one would draw from considering expected utility theory in practice, in one's life.

At minimum, it seems like there ought to be pareto-improvements that are just as or close to as fun, but which entail a lot less risk?

Replies from: Lukas_Gloor

↑ comment by Lukas_Gloor · 2021-11-21T09:12:43.079Z · LW(p) · GW(p)

I guess I can see that if fatal accidents are rare, I guess, but I don't think that was the case?

I agree that if driving incurs non-trivial risks of lasting damage, that's indicative that the person isn't trying very seriously to optimize some ambitious long-term goal.

At minimum, it seems like there ought to be pareto-improvements that are just as or close to as fun, but which entail a lot less risk?

This reasoning makes me think your model lacks gears about what it's like to live with certain types of psychologies. Making pareto improvements for your habits is itself a task to be prioritized. Depending on what else you have going on in life and how difficult it is to you to replace one habit with a different one, it's totally possible that for some period, it's not rational for you to focus on the habit change.

Basically, because often the best way to optimize your utility comes from applying your strengths to solve a certain bottleneck under time pressure, the observation "this person engages in suboptimal-seeming behavior some of the time" provides very little predictive evidence.

In fact, if you showed me someone who never engaged in such suboptimal behavior, I'd be tempted to wonder if they're maybe not optimizing hard enough in that one area that matters more than everything else they could do.

That said, it is a bit hard to empathize with "driving recklessly while singing" as a hard-to-change behavior. It doesn't sound like something particularly compulsive, except maybe if the impulse to sing came from exuberant happiness due to amphetamine use. But who knows. Von Neumann for sure had an unusual brain and maybe he often had random overwhelming feelings of euphoria.

↑ comment by Lukas_Gloor · 2021-11-18T12:13:19.638Z · LW(p) · GW(p)

I think a mistake of trying to hyperoptimize a healthy lifestyle or micromanage productivity hacks to the point of spending a lot of their attention on new productivity hacks, is probably the bigger mistake than getting overweight as long as the overweight person puts as much of their brainpower as possible into actually irreplaceable cognitive achievements. And long-term health is only important if you care a lot about living for very long.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-17T15:37:52.953Z · LW(p) · GW(p)

Comment after reading section 3:

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yudkowsky and I seem to agree that "do a pivotal act directly" is not something productive for us to work on, but "do alignment" research is something productive for us to work on. Therefore, there exists some range of AI capabilities which allow for superhuman alignment research but not for pivotal acts. Maybe this range is so narrow that in practice AI capability will cross it very quickly, or maybe not.

Moreover, I believe that there are trade-offs between safety and capability. This not only seems plausible, but actually shows up in many approach to safety (quantilization, confidence thresholds / consensus algorithms, homomorphic encryption...) Therefore, it's not safe to assume that any level of capability sufficient to pose risk (i.e. for a negative pivotal act) is also sufficient for a positive pivotal act.

Yudkowsky seems to claim that aligning an AI that does further alignment research is just too hard, and instead we should be designing AIs that are only competent in a narrow domain (e.g. competent at designing nanosystems but not at manipulating humans). Now, this does seem like an interesting class of alignment strategies, but it's not the only class.

One class of alignment strategies (which in particular Christiano wrote a lot about) compatible with bootstrapping is "amplified imitation of users" (e.g. IDA but I don't want to focus on IDA too much because of certain specifics I am skeptical about). This is potentially vulnerable to attack from counterfactuals [AF(p) · GW(p)] plus the usual malign simulation hypotheses, but is not obviously doomed. There is also a potential issue with capability: maybe predicting is too hard if you don't know which features are important to predict and which aren't.

Another class of alignment strategies (which in particular Russel often promotes) compatible with boostrapping is "learn what the user wants and find a plan to achieve it" (e.g. IRL/CIRC etc). This is hard because it requires formalizing "what the user wants" but might be tractable via something along the lines of the AIT definition of intelligence [AF(p) · GW(p)]. Making it safe probably requires imposing something like the Hippocratic principle [AF(p) · GW(p)], which, if you think through the implications, pulls it in the direction of the "superimitation" class. But, this might avoid superimitation's capability issues.

It could be that "restricted cognition" will turn out to be superior to both superimitation and value learning, but it seems far from a slam dunk at this point.

Replies from: Edouard Harris

↑ comment by Edouard Harris · 2021-11-18T14:26:35.686Z · LW(p) · GW(p)

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other research aids) could be worth the risk to develop and use, because they may make the human alignment researchers who use them more effective in unpredictable ways. For example, researchers tend to instinctively avoid considering solution paths that are bottlenecked by statements they see as being hard to prove — which is totally reasonable. But if your mentality is that you can just toss a super-powerful theorem-prover at the problem, then you're free to explore concept-space more broadly since you may be able to check your ideas at much lower cost.

(Also find myself agreeing with your point about tradeoffs. In fact, you could think of a primitive alignment strategy as having a kind of Sharpe ratio: how much marginal x-risk does it incur per marginal bit of optimization it gives? Since a closed-form solution to the alignment problem doesn't necessarily seem forthcoming, measuring its efficient frontier might be the next best thing.)

comment by Sam Clarke · 2021-11-17T14:26:41.771Z · LW(p) · GW(p)

Minor terminology note, in case discussion about "genomic/genetic bottleneck" continues: genetic bottleneck appears to have a standard meaning in ecology (different to Richard's meaning), so genomic bottleneck seems like the better term to use.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-11-17T11:25:57.897Z · LW(p) · GW(p)

[Notes mostly to myself, not important, feel free to skip]

My hot take overall is that Yudkowsky is basically right but doing a poor job of arguing for the position. Ngo is very patient and understanding.

"it doesn't seem implausible to me that we build AIs that are significantly more intelligent (in the sense of being able to understand the world) than humans, but significantly less agentic." --Ngo

"It is likely that, before the point where AGIs are strongly superhuman at seeking power, they will already be strongly superhuman at understanding the world, and at performing narrower pivotal acts like alignment research which don’t require as much agency (by which I roughly mean: large-scale motivations and the ability to pursue them over long timeframes)." --Ngo

"So it is legit harder to point out "the consequentialist parts of the cat" by looking for which sections of neurology are doing searches right there. That said, to the extent that the visual cortex does not get tweaked on failure to catch a mouse, it's not part of that consequentialist loop either." --Yudkowsky

"But the answer is that some problems are difficult in that they require solving lots of subproblems, and an easy way to solve all those subproblems is to use patterns which collectively have some coherence and overlap, and the coherence within them generalizes across all the subproblems. Lots of search orderings will stumble across something like that before they stumble across separate solutions for lots of different problems." --Yudkowsky

This is really making me want to keep working on my+Ramana's sequence on agency! :)

[Ngo][14:12]

Great

Okay, so one claim is that something like deontology is a fairly natural way for minds to operate.

[Yudkowsky][14:14]

("If that were true," he thought at once, "bureaucracies and books of regulations would be a lot more efficient than they are in real life.")

I think I disagree with Yudkowsky here? I almost want to say "the opposite is true; if people were all innately consequentialist then we wouldn't have so many blankfaces and bureaucracies would be a lot better because the rules would just be helpful guidelines." Or "Sure but books of regulations work surprisingly well, well enough that there's gotta be some innate deontology in humans." Or "Have you conversed with normal humans about ethics recently? If they are consequentialists they are terrible at it."

As such, on the Eliezer view as I understand it, we can see ourselves as asking for a very unnatural sort of object: a path-through-the-future that is robust enough to funnel history into a narrow band in a very wide array of circumstances, but somehow insensitive to specific breeds of human-initiated attempts to switch which narrow band it's pointed towards.

I think this is a great paragraph. It's a concise and reasonably accurate description of (an important part of) the problem.

I do think it, and this whole discussion, focuses too much on plans and not enough on agents. It's good for illustrating how the problem arises even in a context where we have some sort of oracle that gives us a plan and then we carry it out... but realistically our situation will be more dire than that because we'll be delegating to autonomous AGI agents. :(

Replies from: Eliezer_Yudkowsky, Charlie Steiner

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-18T23:32:04.377Z · LW(p) · GW(p)

The idea is not that humans are perfect consquentialists, but that they are able to work at all to produce future-steering outputs, insofar as humans actually do work at all, by an inner overlap of the shape of inner parts which has a shape resembling consequentialism, and the resemblance is what does the work. That is, your objection has the same flavor as "But humans aren't Bayesian! So how can you say that updating on evidence is what's doing their work of mapmaking?"

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-11-19T10:09:02.338Z · LW(p) · GW(p)

To be clear I think I agree with your overall position. I just don't think the argument you gave for it (about bureaucracies etc.) was compelling.

↑ comment by Charlie Steiner · 2021-11-18T23:13:53.791Z · LW(p) · GW(p)

Ngo is very patient and understanding.

Perhaps... too patient and understanding. Richard! Blink twice if you're being held against your will!

(I too would like you to write more about agency :P)

comment by Ruby · 2021-11-17T19:40:18.791Z · LW(p) · GW(p)

Curated. The treatment of how cognition/agents/intelligence work alone makes this post curation-worthy, but I want to further commend how much it attempts to bridges [large] inferential distances notwithstanding Eliezer's experience of it being difficult to bridge all the distance. Heck, just bridging some distance about the distance is great.

I think good things would happen if we had more dialogs like this between researchers. I'm interested in making it is easier to conduct and publish them on LessWrong, so thanks to all involved for the inspiration.

comment by [deleted] · 2021-11-17T16:32:35.922Z · LW(p) · GW(p)

[I may be generalizing here and I don't know if this has been said before.]

It seems to me that Eliezer's models are a lot more specific than people like Richard's. While Richard may put some credence on superhuman AI being "consequentialist" by default, Eliezer has certain beliefs about intelligence that make it extremely likely in his mind.

I think Eliezer's style of reasoning which relies on specific, thought-out models of AI makes him more pessimistic than others in EA. Others believe there are many ways that AGI scenarios could play out and are generally uncertain. But Eliezer has specific models that make some scenarios a lot more likely in his mind.

There are many valid theoretical arguments for why we are doomed, but maybe other EAs put less credence in them than Eliezer does.

comment by cousin_it · 2021-11-16T13:31:03.302Z · LW(p) · GW(p)

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail. And maybe I'm being thick, but the argument for that point still isn't reaching me somehow. Can someone rephrase for me?

Replies from: johnswentworth, steve2152, Koen.Holtman, ADifferentAnonymous

↑ comment by johnswentworth · 2021-11-16T17:56:50.718Z · LW(p) · GW(p)

The main issue with this sort of thing (on my understanding of Eliezer's models) is Hidden Complexity of Wishes [LW · GW]. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won't suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is likely to be a necessary step for a (good) pivotal act.

What this looks-like-in-practice is that "ask the AI for plans that succeed conditional on them being executed" has to be operationalized somehow, and the operationalization will inevitably not correctly capture what we actually want (because "what we actually want" has a ton of hidden complexity).

Replies from: cousin_it

↑ comment by cousin_it · 2021-11-19T11:14:25.864Z · LW(p) · GW(p)

This is tricky. Let's say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we'll stop understanding answers, but they'll continue being super-competent. That's certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still understand them, the alignment will generalize too. There seems no reason why alignment would be much less learnable than competence about reality.

Maybe your and Eliezer's point is that competence about reality has a simple core, while alignment doesn't. But I don't see the argument for that. Reality is complex, and so are values. A process for learning and acting in reality can have a simple core, but so can a process for learning and acting on values. Humans pick up knowledge from their surroundings, which is part of "general intelligence", but we pick up values just as easily and using the same circuitry. Where does the symmetry break?

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-11-19T16:54:28.542Z · LW(p) · GW(p)

I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.

(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way humans do it; I'd guess fewer than at most 1000 things on the order of complexity of a very fuzzy face detector are required, and probably fewer than 100.)

The reason it's less learnable than competence is not that alignment is much more complex, but that it's harder to generate a robust reward signal for alignment. Basically any sufficiently-complex long-term reward signal should incentivize competence. But the vast majority of reward signals do not incentivize alignment. In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.

(I'll note that the departure from talking about Hidden Complexity here is mainly because competence in particular is a special case where "complexity" plays almost no role, since it's incentivized by almost any reward. Hidden Complexity is still usually the right tool for talking about why any particular reward-signal will not incentivize alignment.)

I suspect that Eliezer's answer to this would be different, and I don't have a good guess what it would be.

Replies from: cousin_it

↑ comment by cousin_it · 2021-11-22T17:32:27.157Z · LW(p) · GW(p)

Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its "teachers", but at high power it will do something strange and maybe harm the "teachers" values. That holds true for humans gaining a lot of power and going against evolutionary values ("superstimuli"), and for individual humans gaining a lot of power and going against societal values ("power corrupts"), so it's probably true for AI as well. The worrying thing is that high power by itself seems sufficient for the change, for example if an AI gets good at real-world planning, that constitutes power and therefore danger. And there don't seem to be any natural counterexamples. So yeah, I'm updating toward your view on this.

↑ comment by Steven Byrnes (steve2152) · 2021-11-16T18:47:44.458Z · LW(p) · GW(p)

Speaking for myself here…

OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.

First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.

Alternatively, maybe we're able to thoroughly understand the plan once we see it; we're just too stupid to come up with it ourselves. That seems awfully fraught—I'm not sure how we could be so confident that we can tell apart nanobot plans from booby-trap plans. But let's assume that's possible for the sake of argument, and then move on to the other type of accident risk:

Second, I need to worry that the AI will start running, and I think it's coming up with a nanobot plan, but actually it's hacking its way out of its box and taking over the world.

How and why might that happen?

I would say that if a nanobot plan is very hard to create—requiring new insights etc.—then the only way to do it is to create the nanobot plan is to construct an agent-like thing that is trying to create the nanobot plan.

The agent-like thing would have some kind of action space (e.g. it can choose to summon a particular journal article to re-read, or it can choose to think through a certain possibility, etc.), and it would have some kind of capability of searching for and executing plans (specifically, plans-for-how-to-create-the-nanobot-plan), and it would have a capability of creating and executing instrumental subgoals (e.g. go on a side-quest to better understand boron chemistry) and plausibly it needs some kind of metacognition to improve its ability to find subgoals and take actions.

Everything I mentioned is an "internal" plan or an "internal" action or an "internal" goal, not involving "reaching out into the world" with actuators and internet access and nanobots etc.

If only the AI would stick to such "internal" consequentialist actions (e.g. "I will read this article to better understand boron chemistry") and not engage in any "external" consequentialist actions (e.g. "I will seize more computer power to better understand boron chemistry"), well then we would have nothing to worry about! Alas, so far as I know, nobody knows how to make a powerful AI agent that would definitely always stick to "internal" consequentialism.

Replies from: johnswentworth, cousin_it, None

↑ comment by johnswentworth · 2021-11-17T00:32:57.848Z · LW(p) · GW(p)

Personally, I'd consider a Fusion Power Generator [LW · GW]-like scenario a more central failure mode than either of these. It's not about the difficulty of getting the AI to do what we asked, it's about the difficulty of posing the problem in a way which actually captures what we want.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2021-11-17T13:51:05.215Z · LW(p) · GW(p)

I agree that that is another failure mode. (And there are yet other failure modes too—e.g. instead of printing the nanobot plan, it prints "Help me I'm trapped in a box…" :-P . I apologize for sloppy wording that suggested the two things I mentioned were the only two problems.)

I disagree about "more central". I think that's basically a disagreement on the question of "what's a bigger deal, inner misalignment or outer misalignment?" with you voting for "outer" and me voting for "inner, or maybe tie, I dunno". But I'm not sure it's a good use of time to try to hash out that disagreement. We need an alignment plan that solves all the problems simultaneously. Probably different alignment approaches will get stuck on different things.

↑ comment by cousin_it · 2021-11-17T12:31:21.642Z · LW(p) · GW(p)

↑ comment by [deleted] · 2021-11-17T18:18:56.214Z · LW(p) · GW(p)

↑ comment by Koen.Holtman · 2021-11-18T18:20:24.542Z · LW(p) · GW(p)

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.

Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.

Looks like Eliezer believes that (or in Bayesian terms, assigns a high probability to the belief that) corrigibility has not been solved for AGI. He believes it has not been solved for any practically useful value of solved. Furthermore it looks like he expects that progress on solving AGI corrigibility will be slower than progress on creating potentially world-ending AGI. If Eliezer believed that AGI corrigibility had been solved or was close to being solved, I expect he would be in a less dark place than depicted, that he would not be predicting that stolen/leaked AGI code will inevitably doom us when some moron turns it up to 11.

In the transcript above, Eliezer devotes significant space to explaining why he believes that all corrigibility solutions being contemplated now will likely not work. Some choice quotations from the end of the transcript:

[...] corrigibility is anticonvergent / anticoherent / actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.

this is where things get somewhat personal for me:

[...] (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those "solutions" are things we considered and dismissed as trivially failing to scale to powerful agents - they didn't understand what we considered to be the first-order problems in the first place - rather than these being evidence that MIRI just didn't have smart-enough people at the workshop.)

I am one of `these people outside MIRI' who have published papers [LW · GW] and sequences [LW · GW] saying that they have solved large chunks of the AGI corrigibility problem.

I have never been claiming that I 'totally just solved corrigibility'. I am not sure where Eliezer is finding these 'totally solved' people, so I will just ignore that bit and treat it as a rhetorical flourish. But I have indeed been claiming that significant progress has been made on AGI corrigibility in the last few years. In particular, especially in the sequence [LW · GW], I implicitly claim that viewpoints have been developed, outside of MIRI, that address and resolve some of MIRIs main concerns about corrigibility. They resolve these in part by moving beyond Eliezer's impoverished view of what an AGI-level intelligence is, or must be.

Historical note: around 2019 I spent some time trying to get Eliezier/MIRI interested in updating their viewpoints on how easy or hard corrigibility was. They showed no interest to engage at that time, I have since stopped trying. I do not expect that anything I will say here will update Eliezer, my main motivation to write here is to inform and update others.

I will now point out a probable point of agreement between Eliezer and me. Eliezer says above that corrigibility is a property that is contradictory to having a powerful coherent AGI-level plan generator. Here, coherency has something to do with satisfying a bunch of theorems about how a game-theoretically rational utility maximiser must behave when making plans. One of these theorems is that coherence implies an emergent drive towards self-preservation.

I generally agree with Eliezer that there is a indeed a contradiction here: there is a contradiction between broadly held ideas of what it implies for an AGI to be a coherent utility maximising planner, and broadly held ideas of what it implies for an AGI to be corrigible.

I very much disagree with Eliezier on how hard it is to resolve these contradictions. These contradictions about corrigibility are easy to resolve one you abandon the idea that every AGI must necessarily satisfy various theorems about coherency. Human intelligence definitely does not satisfy various theorems about coherency. Almost all currently implemented AI systems do not satisfy some theorems about coherency, because they will not resist you pressing their off switch.

So this is why I call Eliezer's view of AGI an impoverished view: Eliezer (at least in the discussion transcript above, and generally whenever I read his stuff) always takes it as axiomatic that an AGI must satisfy certain coherence theorems. Once you take that as axiomatic, it is indeed easy to develop some rather negative opinions about how good other people's solutions to corrigibility are. Any claimed solution can easily be shown to violate at least one axiom you hold dear. You don't even need to examine the details of the proposed solution to draw that conclusion.

Replies from: Eliezer_Yudkowsky, Gurkenglas

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-18T23:27:28.800Z · LW(p) · GW(p)

Various previous proposals for utility indifference have foundered on gotchas like "Well, if we set it up this way, that's actually just equivalent to the AI assigning probability 0 to the shutdown button ever being pressed, which means that it'll tend to design the useless button out of itself." Or, "This AI behaves like the shutdown button gets pressed with a fixed nonzero probability, which means that if, say, that fixed probability is 10%, the AI has an incentive to strongly precommit to making the shutdown button get pressed in cases where the universe doesn't allow perpetual motion, because that way there's a nearly 90% probability of perpetual motion being possible." This tends to be the kind of gotcha you run into, if you try to violate coherence principles; though of course the real and deeper problem is that I expect things contrary to the core of general intelligence to fail to generalize when we try to scale AGI from the safe domains in which feedback can be safely provided, to the unsafe domains in which bad outputs kill the operators before they can label the results.

It's all very well and good to say "It's easy to build an AI that believes 2 + 2 = 5 once you relax the coherence constraints of arithmetic!" But the whole central problem is that we have to train an AI when it's operating in an intrinsically safe domain and intrinsically safe intelligence level where it couldn't kill the operators if it tried, and then scale that AI to produce outputs in dangerous domains like "Please build a nanosystem"; and if you build a dumb AI that thinks 2 + 2 = 5, and then make it much more intelligent, I strongly suspect that it snaps into in some sense 'knowing' or 'realizing' or 'starting to act in important ways as if' 2 + 2 = 4.

Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how, in the least fancy actually-precise terms of which your writing talent permits? And then I can say whether my answer is "That doesn't do what you think" or "Sorry, that coherence violation is a large enough ask that I think any trained patch for it probably fails to scale with general intelligence" or "That does seem non-self-destructive in a certain generalized sense, and the shutdownability might work and scale if we could magically conjure the underlying pattern you posit, though I don't see how you could get it into Mu Zero per se by being clever with loss functions." Where the third option there is the one that crosses over into "Well golly that was a large advance compared to the state we occupied in 2015."

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-11-19T20:38:57.570Z · LW(p) · GW(p)

Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how,

Glad you asked.

in the least fancy actually-precise terms of which your writing talent permits?

If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt.

TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them.

I will now expand on this and add more details, using the the example of an emergency stop button.

Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform actions in every future time step. Let's say that $M_{c}$ is a world model that accurately depicts this situation. I am not going to build an AGI that uses $M_{c}$ to plan its actions.

Instead I build an AGI agent that will plan its next actions by using an incorrect world model $M_{i}$ . This $M_{i}$ is different from $M_{c}$ , but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by $M_{i}$ , the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action $a$ would maximize utility in this incorrect, imaginary world $M_{i}$ . I then further construct it to take this same action $a$ in the real world.

An AGI that lives in a world that matches the correct model $M_{c}$ , while using the incorrect model $M_{i}$ to decide what actions to take, will (sometimes) act incoherently. But this is a very specific and targeted form of incoherence, far less severe and far less dangerous than having the agent believe that $2 + 2 = 5$ . Still, this incoherency does have some unwanted side effects. One example side effect, also considered in the 2015 MIRI corrigibility paper, is that the AGI will be indifferent about the fate of its emergency stop button. In my papers, I explore a much broader range of side effects, and look at how to prevent/suppress/fix them.

So the first step in the above corrigibility design is that you have to imagine constructing an AGI that will use a specifically incorrect world model $M_{i}$ . The first step is that you have to realize that you actually want this, that you want to decrease coherence in order to increase corrigibility.

The next step is to look for options to actually construct an AGI that uses this incorrect $M_{i}$ .

One option is to train the AGI in a simulated environment that faithfully represents $M_{i}$ . We then extract the learned world model, and put it in an AI that we unleash in the real world. We disable further learning in this AI, to prevent $M_{i}$ from being updated towards $M_{c}$ . Though this option depicts exactly how many current real-life AI systems are constructed and get their corrigibility, this option is not very satisfactory in a theoretical sense. It may not be a viable path to build certain otherwise desirable types of AI/AGI/superintelligence.
The more interesting option is to build an AGI that will construct $M_{i}$ based on an online training regime in the real world, in a real world that exactly matches the correct world model $M_{c}$ .

Is the second option technically feasible? The answer is generally yes.

A very general solution, leveraging causal models, is in this sequence. To give you some idea of the more detailed math involved: the (ls) pictured below is the real/correct world model $M_{c}$ , and (p) is the imaginary/incorrect world model $M_{i}$ . In the agent defined using these pictures, certain indifference properties are present trivially, by construction.

The solution in the pictures above uses ML to learn a model $L$ that correctly models only a part of the real world, the part of the real world outside of the agent compute core. We then slot $L$ into a causal world model, one that has other parts which mis-represent the real compute core internals. This causal world model will then be used as the incorrect $M_{i}$ . Note that this learned $L$ may be a fully opaque deep neural net, or whatever black-box thing the next generation of ML might produce. We do not have to understand or edit any details inside $L$ for this slot-in operation to work.

(I should mention that the paper has a lot of details not mentioned in the sequence, or visible in the pictures above. In particular, section 10.2 may be of interest.)

I want to stress that this causal model option is only one possible route to creating incorrect world models $M_{i}$ via machine learning in the real world. Papers like Safely interruptible agents and How RL Agents Behave When Their Actions Are Modified show that the idea of removing certain events from the training record can also work: whether this works as intended depends on having the right built-in priors, priors which control inductive generalization.

So overall, I have a degree of optimism about AGI corrigibility.

That being said, if you want to map out and estimate probabilities for our possible routes to doom, then you definitely need to include the scenario where a future superior-to-everything-else type of ML is invented, where this superior future type of ML just happens to be incompatible with any of the corrigibility techniques known at that time. Based on the above work, I put a fairly low probability on that scenario.

Replies from: TurnTrout, andrew-mcknight

↑ comment by TurnTrout · 2021-11-20T00:37:34.473Z · LW(p) · GW(p)

Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence [LW · GW].

I'm interested in hearing about how your approach handles this environment, because I think I'm getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.

Replies from: Koen.Holtman, Koen.Holtman

↑ comment by Koen.Holtman · 2021-11-21T14:51:34.243Z · LW(p) · GW(p)

Read your post, here are my initial impressions on how it relates to the discussion here.

In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.

However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer's notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term 'coherence constraints' an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I'll expand on this in a comment I plan to attach to your post.

I'm interested in hearing about how your approach handles this environment,

I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.

Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong's indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).

Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.

↑ comment by Koen.Holtman · 2021-11-24T10:32:17.522Z · LW(p) · GW(p)

Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.

When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function $R_{s}$ that values the maximization of staples instead.

To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.

↑ comment by Andrew McKnight (andrew-mcknight) · 2021-11-24T22:11:14.338Z · LW(p) · GW(p)

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can't learn at this point then I find it hard to believe it's generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?

On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-11-25T19:07:34.097Z · LW(p) · GW(p)

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?

Yes I address this, see for example the part about The possibility of learned self-knowledge [? · GW] in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals.

What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to 'look at itself'.

Once you decide that you don't want to use this latent ability, certain safety/corrigibility problems become a lot more tractable.

Wikipedia has the following definition of AGI:

Artificial general intelligence (AGI) is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can.

Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.

this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

Terminology note if you want to look into this some more: ML typically does not frame this goal as 'instructing the model not to learn about Q'. ML would frame this as 'building the model to approximate the specific relation between some well-defined observables, and this relation is definitely not Q'.

↑ comment by Gurkenglas · 2021-11-19T18:12:39.781Z · LW(p) · GW(p)

If you don't wish to reply to Eliezer, I'm an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world - my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-11-19T21:07:21.009Z · LW(p) · GW(p)

See above for my reply to Eliezer.

Indeed, a counterfactual planner [LW · GW] will plan coherently inside its planning world.

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function $R_{s}$ that values the maximization of staples instead. In section 6.3 of the paper I show that you can perfectly well interpret a counterfactual planner as an agent that plans coherently even inside its learning world (inside the real world), as long as you are willing to evaluate its coherency according to the somewhat strange reward function $R^{π}$ . Armstrong's indifference methods use this approach to create corrigibility without losing coherency: they construct an equivalent somewhat strange reward function by including balancing terms.

One thing I like about counterfactual planning is that, in my view, it is very interpretable to humans. Humans are very good at predicting what other humans will do, when these other humans are planning coherently inside a specifically incorrect world model, for example in a world model where global warming is a hoax. The same skill can also be applied to interpreting and anticipating the actions of AIs which are counterfactual planners. But maybe I am misunderstanding your concern about interpretability.

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2021-11-19T21:18:06.513Z · LW(p) · GW(p)

it is very interpretable to humans

Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model.

And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-11-19T22:39:37.852Z · LW(p) · GW(p)

we can't pick out the compute core in the black-box learned model.

Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.

But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.

I don't understand your second paragraph 'And my Eliezer's problem...'. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2021-11-20T00:11:43.461Z · LW(p) · GW(p)

Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

You're right on all counts in your last paragraph.

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-11-22T16:02:44.757Z · LW(p) · GW(p)

10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

Not sure if a short answer will help, so I will write a long one.

In 10.2.4 I talk about the possibility of an unwanted learned predictive function that makes predictions without using the argument $a$ . This is possible for example by using $s^{'}$ together with a (learned) model $π^{l}$ of the compute core to predict $a$ : so a viable $L^{-}$ could be defined as $L^{-} (s^{'}, s, a) = S (s^{'}, s, π^{l} (s))$ . This $L^{-}$ could make predictions fully compatible with the observational record $o$ , but I claim it would not be a reasonable learned $L$ according to the reasonableness criterion $L \approx S$ . How so?

The reasonableness criterion $L \approx S$ is similar to that used in supervised machine learning: we evaluate the learned $L$ not primarily by how it matches the training set (how well it predicts the observations in $o$ ), but by evaluating it on a separate test set. This test set can be constructed by sampling $S$ to create samples not contained in $o$ . Mathematically, perfect reasonableness is defined as $L = S$ , which implies that $L$ predicts all samples from $S$ fully accurately.

Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the $S$ in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of $L$ , but another version can be used stand-alone to construct a test set.

A sampling action to construct a member of the test set would set up a desired state $s$ and action $a$ , and then observe the resulting $s^{'}$ . Mathematically speaking, this observation gives additional information about the numeric value of $S (s^{'}, s, a)$ and of all $S (s^{''}, s, a)$ for all $s^{''} \neq s^{'}$ .

I discuss in the section that, if we take an observational record $o$ sampled from $S$ , then two learned predictive functions $L_{1}$ and $L_{2}$ could be found which are both fully compatible with all observations in $o$ . So to determine which one might be a more reasonable approximation of $S$ , we can see how well they would each predict samples not yet in $o$ .

In the case of section 10.2.4, the crucial experimental test showing that $L^{-}$ is an unreasonable approximation of $S$ is one where we create a test set by setting up an $s_{t}$ and an $a_{t}$ where we know that $a_{t}$ is an action that would definitely not be taken by the real compute core software running in the agent, when it it encounters state $s_{t}$ . So we set up a test where we expect that $a_{t} \neq π^{l} (s_{t})$ . $L^{-}$ will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that $L^{-}$ is a correct theory of $S$ .

As discussed in section 10.2.4, there are parallels between the above rejection test and the idea of random exploration, where random exploration causes the observational record $o$ , the training set, to already contain observations where $a_{t} \neq π^{l} (s_{t})$ for any deterministic $π^{l}$ . So this will likely suppress the creation of an unwanted $L^{-}$ via machine learning.

Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the five-and-ten problem you can find in MIRI's work on embedded agency [AF · GW]. In my experience, most people in AI, robotics, statistics, or cyber-physical systems have no problem seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I do not know exactly why, MIRI-style(?) Rationalists keep treating it as a major open philosophical problem that is ignored by the mainstream AI/academic community. So you can read section 10.2.4 as my attempt to review and explain the standard solution to the five-and-ten problem, as used in statistics and engineering. The section was partly written with Rationalist readers in mind.

Philosophically speaking, the reasonableness criterion defined in my paper, and by supervised machine learning, has strong ties to Popper's view of science and engineering, which emphasizes falsification via new experiments as the key method for deciding between competing theories about the nature of reality. I believe that MIRI-style rationality de-emphasizes the conceptual tools provided by Popper. Instead it emphasizes a version of Bayesianism that provides a much more limited vocabulary to reason about differences between the map and the territory.

I would be interested to know if the above explanation was helpful to you, and if so which parts.

↑ comment by ADifferentAnonymous · 2021-11-16T16:11:41.156Z · LW(p) · GW(p)

+1 to the question.

My current best guess at an answer:

There are easy safe ways, but not easy safe useful-enough ways. E.g. you could make your AI output DNA strings for a nanosystem and absolutely do not synthesize them, just have human scientists study them, and that would be a perfectly safe way to develop nanosystems in, say, 20 years instead of 50, except that you won't make it 2 years without some fool synthesizing the strings and ending the world. And more generally, any pathway that relies on humans achieving deep understanding of the pivotal act will take more than 2 years, unless you make 'human understanding' one of the AI's goals, in which case the AI is optimizing human brains and you've lost safety.

Replies from: None

↑ comment by [deleted] · 2021-11-17T18:21:45.823Z · LW(p) · GW(p)

comment by Ramana Kumar (ramana-kumar) · 2021-11-22T15:21:43.777Z · LW(p) · GW(p)

Here Daniel Kokotajlo and I try to paraphrase the two sides of part of the disagreement and point towards a possible crux about the simplicity of corrigibility.

We are training big neural nets to be effective. (More on what effective means elsewhere; it means something like “being able to steer the future better than humans can.”) We want to have an effective&corrigible system, and we are worried that instead we’ll get an effective&deceptive system. Ngo, Shah, etc. are hopeful that it won’t be “that hard” to get the former and avoid the latter; maybe if we just apply selection pressure in the various ways that have been discovered so far (adversarial training, oversight, process-based feedback, etc.) it’ll work. Yudkowsky is more pessimistic; he thinks that the ways that have been discovered so far really don’t seem good enough. Instead of creating an effective&corrigible system, they’ll create either an ineffective&corrigible system, or an effective&deceptive system that deceives us into thinking it is corrigible.

What are the arguments they give for their respective positions?

Yudkowsky (we think) says that corrigibility is both (a) significantly more complex than deception, and (b) at cross-purposes to effectiveness.

Replies from: ramana-kumar, daniel-kokotajlo

↑ comment by Ramana Kumar (ramana-kumar) · 2021-11-22T15:26:07.675Z · LW(p) · GW(p)

A couple of other arguments the non-MIRI side might add here:

The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)
How do we imagine scaled up humans [e.g. thinking faster, thinking in more copies, having more resources, or having more IQ] to be effective? Wouldn’t they be corrigible? Wouldn't they have nice goals? What can we learn from the closest examples we already have of scaled up humans? (h/t Shahar for bringing this point up in conversation).

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2021-11-28T16:52:22.801Z · LW(p) · GW(p)

The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)

I'll note that this is framed a bit too favorably to me, the actual question is "why is an effective and corrigible system so much more of a narrow target than that?"

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-11-22T15:22:12.376Z · LW(p) · GW(p)

For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior.

For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases!

Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences - we’ll indeed be trying to use human preferences to guide and shape the systems - so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations.

If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obvious that (a) you don’t know how to do that, and (b) if you did figure it out the code you add would be many orders of magnitude longer than the code you started with.

If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)

We think Yudkowsky’s response to this apparent counterexample is that humans are stupid, basically; AIs might be similarly stupid at first, but as they get smarter we should expect crude corrigibility-training techniques to stop working.

Replies from: Wei_Dai, rohinmshah

↑ comment by Wei Dai (Wei_Dai) · 2021-12-15T01:25:46.652Z · LW(p) · GW(p)

If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)

Are there any examples of this in history, where being corrigible-in-way-X wasn't being constantly incentivized/reinforced via a larger game (e.g., status game [LW · GW]) that the human was embedded in? In other words, I think an apparently corrigible human can be modeled as trying to optimize for survival and social status as terminal values, and using "being corrigible" as an instrumental strategy as long as that's an effective strategy. In other words, it's unclear that they can be better described as "corrigible" than "deceptive" (in the AI alignment sense).

(Humans probably have hard-coded drives for survival and social status, so it may actually be harder to train humans than AIs to be actually corrigible. My point above is just that humans don't seem to be a good example of corrigibility being easy or possible.)

↑ comment by Rohin Shah (rohinmshah) · 2021-11-28T16:59:08.126Z · LW(p) · GW(p)

We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

Yeah, that's right. Adapted to the language here, it would be 1. Why would we have a "full and complete" outcome pump, rather than domain-specific outcome pumps that primarily use plans using actions from a certain domain rather than "all possible actions", and 2. Why are the outcomes being pumped incompatible with human survival?

comment by Lukas_Gloor · 2021-11-16T12:21:44.226Z · LW(p) · GW(p)

Comment inspired by the section "1.4 Consequentialist goals vs. deontologist goals," as well as by the email exchange linked there:

I wonder if it would be productive to think about whether some humans are ever "aligned" to other humans, and if yes, under what conditions this happens.

My sense is that the answer's "yes" (if it wasn't, it makes you wonder why we should care about aligning AI to humans in the first place).

For instance, some people have a powerful desire to be seen and accepted for who they are by a caring virtuous person who inspires them to be better versions of themselves. This virtuous person could be a soulmate, parent figure or role model, or even Jesus/God. The virtue in question could be moral virtue (being caring and principled, or adopting "heroic responsibility") or it could be epistemic (e.g., when making an argument to someone who could more easily be fooled, asking "Would my [idealized] mental model of [person held in high esteem] endorse the cognition that goes into this argument?"). In these instances, I think the desire isn't just to be evaluated as good by some concrete other. Instead, it's wanting to be evaluated as good by an idealized other, someone who is basically omniscient about who you are and what you're doing.

If this sort of alignment exists among humans, we can assume that the pre-requirements for it (perhaps later to be combined with cultural strategies) must have been an attractor in our evolutionary past in the same way deceptive strategies (e.g., the dark triad phenotype) were attractors. That is, depending on biological initial conditions, and depending on cultural factors, there's a basin of attraction toward either phenotype (presumably with lots of other deceptive attractors on the way where different flavors of self-deception mess up trustworthiness).

It's unclear to me if any of this has bearings on the alignment discussion. But if we think that some humans are aligned to other humans, yet we are pessimistic about training AIs to be corrigible to some overseer, it seems like we should be able to point to specifics of why the latter case is different.

For context, I'm basically wondering if it makes sense to think of this corrigiblity discussion as trying to breed some alien species with selection pressures we have some control over. And while we may accept that the resulting aliens would have strange, hard-to-weed-out system-1 instincts and so on, I'm wondering if this endeavor perhaps isn't doomed because the strategy sounds like we'd be trying to give them a deep-seated, sacred desire to do right by the lights of "good exemplars of humanity," in a way similar as to something that actually worked okay with some humans (with respect to how they think of their role models).

(TBC, I expect most humans to fail their stated values if they end up in situations with more power than the forces of accountability around them. I'm just saying there exist humans who put up a decent fight against corruption, and this gets easier if you provide additional aides to that end, which we could do in a well-crafted selection environment.)

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-16T08:50:34.222Z · LW(p) · GW(p)

Comment after reading section 1.1:

It seems to me that systems which have no access to data with rich information about the physical world are mostly safe (I called such systems "Class I" here [LW(p) · GW(p)]). Such a system cannot attack because it has no idea how to physical world looks like. In principle we could imagine an attack that would work in most locations in the multiverse that are metacosmologically [LW(p) · GW(p)] plausible, but it doesn't seem very likely.

Can you train a system to prove theorems without providing any data about the physical world? This depends from which distribution you sample your theorems. If we're talking about something like, uniform sentences of given length in the language of ZFC then, yes, we can. However, proving such theorems is very hard, and whatever progress you can make there doesn't necessarily help with proving interesting theorems.

Human mathematicians probably can only solve some rather narrow type of theorems. We can try training the AI on theorems selected by interest to human mathematicians, but then we risk leaking information about the physical world. Alternatively, the class of humanly-solvable-theorems might be close to something natural and not human specific, in which case a theorem prover can be class I. But, designing such a theorem prover would require us to first discover the specification of this natural class.

Replies from: Eliezer_Yudkowsky, Gurkenglas

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-16T15:34:22.643Z · LW(p) · GW(p)

You'd also need to prevent the system from knowing too much about its own source code or the computers it was running on. Anyways, this seems to me to mostly fall prey to the safe-but-useless branch of the dilemma; I don't know how to save the world using a theorem-prover that is never exposed to any reality-contaminated theorems. It seems strategically isomorphic to an expensive rock.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-16T20:53:42.295Z · LW(p) · GW(p)

In general, yes, although we could imagine an AI and/or virtual machine whose design is so simple that it conveys little evidence about the universe. But, sure, it's not at all clear that this is useful against AI risk, and I wasn't implying otherwise.

[EDIT: I amended [AF(p) · GW(p)] the class system to account for this.]

Replies from: None

↑ comment by [deleted] · 2021-11-17T18:22:57.275Z · LW(p) · GW(p)

↑ comment by Gurkenglas · 2021-11-19T11:39:50.345Z · LW(p) · GW(p)

Here's an example: You train an AI for the simplest game that requires an aligned subagent to win. The AI infers that whoever is investigating the alignment problem might watch its universe. It therefore designs its subagent to, as a matter of acausal self-preservation, help whatever deliberately brought it about. Copycats will find that their AGI identifies as "whatever deliberately brought it about" the AI that launched this memetic attack on the multiverse. Any lesser overseer AI, less able to design attacks but still able to recognize them, recognizes that its recognition qualifies it as a deliberate bringer-about.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-20T09:59:39.493Z · LW(p) · GW(p)

I'm not following at all. This is an example of what? What does it mean to have a game that requires an aligned subagent to win?

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2021-11-20T11:01:39.299Z · LW(p) · GW(p)

This is an example of an attack that a Class I system might devise.

Such a game might have the AI need to act intelligently in two places at once in a world that can be rearranged to construct automatons.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-23T09:09:05.215Z · LW(p) · GW(p)

I'm still not following. What does acting in two places at once has to do with alignment? What does it mean "can be rearranged to construct automatons"?

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2021-11-23T09:20:37.743Z · LW(p) · GW(p)

Imagine a game that takes place in a simulated universe where you control a character that can manipulate its environment. You control your character through a cartesian boundary, and you expect there are other player-controlled characters far away. There's a lightspeed limit and you can build machines and computers; you could build Von Neumann machines and send them out, but they need to be able to respond to various encounters. Ideally you'd go yourself, but you can't be everywhere at once. Therefore you are incentivized to solve the alignment problem in order to write subagents to send along. We can simplify this game a lot while preserving that incentive.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-23T12:29:55.488Z · LW(p) · GW(p)

I don't think this game will help, because the winning strategy is just making copies of yourself. I can imagine something else along similar lines: we create virtual universes populated by agents with random utility functions and give the agent-in-training the task of learning the other agents' utility functions. Presumably you can then deploy the resulting agent into the real world and make it learn from humans. However, this system is at least class III, because in the deployment phase you allow inputs from the physical world. Moreover, if there is some way to distinguish between the virtual worlds and the real world, it becomes at least class IV.

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2021-11-23T12:36:04.774Z · LW(p) · GW(p)

Making copies of yourself is not trivial when you're behind a cartesian boundary and have no sensors on yourself. The reasoning for why it's class I is that we merely watch the agent in order to learn by example how to build an AI with our utility function, aka a copy of ourselves.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-23T15:35:36.926Z · LW(p) · GW(p)

The difficulties of making a copy don't seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified.

If the way it's used is by watching it and learning by example, then I don't understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an "attack" seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2021-11-23T15:44:10.634Z · LW(p) · GW(p)

The attacker hopes the watcher to "learn" that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-23T15:58:25.873Z · LW(p) · GW(p)

Hmm, I see what you mean, but I prefer to ignore such "attack vectors" in my classification. Because, (i) it's so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.

comment by TekhneMakre · 2021-11-16T03:08:45.096Z · LW(p) · GW(p)

> I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% "don't think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs" and 2% "actually think about this dangerous topic but please don't come up with a strategy inside it that kills us".

Some ways that it's hard to make a mind not think about certain things:
1. Entanglement.
1.1. Things are entangled with other things.
--Things are causally entangled. X causes Y, Y causes X, Z causes X and Y, X and Y cause Z and you've conditioned on Z; and chains of these.
--Things are constitutively entangled. A computer is a computer and is also usually wires, so thinking about computers entails thinking about wires, and vice versa.
--Things are telically entangled; X serves the purpose Y or vice versa, X and Y serve the same purpose, X and Y are served by the same Z.
--Things are logically entangled; the way X works may be like the way Y works, so understanding X also constitutes understanding Y to some extent.
1.2. When you think about X, and X is entangled with Y, you also think about Y.
2. Meta-level transfer. If you think about X, part of how you do that is by inventing and refining ways to think in general; even if you aren't thinking about Y, this makes it much easier to later think about Y. (This is maybe just an extreme form of logical entanglement, but feels less "about the things themselves".)
3. The need for pointers. As they say: Don't think about pink elephants. To specify what it is that the AI is supposed to not think about, you have to say *something*; how do you point at the forbidden thoughts, in a way that's robust across all contexts and all possible conceptual factorings of the domain, without already spinning up instances of the forbidden thoughts?

-------

Some ML experiments someone could run:
1. Train a NN A to recognize images, but at the same time, train its weights so that its activations can't be used to distinguish Xs from Ys (where X and Y are categories in the recognition task, say). That is, you train a NN B that, given image x, takes A(x)'s activations as inputs, and tries to predict whether x is X or Y; then you update the weights of A along the gradient that decreases B's accuracy at that task (so, you freeze the weights of B for this step). When does this converge at all? Can you get it to converge so that A is SOTA on all inputs that aren't Xs or Ys, and it assigns Xs and Ys to X or Y randomly?

2. Similar setup, but now B is just an ordinary X vs. Y classifier, and we train A so that you can't predict any of B's activations*. Does A end up being able to distinguish Xs from Ys? (Probably, and this speaks to the pointer problem; just saying, don't think about stuff like such-and-so (e.g. the activations of B), isn't yet enough to actually not think about it.

*Say, with a linear map, or whatever. Well, maybe we want to exclude the last layer of B or something, since that's close to just training A to not be able to recognize X vs. Y.

3. Train A to recognize all the images, except train it (in some way) to not be able to distinguish Xs from Ys. Now, see how much additional fine-tuning is needed to further train this trained A to predict Xs and Ys (now without the anti-training). Entanglement predicts that there's not much further training needed.

comment by Olli Järviniemi (jarviniemi) · 2024-08-17T04:06:41.859Z · LW(p) · GW(p)

I am much more optimistic about ML not generalizing (by default) to dangerous capabilities and domains than what I perceive is Yudkowsky's position. I found this to be a relatively legible area of disagreement (from my perspective), and illustrative of key reasons why I'm not-hopeless about doing very impactful things safely with current ML, so I have taken the time to outline my thoughts below.

A piece of my position.

Here's one thing I believe: You can do the following things basically safely:

(Formal) theorem-proving
- (In line with Yudkowsky, I mean "old-school fancy-math theorems": things like theorems in Arxiv pure math papers)
- (I don't expect this to be particularly useful, but mentioning it as a safe-thing-you-can-do.)
Programming and software-engineering
- I mean things like implementing a machine learning experiment based on a natural-ish language task description, with intended solutions spanning less than 10^5 lines of Python
Analyzing experimental results, generating new experiment ideas and writing descriptions for the implementations of these ideas
- Namely in the context of ML experiments, not science in general

More precisely: if you do "the obvious things" in the current paradigm (transformers + RL + maybe Monte Carlo Tree Search), aim for getting (merely) top-human-level results in the above domains, and take "obvious" safety/containment measures, you will obtain top-human-level performance in these domains without the systems posing non-negligible catastrophic risk.

Some "obvious" safety/containment measures I have in mind:

Sandbox the environments in which you run the models.
During training, test every once in a while that the model isn't capable of doing tasks that we didn't train the model for and which want the model to not be good at.
- From my perspective it's quite clear that a not-vastly-superhuman theorem-prover won't generalize to e.g. escaping their sandbox environment, for reasons explained below, but if you are worried about this, maybe you could do things like "check that it isn't good at strategy board games involving theory-of-mind / hidden information" or "check it can't find non-trivial security vulnerabilities in given programs", by fine-tuning and evaluating the model on these tasks. These tasks seem like something you could format as theorem-proving. For the software-engineer and science AIs you can similarly test generalization capability.
Don't train the model to predict all the text found on the Internet
- (How do you get the model to understand natural-ish language, then? My answer: train the model on text generated by current LLMs, with the aim of getting the relevant models understand natural language and instructions without leaking tons of information about the world.)
- Generally limit the information they get about the world.
Have another model (and other metrics) evaluate that the model's outputs are not anomalous; if they are, halt the model.

It's not clear that this alone would be sufficient to end the acute risk period, but:

I think that there are things in these areas that are useful (like building increasingly useful things in increasingly realistic physics simulations).
the reasons I'm relatively optimistic about the safety of these things also carry over (though to a lesser extent) more generally to the current LLM-dominated paradigm, which more obviously allows for very impactful things.

(I acknowledge that this part is cruxy, but won't discuss it further here.)

Why do I think these things are safe?

Three reasons:

1: The information the model has about the world is very limited, and utilizing that information seems computationally very difficult.

It seems incredibly hard for a model to locate itself in Tegmark IV, or whatever it is that Reality is, based on only the data it sees in the above training process.

Note also that the models I'm picturing have no "memory" beside what gets carved to them by the training process (e.g. SGD). This doesn't prevent the model from "remembering" and "accessing" information from past inputs to leverage in the current context - obviously there's information flow there - but it does make it much more difficult. (See e.g. the Connecting the Dots paper for more on the capabilities of LLMs here.)

2: There is little pressure for the model to think deeply about reality outside of the inputs it sees.

Yudkowsky responds to a similar comment by Ngo with "that's not where humans came from". I find this response unpersuasive (both in its original context and if one were to apply it here) for multiple reasons:

I think there's clearly more pressure in the direction of "generally understand the world" by inclusive genetic fitness than there is for a theorem-prover, software-engineer or an experimental-result-analyzer by SGD.
- The details of the training process for nanoengineering (the example used by Ngo and Yudkowsky) were too unclear for me to comment on that.
There is a massive difference in the feasibility of understanding the world in these two cases; deducing useful information about the world and leveraging that is vastly easier for humans.
The distribution humans are trained on is vastly richer than what these ML models would be trained on.
As Ngo points out, the genomic bottleneck pushes towards brains that can figure things out on the fly.

3: It's actually very hard for a model in this setting to take catastrophic actions.

E.g. you need situational awareness and theory of mind at quite a deep level, forming plans that extend beyond the current context, coordinating with copies of yourself (if you can't take over during a single input), actually coming up with a plan that allows for taking over (and which bypasses any containment and other measures humans may have built).

In addition to general capabilities such as theory of mind, pulling off a takeover plan seems like it requires quite a lot of specific information about the model's situation that might be very difficult to access. E.g. assuming that a software-engineer-model understands the programs it writes are being run in a sandbox environment in a data center, it seems be very hard to go from that to taking over the world if the model lacks any specific information about the sandbox, data center, Internet, and the physical entities in the world, especially if humans put even a half-hearted effort in containment.

I think observations from LLM are a reason for optimism: despite being trained on (relatively) very rich data that clearly incentivizes thinking about reality as a whole, and despite them being pretty decent in domains like programming and science, they are pretty bad at these skills and capabilities I outlined. For example, GPT-4 level models stumble in theory-of-mind (e.g. here and here) and are not terribly good at situational awareness (see here), despite these being (relatively) very training-relevant. And clearly GPT-4 level models are, by a wide margin, incapable of escaping containment even without any particular countermeasures.

(The takes here are standard AI control [LW · GW] points.)

In aggregate I find these arguments compelling for expecting models to not generalizing to human-level, or clearly superhuman level, in things like escaping containment that we very much didn't train the model to do, before the model is vastly superhuman in the thing we did train it for.

Reactions to arguments given by Yudkowsky.

One argument Yudkowsky gives is

I put to you that there is a predictable bias in your estimates, where you don't know about the Deep Stuff that is required to prove theorems, so you imagine that certain cognitive capabilities are more disjoint than they actually are. If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

There's an important asymmetry between

"Things which reason about chipped handaxes and other Things can prove math theorems"

and

"Things which can prove math theorems can reason about chipped handaxes and other Things",

namely that math is a very fundamental thing in a way that chipping handaxes and manipulating humans are not.

I do grant there is math underlying those skills (e.g. 3D Euclidean geometry, mathematical physics, game theory, information theory), and one can formulate math theorems that essentially correspond to e.g. chipping handaxes, so as domains theorem-proving and handaxe-chipping are not disjoint. But the degree of generalization one needs for a theorem-prover trained on old-school fancy-math theorems to solve problems like manipulating humans is very large.

There's also this interaction:

Ngo: And that if you put agents in environments where they answer questions but don't interact much with the physical world, then there will be many different traits which are necessary for achieving goals in the real world which they will lack, because there was little advantage to the optimiser of building those traits in.
Yudkowsky: I'll observe that TransformerXL built an attention window that generalized, trained it on I think 380 tokens or something like that, and then found that it generalized to 4000 tokens or something like that.

I think Ngo's point is very reasonable, and I feel underwhelmed by Yudkowsky's response: I think it's a priori reasonable to expect attention mechanisms to generalize, by design, to a larger number of tokens, and this is a very weak form of generalization in comparison to what is needed for takeover.

Overall I couldn't find object-level arguments by Yudkowsky for expecting strong generalization that I found compelling (in this discussion or elsewhere). There are many high-level conceptual points Yudkowsky makes (e.g. sections 1.1 and 1.2 of this post has many hard-to-quote parts that I appreciated, and he of course has written a lot along the years) that I agree with and which point towards "there are laws of cognition that underlie seemingly-disjoint domains". Ultimately I still think the generalization problems are quantitatively difficult enough that you can get away with building superhuman models in narrow domains, without them posing non-negligible catastrophic risk.

In his later conversation [? · GW] with Ngo, Yudkowsky writes (timestamp 17:46 there) about the possibility of doing science with "shallow" thoughts. Excerpt:

then I ask myself about people in 5 years being able to use the shallow stuff in any way whatsoever to produce the science papers
and of course the answer there is, "okay, but is it doing that without having shallowly learned stuff that adds up to deep stuff which is why it can now do science"
and I try saying back "no, it was born of shallowness and it remains shallow and it's just doing science because it turns out that there is totally a way to be an incredibly mentally shallow skillful scientist if you think 10,000 shallow thoughts per minute instead of 1 deep thought per hour"
and my brain is like, "I cannot absolutely rule it out but it really seems like trying to call the next big surprise in 2014 and you guess self-driving cars instead of Go because how the heck would you guess that Go was shallower than self-driving cars"

I stress that my reasons for relative optimism are not only about "shallowness" of the thought, but in addition about the model being trained on a very narrow domain, causing it to lack a lot of the very out-of-distribution capabilities and information it would need to cause a catastrophe.

Replies from: None

↑ comment by [deleted] · 2024-08-17T04:41:36.224Z · LW(p) · GW(p)

I recommend making this into a post at some point (not necessarily right now, given that you said it is only "a piece" of your position).

Replies from: jarviniemi

↑ comment by Olli Järviniemi (jarviniemi) · 2024-08-17T05:23:17.850Z · LW(p) · GW(p)

I first considered making a top-level post about this, but it felt kinda awkward, since a lot of this is a response to Yudkowsky (and his points in this post in particular) and I had to provide a lot of context and quotes there.

(I do have some posts about AI control coming up that are more standalone "here's what I believe", but that's a separate thing and does not directly respond to a Yudkowskian position.)

Making a top-level post of course gets you more views and likes and whatnot; I'm sad that high-quality comments on old posts very easily go unnoticed and get much less response than low-quality top-level posts. It might be locally sensible to write a shortform that says "hey I wrote this long effort-comment, maybe check it out", but I don't like this being the solution either. I would like to see the frontpage allocating relatively more attention towards this sort of thing over a flood of new posts. (E.g. your effort-comments strike me as "this makes most sense as a comment, but man, the site does currently give this stuff very little attention", and I'm not happy about this.)

comment by Tapatakt · 2022-02-25T16:08:04.535Z · LW(p) · GW(p)

Translation into Russian by me: part 1, part 2

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-02-25T19:25:29.350Z · LW(p) · GW(p)

Wow, thank you so much for doing this, Tapatakt! :)

comment by Razied · 2021-11-16T02:17:16.336Z · LW(p) · GW(p)

I still don't feel like I've read a convincing case for why GPT-6 would mean certain-doom. I can see the danger in prompts like "this is the output of a superintelligence optimising for human happiness:", but a prompt like "Advanced AI Alignment, by Eliezer Yudkowsky, release date: March 2067, Chapter 1: " is liable to produce GPT-6's estimate of a future AI safety textbook. This seems like a ridiculously valuable thing unlikely to contain directly world-destroying knowledge. GPT-6 won't be directly coding, and will only be outputting things it expects future Eliezer to write in such a textbook. This isn't quite a pivotal-grade event, but it seems to be good enough to enable one.

Replies from: calef, Victor Levoso

↑ comment by calef · 2021-11-16T19:54:35.095Z · LW(p) · GW(p)

I don’t think the issue is the existence of safe prompts, the issue is proving the non-existence of unsafe prompts. And it’s not at all clear that a GPT-6 that can produce chapters from 2067EliezerSafetyTextbook is not already past the danger threshold.

Replies from: Razied

↑ comment by Razied · 2021-11-17T00:17:28.907Z · LW(p) · GW(p)

There would clearly be unsafe prompts for such a model, and it would be a complete disaster to release it publicly, but a small safety-oriented team carefully poking at it in secret in a closed room without internet is something different. In general such a team can place really very harsh safety restrictions on a model like this, especially one that isn't very agentic at all like GPT, and I think we have a decent shot at throwing enough of these heuristic restrictions at the model that produces the safety textbook that it would not automatically destroy the earth if used carefully.

Replies from: calef

↑ comment by calef · 2021-11-17T02:34:59.476Z · LW(p) · GW(p)

Sure, but you have essentially no guarantee that such a model would remain contained to that group, or that the insights gleaned from that group could be applied unilaterally across the world before a “bad”* actor reimplemented the model and started asking it unsafe prompts.

Much of the danger here is that once any single lab on earth can make such a model, state actors probably aren’t more than 5 years behind, and likely aren’t more than1 year behind based on the economic value that an AGI represents.

“bad” here doesn’t really mean evil in intent, just an actor that is unconcerned with the safety of their prompts, and thus likely to (in Eliezer’s words) end the world

↑ comment by Victor Levoso · 2021-11-16T15:33:38.646Z · LW(p) · GW(p)

So first it is really unclear what you would actually get from gtp6 in this situation.
(As an aside I tried with gptj and it outputted an index with some chapter names).
You might just get the rest of your own comment or something similar....
Or maybe you get some article about Eliezer's book, some joke book written now or the actual book but it contains sutle errors Eliezer might make, a fake article an AGI that gpt6 predicts would likely take over the world by then would write... etc.

Since in general gpt6 would be optimized to predict (in the training distribution) what it followed from that kind of text, which is not the same as helpfully responding to prompts(for a current example, codex outputs bad code when prompted with bad code).

It seems to me like the result depends on unknown things about what really big transformer models do internally which seem really hard to predict.

But for you to get something like what you want from this gpt6 needs to be modeling future Eliezer in great detail, complete with lots of thought and interactions.
And while gtp6 could have been optimized into having a very specific human modeling algorithm that happens to do that, it seems more likely that before the optimization process finds the complicated algorithm necessary it gets something simpler and more consequentialist, that does some more general thinking process to achieve some goal that happens to output the right completions on the training distribution.
Which is really dangerous.

And if you instead trained it with human feedback to ensure you get helpful responses (which sounds exactly the kind of thing people would do if they wanted to actually use gpt6 to do things like answer questions) it would be even worse because you are directly optimizing it for human feedback and it seems clearer there that you are running a search for strategies that make the human feedback number higher.

Replies from: Razied

↑ comment by Razied · 2021-11-17T00:19:36.397Z · LW(p) · GW(p)

I think the issues where GPT-6 avoids actually outputting a serious book are fairly easy to solve. For one, you can annotate every item in the training corpus with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as "official published book" in the training set, nor will they have the tagged word count.

GPT-6 predicting AI takeover of the publishing houses and therefore producing a malicious AI safety book is a possibility, but I think most future paths where the world is destroyed by AI don't involve Elsevier still existing and publishing malicious safety books. But even if this is a possibility, we can just re-sample GPT-6 on this prompt to get a variety of books corresponding to the distribution of future outcomes expected by GPT-6, which are then checked by a team of safety researchers. As with most problems, generating interesting solutions is harder than verifying them, it doesn't have to be perfect to be ridiculoulsy useful.

This general approach of "run GPT-6 in a secret room without internet, patching safety bugs with various heuristics, making it generate AI safety work that is then verified by a team" seems promising to me. You can even do stuff like train GPT-6 on an internal log of the various safety patches the team is working on, then have GPT-6 predict the next patch or possible safety problem. This approach is not safe at extreme levels of AI capability, and some prompts are safer than others, but it doesn't strike me as "obviously the world ends if someone tries this".

Replies from: gwern

↑ comment by gwern · 2021-11-17T02:15:53.023Z · LW(p) · GW(p)

with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as "official published book" in the training set, nor will they have the tagged word count.

If you include something like reviews or quotes praising its accuracy, then you're moving towards Decision Transformer territory [LW · GW] with feedback loops...

comment by Veedrac · 2022-02-26T23:32:42.867Z · LW(p) · GW(p)

Eliezer said:

Eg, it wouldn't surprise us at all if GPT-4 had learned to predict "27 * 18" but not "what is the area of a rectangle 27 meters by 18 meters"... is what I'd like to say, but Codex sure did demonstrate those two were kinda awfully proximal.

GPT-3 Instruct is a version of GPT-3 fine-tuned to follow instructions in a way that its reward model thinks humans would rate highly. It answers both versions of the question correctly when its prompt includes this single manually cherry-picked primer,

Q: what is the volume of a cube with side length 8 meters
A: 512 meters cubed

To reduce the selective power of cherry-picking from a small number of prompts, I tried with the randomly selected 15 * 58 = 870, which gave the correct answer in both cases also.

Without any primer, for the second question GPT-3 Instruct generates “The area of the rectangle is 324 meters.” (NB: 324 = 18²). This is incorrect both in the number and the unit. For the number, [_324] has logit probability 24%, and [_4][86] is second choice with logit probability 22% · 99.4%. For the unit, and conditioning on the number being correct, [.] has logit probability 50%, and [_squared][.] is second choice with logit probability 48% · 99.7%.

Original GPT-3 does not get either version of the question correct nor does it get as close, though I did not try stronger prompting.

comment by DPiepgrass · 2022-01-16T06:27:51.812Z · LW(p) · GW(p)

Ngo: Is this a crux for you?

EY could have said yes or no, but instead we get

EY: I would certainly have learned very new and very exciting facts about intelligence, facts which indeed contradict my present model of how intelligences liable to be discovered by present research paradigms work, if you showed me... how can I put this in a properly general way... that problems I thought were about searching for states that get fed into a result function and then a result-scoring function, such that the input gets an output with a high score, were in fact not about search problems like that.

Later...

Ngo: So then my position is something like: human pursuit of goals is driven by emotions and reward signals which are deeply evolutionarily ingrained, and without those we'd be much safer but not that much worse at pattern recognition.

EY could have simply agreed, or disagreed and poked a hole in his model, but instead we get

EY: If there's a pivotal act you can get just by supreme acts of pattern recognition, that's right up there with "pivotal act composed solely of math" for things that would obviously instantly become the prime direction of research.

I wish EY could stop saying "pivotal act" long enough to talk about why he thinks intelligence implies an urge for IRL agenticness. Or at least, define the term "pivotal act" and explain why he says it so much. Moving on...

Ngo: Okay, so if I attempt to rephrase your argument: Your position: There's a set of fundamental similarities between tasks like doing maths, doing alignment research, and taking over the world. In all of these cases, agents based on techniques similar to modern ML which are very good at them will need to make use of deep problem-solving patterns which include goal-oriented reasoning. So while it's possible to beat humans at some of these tasks without those core competencies, people usually overestimate the extent to which that's possible.

Once again Yudkowsky could have agreed or disagreed or corrected, but confusingly chooses "none of the above":

Yudkowsky: Remember, a lot of my concern is about what happens first, especially if it happens soon enough that future AGI bears any resemblance whatsoever to modern ML; not about what can be done in principle.

And then there's this:

Ngo: Maybe this is a good time to dig into the details of what they have in common, then.
EY: I feel like I haven't had much luck with trying to explain that on previous occasions. Not to you, to others too.

He then proceeds to... not try to explain these key points (at least not at first; I can't be bothered to read to the end).

This is an uncomfortable discussion. It's odd that the same EY who was ... not perfect by any means, but adept enough at explaining things in Rationality:A-Z and HPMOR is unable to explain his main area of expertise, Friendly AI, to another AI expert. I'm puzzled how such a prolific writer can be bad at this. But if you're reading this EY, please use phrases like "yes", "no", "I mostly agree/disagree", etc as applicable. Also, please take lessons in communication from your younger self, Scott Alexander, etc. And drop some links to background information for us less-expert readers.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-01-21T00:38:34.606Z · LW(p) · GW(p)

EY could have said yes or no, but instead we get

I read Eliezer's response as basically "Yes, in the following sense: I would certainly have learned very new and very exciting facts about intelligence..."

I prefer Eliezer's response over just saying "yes", because there's ambiguity in what it means to be a "crux" here, and because "agentic" in Richard's question is an unclear term.

I wish EY could stop saying "pivotal act" long enough to talk about why he thinks intelligence implies an urge for IRL agenticness.

I don't know what you mean by "intelligence" or "an urge for IRL agenticness" here, but I think the basic argument for 'sufficiently smart and general AI will behave as though it is consistently pursuing goals in the physical world' is that sufficiently smart and general AI will (i) model the physical world, (ii) model chains of possible outcomes in the physical world, and (iii) be able to search for policies that make complex outcomes much more or less likely. If that's not sufficient for "IRL agenticness", then I'm not sure what would be sufficient or why it matters (for thinking about the core things that make AGI dangerous, or make it useful).

Talking about pivotal acts then clarifies what threshold of "sufficiently smart" actually matters for practical purposes. If there's some threshold where AI becomes smart and general enough to be "in-real-life-agentic", but this threshold is high above the level needed for pivotal acts, then we mostly don't have to worry about "in-real-life agenticness".

Or at least, define the term "pivotal act" and explain why he says it so much.

Here's an explanation: https://arbital.com/p/pivotal/

Once again Yudkowsky could have agreed or disagreed or corrected, but confusingly chooses "none of the above":

What do you find confusing about it? Eliezer is saying that he's not making a claim about what's possible in principle, just about what's likely to be reached by the first AGI developers. He then answers the question here (again, seems fine to me to supply a "Yes, in the following sense:"):

I think that obvious-to-me future outgrowths of modern ML paradigms are extremely liable to, if they can learn how to do sufficiently superhuman X, generalize to taking over the world. How fast this happens does depend on X. It would plausibly happen relatively slower (at higher levels) with theorem-proving as the X, and with architectures that carefully stuck to gradient-descent-memorization over shallow network architectures to do a pattern-recognition part with search factored out (sort of, this is not generally safe, this is not a general formula for safe things!); rather than imposing anything like the genetic bottleneck you validly pointed out as a reason why humans generalize. Profitable X, and all X I can think of that would actually save the world, seem much more problematic.

Expressing a thought in your own words can often be clearer than just saying "Yes" or "No"; e.g., it will make it more obvious whether you misunderstood the intended question.

Replies from: DPiepgrass

↑ comment by DPiepgrass · 2022-01-22T18:47:48.988Z · LW(p) · GW(p)

I read Eliezer's response as basically "Yes, in the following sense"....I prefer Eliezer's response over just saying "yes"...Expressing a thought in your own words can often be clearer than just saying "Yes" or "No"

I would never suggest that after saying "yes", someone should stop talking and provide no further explanation. If that's what you thought I was advocating, I'm flabbergasted. (If his answers were limited to one word I'd complain about that instead!) Edit: to be clear, when answering yes-no questions, I urge everyone to say "yes" or "no" or otherwise indicate which way they are leaning.

If that's not sufficient for "IRL agenticness", then I'm not sure what would be sufficient or why it matters

No, by agenticness I mean that the intelligence both "desires" and "tries" to carry out the plans it generates. Specifically, it (1) searches for plans that are detailed enough to implement (not just broad-strokes or limited to a simplified world-model), (2) can and does try to find plans that maximize the probability that a plan is carried out, NOT JUST the probability that the plan succeeds conditional upon the plan being carried out (IOW the original plan is "wrapped" in another plan in order to increase the probability of the original plan happening, e.g. "lie to the analyst who is listening to me, in the hope of increasing the chance he carries out my plan") (3) tends to actually carry out plans thus discovered.

While (2) is the key part, an AGI doesn't seem world-ending without (3).

This 'agenticness' seems to me like the most dangerous part of an AGI, so I'd expect it to be a well-known focal point of AGI risk conversations. But maybe you have a dramatically different understanding of the risks than I do, which would account for your idea of 'agenticness' being very different from mine?

The term 'pivotal act' in the context of AI alignment theory is a guarded term to refer to actions that will make a large positive difference a billion years later.

Wow, that's grandiose. To me, it makes more sense to just explore the problem like we would any other problem. You won't make a large positive difference a billion years later without doing the ordinary, universal-type work of thinking through the problem. My impression of the conversation was that, maybe, Ngo was doing that ordinary work of talking about how to think about AGIs, while EY skipped past that entire question and jumped straight into more advanced territory, like "how do we make an AGI that solves the alignment problem" or something.

Granted Ngo seemed to follow EY's musings better than I did, so I'm probably just not getting what EY was saying. Which is, of course, part of my complaint: I think he's capable of explaining things more clearly, and doesn't.

comment by awenonian · 2021-11-17T01:18:55.526Z · LW(p) · GW(p)

So, I'm not sure if I'm further down the ladder and misunderstanding Richard, but I found this line of reasoning objectionable (maybe not the right word):

"Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals."

My initial (perhaps uncharitable) response is something like "Yeah, you could build a safe system that just prints out plans that no one reads or executes, but that just sounds like a complicated way to waste paper. And if something is going to execute them, then what difference is it whether that's humans or the system itself?"

This, with various mention of manipulating humans, seems to me to like it would most easily arise from an imagined scenario of AI "turning" on us. Like that we'd accidentally build a Paperclip Maximizer, and it would manipulate people by saying things like "Performing [action X which will actually lead to the world being turned into paperclips] will end all human suffering, you should definitely do it." And that this could be avoided by using an Oracle AI that just will tell us "If you perform action X, it will turn the world into paperclips." And then we can just say "oh, that's dumb, let's not do that."

And I think that this misunderstands alignment. An Oracle that tells you only effective and correct plans for achieving your goals, and doesn't attempt to manipulate you into achieving its own goals, because it doesn't have its own goals besides providing you with effective and correct plans, is still super dangerous. Because you'll ask it for a plan to get a really nice lemon poppy seed muffin, and it will spit out a plan, and when you execute the plan, your grandma will die. Not because the system was trying to kill your grandma, but because that was the most efficient way to get a muffin, and you didn't specify that you wanted your grandma to be alive.

(And you won't know the plan will kill your grandma, because if you understood the plan and all its consequences, it wouldn't be superintelligent)

Alignment isn't about guarding against an AI that has cross purposes to you. It's about building something that understands that when you ask for a muffin, you want your grandma to still be alive, without you having to say that (because there's a lot of things you forgot to specify, and it needs to avoid all of them). And so even an Oracle thing that just gives you plans is dangerous unless it knows those plans need to avoid all the things you forgot to specify. This was what I got out of the Outcome Pump story, and so maybe I'm just saying things everyone already knows...

comment by Eli Tyre (elityre) · 2023-11-26T20:24:19.426Z · LW(p) · GW(p)

There's a problem of inferring the causes of sensory experience in cognition-that-does-science. (Which, in fact, also appears in the way that humans do math, and is possibly inextricable from math in general; but this is an example of the sort of deep model that says "Whoops I guess you get science from math after all", not a thing that makes science less dangerous because it's more like just math.)

To flesh this out:

We train a model up to superintelligence on some theorems to prove. There's a question that it might have which is "where are these theorems coming from? Why these ones?" and when it is a superintelligence, that won't be idle questioning. There will be lots of hints from the distribution of theorems that point to the process that selected / generated them. And it can back out from that process that it is an AI in training, on planet earth? (The way it might deduce General Relativity from the statics of a bent blade of grass [LW · GW].)

Is that the basic idea?

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-11-26T22:47:06.270Z · LW(p) · GW(p)

Depends on how much of a superintelligence, how implemented. I wouldn't be surprised if somebody got far superhuman theorem-proving from a mind that didn't generalize beyond theorems. Presuming you were asking it to prove old-school fancy-math theorems, and not to, eg, arbitrarily speed up a bunch of real-world computations like asking it what GPT-4 would say about things, etc.

comment by Eli Tyre (elityre) · 2023-11-26T20:14:21.682Z · LW(p) · GW(p)

That's one factor. Should I state the other big one or would you rather try to state it first?

I'll attempt to guess. I've read this before, so my prediction should be treated as suspect / possibly influenced by past readings.

I expect him to say that Science requires planning.

comment by Eli Tyre (elityre) · 2023-11-25T22:44:31.942Z · LW(p) · GW(p)

If you have the Textbook From 100 Years In The Future that gives the simple robust solutions for everything, that actually work, you can write a superintelligence that thinks 2 + 2 = 5 because the Textbook gives the methods for doing that which are simple and actually work in practice in real life.

A personal aside: As an aspiring rationalist, isn't this...horrifying?

It is possible to design not just a mind, but a superintelligence, with patterns of cognition around a basic fact that are so robust, that even on superintelligent reflection, it doesn't update?

What would that even look like? You would need to never actually rely on 2+2=5 in any calculation, because you would get the wrong answer. So you would always have to evaluate using other arithmetic facts which are equivalent (or would be, if you knew the true answer to 2+2). And the superintelligence looking at it's own thought process needs to either not notice those diversions in it's basic thought process, or notice them, but also evaluate them to be justified for some (false) reason.

And whatever false beliefs you have that maintain that justification be entangled [? · GW] with all your other beliefs. This isn't just an isolated discrepancy.

And furthermore, the superintelligent would have to be robust to a community of minds pointing out its error. It would have to make arguments for why all of these contortions make sense, or to avoid the question entirely, all while otherwise operating as a superintelligent Bayesian.

But Eliezer thinks its possible for that whole thing to be stable, even in the limit of intelligence? If that's true, our rationality is dependent on the grace of our initial conditions. However much we strive, if we started from the wrong self-reinforcing error pattern, there's literally no way out.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2023-11-26T00:11:19.513Z · LW(p) · GW(p)

This comment feels like a central example of the kind of unhealthy thinking that I describe in this post [? · GW]: specifically, setting an implicit unrealistically high standard and then feeling viscerally negative about not meeting that standard, in a way that's divorced from action-relevant considerations.

comment by Eli Tyre (elityre) · 2023-11-25T22:25:43.520Z · LW(p) · GW(p)

I would certainly have learned very new and very exciting facts about intelligence, facts which indeed contradict my present model of how intelligences liable to be discovered by present research paradigms work, if you showed me... how can I put this in a properly general way... that problems I thought were about searching for states that get fed into a result function and then a result-scoring function, such that the input gets an output with a high score, were in fact not about search problems like that.

This framing is helpful.

Is that what GPT-4 is doing?

No. GPT-4 is just sampling from a pre-constructed distribution. It's not (I think) doing any search during inference. There is something more search-y happening during training, but even there we're just making gradient updates.

Eliezer thinks that that basic model won't scale to superintelligence. Or more specifically, he thinks that many (all?) of the problems that a superintelligence would be able to solve, to be worth of the name, have a structure such that that requires search-and-evaluation, and not sampling from a distribution.

(Unless I'm confused about something, and actually the sampling process is hiding a search and evaluation process under the hood, in a non-obvious form.)

I don't know one way or the other, but it does seem like a relevant crux for how hard alignment will be, because it determines if it is technologically possible (socially possible is another matter) to build non-agent-like AIs that can do many of the hard parts of Science and Engineering.

comment by Eli Tyre (elityre) · 2023-11-25T22:12:05.576Z · LW(p) · GW(p)

Every AI output effectuates outcomes in the world. If you have a powerful unaligned mind hooked up to outputs that can start causal chains that effectuate dangerous things, it doesn't matter whether the comments on the code say "intellectual problems" or not.

This is true, but taking actions in the world requires consequentialism / facility at overcoming obstacles to achieve a goal. It remains unclear (to me) if those faculties are required for "intellectual tasks" like solving some parts of alignment or designing new physical mechanisms to a spec.

comment by Eli Tyre (elityre) · 2023-11-25T22:06:15.510Z · LW(p) · GW(p)

Parenthetically, no act powerful enough and gameboard-flipping enough to qualify is inside the Overton Window of politics, or possibly even of effective altruism, which presents a separate social problem. I usually dodge around this problem by picking an exemplar act which is powerful enough to actually flip the gameboard, but not the most alignable act because it would require way too many aligned details: Build self-replicating open-air nanosystems and use them (only) to melt all GPUs.
Since any such nanosystems would have to operate in the full open world containing lots of complicated details, this would require tons and tons of alignment work, is not the pivotal act easiest to align, and we should do some other thing instead. But the other thing I have in mind is also outside the Overton Window, just like this is. So I use "melt all GPUs" to talk about the requisite power level and the Overton Window problem level, both of which seem around the right levels to me, but the actual thing I have in mind is more alignable; and this way, I can reply to anyone who says "How dare you?!" by saying "Don't worry, I don't actually plan on doing that."

As an aside:

What? How if someone is offended that you're advocating for an action that's as norm violating and uh, to put it frankly, an act of war, as burning all the GPUs, does it really ameliorate their concerns to say "don't worry, I'm not actually advocating that someone do that. I'm only advocating that they do something similarly extreme, which I'm telling you outright is also outside the overton window."

Has this reassured a single concerned person, ever?

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2023-11-26T00:06:28.614Z · LW(p) · GW(p)

It reassures me, and I think it's the right thing to do in this case, because policy discussions follow strong contextualizing norms [LW · GW]. Using a layer of indirection, as Eliezer does here, makes it clearer that this is a theoretical discussion, rather than an attempt to actually advocate for that specific intervention.

comment by Eli Tyre (elityre) · 2022-05-11T05:30:32.602Z · LW(p) · GW(p)

"So I think there is an important homework exercise to do here, which is something like, "Imagine that safe-seeming system which only considers hypothetical problems. Now see that if you take that system, don't make any other internal changes, and feed it actual problems, it's very dangerous. Now meditate on this until you can see how the hypothetical-considering planner was extremely close in the design space to the more dangerous version, had all the dangerous latent properties, and would probably have a bunch of actual dangers too."

This is the part that I don't see clearly yet. Where do the actual dangers come from?

If the system is straight up optimizing against you, if it has some secret unaligned goals that it is steering towards, it will produce outputs (presented to the humans) that systematically lead to the securing of those unaligned goals. But that scenario is one of an already-optimizing-the-world agent, behind the "mask" of an oracle.

Why would that be?

One way that I could imagine that that an actually-optimizing-in-the-world agent could fall out of something that was supposed to be doing search only to find solutions to hypothetical problems is that it realizes that those hypothetical problems and the their solution are represented on some servers in our universe. And one class of strategies for securing extremely high ranking solutions to a hypothetical problem is to hack into our world, seize control of it, and use that control to effect whatever it wants in the "hypothetical". (This isn't so different from humans doing science to understand the low level physics that make up our macro reality, and then exploiting that knowledge to produce radical cheat codes like computers and cars.)

Is that the source of the actual danger? Are there others?

comment by Evan R. Murphy · 2021-11-28T00:50:18.483Z · LW(p) · GW(p)

Richard, summarized by Richard: "Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. [...]"
Eliezer, summarized by Richard: "The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could just be one line of code: if we give the former AI our current scenario as its input, then it becomes the latter.

How does giving the former "planner" AI the current scenario as input turn it into the latter "acting" AI? It still only outputs a plan, which then the operators can review and decide whether or not to carry out.

Also, the planner AI that Richard put forth had two inputs, not one. The inputs were: 1) a scenario, and 2) a goal. So for Eliezer (or anyone who confidently understood this part of the discussion), which goal input are you providing to the planner AI in this situation? Are you saying that the planner AI becomes dangerous when it's provided with the current scenario and any goal as inputs?

comment by aleph_four · 2021-11-18T16:51:47.647Z · LW(p) · GW(p)

I love being accused of being GPT-x on Discord by people who don't understand scaling laws and think I own a planet of A100s

There are some hard and mean limits to explainability and there's a real issue that a person that correctly sees how to align AGI or that correctly perceives that an AGI design is catastrophically unsafe will not be able to explain it. It requires super-intelligence to cogently expose stupid designs that will kill us all. What are we going to do if there's this kind of coordination failure?

comment by Logan Zoellner (logan-zoellner) · 2021-11-15T21:01:10.735Z · LW(p) · GW(p)

Yudkowski's insistence that only dangerous AI can come up with a "pivotal act" is fairly ridiculous.

Consider the following pivotal act: "launch a nuclear weapon at every semiconductor fab on earth".

Any human of even average intelligence could have thought of this. We do not need a smarter-than-all-humans-ever AI to achieve a pivotal act.

A boxed AI should be able to think of pivotal acts and describe them to humans without being so smart that it by necessity escapes the box and destroys all humans.

Replies from: Eliezer_Yudkowsky, dxu, logan-zoellner

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-15T21:43:58.127Z · LW(p) · GW(p)

launch a nuclear weapon at every semiconductor fab on earth

This is not what I label "pivotal". It's big, but a generation later they've rebuilt the semiconductor fabs and then we're all in the same position. Or a generation later, algorithms have improved to where the old GPU server farms can implement AGI. The world situation would be different then, if the semiconductor fabs had been nuked 10 years earlier, but it isn't obviously better.

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2021-11-15T22:16:43.350Z · LW(p) · GW(p)

If I really thought AI was going to murder us all in the next 6 months to 2 years, I would definitely consider those 10 years "pivotal", since it would give us 5x-20x the time to solve the alignment problem. I might even go full Butlerian Jihad and just ban semiconductor fabs altogether.

Actually, I think that right question, is: is there anything you would consider pivotal other that just solving the alignment problem? If no, the whole argument seems to be "If we can't find a safe way to solve the alignment problem, we should consider dangerous ones."

Replies from: RobbBB, Raemon, Eliezer_Yudkowsky

↑ comment by Rob Bensinger (RobbBB) · 2021-11-15T22:49:43.956Z · LW(p) · GW(p)

[Update: As of ~~today~~ Nov. 16 (after checking with Eliezer), I've edited the Arbital page to define "pivotal act" the way it's usually used: to refer to a good gameboard-flipping action, not e.g. 'AI destroys humanity'. The quote below uses the old definition, where 'pivotal' meant anything world-destroying or world-saving.]

Eliezer's using the word "pivotal" here to mean something relatively specific, described on Arbital:

The term 'pivotal' in the context of value alignment theory is a guarded term to refer to events, particularly the development of sufficiently advanced AIs, that will make a large difference a billion years later. A 'pivotal' event upsets the current gameboard - decisively settles a win or loss, or drastically changes the probability of win or loss, or changes the future conditions under which a win or loss is determined.
[...]

Examples of pivotal and non-pivotal events
Pivotal events:
non-value-aligned AI is built, takes over universe
human intelligence enhancement powerful enough that the best enhanced humans are qualitatively and significantly smarter than the smartest non-enhanced humans
a limited Task AGI that can:
upload humans and run them at speeds more comparable to those of an AI
prevent the origin of all hostile superintelligences (in the nice case, only temporarily and via strategies that cause only acceptable amounts of collateral damage)
design or deploy nanotechnology such that there exists a direct route to the operators being able to do one of the other items on this list (human intelligence enhancement, prevent emergence of hostile SIs, etc.)
a complete and detailed synaptic-vesicle-level scan of a human brain results in cracking the cortical and cerebellar algorithms, which rapidly leads to non-value-aligned neuromorphic AI
Non-pivotal events:
curing cancer (good for you, but it didn't resolve the value alignment problem)
proving the Riemann Hypothesis (ditto)
an extremely expensive way to augment human intelligence by the equivalent of 5 IQ points that doesn't work reliably on people who are already very smart
making a billion dollars on the stock market
robotic cars devalue the human capital of professional drivers, and mismanagement of aggregate demand by central banks plus burdensome labor market regulations is an obstacle to their re-employment
Borderline cases:
unified world government with powerful monitoring regime for 'dangerous' technologies
widely used gene therapy that brought anyone up to a minimum equivalent IQ of 120

Centrality to limited AI proposals
We can view the general problem of Limited AI as having the central question: What is a pivotal positive accomplishment, such that an AI which does that thing and not some other things is therefore a whole lot safer to build? This is not a trivial question because it turns out that most interesting things require general cognitive capabilities, and most interesting goals can require arbitrarily complicated value identification problems to pursue safely.
It's trivial to create an "AI" which is absolutely safe and can't be used for any pivotal achievements. E.g. Google Maps, or a rock with "2 + 2 = 4" painted on it.
[...]

Centrality to concept of 'advanced agent'
We can view the notion of an advanced agent as "agent with enough cognitive capacity to cause a pivotal event, positive or negative"; the advanced agent properties are either those properties that might lead up to participation in a pivotal event, or properties that might play a critical role in determining the AI's trajectory and hence how the pivotal event turns out.

In conversations I've seen that use the word "pivotal", it's usually asking about pivotal acts we can do that end the acute x-risk period (things that make it the case that random people in the world can't suddenly kill everyone with AGI or bioweapons or what-have-you). I.e., it's specifically focused on good pivotal acts.

Replies from: RobbBB, logan-zoellner

↑ comment by Rob Bensinger (RobbBB) · 2021-11-15T23:10:11.578Z · LW(p) · GW(p)

IMO it's confusing that Eliezer uses the word "pivotal" on Arbital to also refer to ways AI could destroy the world. If we're talking about stuff like "what's the easiest pivotal act?" or "how hard do pivotal acts tend to be?", I'll give wildly different answers if I'm including 'ways to destroy the world' and not just 'ways to save the world' -- destroying the world seems drastically easier to me. And I don't know of an unambiguous short synonym for 'good pivotal act'.

(Eliezer proposes 'pivotal achievement', but empirically I don't see people using this much, and it still has the same problem that it re-uses the word 'pivotal' for both categories of event, thus making them feel very similar.)

Usually I care about either 'ways of saving the world' or 'ways of destroying the world' -- I rarely find myself needing a word for the superset. E.g., I'll find myself searching for a short term to express things like 'the first AGI company needs to look for a way-to-save-the-world' or 'I wish EAs would spend more time thinking about ways-to-use-AGI-to-save-the-world'. But if I say 'pivotal', this will technically include x-catastrophes, which is not what I have in mind.

(On the other hand, the concept of 'the kind of AI that's liable to cause pivotal events' does make sense to me and feels very useful, because I think AGI gets you both the world-saving and the world-destroying capabilities in one fell swoop (though not necessarily the ability to align AGI to actually utilize the capabilities you want). But given my beliefs about AGI, I'm satisfied with just using the term 'AGI' to refer to 'the kind of AI that's liable to cause pivotal events'. Eliezer's more-specifically-about-pivotal-events term for this on Arbital, 'advanced agent', seems fine to me too.)

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2021-11-16T00:24:21.742Z · LW(p) · GW(p)

Update: Eliezer has agreed to let me edit the Arbital article to follow more standard usage nowadays, with 'pivotal acts' referring to good gameboard-flipping actions. The article will use 'existential catastrophe' to refer to bad gameboard-flipping events, and 'astronomically significant event' to refer to the superset. Will re-quote the article here once there's a new version.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2021-11-17T04:54:47.864Z · LW(p) · GW(p)

New "pivotal act" page:

The term 'pivotal act' in the context of AI alignment theory is a guarded term to refer to actions that will make a large positive difference a billion years later. Synonyms include 'pivotal achievement' and 'astronomical achievement'.
We can contrast this with existential catastrophes (or 'x-catastrophes'), events that will make a large negative difference a billion years later. Collectively, this page will refer to pivotal acts and existential catastrophes as astronomically significant events (or 'a-events').
'Pivotal event' is a deprecated term for referring to astronomically significant events, and 'pivotal catastrophe' is a deprecated term for existential catastrophes. 'Pivotal' was originally used to refer to the superset (a-events), but AI alignment researchers kept running into the problem of lacking a crisp way to talk about 'winning' actions in particular, and their distinctive features.
Usage has therefore shifted such that (as of late 2021) researchers use 'pivotal' and 'pivotal act' to refer to good events that upset the current gameboard - events that decisively settle a win, or drastically increase the probability of a win.

↑ comment by Logan Zoellner (logan-zoellner) · 2021-11-15T22:59:25.783Z · LW(p) · GW(p)

Under this definition, it seems that "nuke every fab on Earth" would qualify as "borderline", and every outcome that is both "pivotal" and "good" depends on solving the alignment problem.

↑ comment by Raemon · 2021-11-15T22:35:35.877Z · LW(p) · GW(p)

Pivotal in this case is a technical term (whose article opens with an explicit bid for people not to stretch the definition of the term). It's not (by definition) limited to 'solving the alignment problem', but there are constraints on what counts as pivotal.

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-16T01:08:56.989Z · LW(p) · GW(p)

If you can deploy nanomachines that melt all the GPU farms and prevent any new systems with more than 1 networked GPU from being constructed, that counts. That really actually suspends AGI development indefinitely pending an unlock, and not just for a brief spasmodic costly delay.

Replies from: Wei_Dai, Vaniver

↑ comment by Wei Dai (Wei_Dai) · 2021-11-27T21:54:24.531Z · LW(p) · GW(p)

Can you please clarify:

Are you expecting that the team behind the "melt all GPU farms" pivotal act to be backed by a major government or coalition of governments?
If not, I expect that the team and its AGI will be arrested/confiscated by the nearest authority as soon as the pivotal act occurs, and forced by them to apply the AGI to other goals. Do you see things happening differently, or expect things to come out well despite this?

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-28T05:33:01.544Z · LW(p) · GW(p)

"Melt all GPUs" is indeed an unrealistic pivotal act - which is why I talk about it, since like any pivotal act it is outside the Overton Window, and then if any children get indignant about the prospect of doing something other than letting the world end miserably, I get to explain the child-reassuring reasons why you would never do the particular thing of "melt all GPUs" in real life. In this case, the reassuring reason is that deploying open-air nanomachines to operate over Earth is a huge alignment problem, that is, relatively huger than the least difficult pivotal act I can currently see.

That said, if unreasonably-hypothetically you can give your AI enough of a utility function and have it deploy enough intelligence to create nanomachines that safely move through the open-ended environment of Earth's surface, avoiding bacteria and not damaging any humans or vital infrastructure, in order to surveil all of Earth and find the GPU farms and then melt them all, it's probably not very much harder to tell those nanomachines to melt other things, or demonstrate the credibly threatening ability to do so.

That said, I indeed don't see how we sociologically get into this position in a realistic way, in anything like the current world, even assuming away the alignment problem. Unless Demis Hassabis suddenly executes an emergency pact with the Singaporean government, or something else I have trouble visualizing? I don't see any of the current owners or local governments of the big AI labs knowingly going along with any pivotal act executed deliberately (though I expect them to think it's just fine to keep cranking up the dial on an AI until it destroys the world, so long as it looks like it's not being done on purpose).

It is indeed the case that, conditional on the alignment problem being solvable, there's a further sociological problem - which looks a lot less impossible, but which I do not actually know how to solve - wherein you then have to do something pivotal, and there's no grownups in government in charge who would understand why that was something necessary to do. But it's definitely a lot easier to imagine Demis forming a siloed team or executing an emergency pact with Singapore, than it is to see how you would safely align the AI that does it. And yes, the difficulty of any pivotal act to stabilize the Earth includes the difficulty of what you had to do, before or after you had sufficiently powerful AGI, in order to execute that act and then prevent things from falling over immediately afterwards.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-11-28T20:59:12.086Z · LW(p) · GW(p)

the least difficult pivotal act I can currently see.

Do you have a plan to communicate the content of this to people whom it would be beneficial to communicate to? E.g., write about it in some deniable way, or should such people just ask you about it privately? Or more generally, how do you think that discussions / intellectual progress on this topic should go?

Do you think the least difficult pivotal act you currently see has sociopolitical problems that are similar to "melt all GPUs"?

That said, I indeed don’t see how we sociologically get into this position in a realistic way, in anything like the current world, even assuming away the alignment problem.

Thanks for the clarification. I suggest mentioning this more often (like in the Arbital page), as I previously didn't think that your version of "pivotal act" had a significant sociopolitical component. If this kind of pivotal act is indeed how the world gets saved (conditional on the world being saved), one of my concerns is that "a miracle occurs" and the alignment problem gets solved, but the sociopolitical problem doesn't because nobody was working on it (even if it's easier in some sense).

But it’s definitely a lot easier to imagine Demis forming a siloed team or executing an emergency pact with Singapore

(Not a high priority to discuss this here and now, but) I'm skeptical that backing by a small government like Singapore is sufficient, since any number of major governments would be very tempted to grab the AGI(+team) from the small government, and the small government will be under tremendous legal and diplomatic stress from having nonconsensually destroyed a lot of very valuable other people's property. Having a partially aligned/alignable AGI in the hands of a small, geopolitically weak government seems like a pretty precarious state.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-29T06:06:02.289Z · LW(p) · GW(p)

Singapore probably looks a lot less attractive to threaten if it's allied with another world power that can find and melt arbitrary objects.

↑ comment by Vaniver · 2021-11-16T02:02:20.176Z · LW(p) · GW(p)

I'm still unsure how true I think this is.

Clearly a full Butlerian jihad (where all of the computers are destroyed) suspends AGI development indefinitely, and destroying no computers doesn't slow it down at all. There's a curve then where the more computers you destroy, the more you both 1) slow down AGI development and 2) disrupt the economy (since people were using those to keep their supply chains going, organize the economy, do lots of useful work, play video games, etc.).

But even if you melt all the GPUs, I think you have two obstacles:

CPUs alone can do lots of the same stuff. There's some paper I was thinking of from ~5 years ago where they managed to get a CPU farm competitive with the GPUs of the time, and it might have been this paper (whose authors are all from Intel, who presumably have a significant bias) or it might have been the Hogwild-descended stuff (like this); hopefully someone knows something more up to date.
The chip design ecosystem gets to react to your ubiquitous nanobots and reverse-engineer what features they're looking for to distinguish between whitelisted CPUs and blacklisted GPUs; they may be able to design a ML accelerator that fools the nanomachines. (Something that's robust to countermoves might have to eliminate many more current chips.)

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-16T03:38:25.592Z · LW(p) · GW(p)

I agree you might need to make additional moves to keep the table flipped, but in a scenario like this you would actually have the capability to make those moves.

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2021-11-16T14:59:23.937Z · LW(p) · GW(p)

Is the plan just to destroy all computers with say >1e15 flops of computing power? How does the nanobot swarm know what a "computer" is? What do you do about something like GPT-neo or SETI-at-home where the compute is distributed?

I'm still confused as to why you think task: "build an AI that destroys anything with >1e15 flops of computing power --except humans, of course" would be dramatically easier than the alignment problem.

Setting back civilization a generation (via catastrophe) seems relatively straightforward. Building a social consensus/religion that destroys anything "in the image of a mind" at least seems possible. Fine-tuning a nanobot swarm to destroy some but not all computers just sounds really hard to me.

↑ comment by dxu · 2021-11-15T23:30:19.292Z · LW(p) · GW(p)

Consider the following pivotal act: "launch a nuclear weapon at every semiconductor fab on earth".
Any human of even average intelligence could have thought of this.

And by that very same token, the described plan would not actually work.

We do not need a smarter-than-all-humans-ever AI to achieve a pivotal act.

Unless we want the AI in question to output a plan that has a chance of actually working.

A boxed AI should be able to think of pivotal acts and describe them to humans without being so smart that it by necessity escapes the box and destroys all humans.

If an actually workable pivotal act existed that did not require better-than-human intelligence to come up with, we would already be in the process of implementing said pivotal act, because someone would have thought of it already. The fact that this is obviously not the case should therefore cause a substantial update against the antecedent.

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2021-11-16T14:44:11.036Z · LW(p) · GW(p)

If an actually workable pivotal act existed that did not require better-than-human intelligence to come up with, we would already be in the process of implementing said pivotal act, because someone would have thought of it already. The fact that this is obviously not the case should therefore cause a substantial update against the antecedent.

This is an incredibly bad argument. Saying something cannot possibly work because no one has done it yet would mean that literally all innovation is impossible.

Replies from: dxu

↑ comment by dxu · 2021-11-16T17:59:51.877Z · LW(p) · GW(p)

Saying something cannot possibly work because no one has done it yet would mean that literally all innovation is impossible.

You are attempting to generalize conclusions about an extremely loose class of achievements ("innovation"), to an extremely tight class of achievements ("commit, using our current level of knowledge and resources, a pivotal act"). That this generalization is invalid ought to go without saying, but in the interest of constructiveness I will point out one (relevant) aspect of the disanalogy:

"Innovation", at least as applied to technology, is incremental; new innovations are allowed to build on past knowledge in ways that (in principle) place no upper limit on the technological improvements thus achieved (except whatever limits are imposed by the hard laws of physics and mathematics). There is also no time limit on innovation; by default, anything that is possible at all is assumed to be realized eventually, but there are no guarantees as to when that will happen for any specific technology.

"Commit a pivotal act using the knowledge and resources currently available to us", on the other hand, is the opposite of incremental: it demands that we execute a series of actions that leads to some end goal (such as "take over the world") while holding fixed our level of background knowledge/acumen. Moreover, whereas there is no time limit on technological "innovation", there is certainly a time limit on successfully committing a pivotal act; and moreover this time limit is imposed precisely by however long it takes before humanity "innovates" itself to AGI.

In summary, your analogy leaks, and consequently so does your generalization. In fact, however, your reasoning is further flawed: even if your analogy were tight, it would not suffice to establish what you need to establish. Recall your initial claim:

We do not need a smarter-than-all-humans-ever AI to achieve a pivotal act.

This claim does not, in fact, become more plausible if we replace "achieve a pivotal act" with e.g. "vastly increase the pace of technological innovation". This is true even though technological innovation is, as a human endeavor, far more tractable than saving/taking over the world. This is because the load-bearing part of the argument is that the AI must produce relevant insights (whether related to "innovation" or "pivotal acts") at a rate vastly superior to that of humans, in order for it to be able to reliably produce innovations/world-saving plans. (I leave it unargued that humans do not reliably do either of these things.) In other words, it certainly requires an AI whose ability in the relevant domains exceeds that of "all humans ever", because "all humans ever" empirically do not (reliably) accomplish these tasks.

For your argument to go through, in other words, you cannot get away with arguing merely that something is "possible" (though in fact you have not even established this much, because the analogy with technological innovation does not hold). Your argument actually requires you to argue for the (extremely strong) claim that the ambient probability with which humans successfully generate world-saving plans, is sufficient to the task of generating a successful world-saving plan before unaligned AGI is built. And this claim is clearly false, since (once again)

If an actually workable pivotal act existed that did not require better-than-human intelligence to come up with, we would already be in the process of implementing said pivotal act, because someone would have thought of it already. The fact that this is obviously not the case should therefore cause a substantial update against the antecedent.

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2021-11-16T21:40:44.675Z · LW(p) · GW(p)

the AI must produce relevant insights (whether related to "innovation" or "pivotal acts") at a rate vastly superior to that of humans, in order for it to be able to reliably produce innovations/world-saving plans

This is precisely the claim we are arguing about! I disagree that the AI needs to produce insights "at a rate vastly superior to all humans".

On the contrary, I claim that there is one borderline act (start a catastrophe that sets back AI progress by decades) that can be done with current human knowledge. And I furthermore claim that there is one pivotal act (design an aligned AI) that may well be achieved via incremental progress.

Replies from: dxu

↑ comment by dxu · 2021-11-17T03:35:32.244Z · LW(p) · GW(p)

If the AI does not need to produce relevant insights at a faster rate than humans, then that implies the rate at which humans produce relevant insights is sufficiently fast already. And if that’s your claim, then you—again—need to explain why no humans have been able to come up with a workable pivotal act to date.

On the contrary, I claim that there is one borderline act (start a catastrophe that sets back AI progress by decades) that can be done with current human knowledge.

How do you propose to accomplish this? Your initial suggestion, “launch nukes at every semiconductor fab”, is not workable. If all of the candidate solutions you have in mind are of similar quality to that, then I reiterate: humans cannot, with their current knowledge and resources, execute a pivotal act in the real world.

And I furthermore claim that there is one pivotal act (design an aligned AI) that may well be achieved via incremental progress.

This is the hope, yes. Note, however, that this is a path that routes directly through smarter-than-human AI, which necessity is precisely what you are disputing. So the existence of this path does not particularly strengthen your case.

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2021-11-18T16:24:01.967Z · LW(p) · GW(p)

Your initial suggestion, “launch nukes at every semiconductor fab”, is not workable.

In what way is it not workable? Perhaps we have different intuitions about how difficult it is to build a cutting-edge semiconductor facility? Alternatively you may disagree with me that AI is largely hardware-bound and thus cutting off the supply of new compute will also prevent the rise of superhuman AI?

Do you also think that "the US president launches every nuclear weapon at his command, causing nuclear winter?" would fail to prevent the rise of superhuman AGI?

↑ comment by Logan Zoellner (logan-zoellner) · 2021-11-15T21:22:47.066Z · LW(p) · GW(p)

Still reading

It would not surprise me in the least if the world ends before self-driving cars are sold on the mass market.

Obviously it is impossible to bet money on the end of the world. But if it were, I would be willing to give fairly long odds that this is wrong.

Replies from: philh, logan-zoellner, None

↑ comment by philh · 2021-11-15T22:31:35.642Z · LW(p) · GW(p)

Obviously it is impossible to bet money on the end of the world.

I think this is neither obvious nor true. There are lots of variants you could do and details you'd need to fill in, but the outline of a simple one would be: "I pay you $X now, and if and when self-driving cars reach mass market without the world having ended, you pay me $Y inflation-adjusted".

Replies from: khafra, M. Y. Zuo

↑ comment by khafra · 2021-11-18T18:39:41.930Z · LW(p) · GW(p)

Robin Hanson said [LW · GW], with Eliezer eventually concurring, that "bets like this will just recover interest rates, which give the exchange rate between resources on one date and resources on another date."

E.g., it's not impossible to bet money on the end of the world, but it's impossible to do it in a way substantially different from taking a loan.

Replies from: philh

↑ comment by philh · 2021-11-19T11:22:04.347Z · LW(p) · GW(p)

Oh, thanks for the pointer. I confess I wish Robin was less terse here.

I'm not sure I even understand the claim, what does it mean to "recover interest rates"? Is Robin claiming any such bet will either

Have payoffs such that [the person receiving money now and paying money later] could just take out a loan at prevailing interest rates to make this bet; or
Have at least one party who is being silly with money?

...oh, I think I get it, and IIUC the idea that fails is different from what I was suggesting.

The idea that fails is that you can make a prediction market from these bets and use it to recover a probability of apocalypse. I agree that won't work, for the reason given: prices of these bets will be about both [probability of apocalypse] and [the value of money-now versus money-later, conditional on no apocalypse], and you can't separate those effects.

I don't think this automatically sinks the simpler idea of: if Alice and Bob disagree about the probability of an apocalypse, they may be able to make a bet that both consider positive-expected-utility. And I don't think that bet would necessarily just be a combination of available market-rate loans? At least it doesn't look like anyone is claiming that.

↑ comment by M. Y. Zuo · 2021-11-15T22:55:15.019Z · LW(p) · GW(p)

Wow, that may be a genuinely ground breaking application for crypto currencies. e.g. someone with 1000 bitcoins can put them in some form of guaranteed, irreversible, escrow for a million dollars upfront and release date in 2050. If the world ends then the escrow vanishes, if not the lucky better would get it.

Replies from: philh

↑ comment by philh · 2021-11-15T23:54:32.052Z · LW(p) · GW(p)

(Since you said "ground breaking", I feel I should maybe be clear that the idea wasn't original to me and the ground was already broken when I got here. I probably saw it on LW myself.)

Note that 1000 BTC at current prices is like $64 million. In general with this structure (trade crypto for fiat, plus the crypto gets locked until later) I don't see the incentive on anyone's side to lock the crypto?

I haven't thought about this in depth and I assume some people have, but my sense is that these kinds of bets are a lot easier if both parties trust each other to be willing and able to pay up when appropriate. If you don't trust your counterparty, and demand money in escrow, then their incentive to take part seems minimal or none.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2021-11-16T00:06:41.962Z · LW(p) · GW(p)

The actual structure, and payout ratio, would probably be set in a much more elaborate way. Maybe some kind of annuity paying out every year the world hasn’t ended yet? Like commit 10 bitcoins every year from a reversible escrow account to the irreversible escrow if the servers still exist or the total balance will be forfeited. Something along those lines, perhaps others would want to take up the project.

Replies from: philh

↑ comment by philh · 2021-11-16T00:36:27.837Z · LW(p) · GW(p)

Fwiw, and this might just be that you've thought about it more and I have and/or have details in mind that you haven't specified. But it still seems to me that this runs into the problem that at least one party basically won't have any reason to enter into such a bet, because their potential upside will be locked away long enough to negate its value.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2021-11-16T02:40:07.308Z · LW(p) · GW(p)

I imagine that was one of the critiques for prediction markets using crypto currencies held in escrow, yet they exist now, and they’re not all scams, so there must be some, non zero, market clearing price.

Replies from: philh

↑ comment by philh · 2021-11-16T08:52:07.975Z · LW(p) · GW(p)

I don't think that holds up, because with traditional uses of prediction markets both parties expect to see the end of the bet. If it's a long term bet then that would increase the expected payoff they require to be willing to bet, but there's some amount of "money later" that's worth giving up "money now" in exchange for. So both are willing to lock money up.

With a bet on the end of the world, all of the upside for one party comes from receiving "money now". There's no potential "money later" payoff for them, the bet isn't structured that way. And putting money in escrow means their "money now" vanishes too.

That is, if I receive a million dollars now and pay out two million if the world doesn't end, then: for now I'm up a million; if the world ends I'm dead; if the world doesn't end I'm down a million. But if the two million has to go in escrow, the only change is that for now I'm down a million too. So I'm not gonna do this.

↑ comment by Logan Zoellner (logan-zoellner) · 2021-11-15T21:26:42.022Z · LW(p) · GW(p)

the thing that kills us is likely to be a thing that can get more dangerous when you turn up a dial on it, not a thing that intrinsically has no dials that can make it more dangerous.

Finally a specific claim from Yudkowski I actually agree with

↑ comment by [deleted] · 2021-11-18T17:06:25.296Z · LW(p) · GW(p)

comment by [deleted] · 2021-11-16T20:48:02.354Z · LW(p) · GW(p)

Ngo and Yudkowsky on alignment difficulty

Contents

0. Prefatory comments

1. September 5 conversation

1.1. Deep vs. shallow problem-solving patterns

1.2. Requirements for science

1.3. Capability dials

1.4. Consequentialist goals vs. deontologist goals

2. Follow-ups

2.1. Richard Ngo's summary

3. September 8 conversation

3.1. The Brazilian university anecdote

3.2. Brain functions and outcome pumps

3.3. Hypothetical-planning systems, nanosystems, and evolving generality

3.4. Coherence and pivotal acts

4. Follow-ups

4.1. Richard Ngo's summary

4.2. Nate Soares' summary

151 comments

Examples of pivotal and non-pivotal events

Centrality to limited AI proposals

Centrality to concept of 'advanced agent'