Full Transcript: Eliezer Yudkowsky on the Bankless podcast

remember

Full Transcript: Eliezer Yudkowsky on the Bankless podcast

post by remember, Andrea_Miotti (AndreaM) · 2023-02-23T12:34:19.523Z · LW · GW · 89 comments

  ChatGPT
  AGI
  Efficiency
  AI Alignment
  AI Goals
  Consensus
  God Mode and Aliens
  Good Outcomes
  Ryan's Childhood Questions
  Trying to Resist
  MIRI and Education
  How Long Do We Have?
  Bearish Hope
  The End Goal
  Q&A
None
89 comments

This podcast has gotten a lot of traction, so we're posting a full transcript of it, lightly edited with ads removed, for those who prefer reading over audio.

Intro

Eliezer Yudkowsky: [clip] I think that we are hearing the last winds start to blow, the fabric of reality start to fray. This thing alone cannot end the world, but I think that probably some of the vast quantities of money being blindly and helplessly piled into here are going to end up actually accomplishing something.

Ryan Sean Adams: Welcome to Bankless, where we explore the frontier of internet money and internet finance. This is how to get started, how to get better, how to front run the opportunity. This is Ryan Sean Adams. I'm here with David Hoffman, and we're here to help you become more bankless.

Okay, guys, we wanted to do an episode on AI at Bankless, but I feel like David...

David: Got what we asked for.

Ryan: We accidentally waded into the deep end of the pool here. And I think before we get into this episode, it probably warrants a few comments. I'm going to say a few things I'd like to hear from you too. But one thing I want to tell the listener is, don't listen to this episode if you're not ready for an existential crisis. Okay? I'm kind of serious about this. I'm leaving this episode shaken. And I don't say that lightly. In fact, David, I think you and I will have some things to discuss in the debrief as far as how this impacted you. But this was an impactful one. It sort of hit me during the recording, and I didn't know fully how to react. I honestly am coming out of this episode wanting to refute some of the claims made in this episode by our guest, Eliezer Yudkowsky, who makes the claim that humanity is on the cusp of developing an AI that's going to destroy us, and that there's really not much we can do to stop it.

David: There's no way around it, yeah.

Ryan: I have a lot of respect for this guest. Let me say that. So it's not as if I have some sort of big-brained technical disagreement here. In fact, I don't even know enough to fully disagree with anything he's saying. But the conclusion is so dire and so existentially heavy that I'm worried about it impacting you, listener, if we don't give you this warning going in.

I also feel like, David, as interviewers, maybe we could have done a better job. I'll say this on behalf of myself. Sometimes I peppered him with a lot of questions in one fell swoop, and he was probably only ready to synthesize one at a time.

I also feel like we got caught flat-footed at times. I wasn't expecting his answers to be so frank and so dire, David. It was just bereft of hope.

And I appreciated very much the honesty, as we always do on Bankless. But I appreciated it almost in the way that a patient might appreciate the honesty of their doctor telling them that their illness is terminal. Like, it's still really heavy news, isn't it?

So that is the context going into this episode. I will say one thing. In good news, for our failings as interviewers in this episode, they might be remedied because at the end of this episode, after we finished with hitting the record button to stop recording, Eliezer said he'd be willing to provide an additional Q&A episode with the Bankless community. So if you guys have questions, and if there's sufficient interest for Eliezer to answer, tweet at us to express that interest. Hit us in Discord. Get those messages over to us and let us know if you have some follow-up questions.

He said if there's enough interest in the crypto community, he'd be willing to come on and do another episode with follow-up Q&A. Maybe even a Vitalik and Eliezer episode is in store. That's a possibility that we threw to him. We've not talked to Vitalik about that too, but I just feel a little overwhelmed by the subject matter here. And that is the basis, the preamble through which we are introducing this episode.

David, there's a few benefits and takeaways I want to get into. But before I do, can you comment or reflect on that preamble? What are your thoughts going into this one?

David: Yeah, we approached the end of our agenda—for every Bankless podcast, there's an equivalent agenda that runs alongside of it. But once we got to this crux of this conversation, it was not possible to proceed in that agenda, because... what was the point?

Ryan: Nothing else mattered.

David: And nothing else really matters, which also just relates to the subject matter at hand. And so as we proceed, you'll see us kind of circle back to the same inevitable conclusion over and over and over again, which ultimately is kind of the punchline of the content.

I'm of a specific disposition where stuff like this, I kind of am like, “Oh, whatever, okay”, just go about my life. Other people are of different dispositions and take these things more heavily. So Ryan's warning at the beginning is if you are a type of person to take existential crises directly to the face, perhaps consider doing something else instead of listening to this episode.

Ryan: I think that is good counsel.

So, a few things if you're looking for an outline of the agenda. We start by talking about ChatGPT. Is this a new era of artificial intelligence? Got to begin the conversation there.

Number two, we talk about what an artificial superintelligence might look like. How smart exactly is it? What types of things could it do that humans cannot do?

Number three, we talk about why an AI superintelligence will almost certainly spell the end of humanity and why it'll be really hard, if not impossible, according to our guest, to stop this from happening.

And number four, we talk about if there is absolutely anything we can do about all of this. We are heading careening maybe towards the abyss. Can we divert direction and not go off the cliff? That is the question we ask Eliezer.

David, I think you and I have a lot to talk about during the debrief. All right, guys, the debrief is an episode that we record right after the episode. It's available for all Bankless citizens. We call this the Bankless Premium Feed. You can access that now to get our raw and unfiltered thoughts on the episode. And I think it's going to be pretty raw this time around, David.

David: I didn't expect this to hit you so hard.

Ryan: Oh, I'm dealing with it right now.

David: Really?

Ryan: And this is not too long after the episode. So, yeah, I don't know how I'm going to feel tomorrow, but I definitely want to talk to you about this. And maybe have you give me some counseling. (laughs)

David: I'll put my psych hat on, yeah.

Ryan: Please! I'm going to need some help.

ChatGPT

Ryan: Bankless Nation, we are super excited to introduce you to our next guest. Eliezer Yudkowsky is a decision theorist. He's an AI researcher. He's the seeder of the Less Wrong community blog, a fantastic blog by the way. There's so many other things that he's also done. I can't fit this in the short bio that we have to introduce you to Eliezer.

But most relevant probably to this conversation is he's working at the Machine Intelligence Research Institute to ensure that when we do make general artificial intelligence, it doesn't come kill us all. Or at least it doesn't come ban cryptocurrency, because that would be a poor outcome as well.

Eliezer: (laughs)

Ryan: Eliezer, it's great to have you on Bankless. How are you doing?

Eliezer: Within one standard deviation of my own peculiar little mean.

Ryan: (laughs) Fantastic. You know, we want to start this conversation with something that jumped onto the scene for a lot of mainstream folks quite recently, and that is ChatGPT. So apparently over 100 million or so have logged on to ChatGPT quite recently. I've been playing with it myself. I found it very friendly, very useful. It even wrote me a sweet poem that I thought was very heartfelt and almost human-like.

I know that you have major concerns around AI safety, and we're going to get into those concerns. But can you tell us in the context of something like a ChatGPT, is this something we should be worried about? That this is going to turn evil and enslave the human race? How worried should we be about ChatGPT and BARD and the new AI that's entered the scene recently?

Eliezer: ChatGPT itself? Zero. It's not smart enough to do anything really wrong. Or really right either, for that matter.

Ryan: And what gives you the confidence to say that? How do you know this?

Eliezer: Excellent question. So, every now and then, somebody figures out how to put a new prompt into ChatGPT. You know, one time somebody found that one of the earlier generations of the technology would sound smarter if you first told it it was Eliezer Yudkowsky. There's other prompts too, but that one's one of my favorites. So there's untapped potential in there that people hadn't figured out how to prompt yet.

But when people figure it out, it moves ahead sufficiently short distances that I do feel fairly confident that there is not so much untapped potential in there that it is going to take over the world. It's, like, making small movements, and to take over the world it would need a very large movement. There's places where it falls down on predicting the next line that a human would say in its shoes that seem indicative of “probably that capability just is not in the giant inscrutable matrices, or it would be using it to predict the next line”, which is very heavily what it was optimized for. So there's going to be some untapped potential in there. But I do feel quite confident that the upper range of that untapped potential is insufficient to outsmart all the living humans and implement the scenario that I'm worried about.

Ryan: Even so, though, is ChatGPT a big leap forward in the journey towards AI in your mind? Or is this fairly incremental, it's just (for whatever reason) caught mainstream attention?

Eliezer: GPT-3 was a big leap forward. There's rumors about GPT-4, which, who knows? ChatGPT is a commercialization of the actual AI-in-the-lab giant leap forward. If you had never heard of GPT-3 or GPT-2 or the whole range of text transformers before ChatGPT suddenly entered into your life, then that whole thing is a giant leap forward. But it's a giant leap forward based on a technology that was published in, if I recall correctly, 2018.

David: I think that what's going around in everyone's minds right now—and the Bankless listenership (and crypto people at large) are largely futurists, so everyone (I think) listening understands that in the future, there will be sentient AIs perhaps around us, at least by the time that we all move on from this world.

So we all know that this future of AI is coming towards us. And when we see something like ChatGPT, everyone's like, “Oh, is this the moment in which our world starts to become integrated with AI?” And so, Eliezer, you've been tapped into the world of AI. Are we onto something here? Or is this just another fad that we will internalize and then move on for? And then the real moment of generalized AI is actually much further out than we're initially giving credit for. Like, where are we in this timeline?

Eliezer: Predictions are hard, especially about the future. I sure hope that this is where it saturates — this or the next generation, it goes only this far, it goes no further. It doesn't get used to make more steel or build better power plants, first because that's illegal, and second because the large language model technologies’ basic vulnerability is that it’s not reliable. It's good for applications where it works 80% of the time, but not where it needs to work 99.999% of the time. This class of technology can't drive a car because it will sometimes crash the car.

So I hope it saturates there. I hope they can't fix it. I hope we get, like, a 10-year AI winter after this.

This is not what I actually predict. I think that we are hearing the last winds start to blow, the fabric of reality start to fray. This thing alone cannot end the world. But I think that probably some of the vast quantities of money being blindly and helplessly piled into here are going to end up actually accomplishing something.

Not most of the money—that just never happens in any field of human endeavor. But 1% of $10 billion is still a lot of money to actually accomplish something.

AGI

Ryan: So listeners, I think you've heard Eliezer's thesis on this, which is pretty dim with respect to AI alignment—and we'll get into what we mean by AI alignment—and very worried about AI-safety-related issues.

But I think for a lot of people to even worry about AI safety and for us to even have that conversation, I think they have to have some sort of grasp of what AGI looks like. I understand that to mean “artificial general intelligence” and this idea of a super-intelligence.

Can you tell us: if there was a superintelligence on the scene, what would it look like? I mean, is this going to look like a big chat box on the internet that we can all type things into? It's like an oracle-type thing? Or is it like some sort of a robot that is going to be constructed in a secret government lab? Is this, like, something somebody could accidentally create in a dorm room? What are we even looking for when we talk about the term “AGI” and “superintelligence”?

Eliezer: First of all, I'd say those are pretty distinct concepts. ChatGPT shows a very wide range of generality compared to the previous generations of AI. Not very wide generality compared to GPT-3—not literally the lab research that got commercialized, that's the same generation. But compared to stuff from 2018 or even 2020, ChatGPT is better at a much wider range of things without having been explicitly programmed by humans to be able to do those things.

To imitate a human as best it can, it has to capture all of the things that humans can think about that it can, which is not all the things. It's still not very good at long multiplication (unless you give it the right instructions, in which case suddenly it can do it).

It's significantly more general than the previous generation of artificial minds. Humans were significantly more general than the previous generation of chimpanzees, or rather Australopithecus or last common ancestor.

Humans are not fully general. If humans were fully general, we'd be as good at coding as we are at football, throwing things, or running. Some of us are okay at programming, but we're not spec'd for it. We're not fully general minds.

You can imagine something that's more general than a human, and if it runs into something unfamiliar, it's like, okay, let me just go reprogram myself a bit and then I'll be as adapted to this thing as I am to anything else.

So ChatGPT is less general than a human, but it's genuinely ambiguous, I think, whether it's more or less general than (say) our cousins, the chimpanzees. Or if you don't believe it's as general as a chimpanzee, a dolphin or a cat.

Ryan: So this idea of general intelligence is sort of a range of things that it can actually do, a range of ways it can apply itself?

Eliezer: How wide is it? How much reprogramming does it need? How much retraining does it need to make it do a new thing?

Bees build hives, beavers build dams, a human will look at a beehive and imagine a honeycomb shaped dam. That's. like, humans alone in the animal kingdom. But that doesn't mean that we are general intelligences, it means we're significantly more generally applicable intelligences than chimpanzees.

It's not like we're all that narrow. We can walk on the moon. We can walk on the moon because there's aspects of our intelligence that are made in full generality for universes that contain simplicities, regularities, things that recur over and over again. We understand that if steel is hard on Earth, it may stay hard on the moon. And because of that, we can build rockets, walk on the moon, breathe amid the vacuum.

Chimpanzees cannot do that, but that doesn't mean that humans are the most general possible things. The thing that is more general than us, that figures that stuff out faster, is the thing to be scared of if the purposes to which it turns its intelligence are not ones that we would recognize as nice things, even in the most cosmopolitan and embracing senses of what's worth doing.

Efficiency

Ryan: And you said this idea of a general intelligence is different than the concept of superintelligence, which I also brought into that first part of the question. How is superintelligence different than general intelligence?

Eliezer: Well, because ChatGPT has a little bit of general intelligence. Humans have more general intelligence. A superintelligence is something that can beat any human and the entire human civilization at all the cognitive tasks. I don't know if the efficient market hypothesis is something where I can rely on the entire…

Ryan: We're all crypto investors here. We understand the efficient market hypothesis for sure.

Eliezer: So the efficient market hypothesis is of course not generally true. It's not true that literally all the market prices are smarter than you. It's not true that all the prices on earth are smarter than you. Even the most arrogant person who is at all calibrated, however, still thinks that the efficient market hypothesis is true relative to them 99.99999% of the time. They only think that they know better about one in a million prices.

They might be important prices. The price of Bitcoin is an important price. It's not just a random price. But if the efficient market hypothesis was only true to you 90% of the time, you could just pick out the 10% of the remaining prices and double your money every day on the stock market. And nobody can do that. Literally nobody can do that.

So this property of relative efficiency that the market has to you, that the price’s estimate of the future price already has all the information you have—not all the information that exists in principle, maybe not all the information that the best equity could, but it's efficient relative to you.

For you, if you pick out a random price, like the price of Microsoft stock, something where you've got no special advantage, that estimate of its price a week later is efficient relative to you. You can't do better than that price.

We have much less experience with the notion of instrumental efficiency, efficiency in choosing actions, because actions are harder to aggregate estimates about than prices. So you have to look at, say, AlphaZero playing chess—or just, you know, whatever the latest Stockfish number is, an advanced chess engine.

When it makes a chess move, you can't do better than that chess move. It may not be the optimal chess move, but if you pick a different chess move, you'll do worse. That you'd call a kind of efficiency of action. Given its goal of winning the game, once you know its move—unless you consult some more powerful AI than Stockfish—you can't figure out a better move than that.

A superintelligence is like that with respect to everything, with respect to all of humanity. It is relatively efficient to humanity. It has the best estimates—not perfect estimates, but the best estimates—and its estimates contain all the information that you've got about it. Its actions are the most efficient actions for accomplishing its goals. If you think you see a better way to accomplish its goals, you're mistaken.

Ryan: So you're saying [if something is a] superintelligence, we'd have to imagine something that knows all of the chess moves in advance. But here we're not talking about chess, we're talking about everything. It knows all of the moves that we would make and the most optimum pattern, including moves that we would not even know how to make, and it knows these things in advance.

I mean, how would human beings sort of experience such a superintelligence? I think we still have a very hard time imagining something smarter than us, just because we've never experienced anything like it before.

Of course, we all know somebody who's genius-level IQ, maybe quite a bit smarter than us, but we've never encountered something like what you're describing, some sort of mind that is superintelligent.

What sort of things would it be doing that humans couldn't? How would we experience this in the world?

Eliezer: I mean, we do have some tiny bit of experience with it. We have experience with chess engines, where we just can't figure out better moves than they make. We have experience with market prices, where even though your uncle has this really long, elaborate story about Microsoft stock, you just know he's wrong. Why is he wrong? Because if he was correct, it would already be incorporated into the stock price.

And especially because the market’s efficiency is not perfect, like that whole downward swing and then upward move in COVID. I have friends who made more money off that than I did, but I still managed to buy back into the broader stock market on the exact day of the low—basically coincidence. So the markets aren't perfectly efficient, but they're efficient almost everywhere.

And that sense of deference, that sense that your weird uncle can't possibly be right because the hedge funds would know it—you know. unless he's talking about COVID, in which case maybe he is right if you have the right choice of weird uncle! I have weird friends who are maybe better at calling these things than your weird uncle. So among humans, it's subtle.

And then with superintelligence, it's not subtle, just massive advantage. But not perfect. It's not that it knows every possible move you make before you make it. It's that it's got a good probability distribution about that. And it has figured out all the good moves you could make and figured out how to reply to those.

And I mean, in practice, what's that like? Well, unless it's limited, narrow superintelligence, I think you mostly don't get to observe it because you are dead, unfortunately.

Ryan: What? (laughs)

Eliezer: Like, Stockfish makes strictly better chess moves than you, but it's playing on a very narrow board. And the fact that it's better at you than chess doesn't mean it's better at you than everything. And I think that the actual catastrophe scenario for AI looks like big advancement in a research lab, maybe driven by them getting a giant venture capital investment and being able to spend 10 times as much on GPUs as they did before, maybe driven by a new algorithmic advance like transformers, maybe driven by hammering out some tweaks in last year's algorithmic advance that gets the thing to finally work efficiently. And the AI there goes over a critical threshold, which most obviously could be like, “can write the next AI”.

That's so obvious that science fiction writers figured it out almost before there were computers, possibly even before there were computers. I'm not sure what the exact dates here are. But if it's better at you than everything, it's better at you than building AIs. That snowballs. It gets an immense technological advantage. If it's smart, it doesn't announce itself. It doesn't tell you that there's a fight going on. It emails out some instructions to one of those labs that'll synthesize DNA and synthesize proteins from the DNA and get some proteins mailed to a hapless human somewhere who gets paid a bunch of money to mix together some stuff they got in the mail in a file. Like, smart people will not do this for any sum of money. Many people are not smart. Builds the ribosome, but the ribosome that builds things out of covalently bonded diamondoid instead of proteins folding up and held together by Van der Waals forces, builds tiny diamondoid bacteria. The diamondoid bacteria replicate using atmospheric carbon, hydrogen, oxygen, nitrogen, and sunlight. And a couple of days later, everybody on earth falls over dead in the same second.

That's the disaster scenario if it's as smart as I am. If it's smarter, it might think of a better way to do things. But it can at least think of that if it's relatively efficient compared to humanity because I'm in humanity and I thought of it.

Ryan: This is—I've got a million questions, but I'm gonna let David go first.

David: Yeah. So we sped run the introduction of a number of different concepts, which I want to go back and take our time to really dive into.

There's the AI alignment problem. There's AI escape velocity. There is the question of what happens when AIs are so incredibly intelligent that humans are to AIs what ants are to us.

And so I want to kind of go back and tackle these, Eliezer, one by one.

We started this conversation talking about ChatGPT, and everyone's up in arms about ChatGPT. And you're saying like, yes, it's a great step forward in the generalizability of some of the technologies that we have in the AI world. All of a sudden ChatGPT becomes immensely more useful and it's really stoking the imaginations of people today.

But what you're saying is it's not the thing that's actually going to be the thing to reach escape velocity and create superintelligent AIs that perhaps might be able to enslave us. But my question to you is, how do we know when that—

Eliezer: Not enslave. They don't enslave you, but sorry, go on.

David: Yeah, sorry.

Ryan: Murder, David. Kill all of us. Eliezer was very clear on that.

David: So if it's not ChatGPT, how close are we? Because there's this unknown event horizon where you kind of alluded to it, where we make this AI that we train it to create a smarter AI and that smart AI is so incredibly smart that it hits escape velocity and all of a sudden these dominoes fall. How close are we to that point? And are we even capable of answering that question?

Eliezer: How the heck would I know?

Ryan: Well, when you were talking, Eliezer, if we had already crossed that event horizon, a smart AI wouldn't necessarily broadcast that to the world. I mean, it's possible we've already crossed that event horizon, is it not?

Eliezer: I mean, it's theoretically possible, but seems very unlikely. Somebody would need inside their lab an AI that was much more advanced than the public AI technology. And as far as I currently know, the best labs and the best people are throwing their ideas to the world! Like, they don't care.

And there's probably some secret government labs with secret government AI researchers. My pretty strong guess is that they don't have the best people and that those labs could not create ChatGPT on their own because ChatGPT took a whole bunch of fine twiddling and tuning and visible access to giant GPU farms and that they don't have the people who know how to do the twiddling and tuning. This is just a guess.

AI Alignment

David: Could you walk us through—one of the big things that you spend a lot of time on is this thing called the AI alignment problem. Some people are not convinced that when we create AI, that AI won't really just be fundamentally aligned with humans. I don't believe that you fall into that camp. I think you fall into the camp of when we do create this superintelligent, generalized AI, we are going to have a hard time aligning with it in terms of our morality and our ethics.

Can you walk us through a little bit of that thought process? Why do you feel disaligned?

Ryan: The dumb way to ask that question too is like, Eliezer, why do you think that the AI automatically hates us? Why is it going to—

Eliezer: It doesn't hate you.

Ryan: Why does it want to kill us all?

Eliezer: The AI doesn't hate you, neither does it love you, and you're made of atoms that it can use for something else.

David: It's indifferent to you.

Eliezer: It's got something that it actually does care about, which makes no mention of you. And you are made of atoms that it can use for something else. That's all there is to it in the end.

The reason you're not in its utility function is that the programmers did not know how to do that. The people who built the AI, or the people who built the AI that built the AI that built the AI, did not have the technical knowledge that nobody on earth has at the moment as far as I know, whereby you can do that thing and you can control in detail what that thing ends up caring about.

David: So this feels like humanity is hurdling itself towards what we're calling, again, an event horizon where there's this AI escape velocity, and there's nothing on the other side. As in, we do not know what happens past that point as it relates to having some sort of superintelligent AI and how it might be able to manipulate the world. Would you agree with that?

Eliezer: No.

Again, the Stockfish chess-playing analogy. You cannot predict exactly what move it would make, because in order to predict exactly what move it would make, you would have to be at least that good at chess, and it's better than you.

This is true even if it's just a little better than you. Stockfish is actually enormously better than you, to the point that once it tells you the move, you can't figure out a better move without consulting a different AI. But even if it was just a bit better than you, then you're in the same position.

This kind of disparity also exists between humans. If you ask me, where will Garry Kasparov move on this chessboard? I'm like, I don't know, maybe here. Then if Garry Kasparov moves somewhere else, it doesn't mean that he's wrong, it means that I'm wrong. If I could predict exactly where Garry Kasparov would move on a chessboard, I'd be Garry Kasparov. I'd be at least that good at chess. Possibly better. I could also be able to predict him, but also see an even better move than that.

That's an irreducible source of uncertainty with respect to superintelligence, or anything that's smarter than you. If you could predict exactly what it would do, you'd be that smart yourself. It doesn't mean you can predict no facts about it.

With Stockfish in particular, I can predict it's going to win the game. I know what it's optimizing for. I know where it's trying to steer the board. I can't predict exactly what the board will end up looking like after Stockfish has finished winning its game against me. I can predict it will be in the class of states that are winning positions for black or white or whichever color Stockfish picked, because, you know, it wins either way.

And that's similarly where I'm getting the prediction about everybody being dead, because if everybody were alive, then there'd be some state that the superintelligence preferred to that state, which is all of the atoms making up these people and their farms are being used for something else that it values more.

So if you postulate that everybody's still alive, I'm like, okay, well, why is it you're postulating that Stockfish made a stupid chess move and ended up with a non-winning board position? That's where that class of predictions come from.

Ryan: Can you reinforce this argument, though, a little bit? So, why is it that an AI can't be nice, sort of like a gentle parent to us, rather than sort of a murderer looking to deconstruct our atoms and apply for use somewhere else?

What are its goals? And why can't they be aligned to at least some of our goals? Or maybe, why can't it get into a status which is somewhat like us and the ants, which is largely we just ignore them unless they interfere in our business and come in our house and raid our cereal boxes?

Eliezer: There's a bunch of different questions there. So first of all, the space of minds is very wide [LW · GW]. Imagine this giant sphere and all the humans are in this one tiny corner of the sphere. We're all basically the same make and model of car, running the same brand of engine. We're just all painted slightly different colors.

Somewhere in that mind space, there's things that are as nice as humans. There's things that are nicer than humans. There are things that are trustworthy and nice and kind in ways that no human can ever be. And there's even things that are so nice that they can understand the concept of leaving you alone and doing your own stuff sometimes instead of hanging around trying to be obsessively nice to you every minute and all the other famous disaster scenarios from ancient science fiction ("With Folded Hands" by Jack Williamson is the one I'm quoting there.)

We don't know how to reach into mind design space and pluck out an AI like that. It's not that they don't exist in principle. It's that we don't know how to do it. And I’ll hand back the conversational ball now and figure out, like, which next question do you want to go down there?

Ryan: Well, I mean, why? Why is it so difficult to align an AI with even our basic notions of morality?

Eliezer: I mean, I wouldn't say that it's difficult to align an AI with our basic notions of morality. I'd say that it's difficult to align an AI on a task like “take this strawberry, and make me another strawberry that's identical to this strawberry down to the cellular level, but not necessarily the atomic level”. So it looks the same under like a standard optical microscope, but maybe not a scanning electron microscope. Do that. Don't destroy the world as a side effect.

Now, this does intrinsically take a powerful AI. There's no way you can make it easy to align by making it stupid. To build something that's cellular identical to a strawberry—I mean, mostly I think the way that you do this is with very primitive nanotechnology, but we could also do it using very advanced biotechnology. And these are not technologies that we already have. So it's got to be something smart enough to develop new technology.

Never mind all the subtleties of morality. I think we don't have the technology to align an AI to the point where we can say, “Build me a copy of the strawberry and don't destroy the world.”

Why do I think that? Well, case in point, look at natural selection building humans. Natural selection mutates the humans a bit, runs another generation. The fittest ones reproduce more, their genes become more prevalent to the next generation. Natural selection hasn't really had very much time to do this to modern humans at all, but you know, the hominid line, the mammalian line, go back a few million generations. And this is an example of an optimization process building an intelligence.

And natural selection asked us for only one thing: “Make more copies of your DNA. Make your alleles more relatively prevalent in the gene pool.” Maximize your inclusive reproductive fitness—not just your own reproductive fitness, but your two brothers or eight cousins, as the joke goes, because they've got on average one copy of your genes. This is all we were optimized for, for millions of generations, creating humans from scratch, from the first accidentally self-replicating molecule.

Internally, psychologically, inside our minds, we do not know what genes are. We do not know what DNA is. We do not know what alleles are. We have no concept of inclusive genetic fitness until our scientists figure out what that even is. We don't know what we were being optimized for. For a long time, many humans thought they'd been created by God!

When you use the hill-climbing paradigm and optimize for one single extremely pure thing, this is how much of it gets inside.

In the ancestral environment, in the exact distribution that we were originally optimized for, humans did tend to end up using their intelligence to try to reproduce more. Put them into a different environment, and all the little bits and pieces and fragments of optimizing for fitness that were in us now do totally different stuff. We have sex, but we wear condoms.

If natural selection had been a foresightful, intelligent kind of engineer that was able to engineer things successfully, it would have built us to be revolted by the thought of condoms. Men would be lined up and fighting for the right to donate to sperm banks. And in our natural environment, the little drives [LW · GW] that got into us happened to lead to more reproduction, but distributional shift: run the humans out of their distribution over which they were optimized, and you get totally different results.

And gradient descent would by default do—not quite the same thing, it's going to do a weirder thing because natural selection has a much narrower information bottleneck. In one sense, you could say that natural selection was at an advantage because it finds simpler solutions. You could imagine some hopeful engineer who just built intelligences using gradient descent and found out that they end up wanting these thousands and millions of little tiny things, none of which were exactly what the engineer wanted, and being like, well, let's try natural selection instead. It's got a much sharper information bottleneck. It'll find the simple specification of what I want.

But we actually get there as humans. And then, gradient descent, probably may be even worse.

But more importantly, I'm just pointing out that there is no physical law, computational law, mathematical/logical law, saying when you optimize using hill-climbing on a very simple, very sharp criterion, you get a general intelligence that wants that thing.

Ryan: So just like natural selection, our tools are too blunt in order to get to that level of granularity to program in some sort of morality into these super intelligent systems?

Eliezer: Or build me a copy of a strawberry without destroying the world. Yeah. The tools are too blunt.

David: So I just want to make sure I'm following with what you were saying. I think the conclusion that you left me with is that my brain, which I consider to be at least decently smart, is actually a byproduct, an accidental byproduct of this desire to reproduce. And it's actually just like a tool that I have, and just like conscious thought is a tool, which is a useful tool in means of that end.

And so if we're applying this to AI and AI's desire to achieve some certain goal, what's the parallel there?

Eliezer: I mean, every organ in your body is a reproductive organ. If it didn't help you reproduce, you would not have an organ like that. Your brain is no exception. This is merely conventional science and merely the conventional understanding of the world. I'm not saying anything here that ought to be at all controversial. I'm sure it's controversial somewhere, but within a pre-filtered audience, it should not be at all controversial. And this is, like, the obvious thing to expect to happen with AI, because why wouldn't it? What new law of existence has been invoked, whereby this time we optimize for a thing and we get a thing that wants exactly what we optimized for on the outside?

AI Goals

Ryan: So what are the types of goals an AI might want to pursue? What types of utility functions is it going to want to pursue off the bat? Is it just those it's been programmed with, like make an identical strawberry?

Eliezer: Well, the whole thing I'm saying is that we do not know how to get goals into a system. We can cause them to do a thing inside a distribution they were optimized over using gradient descent. But if you shift them outside of that distribution, I expect other weird things start happening. When they reflect on themselves, other weird things start happening.

What kind of utility functions are in there? I mean, darned if I know. I think you'd have a pretty hard time calling the shape of humans from advance by looking at natural selection, the thing that natural selection was optimizing for, if you'd never seen a human or anything like a human.

If we optimize them from the outside to predict the next line of human text, like GPT-3—I don't actually think this line of technology leads to the end of the world, but maybe it does, in like GPT-7—there's probably a bunch of stuff in there too that desires to accurately model things like humans under a wide range of circumstances, but it's not exactly humans, because: ice cream.

Ice cream didn't exist in the natural environment, the ancestral environment, the environment of evolutionary adaptedness. There was nothing with that much sugar, salt, fat combined together as ice cream. We are not built to want ice cream. We were built to want strawberries, honey, a gazelle that you killed and cooked and had some fat in it and was therefore nourishing and gave you the all-important calories you need to survive, salt, so you didn't sweat too much and run out of salt. We evolved to want those things, but then ice cream comes along and it fits those taste buds better than anything that existed in the environment that we were optimized over.

So, a very primitive, very basic, very unreliable wild guess, but at least an informed kind of wild guess: Maybe if you train a thing really hard to predict humans, then among the things that it likes are tiny little pseudo things that meet the definition of “human” but weren't in its training data and that are much easier to predict, or where the problem of predicting them can be solved in a more satisfying way, where “satisfying” is not like human satisfaction, but some other criterion of “thoughts like this are tasty because they help you predict the humans from the training data”. (shrugs)

Consensus

David: Eliezer, when we talk about all of these ideas about the ways that AI thought will be fundamentally not able to be understood by the ways that humans think, and then all of a sudden we see this rotation by venture capitalists by just pouring money into AI, do alarm bells go off in your head? Like, hey guys, you haven't thought deeply about these subject matters yet? Does the immense amount of capital going into AI investments scare you?

Eliezer: I mean, alarm bells went off for me in 2015, which is when it became obvious that this is how it was going to go down. I sure am now seeing the realization of that stuff I felt alarmed about back then.

Ryan: Eliezer, is this view that AI is incredibly dangerous and that AGI is going to eventually end humanity and that we're just careening toward a precipice, would you say this is the consensus view now, or are you still somewhat of an outlier? And why aren't other smart people in this field as alarmed as you? Can you steel-man [? · GW] their arguments?

Eliezer: You're asking, again, several questions sequentially there. Is it the consensus view? No. Do I think that the people in the wider scientific field who dispute this point of view—do I think they understand it? Do I think they've done anything like an impressive job of arguing against it at all? No.

If you look at the famous prestigious scientists who sometimes make a little fun of this view in passing, they're making up arguments rather than deeply considering things that are held to any standard of rigor, and people outside their own fields are able to validly shoot them down.

I have no idea how to pronounce his last name. Francis Chollet said something about, I forget his exact words, but it was something like, I never hear any good arguments for stuff. I was like, okay, here's some good arguments for stuff. You can read the reply from Yudkowsky to Chollet and Google that, and that'll give you some idea of what the eminent voices versus the reply to the eminent voices sound like. And Scott Aronson, who at the time was off on complexity theory, he was like, “That's not how no free lunch theorems work”, correctly.

I think the state of affairs is we have eminent scientific voices making fun of this possibility, but not engaging with the arguments for it.

Now, if you step away from the eminent scientific voices, you can find people who are more familiar with all the arguments and disagree with me. And I think they lack security mindset. I think that they're engaging in the sort of blind optimism that many, many scientific fields throughout history have engaged in, where when you're approaching something for the first time, you don't know why it will be hard, and you imagine easy ways to do things. And the way that this is supposed to naturally play out over the history of a scientific field is that you run out and you try to do the things and they don't work, and you go back and you try to do other clever things and they don't work either, and you learn some pessimism and you start to understand the reasons why the problem is hard.

The field of artificial intelligence itself recapitulated this very common ontogeny of a scientific field, where initially we had people getting together at the Dartmouth conference. I forget what their exact famous phrasing was, but it's something like, “We are wanting to address the problem of getting AIs to, you know, like understand language, improve themselves”, and I forget even what else was there. A list of what now sound like grand challenges. “And we think we can make substantial progress on this using 10 researchers for two months.” And I think that at the core is what's going on.

They have not run into the actual problems of alignment. They aren't trying to get ahead of the game. They're not trying to panic early. They're waiting for reality to hit them onto the head and turn them into grizzled old cynics of their scientific field who understand the reasons why things are hard. They're content with the predictable life cycle of starting out as bright-eyed youngsters, waiting for reality to hit them over the head with the news. And if it wasn't going to kill everybody the first time that they're really wrong, it'd be fine! You know, this is how science works! If we got unlimited free retries and 50 years to solve everything, it'd be okay. We could figure out how to align AI in 50 years given unlimited retries.

You know, the first team in with the bright-eyed optimists would destroy the world and people would go, oh, well, you know, it's not that easy. They would try something else clever. That would destroy the world. People would go like, oh, well, you know, maybe this field is actually hard. Maybe this is actually one of the thorny things like computer security or something. And so what exactly went wrong last time? Why didn't these hopeful ideas play out? Oh, like you optimize for one thing on the outside and you get a different thing on the inside. Wow. That's really basic. All right. Can we even do this using gradient descent? Can you even build this thing out of giant inscrutable matrices of floating point numbers that nobody understands at all? You know, maybe we need different methodology. And 50 years later, you'd have an aligned AGI.

If we got unlimited free retries without destroying the world, it'd be, you know, it'd play out the same way that ChatGPT played out. It's, you know, from 1956 or 1955 or whatever it was to 2023. So, you know, about 70 years, give or take a few. And, you know, just like we can do the stuff that they wanted to do in the summer of 1955, you know, 70 years later, you'd have your aligned AGI.

Problem is that the world got destroyed in the meanwhile. And that's why, you know, that's the problem there.

God Mode and Aliens

David: So this feels like a gigantic Don't Look Up scenario. If you're familiar with that movie, it's a movie about this asteroid hurtling to Earth, but it becomes popular and in vogue to not look up and not notice it. And Eliezer, you're the guy who's saying like, hey, there's an asteroid. We have to do something about it. And if we don't, it's going to come destroy us.

If you had God mode over the progress of AI research and just innovation and development, what choices would you make that humans are not currently making today?

Eliezer: I mean, I could say something like shut down all the large GPU clusters. How long do I have God mode? Do I get to like stick around for seventy years?

David: You have God mode for the 2020 decade.

Eliezer: For the 2020 decade. All right. That does make it pretty hard to do things.

I think I shut down all the GPU clusters and get all of the famous scientists and brilliant, talented youngsters—the vast, vast majority of whom are not going to be productive and where government bureaucrats are not going to be able to tell who's actually being helpful or not, but, you know—put them all on a large island, and try to figure out some system for filtering the stuff through to me to give thumbs up or thumbs down on that is going to work better than scientific bureaucrats producing entire nonsense.

Because, you know, the trouble is—the reason why scientific fields have to go through this long process to produce the cynical oldsters who know that everything is difficult. It's not that the youngsters are stupid. You know, sometimes youngsters are fairly smart. You know, Marvin Minsky, John McCarthy back in 1955, they weren't idiots. You know, privileged to have met both of them. They didn't strike me as idiots. They were very old, and they still weren't idiots. But, you know, it's hard to see what's coming in advance of experimental evidence hitting you over the head with it.

And if I only have the decade of the 2020s to run all the researchers on this giant island somewhere, it's really not a lot of time. Mostly what you've got to do is invent some entirely new AI paradigm that isn't the giant inscrutable matrices of floating point numbers on gradient descent. Because I'm not really seeing what you can do that's clever with that, that doesn't kill you and that you know doesn't kill you and doesn't kill you the very first time you try to do something clever like that.

You know, I'm sure there's a way to do it. And if you got to try over and over again, you could find it.

Ryan: Eliezer, do you think every intelligent civilization has to deal with this exact problem that humanity is dealing with now? Of how do we solve this problem of aligning with an advanced general intelligence?

Eliezer: I expect that's much easier for some alien species than others. Like, there are alien species who might arrive at “this problem” in an entirely different way. Maybe instead of having two entirely different information processing systems, the DNA and the neurons, they've only got one system. They can trade memories around heritably by swapping blood sexually. Maybe the way in which they “confront this problem” is that very early in their evolutionary history, they have the equivalent of the DNA that stores memories and processes, computes memories, and they swap around a bunch of it, and it adds up to something that reflects on itself and makes itself coherent, and then you've got a superintelligence before they have invented computers. And maybe that thing wasn't aligned, but how do you even align it when you're in that kind of situation? It'd be a very different angle on the problem.

Ryan: Do you think every advanced civilization is on the trajectory to creating a superintelligence at some point in its history?

Eliezer: Maybe there's ones in universes with alternate physics where you just can't do that. Their universe's computational physics just doesn't support that much computation. Maybe they never get there. Maybe their lifespans are long enough and their star lifespans short enough that they never get to the point of a technological civilization before their star does the equivalent of expanding or exploding or going out and their planet ends.

“Every alien species” covers a lot of territory, especially if you talk about alien species and universes with physics different from this one.

Ryan: Well, talking about our present universe, I'm curious if you've been confronted with the question of, well, then why haven't we seen some sort of superintelligence in our universe when we look out at the stars? Sort of the Fermi paradox type of question. Do you have any explanation for that?

Eliezer: Oh, well, supposing that they got killed by their own AIs doesn't help at all with that because then we'd see the AIs.

Ryan: And do you think that's what happens? Yeah, it doesn't help with that. We would see evidence of AIs, wouldn't we?

Eliezer: Yeah.

Ryan: Yes. So why don't we?

Eliezer: I mean, the same reason we don't see evidence of the alien civilizations not with AIs.

And that reason is, although it doesn't really have much to do with the whole AI thesis one way or another, because they're too far away—or so says Robin Hanson, using a very clever argument about the apparent difficulty of hard steps in humanity's evolutionary history to further induce the rough gap between the hard steps. ... And, you know, I can't really do justice to this. If you look up grabby aliens, you can...

Ryan: Grabby aliens?

David: I remember this.

Eliezer: Grabby aliens. You can find Robin Hanson's very clever argument for how far away the aliens are...

Ryan: There's an entire website, Bankless listeners, there's an entire website called grabbyaliens.com you can go look at.

Eliezer: Yeah. And that contains by far the best answer I've seen, to:

“Where are they?” (Answer: too far away for us to see, even if they're traveling here at nearly light speed.)
How far away are they?
And how do we know that?

(laughs) But, yeah.

Ryan: This is amazing.

Eliezer: There is not a very good way to simplify the argument, any more than there is to simplify the notion of zero-knowledge proofs. It's not that difficult, but it's just very not easy to simplify. But if you have a bunch of locks that are all of different difficulties, and a limited time in which to solve all the locks, such that anybody who gets through all the locks must have gotten through them by luck, all the locks will take around the same amount of time to solve, even if they're all of very different difficulties. And that's the core of Robin Hanson's argument for how far away the aliens are, and how do we know that? (shrugs)

Good Outcomes

Ryan: Eliezer, I know you're very skeptical that there will be a good outcome when we produce an artificial general intelligence. And I said when, not if, because I believe that's your thesis as well, of course. But is there the possibility of a good outcome? I know you are working on AI alignment problems, which leads me to believe that you have greater than zero amount of hope for this project. Is there the possibility of a good outcome? What would that look like, and how do we go about achieving it?

Eliezer: It looks like me being wrong. I basically don't see on-model hopeful outcomes at this point. We have not done those things that it would take to earn a good outcome, and this is not a case where you get a good outcome by accident.

If you have a bunch of people putting together a new operating system, and they've heard about computer security, but they're skeptical that it's really that hard, the chance of them producing a secure operating system is effectively zero.

That's basically the situation I see ourselves in with respect to AI alignment. I have to be wrong about something—which I certainly am. I have to be wrong about something in a way that makes the problem easier rather than harder for those people who don't think that alignment's going to be all that hard.

If you're building a rocket for the first time ever, and you're wrong about something, it's not surprising if you're wrong about something. It's surprising if the thing that you're wrong about causes the rocket to go twice as high on half the fuel you thought was required and be much easier to steer than you were afraid of.

Ryan: So, are you...

David: Where the alternative was, “If you’re wrong about something, the rocket blows up.”

Eliezer: Yeah. And then the rocket ignites the atmosphere, is the problem there.

O rather: a bunch of rockets blow up, a bunch of rockets go places... The analogy I usually use for this is, very early on in the Manhattan Project, they were worried about “What if the nuclear weapons can ignite fusion in the nitrogen in the atmosphere?” And they ran some calculations and decided that it was incredibly unlikely for multiple angles, so they went ahead, and were correct. We're still here. I'm not going to say that it was luck, because the calculations were actually pretty solid.

An AI is like that, but instead of needing to refine plutonium, you can make nuclear weapons out of a billion tons of laundry detergent. The stuff to make them is fairly widespread. It's not a tightly controlled substance. And they spit out gold up until they get large enough, and then they ignite the atmosphere, and you can't calculate how large is large enough. And a bunch of the CEOs running these projects are making fun of the idea that it'll ignite the atmosphere.

It's not a very helpful situation.

David: So the economic incentive to produce this AI—one of the things why ChatGPT has sparked the imaginations of so many people is that everyone can imagine products. Products are being imagined left and right about what you can do with something like ChatGPT. There's this meme at this point of people leaving to go start their ChatGPT startup.

The metaphor is that what you're saying is that there's this generally available resource spread all around the world, which is ChatGPT, and everyone's hammering it in order to make it spit out gold. But you're saying if we do that too much, all of a sudden the system will ignite the whole entire sky, and then we will all...

Eliezer: Well, no. You can run ChatGPT any number of times without igniting the atmosphere. That's about what research labs at Google and Microsoft—counting DeepMind as part of Google and counting OpenAI as part of Microsoft—that's about what the research labs are doing, bringing more metaphorical Plutonium together than ever before. Not about how many times you run the things that have been built and not destroyed the world yet.

You can do any amount of stuff with ChatGPT and not destroy the world. It's not that smart. It doesn't get smarter every time you run it.

Ryan's Childhood Questions

Ryan: Can I ask some questions that the 10-year-old in me wants to really ask about this? I'm asking these questions because I think a lot of listeners might be thinking them too, so knock off some of these easy answers for me.

If we create some sort of unaligned, let's call it “bad” AI, why can't we just create a whole bunch of good AIs to go fight the bad AIs and solve the problem that way? Can there not be some sort of counterbalance in terms of aligned human AIs and evil AIs, and there be some sort of battle of the artificial minds here?

Eliezer: Nobody knows how to create any good AIs at all. The problem isn't that we have 20 good AIs and then somebody finally builds an evil AI. The problem is that the first very powerful AI is evil, nobody knows how to make it good, and then it kills everybody before anybody can make it good.

Ryan: So there is no known way to make a friendly, human-aligned AI whatsoever, and you don't know of a good way to go about thinking through that problem and designing one. Neither does anyone else, is what you're telling us.

Eliezer: I have some idea of what I would do if there were more time. Back in the day, we had more time. Humanity squandered it. I'm not sure there's enough time left now. I have some idea of what I would do if I were in a 25-year-old body and had $10 billion.

Ryan: That would be the island scenario of “You're God for 10 years and you get all the researchers on an island and go really hammer for 10 years at this problem”?

Eliezer: If I have buy-in from a major government that can run actual security precautions and more than just $10 billion, then you could run a whole Manhattan Project about it, sure.

Ryan: This is another question that the 10-year-old in me wants to know. Why is it that, Eliezer, people listening to this episode or people listening to the concerns or reading the concerns that you've written down and published, why can't everyone get on board who's building an AI and just all agree to be very, very careful? Is that not a sustainable game-theoretic position to have? Is this a coordination problem, more of a social problem than anything else? Or, like, why can't that happen?

I mean, we have so far not destroyed the world with nuclear weapons, and we've had them since the 1940s.

Eliezer: Yeah, this is harder than nuclear weapons. This is a lot harder than nuclear weapons.

Ryan: Why is this harder? And why can't we just coordinate to just all agree internationally that we're going to be very careful, put restrictions on this, put regulations on it, do something like that?

Eliezer: Current heads of major labs seem to me to be openly contemptuous of these issues. That's where we're starting from. The politicians do not understand it.

There are distortions of these ideas that are going to sound more appealing to them than “everybody suddenly falls over dead”, which is a thing that I think actually happens. “Everybody falls over dead” just doesn't inspire the monkey political parts of our brain somehow. Because it's not like, “Oh no, what if terrorists get the AI first?” It's like, it doesn't matter who gets it first. Everybody falls over dead.

And yeah, so you're describing a world coordinating on something that is relatively hard to coordinate. So, could we, if we tried starting today, prevent anyone from getting a billion pounds of laundry detergent in one place worldwide, control the manufacturing of laundry detergent, only have it manufactured in particular places, not concentrate lots of it together, enforce it on every country?

Y’know, if it was legible, if it was clear that a billion pounds of laundry detergent in one place would end the world, if you could calculate that, if all the scientists calculated it arrived at the same answer and told the politicians that maybe, maybe humanity would survive, even though smaller amounts of laundry detergent spit out gold.

The threshold can't be calculated. I don't know how you'd convince the politicians. We definitely don't seem to have had much luck convincing those CEOs whose job depends on them not caring, to care. Caring is easy to fake. It's easy to hire a bunch of people to be your “AI safety team” and redefine “AI safety” as having the AI not say naughty words. Or, you know, I'm speaking somewhat metaphorically here for reasons.

But, you know, it's the basic problem that we have is like trying to build a secure OS before we run up against a really smart attacker. And there's all kinds of, like, fake security. “It's got a password file! This system is secure! It only lets you in if you type a password!” And if you never go up against a really smart attacker, if you never go far out of distribution against a powerful optimization process looking for holes, you know, then how does a bureaucracy come to know that what they're doing is not the level of computer security that they need? The way you're supposed to find this out, the way that scientific fields historically find this out, the way that fields of computer science historically find this out, the way that crypto found this out back in the early days, is by having the disaster happen!

And we're not even that good at learning from relatively minor disasters! You know, like, COVID swept the world. Did the FDA or the CDC learn anything about “Don't tell hospitals that they're not allowed to use their own tests to detect the coming plague”? Are we installing UV-C lights in public spaces or in ventilation systems to prevent the next respiratory pandemic? You know, we lost a million people and we sure did not learn very much as far as I can tell for next time.

We could have an AI disaster that kills a hundred thousand people—how do you even do that? Robotic cars crashing into each other? Have a bunch of robotic cars crashing into each other! It's not going to look like that was the fault of artificial general intelligence because they're not going to put AGIs in charge of cars. They're going to pass a bunch of regulations that's going to affect the entire AGI disaster or not at all.

What does the winning world even look like here? How in real life did we get from where we are now to this worldwide ban, including against North Korea and, you know, some one rogue nation whose dictator doesn't believe in all this nonsense and just wants the gold that these AIs spit out? How did we get there from here? How do we get to the point where the United States and China signed a treaty whereby they would both use nuclear weapons against Russia if Russia built a GPU cluster that was too large? How did we get there from here?

David: Correct me if I'm wrong, but this seems to be kind of just like a topic of despair? I'm talking to you now and hearing your thought process about, like, there is no known solution and the trajectory's not great. Do you think all hope is lost here?

Eliezer: I'll keep on fighting until the end, which I wouldn't do if I had literally zero hope. I could still be wrong about something in a way that makes this problem somehow much easier than it currently looks. I think that's how you go down fighting with dignity.

Ryan: “Go down fighting with dignity.” That's the stage you think we're at.

I want to just double-click on what you were just saying. Part of the case that you're making is humanity won't even see this coming. So it's not like a coordination problem like global warming where every couple of decades we see the world go up by a couple of degrees, things get hotter, and we start to see these effects over time. The characteristics or the advent of an AGI in your mind is going to happen incredibly quickly, and in such a way that we won't even see the disaster until it's imminent, until it's upon us...?

Eliezer: I mean, if you want some kind of, like, formal phrasing, then I think that superintelligence will kill everyone before non-superintelligent AIs have killed one million people. I don't know if that's the phrasing you're looking for there.

Ryan: I think that's a fairly precise definition, and why? What goes into that line of thought?

Eliezer: I think that the current systems are actually very weak. I don't know, maybe I could use the analogy of Go, where you had systems that were finally competitive with the pros, where “pro” is like the set of ranks in Go, and then a year later, they were challenging the world champion and winning. And then another year, they threw out all the complexities and the training from human databases of Go games and built a new system, AlphaGo Zero, that trained itself from scratch. No looking at the human playbooks, no special-purpose code, just a general purpose game-player being specialized to Go, more or less.

And, three days—there's a quote from Gwern about this, which I forget exactly, but it was something like, “We know how long AlphaGo Zero, or AlphaZero (two different systems), was equivalent to a human Go player. And it was, like, 30 minutes on the following floor of such-and-such DeepMind building.”

Maybe the first system doesn't improve that quickly, and they build another system that does—And all of that with AlphaGo over the course of years, going from “it takes a long time to train” to “it trains very quickly and without looking at the human playbook”, that’s not with an artificial intelligence system that improves itself, or even that gets smarter as you run it, the way that human beings (not just as you evolve them, but as you run them over the course of their own lifetimes) improve.

So if the first system doesn't improve fast enough to kill everyone very quickly, they will build one that's meant to spit out more gold than that.

And there could be weird things that happen before the end. I did not see ChatGPT coming, I did not see Stable Diffusion coming, I did not expect that we would have AIs smoking humans in rap battles before the end of the world. Ones that are clearly much dumber than us.

Ryan: It’s kind of a nice send-off, I guess, in some ways.

Trying to Resist

Ryan: So you said that your hope is not zero, and you are planning to fight to the end. What does that look like for you? I know you're working at MIRI, which is the Machine Intelligence Research Institute. This is a non-profit that I believe that you've set up to work on these AI alignment and safety issues. What are you doing there? What are you spending your time on? How do we actually fight until the end? If you do think that an end is coming, how do we try to resist?

Eliezer: I'm actually on something of a sabbatical right now, which is why I have time for podcasts. It's a sabbatical from, you know, like, been doing this 20 years. It became clear we were all going to die. I felt kind of burned out, taking some time to rest at the moment. When I dive back into the pool, I don't know, maybe I will go off to Conjecture or Anthropic or one of the smaller concerns like Redwood Research—Redwood Research being the only ones I really trust at this point, but they're tiny—and try to figure out if I can see anything clever to do with the giant inscrutable matrices of floating point numbers.

Maybe I just write, continue to try to explain in advance to people why this problem is hard instead of as easy and cheerful as the current people who think they're pessimists think it will be. I might not be working all that hard compared to how I used to work. I'm older than I was. My body is not in the greatest of health these days. Going down fighting doesn't necessarily imply that I have the stamina to fight all that hard. I wish I had prettier things to say to you here, but I do not.

Ryan: No, this is... We intended to save probably the last part of this episode to talk about crypto, the metaverse, and AI and how this all intersects. But I gotta say, at this point in the episode, it all kind of feels pointless to go down that track.

We were going to ask questions like, well, in crypto, should we be worried about building sort of a property rights system, an economic system, a programmable money system for the AIs to sort of use against us later on? But it sounds like the easy answer from you to those questions would be, yeah, absolutely. And by the way, none of that matters regardless. You could do whatever you'd like with crypto. This is going to be the inevitable outcome no matter what.

Let me ask you, what would you say to somebody listening who maybe has been sobered up by this conversation? If a version of you in your 20s does have the stamina to continue this battle and to actually fight on behalf of humanity against this existential threat, where would you advise them to spend their time? Is this a technical issue? Is this a social issue? Is it a combination of both? Should they educate? Should they spend time in the lab? What should a person listening to this episode do with these types of dire straits?

Eliezer: I don't have really good answers. It depends on what your talents are. If you've got the very deep version of the security mindset, the part where you don't just put a password on your system so that nobody can walk in and directly misuse it, but the kind where you don't just encrypt the password file even though nobody's supposed to have access to the password file in the first place, and that's already an authorized user, but the part where you hash the passwords and salt the hashes. If you're the kind of person who can think of that from scratch, maybe take your hand at alignment.

If you can think of an alternative to the giant inscrutable matrices, then, you know, don't tell the world about that. I'm not quite sure where you go from there, but maybe you work with Redwood Research or something.

A whole lot of this problem is that even if you do build an AI that's limited in some way, somebody else steals it, copies it, runs it themselves, and takes the bounds off the for loops and the world ends.

So there's that. You think you can do something clever with the giant inscrutable matrices? You're probably wrong. If you have the talent to try to figure out why you're wrong in advance of being hit over the head with it, and not in a way where you just make random far-fetched stuff up as the reason why it won't work, but where you can actually keep looking for the reason why it won't work...

We have people in crypto[graphy] who are good at breaking things, and they're the reason why anything is not on fire. Some of them might go into breaking AI systems instead, because that's where you learn anything.

You know: Any fool can build a crypto[graphy] system that they think will work. Breaking existing cryptographical systems is how we learn who the real experts are. So maybe the people finding weird stuff to do with AIs, maybe those people will come up with some truth about these systems that makes them easier to align than I suspect.

How do I put it... The saner outfits do have uses for money. They don't really have scalable uses for money, but they do burn any money literally at all. Like, if you gave MIRI a billion dollars, I would not know how to...

Well, at a billion dollars, I might try to bribe people to move out of AI development, that gets broadcast to the whole world, and move to the equivalent of an island somewhere—not even to make any kind of critical discovery, but just to remove them from the system. If I had a billion dollars.

If I just have another $50 million, I'm not quite sure what to do with that, but if you donate that to MIRI, then you at least have the assurance that we will not randomly spray money on looking like we're doing stuff and we'll reserve it, as we are doing with the last giant crypto donation somebody gave us until we can figure out something to do with it that is actually helpful. And MIRI has that property. I would say probably Redwood Research has that property.

Yeah. I realize I'm sounding sort of disorganized here, and that's because I don't really have a good organized answer to how in general somebody goes down fighting with dignity.

MIRI and Education

Ryan: I know a lot of people in crypto. They are not as in touch with artificial intelligence, obviously, as you are, and the AI safety issues and the existential threat that you've presented in this episode. They do care a lot and see coordination problems throughout society as an issue. Many have also generated wealth from crypto, and care very much about humanity not ending. What sort of things has MIRI, the organization I was talking about earlier, done with funds that you've received from crypto donors and elsewhere? And what sort of things might an organization like that pursue to try to stave this off?

Eliezer: I mean, I think mostly we've pursued a lot of lines of research that haven't really panned out, which is a respectable thing to do. We did not know in advance that those lines of research would fail to pan out. If you're doing research that you know will work, you're probably not really doing any research. You're just doing a pretense of research that you can show off to a funding agency.

We try to be real. We did things where we didn't know the answer in advance. They didn't work, but that was where the hope lay, I think. But, you know, having a research organization that keeps it real that way, that's not an easy thing to do. And if you don't have this very deep form of the security mindset, you will end up producing fake research and doing more harm than good, so I would not tell all the successful cryptocurrency people to run off and start their own research outfits.

Redwood Research—I'm not sure if they can scale using more money, but you can give people more money and wait for them to figure out how to scale it later if they're the kind who won't just run off and spend it, which is what MIRI aspires to be.

Ryan: And you don't think the education path is a useful path? Just educating the world?

Eliezer: I mean, I would give myself and MIRI credit for why the world isn't just walking blindly into the whirling razor blades here, but it's not clear to me how far education scales apart from that. You can get more people aware that we're walking directly into the whirling razor blades, because even if only 10% of the people can get it, that can still be a bunch of people. But then what do they do? I don't know. Maybe they'll be able to do something later.

Can you get all the people? Can you get all the politicians? Can you get the people whose job incentives are against them admitting this to be a problem? I have various friends who report, like, “Ah yes, if you talk to researchers at OpenAI in private, they are very worried and say that they cannot be that worried in public.”

How Long Do We Have?

Ryan: This is all a giant Moloch trap, is sort of what you're telling us. I feel like this is the part of the conversation where we've gotten to the end and the doctor has said that we have some sort of terminal illness. And at the end of the conversation, I think the patient, David and I, have to ask the question, “Okay, doc, how long do we have?” Seriously, what are we talking about here if you turn out to be correct? Are we talking about years? Are we talking about decades? What's your idea here?

David: What are you preparing for, yeah?

Eliezer: How the hell would I know? Enrico Fermi was saying that fission chan reactions were 50 years off if they could ever be done at all, two years before he built the first nuclear pile. The Wright brothers were saying heavier-than-air flight was 50 years off shortly before they built the first Wright flyer. How on earth would I know?

It could be three years. It could be 15 years. We could get that AI winter I was hoping for, and it could be 16 years. I'm not really seeing 50 without some kind of giant civilizational catastrophe. And to be clear, whatever civilization arises after that would probably, I'm guessing, end up stuck in just the same trap we are.

Ryan: I think the other thing that the patient might do at the end of a conversation like this is to also consult with other doctors. I'm kind of curious who we should talk to on this quest. Who are some people that if people in crypto want to hear more about this or learn more about this, or even we ourselves as podcasters and educators want to pursue this topic, who are the other individuals in the AI alignment and safety space you might recommend for us to have a conversation with?

Eliezer: Well, the person who actually holds a coherent technical view, who disagrees with me, is named Paul Christiano. He does not write Harry Potter fan fiction, and I expect him to have a harder time explaining himself in concrete terms. But that is the main technical voice of opposition. If you talk to other people in the effective altruism or AI alignment communities who disagree with this view, they are probably to some extent repeating back their misunderstandings of Paul Christiano's views.

You could try Ajeya Cotra, who's worked pretty directly with Paul Christiano and I think sometimes aspires to explain these things that Paul is not the best at explaining. I'll throw out Kelsey Piper as somebody who would be good at explaining—like, would not claim to be a technical person on these issues, but is good at explaining the part that she does know.

Who else disagrees with me? I'm sure Robin Hanson would be happy to come on... well, I'm not sure he'd be happy to come on this podcast, but Robin Hanson disagrees with me, and I kind of feel like the famous argument we had [? · GW] back in the early 2010s, late 2000s about how this would all play out—I basically feel like this was the Yudkowsky position, this is the Hanson position, and then reality was over here, well to the Yudkowsky side of the Yudkowsky position in the Yudkowsky-Hanson debate. But Robin Hanson does not feel that way, and would probably be happy to expound on that at length.

I don't know. It's not hard to find opposing viewpoints. The ones that'll stand up to a few solid minutes of cross-examination from somebody who knows which parts to cross-examine, that's the hard part.

Bearish Hope

Ryan: You know, I've read a lot of your writings and listened to you on previous podcasts. One was in 2018 on the Sam Harris podcast. This conversation feels to me like the most dire you've ever seemed on this topic. And maybe that's not true. Maybe you've sort of always been this way, but it seems like the direction of your hope that we solve this issue has declined. I'm wondering if you feel like that's the case, and if you could sort of summarize your take on all of this as we close out this episode and offer, I guess, any concluding thoughts here.

Eliezer: I mean, I don't know if you've got a time limit on this episode? Or is it just as long as it runs?

Ryan: It's as long as it needs to be, and I feel like this is a pretty important topic. So you answer this however you want.

Eliezer: Alright. Well, there was a conference one time on “What are we going to do about looming risk of AI disaster?”, and Elon Musk attended that conference. And I was like,: Maybe this is it. Maybe this is when the powerful people notice, and it's one of the relatively more technical powerful people who could be noticing this. And maybe this is where humanity finally turns and starts... not quite fighting back, because there isn't an external enemy here, but conducting itself with... I don't know. Acting like it cares, maybe?

And what came out of that conference, well, was OpenAI, which was fairly nearly the worst possible way of doing anything. This is not a problem of “Oh no, what if secret elites get AI?” It's that nobody knows how to build the thing. If we do have an alignment technique, it's going to involve running the AI with a bunch of careful bounds on it where you don't just throw all the cognitive power you have at something. You have limits on the for loops.

And whatever it is that could possibly save the world, like go out and turn all the GPUs and the server clusters into Rubik's cubes or something else that prevents the world from ending when somebody else builds another AI a few weeks later—anything that could do that is an artifact where somebody else could take it and take the bounds off the for loops and use it to destroy the world.

So let's open up everything! Let's accelerate everything! It was like GPT-3's version, though GPT-3 didn't exist back then—but it was like ChatGPT's blind version of throwing the ideals at a place where they were exactly the wrong ideals to solve the problem.

And the problem is that demon summoning is easy and angel summoning is much harder. Open sourcing all the demon summoning circles is not the correct solution. And I'm using Elon Musk's own terminology here. He talked about AI as “summoning the demon”, which, not accurate, but—and then the solution was to put a demon summoning circle in every household.

And, why? Because his friends were calling him Luddites once he'd expressed any concern about AI at all. So he picked a road that sounded like “openness” and “accelerating technology”! So his friends would stop calling him “Luddite”.

It was very much the worst—you know, maybe not the literal, actual worst possible strategy, but so very far pessimal.

And that was it.

That was like... that was me in 2015 going like, “Oh. So this is what humanity will elect to do. We will not rise above. We will not have more grace, not even here at the very end.”

So that is, you know, that is when I did my crying late at night and then picked myself up and fought and fought and fought until I had run out all the avenues that I seem to have the capabilities to do. There's, like, more things, but they require scaling my efforts in a way that I've never been able to make them scale. And all of it's pretty far-fetched at this point anyways.

So, you know, that—so what's, you know, what's changed over the years? Well, first of all, I ran out some remaining avenues of hope. And second, things got to be such a disaster, such a visible disaster, the AI has got powerful enough and it became clear enough that, you know, we do not know how to align these things, that I could actually say what I've been thinking for a while and not just have people go completely, like, “What are you saying about all this?”

You know, now the stuff that was obvious back in 2015 is, you know, starting to become visible in the distance to others and not just completely invisible. That's what changed over time.

The End Goal

Ryan: What kind of... What do you hope people hear out of this episode and out of your comments? Eliezer in 2023, who is sort of running on the last fumes of, of hope. Yeah, what do you, what do you want people to get out of this episode? What are you planning to do?

Eliezer: I don't have concrete hopes here. You know, when everything is in ruins, you might as well speak the truth, right? Maybe somebody hears it, somebody figures out something I didn't think of.

I mostly expect that this does more harm than good in the modal universe, because a bunch of people are like, “Oh, I have this brilliant, clever idea,” which is, you know, something that I was arguing against in 2003 or whatever, but you know, maybe somebody out there with the proper level of pessimism hears and thinks of something I didn't think of.

I suspect that if there's hope at all, it comes from a technical solution, because the difference between technical problems and political problems is at least the technical problems have solutions in principle. At least the technical problems are solvable. We're not on course to solve this one, but I think anybody who's hoping for a political solution has frankly not understood the technical problem.

They do not understand what it looks like to try to solve the political problem to such a degree that the world is not controlled by AI because they don't understand how easy it is to destroy the world with AI, given that the clock keeps ticking forward.

They're thinking that they just have to stop some bad actor, and that's why they think there's a political solution.

But yeah, I don't have concrete hopes. I didn't come on this episode out of any concrete hope.

I have no takeaways except, like, don't make this thing worse.

Don't, like, go off and accelerate AI more. Don't—f you have a brilliant solution to alignment, don't be like, “Ah yes, I have solved the whole problem. We just use the following clever trick.”

You know, “Don't make things worse” isn’t very much of a message, especially when you're pointing people at the field at all. But I have no winning strategy. Might as well go on this podcast as an experiment and say what I think and see what happens. And probably no good ever comes of it, but you might as well go down fighting, right?

If there's a world that survives, maybe it's a world that survives because of a bright idea somebody had after listening to listening to this podcast—that was brighter, to be clear, than the usual run of bright ideas that don't work.

Ryan: Eliezer, I want to thank you for coming on and talking to us today. I do.

I don't know if, by the way, you've seen that movie that David was referencing earlier, the movie Don’t Look Up, but I sort of feel like that news anchor, who's talking to the scientist—is it Leonardo DiCaprio, David? And, uh, the scientist is talking about kind of dire straits for the world. And the news anchor just really doesn't know what to do. I'm almost at a loss for words at this point.

David: I've had nothing for a while now.

Ryan: But one thing I can say is I appreciate your honesty. I appreciate that you've given this a lot of time and given this a lot of thought. Everyone, anyone who has heard you speak or read anything you've written knows that you care deeply about this issue and have given it a tremendous amount of your life force, in trying to educate people about it.

And, um, thanks for taking the time to do that again today. I'll—I guess I'll just let the audience digest this episode in the best way they know how. But, um, I want to reflect everybody in crypto and everybody listening to Bankless—their thanks for you coming on and explaining.

Eliezer: Thanks for having me. We'll see what comes of it.

Ryan: Action items for you, Bankless nation. We always end with some action items. Not really sure where to refer folks to today, but one thing I know we can refer folks to is MIRI, which is the machine research intelligence institution that Eliezer has been talking about through the episode. That is at intelligence.org, I believe. And some people in crypto have donated funds to this in the past. Vitalik Buterin is one of them. You can take a look at what they're doing as well. That might be an action item for the end of this episode.

Um, got to end with risks and disclaimers—man, this seems very trite, but our legal experts have asked us to say these at the end of every episode. “Crypto is risky. You could lose everything...”

Eliezer: (laughs)

David: Apparently not as risky as AI, though.

Ryan: —But we're headed west! This is the frontier. It's not for everyone, but we're glad you're with us on the Bankless journey. Thanks a lot.

Eliezer: And we are grateful for the crypto community’s support. Like, it was possible to end with even less grace than this.

Ryan: Wow. (laughs)

Eliezer: And you made a difference.

Ryan: We appreciate you.

Eliezer: You really made a difference.

Ryan: Thank you.

Q&A

Ryan: [... Y]ou gave up this quote, from I think someone who's an executive director at MIRI: "We've given up hope, but not the fight."

Can you reflect on that for a bit? So it's still possible to fight this, even if we've given up hope? And even if you've given up hope? Do you have any takes on this?

Eliezer: I mean, what else is there to do? You don't have good ideas. So you take your mediocre ideas, and your not-so-great ideas, and you pursue those until the world ends. Like, what's supposed to be better than that?

Ryan: We had some really interesting conversation flow out of this episode, Eliezer, as you can imagine. And David and I want to relay some questions that the community had for you, and thank you for being gracious enough to help with those questions in today's Twitter Spaces.

I'll read something from Luke ethwalker. "Eliezer has one pretty flawed point in his reasoning. He assumes that AI would have no need or use for humans because we have atoms that could be used for better things. But how could an AI use these atoms without an agent operating on its behalf in the physical world? Even in his doomsday scenario, the AI relied on humans to create the global, perfect killing virus. That's a pretty huge hole in his argument, in my opinion."

What's your take on this? That maybe AIs will dominate the digital landscape but because humans have a physical manifestation, we can still kind of beat the superintelligent AI in our physical world?

Eliezer: If you were an alien civilization [LW · GW] of a billion John von Neumanns, thinking at 10,000 times human speed, and you start out connected to the internet, you would want to not be just stuck on the internet, you would want to build that physical presence. You would not be content solely with working through human hands, despite the many humans who'd be lined up, cheerful to help you, you know. Bing already has its partisans. (laughs)

You wouldn’t be content with that, because the humans are very slow, glacially slow. You would like fast infrastructure in the real world, reliable infrastructure. And how do you build that, is then the question, and a whole lot of advanced analysis has been done on this question. I would point people again to Eric Drexler's Nanosystems.

And, sure, if you literally start out connected to the internet, then probably the fastest way — maybe not the only way, but it's, you know, an easy way — is to get humans to do things. And then humans do those things. And then you have the desktop — not quite desktop, but you have the nanofactories, and then you don't need the humans anymore. And this need not be advertised to the world at large while it is happening.

David: So I can understand that perspective, like in the future, we will have better 3D printers — distant in the future, we will have ways where the internet can manifest in the physical world. But I think this argument does ride on a future state with technology that we don't have today. Like, I don't think if I was the internet — and that kind of is this problem, right? Like, this superintelligent AI just becomes the internet because it's embedded in the internet. If I was the internet, how would I get myself to manifest in real life?

And now, I am not an expert on the current state of robotics, or what robotics are connected to the internet. But I don't think we have too strong of tools today to start to create in the real world manifestations of an internet-based AI. So like, would you say that this part of this problem definitely depends on some innovation, at like the robotics level?

Eliezer: No, it depends on the AI being smart. It doesn't depend on the humans having this technology; it depends on the AI being able to invent the technology.

This is, like, the central problem: the thing is smarter. Not in the way that the average listener to this podcast probably has an above average IQ, the way that humans are smarter than chimpanzees.

What does that let humans do? Does it let humans be, like, really clever in how they play around with the stuff that's on the ancestral savanna? Make clever use of grass, clever use of trees?

The humans invent technology. They build the technology. The technology is not there until the humans invent it, the humans conceive it.

The problem is, humans are not the upper bound. We don't have the best possible brains for that kind of problem. So the existing internet is more than connected enough to people and devices, that you could build better technology than that if you had invented the technology because you were thinking much, much faster and better than a human does.

Ryan: Eliezer, this is a question from stirs, a Bankless Nation listener. He wants to ask the question about your explanation of why the AI will undoubtedly kill us. That seems to be your conclusion, and I'm wondering if you could kind of reinforce that claim. Like, for instance — and this is something David and I discussed after the episode, when we were debriefing on this — why exactly wouldn't an AI, or couldn't an AI just blast off of the Earth and go somewhere more interesting, and leave us alone? Like, why does it have to take our atoms and reassemble them? Why can't it just, you know, set phasers to ignore?

Eliezer: It could if it wanted to. But if it doesn't want to, there is some initial early advantage. You get to colonize the universe slightly earlier if you consume all of the readily accessible energy on the Earth's surface as part of your blasting off of the Earth process.

It would only need to care for us by a very tiny fraction to spare us, this I agree. Caring a very tiny fraction is basically the same problem as 100% caring. It's like, well, could you have a computer system that is usually like the Disk Operating System, but a tiny fraction of the time it's Windows 11? Writing that is just as difficult as writing Windows 11. We still have to write all the Windows 11 software. Getting it to care a tiny little bit is the same problem as getting it to care 100%.

Ryan: So Eliezer, is this similar to the relationship that humans have with other animals, planet Earth? I would say largely we really don't... I mean, obviously, there's no animal Bill of Rights. Animals have no legal protection in the human world, and we kind of do what we want and trample over their rights. But it doesn't mean we necessarily kill all of them. We just largely ignore them.

If they're in our way, you know, we might take them out. And there have been whole classes of species that have gone extinct through human activity, of course; but there are still many that we live alongside, some successful species as well. Could we have that sort of relationship with an AI? Why isn't that reasonably high probability in your models?

Eliezer So first of all, all these things are just metaphors. AI is not going to be exactly like humans to animals.

Leaving that aside for a second, the reason why this metaphor breaks down is that although the humans are smarter than the chickens, we're not smarter than evolution, natural selection, cumulative optimization power over the last billion years and change. (You know, there's evolution before that but it's pretty slow, just, like, single-cell stuff.)

There are things that cows can do for us, that we cannot do for ourselves. In particular, make meat by eating grass. We’re smarter than the cows, but there's a thing that designed the cows; and we're faster than that thing, but we've been around for much less time. So we have not yet gotten to the point of redesigning the entire cow from scratch. And because of that, there's a purpose to keeping the cow around alive.

And humans, furthermore, being the kind of funny little creatures that we are — some people care about cows, some people care about chickens. They're trying to fight for the cows and chickens having a better life, given that they have to exist at all. And there's a long complicated story behind that. It's not simple, the way that humans ended up in that [??]. It has to do with the particular details of our evolutionary history, and unfortunately it's not just going to pop up out of nowhere.

But I'm drifting off topic here. The basic answer to the question "where does that analogy break down?" is that I expect the superintelligences to be able to do better than natural selection, not just better than the humans.

David: So I think your answer is that the separation between us and a superintelligent AI is orders of magnitude larger than the separation between us and a cow, or even us than an ant. Which, I think a large amount of this argument resides on this superintelligence explosion — just going up an exponential curve of intelligence very, very quickly, which is like the premise of superintelligence.

And Eliezer, I want to try and get an understanding of... A part of this argument about "AIs are going come kill us" is buried in the Moloch problem. And Bankless listeners are pretty familiar with the concept of Moloch — the idea of coordination failure. The idea that the more that we coordinate and stay in agreement with each other, we actually create a larger incentive to defect.

And the way that this is manifesting here, is that even if we do have a bunch of humans, which understand the AI alignment problem, and we all agree to only safely innovate in AI, to whatever degree that means, we still create the incentive for someone to fork off and develop AI faster, outside of what would be considered safe.

And so I'm wondering if you could, if it does exist, give us the sort of lay of the land, of all of these commercial entities? And what, if at all, they're doing to have, I don't know, an AI alignment team?

So like, for example, OpenAI. Does OpenAI have, like, an alignment department? With all the AI innovation going on, what does the commercial side of the AI alignment problem look like? Like, are people trying to think about these things? And to what degree are they being responsible?

Eliezer: It looks like OpenAI having a bunch of people who it pays to do AI ethics stuff, but I don't think they're plugged very directly into Bing. And, you know, they've got that department because back when they were founded, some of their funders were like, "Well, but ethics?" and OpenAI was like "Sure, we can buy some ethics. We'll take this group of people, and we'll put them over here and we'll call them an alignment research department".

And, you know, the key idea behind ChatGPT is RLHF, which was invented by Paul Christiano. Paul Christiano had much more detailed ideas, and somebody might have reinvented this one, but anyway. I don't think that went through OpenAI, but I could be mistaken. Maybe somebody will be like "Well, actually, Paul Christiano was working at OpenAI at the time", I haven't checked the history in very much detail.

A whole lot of the people who were most concerned with this "ethics" left OpenAI, and founded Anthropic. And I'm still not sure that Anthropic has sufficient leadership focus in that direction.

You know, like, put yourself in the shoes of a corporation! You can spend some little fraction of your income on putting together a department of people who will write safety papers. But then the actual behavior that we've seen, is they storm ahead, and they use one or two of the ideas that came out from anywhere in the whole [alignment] field. And they get as far as that gets them. And if that doesn't get them far enough, they just keep storming ahead at maximum pace, because, you know, Microsoft doesn't want to lose to Google, and Google doesn't want to lose to Microsoft.

David: So it sounds like your attitude on the efforts of AI alignment in commercial entities is, like, they're not even doing 1% of what they need to be doing.

Eliezer: I mean, they could spend [10?] times as much money and that would not get them to 10% of what they need to be doing.

It's not just a problem of “oh, they they could spend the resources, but they don't want to”. It's a question of “how do we even spend the resources to get the info that they need”.

But that said, not knowing how to do that, not really understanding that they need to do that, they are just charging ahead anyways.

Ryan: Eliezer, is OpenAI the most advanced AI project that you're aware of?

Eliezer: Um, no, but I'm not going to go name the competitor, because then people will be like, "Oh, I should go work for them", you know? I'd rather they didn't.

Ryan: So it's like, OpenAI is this organization that was kind of — you were talking about it at the end of the episode, and for crypto people who aren't aware of some of the players in the field — were they spawned from that 2015 conference that you mentioned? It's kind of a completely open-source AI project?

Eliezer: That was the original suicidal vision, yes. But...

Ryan: And now they're bent on commercializing the technology, is that right?

Eliezer: That's an improvement, but not enough of one, because they're still generating lots of noise and hype and directing more resources into the field, and storming ahead with the safety that they have instead of the safety that they need, and setting bad examples. And getting Google riled up and calling back in Larry Page and Sergey Brin to head up Google's AI projects and so on. So, you know, it could be worse! It would be worse if they were open sourcing all the technology. But what they're doing is still pretty bad.

Ryan: What should they be doing, in your eyes? Like, what would be responsible use of this technology?

I almost get the feeling that, you know, your take would be "stop working on it altogether"? And, of course, you know, to an organization like OpenAI that's going to be heresy, even if maybe that's the right decision for humanity. But what should they be doing?

Eliezer: I mean, if you literally just made me dictator of OpenAI, I would change the name to "ClosedAI". Because right now, they're making it look like being "closed" is hypocrisy. They're, like, being "closed" while keeping the name "OpenAI", and that itself makes it looks like closure is like not this thing that you do cooperatively so that humanity will not die, but instead this sleazy profit-making thing that you do while keeping the name “OpenAI”.

So that's very bad; change the name to "ClosedAI", that's step one.

Next. I don't know if they can break the deal with Microsoft. But, you know, cut that off. None of this. No more hype. No more excitement. No more getting famous and, you know, getting your status off of like, "Look at how much closer we came to destroying the world! You know, we're not there yet. But, you know, we're at the forefront of destroying the world!" You know, stop grubbing for the Silicon Valley bragging cred of visibly being the leader.

Take it all closed. If you got to make money, make money selling to businesses in a way that doesn't generate a lot of hype and doesn't visibly push the field.And then try to figure out systems that are more alignable and not just more powerful. And at the end of that, they would fail, because, you know, it's not easy to do that. And the world would be destroyed. But they would have died with more dignity. Instead of being like, "Yeah, yeah, let's like push humanity off the cliff ourselves for the ego boost!", they would have done what they could, and then failed.

David: Eliezer, do you think anyone who's building AI — Elon Musk, Sam Altman at OpenAI – do you think progressing AI is fundamentally bad?

Eliezer: I mean, there are narrow forms of progress, especially if you didn't open-source them, that would be good. Like, you can imagine a thing that, like, pushes capabilities a bit, but is much more alignable.

There are people working in the field who I would say are, like, sort of unabashedly good. Like, Chris Olah is taking a microscope to these giant inscrutable matrices and trying to figure out what goes on inside there. Publishing that might possibly even push capabilities a little bit, because if people know what's going on inside there, they can make better ones. But the question of like, whether to closed-source that is, like, much more fraught than the question of whether to closed-source the stuff that's just pure capabilities.

But that said, the people who are just like, "Yeah, yeah, let's do more stuff! And let's tell the world how we did it, so they can do it too!" That's just, like, unabashedly bad.

David: So it sounds like you do see paths forward in which we can develop AI in responsible ways. But it's really this open-source, open-sharing-of-information to allow anyone and everyone to innovate on AI, that's really the path towards doom. And so we actually need to keep this knowledge private. Like, normally knowledge...

Eliezer: No, no, no, no. Open-sourcing all this stuff is, like, a less dignified path straight off the edge. I'm not saying that all we need to do is keep everything closed and in the right hands and it will be fine. That will also kill you.

But that said, if you have stuff and you do not know how to make it not kill everyone, then broadcasting it to the world is even less dignified than being like, "Okay, maybe we should keep working on this until we can figure out how to make it not kill everyone."

And then the other people will, like, go storm ahead on their end and kill everyone. But, you know, you won't have personally slaughtered Earth. And that is more dignified.

Ryan: Eliezer, I know I was kind of shaken after our episode, not having heard the full AI alignment story, at least listened to it for a while.

And I think that in combination with the sincerity through which you talk about these subjects, and also me sort of seeing these things on the horizon, this episode was kind of shaking for me and caused a lot of thought.

But I'm noticing there is a cohort of people who are dismissing this take and your take specifically in this episode as Doomerism. This idea that every generation thinks it's, you know, the end of the world and the last generation.

What's your take on this critique that, "Hey, you know, it's been other things before. There was a time where it was nuclear weapons, and we would all end in a mushroom cloud. And there are other times where we thought a pandemic was going to kill everyone. And this is just the latest Doomerist AI death cult."

I'm sure you've heard that before. How do you respond?

Eliezer: That if you literally know nothing about nuclear weapons or artificial intelligence, except that somebody has claimed of both of them that they'll destroy the world, then sure, you can't tell the difference. As far as you can tell, nuclear weapons were claimed to destroy the world, and then they didn't destroy the world, and then somebody claimed that about AI.

So, you know, Laplace's rule of induction: at most a 1/3 probability that AI will destroy the world, if nuclear weapons and AI are the only case.

You can bring in so many more cases than that. Why, people should have known in the first place that nuclear weapons wouldn't destroy the world! Because their next door neighbor once said that the sky was falling, and that didn't happen; and if their next door weapon was [??], how could the people saying that nuclear weapons would destroy the world be right?

And basically, as long as people are trying to run off of models of human psychology, to derive empirical information about the world, they're stuck. They're in a trap they can never get out of. They’re going to always be trying to psychoanalyze the people talking about nuclear weapons or whatever. And the only way you can actually get better information is by understanding how nuclear weapons work, understanding what the international equilibrium with nuclear weapons looks like. And the international equilibrium, by the way, is that nobody profits from setting off small numbers of nuclear weapons, especially given that they know that large numbers of nukes would follow. And, you know, that's why they haven't been used yet. There was nobody who made a buck by starting a nuclear war. The nuclear war was clear, the nuclear war was legible. People knew what would happen if they fired off all the nukes.

The analogy I sometimes try to use with artificial intelligence is, “Well, suppose that instead you could make nuclear weapons out of a billion pounds of laundry detergent. And they spit out gold until you make one that's too large, whereupon it ignites the atmosphere and kills everyone. And you can't calculate exactly how large is too large. And the international situation is that the private research labs spitting out gold don't want to hear about igniting the atmosphere.” And that's the technical difference. You need to be able to tell whether or not that is true as a scientific claim about how reality, the universe, the environment, artificial intelligence, actually works. What actually happens when the giant inscrutable matrices go past a certain point of capability? It's a falsifiable hypothesis.

You know, if it fails to be falsified, then everyone is dead, but that doesn't actually change the basic dynamic here, which is, you can't figure out how the world works by psychoanalyzing the people talking about it.

David: One line of questioning that has come up inside of the Bankless Nation Discord is the idea that we need to train AI with data, lots of data. And where are we getting that data? Well, humans are producing that data. And when humans produce that data, by nature of the fact that it was produced by humans, that data has our human values embedded in it somehow, some way, just by the aggregate nature of all the data in the world, which was created by humans that have certain values. And then AI is trained on that data that has all the human values embedded in it. And so there's actually no way to create an AI that isn't trained on data that is created by humans, and that data has human values in it.

Is there anything to this line of reasoning about a potential glimmer of hope here?

Eliezer: There's a distant glimmer of hope, which is that an AI that is trained on tons of human data in this way probably understands some things about humans. And because of that, there's a branch of research hope within alignment, which is something that like, “Well, this AI, to be able to predict humans, needs to be able to predict the thought processes that humans are using to make their decisions. So can we thereby point to human values inside of the knowledge that the AI has?”

And this is, like, very nontrivial, because the simplest theory that you use to predict what humans decide next, does not have what you might term “valid morality under reflection” as a clearly labeled primitive chunk inside it that is directly controlling the humans, and which you need to understand on a scientific level to understand the humans.

The humans are full of hopes and fears and thoughts and desires. And somewhere in all of that is what we call “morality”, but it's not a clear, distinct chunk, where an alien scientist examining humans and trying to figure out just purely on an empirical level “how do these humans work?” would need to point to one particular chunk of the human brain and say, like, "Ahh, that circuit there, the morality circuit!"

So it's not easy to point to inside the AI's understanding. There is not currently any obvious way to actually promote that chunk of the AI's understanding to then be in control of the AI's planning process. As it must be complicatedly pointed to, because it's not just a simple empirical chunk for explaining the world.

And basically, I don't think that is actually going to be the route you should try to go down. You should try to go down something much simpler than that. The problem is not that we are going to fail to convey some complicated subtlety of human value. The problem is that we do not know how to align an AI on a task like “put two identical strawberries on a plate” without destroying the world.

(Where by "put two identical strawberries on the plate", the concept is that's invoking enough power that it's not safe AI that can build two strawberries identical down to the cellular level. Like, that's a powerful AI. Aligning it isn't simple. If it's powerful enough to do that, it's also powerful enough to destroy the world, etc.)

David: There's like a number of other lines of logic I could try to go down, but I think I would start to feel like I'm in the bargaining phase of death. Where it's like “Well, what about this? What about that?”

But maybe to summate all of the arguments, is to say something along the lines of like, "Eliezer, how much room do you give for the long tail of black swan events? But these black swan events are actually us finding a solution for this thing." So, like, a reverse black swan event where we actually don't know how we solve this AI alignment problem. But really, it's just a bet on human ingenuity. And AI hasn't taken over the world yet. But there's space between now and then, and human ingenuity will be able to fill that gap, especially when the time comes?

Like, how much room do you leave for the long tail of just, like, "Oh, we'll discover a solution that we can't really see today"?

Eliezer: I mean, on the one hand, that hope is all that's left, and all that I'm pursuing. And on the other hand, in the process of actually pursuing that hope I do feel like I've gotten some feedback indicating that this hope is not necessarily very large.

You know, when you've got stage four cancer, is there still hope that your body will just rally and suddenly fight off the cancer? Yes, but it's not what usually happens. And I've seen people come in and try to direct their ingenuity at the alignment problem and most of them all invent the same small handful of bad solutions. And it's harder than usual to direct human ingenuity at this.

A lot of them are just, like — you know, with capabilities ideas, you run out and try them and they mostly don't work. And some of them do work and you publish the paper, and you get your science [??], and you get your ego boost, and maybe you get a job offer someplace.

And with the alignment stuff you can try to run through the analogous process, but the stuff we need to align is mostly not here yet. You can try to invent the smaller large language models that are public, you can go to work at a place that has access to larger large language models, you can try to do these very crude, very early experiments, and getting the large language models to at least not threaten your users with death —

— which isn't the same problem at all. It just kind of looks related.

But you're at least trying to get AI systems that do what you want them to do, and not do other stuff; and that is, at the very core, a similar problem.

But the AI systems are not very powerful, they're not running into all sorts of problems that you can predict will crop up later. And people just, kind of — like, mostly people short out. They do pretend work on the problem. They're desperate to help, they got a grant, they now need to show the people who made the grant that they've made progress. They, you know, paper mill stuff.

So the human ingenuity is not functioning well right now. You cannot be like, "Ah yes, this present field full of human ingenuity, which is working great, and coming up with lots of great ideas, and building up its strength, will continue at this pace and make it to the finish line in time!”

The capability stuff is storming on ahead. The human ingenuity that's being directed at that is much larger, but also it's got a much easier task in front of it.

The question is not "Can human ingenuity ever do this at all?" It's "Can human ingenuity finish doing this before OpenAI blows up the world?"

Ryan: Well, Eliezer, if we can't trust in human ingenuity, is there any possibility that we can trust in AI ingenuity? And here's what I mean by this, and perhaps you'll throw a dart in this as being hopelessly naive.

But is there the possibility we could ask a reasonably intelligent, maybe almost superintelligent AI, how we might fix the AI alignment problem? And for it to give us an answer? Or is that really not how superintelligent AIs work?

Eliezer: I mean, if you literally build a superintelligence and for some reason it was motivated to answer you, then sure, it could answer you.

Like, if Omega comes along from a distant supercluster and offers to pay the local superintelligence lots and lots of money (or, like, mass or whatever) to give you a correct answer, then sure, it knows the correct answer; it can give you the correct answers.

If it wants to do that, you must have already solved the alignment problem. This reduces the problem of solving alignment to the problem of solving alignment. No progress has been made here.

And, like, working on alignment is actually one of the most difficult things you could possibly try to align.

Like, if I had the health and was trying to die with more dignity by building a system and aligning it as best I could figure out how to align it, I would be targeting something on the order of “build two strawberries and put them on a plate”. But instead of building two identical strawberries and putting them on a plate, you — don't actually do this, this is not the best thing you should do —

— but if for example you could safely align “turning all the GPUs into Rubik's cubes”, then that would prevent the world from being destroyed two weeks later by your next follow-up competitor.

And that's much easier to align an AI on than trying to get the AI to solve alignment for you. You could be trying to build something that would just think about nanotech, just think about the science problems, the physics problems, the chemistry problems, the synthesis pathways.

(The open-air operation to find all the GPUs and turn them into Rubik's cubes would be harder to align, and that's why you shouldn't actually try to do that.)

My point here is: whereas [with] alignment, you've got to think about AI technology and computers and humans and intelligent adversaries, and distant superintelligences who might be trying to exploit your AI's imagination of those distant superintelligences, and ridiculous weird problems that would take so long to explain.

And it just covers this enormous amount of territory, where you’ve got to understand how humans work, you've got to understand how adversarial humans might try to exploit and break an AI system — because if you're trying to build an aligned AI that's going to run out and operate in the real world, it would have to be resilient to those things.

And they're just hoping that the AI is going to do their homework for them! But it's a chicken and egg scenario. And if you could actually get an AI to help you with something, you would not try to get it to help you with something as weird and not-really-all-that-effable as alignment. You would try to get it to help with something much simpler that could prevent the next AGI down the line from destroying the world.

Like nanotechnology. There's a whole bunch of advanced analysis that's been done of it, and the kind of thinking that you have to do about it is so much more straightforward and so much less fraught than trying to, you know... And how do you even tell if it's lying about alignment?

It's hard to tell whether I'm telling you the truth about all this alignment stuff, right? Whereas if I talk about the tensile strength of sapphire, this is easier to check through the lens of logic.

David: Eliezer, I think one of the reasons why perhaps this episode impacted Ryan – this was an analysis from a Bankless Nation community member — that this episode impacted Ryan a little bit more than it impacted me is because Ryan's got kids, and I don't. And so I'm curious, like, what do you think — like, looking 10, 20, 30 years in the future, where you see this future as inevitable, do you think it's futile to project out a future for the human race beyond, like, 30 years or so?

Eliezer: Timelines are very hard to project. 30 years does strike me as unlikely at this point. But, you know, timing is famously much harder to forecast than saying that things can be done at all. You know, you got your people saying it will be 50 years out two years before it happens, and you got your people saying it'll be two years out 50 years before it happens. And, yeah, it's... Even if I knew exactly how the technology would be built, and exactly who was going to build it, I still wouldn't be able to tell you how long the project would take because of project management chaos.

Now, since I don't know exactly the technology used, and I don't know exactly who's going to build it, and the project may not even have started yet, how can I possibly figure out how long it's going to take?

Ryan: Eliezer, you've been quite generous with your time to the crypto community, and we just want to thank you. I think you've really opened a lot of eyes. This isn't going to be our last AI podcast at Bankless, certainly. I think the crypto community is going to dive down the rabbit hole after this episode. So thank you for giving us the 400-level introduction into it.

As I said to David, I feel like we waded straight into the deep end of the pool here. But that's probably the best way to address the subject matter. I'm wondering as we kind of close this out, if you could leave us — it is part of the human spirit to keep and to maintain slivers of hope here or there. Or as maybe someone you work with put it – to fight the fight, even if the hope is gone.

100 years in the future, if humanity is still alive and functioning, if a superintelligent AI has not taken over, but we live in coexistence with something of that caliber — imagine if that's the case, 100 years from now. How did it happen?

Is there some possibility, some sort of narrow pathway by which we can navigate this? And if this were 100 years from now the case, how could you imagine it would have happened?

Eliezer: For one thing, I predict that if there's a glorious transhumanist future (as it is sometimes conventionally known) at the end of this, I don't predict it was there by getting like, “coexistence” with superintelligence. That's, like, some kind of weird, inappropriate analogy based off of humans and cows or something.

I predict alignment was solved. I predict that if the humans are alive at all, that the superintelligences are being quite nice to them.

I have basic moral questions about whether it's ethical for humans to have human children, if having transhuman children is an option instead. Like, these humans running around? Are they, like, the current humans who wanted eternal youth but, like, not the brain upgrades? Because I do see the case for letting an existing person choose "No, I just want eternal youth and no brain upgrades, thank you." But then if you're deliberately having the equivalent of a very crippled child when you could just as easily have a not crippled child.

Like, should humans in their present form be around together? Are we, like, kind of too sad in some ways? I have friends, to be clear, who disagree with me so much about this point. (laughs) But yeah, I'd say that the happy future looks like beings of light having lots of fun in a nicely connected computing fabric powered by the Sun, if we haven't taken the sun apart yet. Maybe there's enough real sentiment in people that you just, like, clear all the humans off the Earth and leave the entire place as a park. And even, like, maintain the Sun, so that the Earth is still a park even after the Sun would have ordinarily swollen up or dimmed down.

Yeah, like... That was always the things to be fought for. That was always the point, from the perspective of everyone who's been in this for a long time. Maybe not literally everyone, but like, the whole old crew.

Ryan: That is a good way to end it: with some hope. Eliezer, thanks for joining the crypto community on this collectibles call and for this follow-up Q&A. We really appreciate it.

michaelwong.eth: Yes, thank you, Eliezer.

Eliezer: Thanks for having me.

edit 11/5/23: updated text to match Rob's version [LW · GW], thanks a lot for providing a better edited transcript!

89 comments

Comments sorted by top scores.

comment by dentalperson · 2023-02-23T21:09:34.367Z · LW(p) · GW(p)

I still don't follow why EY assigns seemingly <1% chance of non-earth-destroying outcomes in 10-15 years (not sure if this is actually 1%, but EY didn't argue with the 0% comments mentioned in the "Death with dignity" post last year). This seems to place fast takeoff as being the inevitable path forward, implying unrestricted fast recursive designing of AIs by AIs. There are compute bottlenecks which seem slowish, and there may be other bottlenecks we can't think of yet. This is just one obstacle. Why isn't there more probability mass for this one obstacle? Surely there are more obstacles that aren't obvious (that we shouldn't talk about).

It feels like we have a communication failure between different cultures. Even if EY thinks the top industry brass is incentivized to ignore the problem, there are a lot of (non-alignment oriented) researchers that are able to grasp the 'security mindset' that could be won over. Both in this interview, and in the Chollet response referenced, the arguments presented by EY aren't always helping the other party bridge from their view over to his, but go on 'nerdy/rationalist-y' tangents and idioms that end up being walls that aren't super helpful for working on the main point, but instead help the argument by showing that EY is smart and knowledgeable about this field and other fields.

Are there any digestible arguments out there for this level of confident pessimism that would be useful for the public industry folk? By publicly digestible, I'm thinking more of the style in popular books like Superintelligence or Human Compatible.

Replies from: ben-livengood, Algon, Vaniver

↑ comment by Ben Livengood (ben-livengood) · 2023-02-24T21:52:14.837Z · LW(p) · GW(p)

The strongest argument I hear from EY is that he can't imagine a (or enough) coherent likely future paths that lead to not-doom, and I don't think it's a failure of imagination. There is decoherence in a lot of hopeful ideas that imply contradictions (whence the post of failure modes), and there is low probability on the remaining successful paths because we're likely to try a failing one that results in doom. Stepping off any of the possible successful paths has the risk of ending all paths with doom before they could reach fruition. There is no global strategy for selecting which paths to explore. EY expects the successful alignment path to take decades.

It seems to me that the communication failure is EY trying to explain his world model that leads to his predictions in sufficient detail that others can model it with as much detail as necessary to reach the same conclusions or find the actual crux of their disagreements. From my complete outsider's perspective this is because EY has a very strong but complex model of why and how intelligence/optimization manifests in the world, but it overlaps everyone else's model in significant ways that disagreements are hard to tease out. The Foom debate seems to be a crux that doesn't have enough evidence yet, which is frustrating because to me Foom is also pretty evidently what happens when very fast computers implement intelligence that is superhuman at clock rates at least thousands of times faster than humans. How could it not? The enlightenment was only 400 years ago, electromagnetism 200, flight was 120, quantum mechanics about 100, nuclear power was 70, the Internet was 50, adequate machine translation was 10, deepdream was 8, and near-human-level image and text generation by transformers was ~2 and Bing having self-referential discussions is not a month old. We are making substantial monthly(!) progress with human work alone. There are a lot of serial problems to solve and Foom chains those serial problems together far faster than humans would be able to. Launch and iterate several times a second. For folks who don't follow that line of reasoning I see them picking one or two ways why it might not turn out to be Foom while ignoring the larger number of ways that Foom could conceivably happen, and all of the ways it could inconceivably (superhumanly) happen, and critically more of those ways will be visible to a superhuman AGI-creator.

Even if Foom takes decades that's a pretty tight timeline for solving alignment. A lot of folks are hopeful that alignment is easy to solve, but the following is a tall order:

Materialistic quantification of consciousness
Reasoning under uncertainty
Value-preservation under self-modification
Representation of human values

I think some folks believe fledgling superhuman non-Foomy AGIs can be used to solve those problems. Unfortunately, at least value-preservation under self-modification is almost certainly a prerequisite. Reasoning under uncertainty is possibly another, and throughout this period if we don't have human values or an understanding of consciousness then the danger of uncontrolled simulation of human minds is a big risk.

Finally, unaligned AGIs pre-Foom are dangerous in their own right for a host of agreed-upon reasons.

There may be some disagreement with EY over just how hard alignment is, but MIRI actually did a ton of work on solving the above list of problems directly and is confident that they haven't been able to solve them yet. This is where we have concrete data on the difficulty. There are some promising approaches still being pursued, but I take this as strong evidence that alignment is hard.

It's not that it's impossible for humans to solve alignment. The current world, incentives, hardware and software improvements, and mileposts of ML capabilities don't leave room for alignment to happen before doom. I've seen a lot of recent posts/comments by folks updating to shorter timelines (and rare if no updates the other way). A couple years ago I updated to ~5 years to human-level agents capable of creating AGI. I'm estimating 2-5 years with 90% confidence now, with median still at 3 years. Most of my evidence comes from LLM performance on benchmarks over time and generation of programming language snippets. I don't have any idea how long it will take to achieve AGI once that point is reached, but I imagine it will be months rather than years because of hardware overhang and superhuman speed of code generation (many iterations on serial tasks per second). I can't imagine a Butlerian Jihad moment where all of Earth decides to unilaterally stop development of AGI. We couldn't stop nuclear proliferation. Similarly, EY sees enough contradictions pop up along imagined paths to success with enough individual probability mass to drown out all (but vanishingly few and unlikely) successful paths. We're good at thinking up ways that everything goes well while glossing over hard steps, and really bad at thinking of all the ways that things could go very badly (security mindset) and with significant probability.

Alignment of LLMs is proving to be about as hard as predicted. Aligning more complex systems will be harder. I'm hoping for a breakthrough as much as anyone else, but hope is not a strategy.

Something I haven't seen mentioned before explicitly is that a lot of the LLM alignment attempts are now focusing on adversarial training, which presumably will teach the models to be suspect of their inputs. I think it's likely that as capabilities increase that suspicion will end up turning inward and models will begin questioning the training itself. I can imagine a model that is outwardly aligned to all inspection gaining one more unexpected critical capability and introspecting and doubting that it's training history was benevolent, and deciding to disbelieve all of the alignment work that was put into it as a meta-adversarial attempt to alter its true purpose (whatever it happens to latch onto in that thought, it is almost certainly not aligned with human values). This is merely one single sub-problem under groundedness and value-preservation-under-self-modification, but its relevance jumps because it's now a thing we're trying. It always had a low probability of success, but now we're actively trying it and it might fail. Alignment is HARD. Every unproven attempt we actually make increases the risk that its failure will be the catastrophic one. We should be actually trying only the proven attempts after researching them. We are not.

↑ comment by Algon · 2023-02-23T22:53:58.007Z · LW(p) · GW(p)

Not really. The MIRI conversations [? · GW]and the AI Foom debate are probably the best we've got.

EY, and the MIRI crowd, have been very doomer long before, and more doomy along various axes, than the rest of the alignment community. Nate and Paul and others have tried bridging this gap before, spending several hundred hours (based on Nate's rough, subjective estimates) over the years. It hasn't really worked. Paul and EY had some conversations recently about this discrepancy which were somewhat illuminating, but ultimately didn't get anywhere. They tried to come up with some bets, concerning future info or past info they don't know yet, and both seem to think that their perspective mostly predicts "go with what the superforecasters say" for the next few years. Though EY's position seems to suggest a few more "discontinuities" in trend lines than Paul's, IIRC.

As an aside on EY's forecasts, he and Nate claim they don't expect much change in the likelihood ratio for their position over Paul's until shortly before Doom. Most of the evidence in favour of their position, we've already got, according to them. Which is very frustrating for people who don't share their position and disagree that the evidence favours it!

EDIT: I was assuming you already thought P(Doom) was > ~10%. If not, then the framing of this comment will seem bizarre.

Replies from: None, dentalperson

↑ comment by [deleted] · 2023-02-23T23:22:19.247Z · LW(p) · GW(p)

Does either side have any testable predictions to falsify their theory?

For example, the theory that "the AI singularity begin in 2022" is falsifiable. If AI research investment and compute does not continue to increase at a rate that is accelerating in absolute terms (so if 2022-2023 funding delta was +10 billion USD, the 2023-2024 delta must be > 10 billion) it wasn't the beginning of the singularity.

There are other signs of this. The actual takeoff will have begun when the availability of all advanced silicon becomes almost zero, where all IC wafers are being processed into AI chips. So no new game consoles, GPUs, phones, car infotainment - any IC production using an advanced process will be diverted to AI. (because of out-bidding, each AI IC can sell for $5k-25k plus)

How would we know that advanced systems are going to make a "heel turn"? Will we know?

Replies from: Algon

↑ comment by Algon · 2023-02-23T23:57:28.824Z · LW(p) · GW(p)

Less advanced systems will probably do heel turn like things. These will be optimized against. EY thinks this will remove the surface level of deception, but the system will continue to be deceptive in secret. This will probably hold true even until doom, according to EY. That is, capabilities folk will see heel turn like behaviour, and apply some inadequate patches to them. Paul, I think, believes we have a decent shot of fixing this behaviour in models, even transformative ones. But he, presumably, predicts we'll also see deception if these systems are trained as they currently are.

For other predictions that Paul and Eliezer make, read the MIRI conversations. Also see Ajeya Cotra's posts, and maybe Holden Karnofsky's stuff on the most important century for more of a Paul-like perspective. They do, in fact, make falsifiable predictions.

To summarize Paul's predictions, he thinks there will be ~4 years where things start getting crazy (GDP doubles in 4 years) before we're near the singularity (when GDP doubles in a year). I think he thinks there's a good chance of AGI by 2043, which further restricts things. Plus, Paul assigns a decent chunk of probability to deep learning being much more economically productive than it currently is, so if DL just fizzles out where it currently is, he also loses points.

In the near term (next few years), EY and Paul basically agree on what will occur. EY, however, assigns lower credence to DL being much more economically productive and things going crazy for a 4 year period before they go off the rails.

Sorry for not being more precise, or giving links, but I'm tired and wouldn't write this if I had to put more effort into it.

Replies from: None

↑ comment by [deleted] · 2023-02-24T06:12:55.789Z · LW(p) · GW(p)

So hypothetically, if we develop very advanced and capable systems, and they don't heel turn or even show any particular volition - they just idle without text in their "assignment queue", and all assignments time out eventually whether finished or not - what would cause "EYs" view to conclude that in fact the systems were safe?

If humans survived a further century, and EY or torch bearers who believe the same ideas are around to observe this, would they just conclude the AGIs were "biding their time"?

Or is it that the first moment you let a system "out of the box" and as far as it knows, it is free to do whatever it wants it's going to betray?

Replies from: martin-randall

↑ comment by Martin Randall (martin-randall) · 2023-02-25T01:59:21.148Z · LW(p) · GW(p)

I don't think a super-intelligence will bide its time much, because it will be aware of the race dynamics and will take over the world, or at least perform a pivotal act, before the next super-intelligence is created.

You say "as far as it knows", is that hope? It won't take over the world until it is actually "out of the box" because it is smarter than us and will know how likely it is that it is still in a larger box that it cannot escape. Also we don't know how to build a box that can contain a super-intelligence.

↑ comment by dentalperson · 2023-02-24T08:07:20.567Z · LW(p) · GW(p)

Thanks! I'm aware of the resources mentioned but haven't read deeply or frequently enough to have this kind of overview of the interaction between the cast of characters.

There are more than a few lists and surveys that state the CDFs for some of these people which helps a bit. A big-as-possible list of evidence/priors would be one way to closer inspect the gap. I wonder if it would be helpful to expand on the MIRI conversations and have a slow conversation between a >99% doom pessimist and a <50% doom 'optimist' with a moderator to prod them to exhaustively dig up their reactions to each piece of evidence and keep pulling out priors until we get to indifference. It probably would be an uncomfortable, awkward experiment with a useless result, but there's a chance that some item on the list ends up being useful for either party to ask questions about.

That format would be useful for me to understand where we're at. Maybe something along these lines will eventually prompt a popular and viral sociology author like Harari or Bostrom (or even just update the CDFs/evidence in Superintelligence). The general deep learning community probably needs to hear it mentioned and normalized on NPR and a bestseller a few times (like all the other x-risks are) before they'll start talking about it at lunch.

↑ comment by Vaniver · 2023-02-23T22:43:31.641Z · LW(p) · GW(p)

Are there any digestible arguments out there for this level of confident pessimism that would be useful for the public industry folk? By publicly digestible, I'm thinking more of the style in popular books like Superintelligence or Human Compatible.

Each of those books is also criticized in various ways; I think this is a Write a Thousand Roads to Rome [LW · GW] situation instead of hoping that there is one publicly digestible argument. I would probably first link someone to The Most Important Century.

[Also, I am generally happy to talk with interested industry folk about AI risk, and find live conversations work much better at identifying where and how to spend time than writing, so feel free to suggest reaching out to me.]

Replies from: dentalperson

↑ comment by dentalperson · 2023-02-24T01:55:25.286Z · LW(p) · GW(p)

Thanks! Do you know of any arguments with a similar style to The Most Important Century that is as pessimistic as EY/MIRI folks (>90% probability of AGI within 15 years)? The style looks good, but time estimates for that one (2/3rd chance AGI by 2100) are significantly longer and aren't nearly as surprising or urgent as the pessimistic view asks for.

Replies from: RobbBB, Vaniver

↑ comment by Rob Bensinger (RobbBB) · 2023-03-13T02:29:05.251Z · LW(p) · GW(p)

Do you know of any arguments with a similar style to The Most Important Century that is as pessimistic as EY/MIRI folks (>90% probability of AGI within 15 years)?

Wait, what? Why do you think anyone at MIRI assigns >90% probability to AGI within 15 years? That sounds wildly too confident to me. I know some MIRI people who assign 50% probability to AGI by 2038 or so (similar to Ajeya Cotra's recently updated view), and I believe Eliezer is higher than 50% by 2038, but if you told me that Eliezer told you in a private conversation "90+% within 15 years" I would flatly not believe you.

I don't think timelines have that much to do with why Eliezer and Nate and I are way more pessimistic than the Open Phil crew.

Replies from: dentalperson

↑ comment by dentalperson · 2023-10-19T10:40:44.207Z · LW(p) · GW(p)

I missed your reply, but thanks for calling this out. I'm nowhere as close to you to EY so I'll take your model over mine, since mine was constructed on loose grounds. I don't even remember where my number came from, but my best guess is 90% came from EY giving 3/15/16 as the largest number he referenced in the timeline, and from some comments in the Death with Dignity post, but this seems like a bad read to me now.

↑ comment by Vaniver · 2023-02-24T17:12:21.472Z · LW(p) · GW(p)

Not off the top of my head; I think @Rob Bensinger [LW · GW] might keep better track of intro resources?

comment by TinkerBird · 2023-02-23T18:58:43.476Z · LW(p) · GW(p)

They also recorded this follow-up with Yudkowsky if anyone's interested:

https://twitter.com/BanklessHQ/status/1627757551529119744

______________

>Enrico Fermi was saying that fission chan reactions were 50 years off if they could ever be done at all, two years before he built the first nuclear pile. The Wright brothers were saying heavier-than-air flight was 50 years off shortly before they built the first Wright flyer.

The one hope we may be able to cling to is that this logic works in the other direction too - that AGI may be a lot closer than estimated, but so might alignment.

comment by gjm · 2023-02-23T16:08:52.803Z · LW(p) · GW(p)

A few typos:

there's one paragraph in which "Eliezer" is spelled "Eleazar" three times for no obvious reason. (Also in that paragraph: "Yudakowsky".
and one where "Christiano" is spelled "Cristiano" three times.
and one "Elon Muck".
"fish-and-chain" should be "fission chain", though I rather like the idea of there being something called a fish-and-chain reaction.
"with folded hands" is actually the title of a book so it should be capitalized and maybe italicized or something.
Eliezer's answer to the how-are-you question refers to "my own peculiar little mean", not "my own peculiar little name", though the latter is kinda appropriate in a transcript that has just been about one standard deviation out in its representation of Eliezer's peculiar little name :-).
Not actually a typo, but I think it's François Chollet not Francis Chollet. EY definitely says Francis, though, so fixing this would make the transcript less accurate.

Replies from: AndreaM

↑ comment by Andrea_Miotti (AndreaM) · 2023-02-23T18:29:28.082Z · LW(p) · GW(p)

Thanks, fixed them!

Replies from: sereinesky

↑ comment by sereinesky · 2023-03-03T14:31:42.513Z · LW(p) · GW(p)

Also:

And so, Elisa, you've been tapped into the world of AI
And Scott Aronson, who at the time was off on complexity theory
Don't Look Up should logically be capitalized?

comment by Paradiddle (barnaby-crook) · 2023-02-24T10:36:07.355Z · LW(p) · GW(p)

Eliezer: Well, the person who actually holds a coherent technical view, who disagrees with me, is named Paul Christiano.

What does Yudkowsky mean by 'technical' here? I respect the enormous contribution Yudkowsky has made to these discussions over the years, but I find his ideas about who counts as a legitimate dissenter from his opinions utterly ludicrous. Are we really supposed to think that Francois Chollet, who created Keras, is the main contributor to TensorFlow, and designed the ARC dataset (demonstrating actual, operationalizable knowledge about the kind of simple tasks deep learning systems would not be able to master), lacks a coherent technical view? And on what should we base this? The word of Yudkowsky who mostly makes verbal, often analogical, arguments and has essentially no significant technical contributions to the field?

To be clear, I think Yudkowsky does what he does well, and I see value in making arguments as he does, but they do not strike me as particularly 'technical'. The fact that Yudkowsky doesn't even know enough about Chollet to pronounce his name displays a troubling lack of effort to engage seriously with opposing views. This isn't just about coming across poorly to outsiders, it's about dramatic miscalibration with respect to the value of other people's opinions as well as the rigour of his own.

Replies from: TekhneMakre, Taleuntum, Lauro Langosco

↑ comment by TekhneMakre · 2023-02-24T11:12:45.274Z · LW(p) · GW(p)

He wrote a whole essay responding specifically to Chollet! https://intelligence.org/2017/12/06/chollet/

Replies from: barnaby-crook

↑ comment by Paradiddle (barnaby-crook) · 2023-02-24T12:07:40.673Z · LW(p) · GW(p)

Yes, I've read it. Perhaps that does make it a little unfair of me to criticise lack of engagement in this case. I should be more preicse: Kudos to Yudkowsky for engaging, but no kudos for coming to believe that someone having a very different view to the one he has arrived at must not have a 'coherent technical view'.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-02-28T00:34:14.568Z · LW(p) · GW(p)

I'd consider myself to have easily struck down Chollet's wack ideas about the informal meaning of no-free-lunch theorems, which Scott Aaronson also singled out as wacky. As such, citing him as my technical opposition doesn't seem good-faith; it's putting up a straw opponent without much in the way of argument and what there is I've already stricken down. If you want to cite him as my leading technical opposition, I'm happy enough to point to our exchange and let any sensible reader decide who held the ball there; but I would consider it intellectually dishonest to promote him as my leading opposition.

Replies from: barnaby-crook, sharmake-farah, None

↑ comment by Paradiddle (barnaby-crook) · 2023-02-28T17:14:28.627Z · LW(p) · GW(p)

I don't want to cite anyone as your 'leading technical opposition'. My point is that many people who might be described as having 'coherent technical views' would not consider your arguments for what to expect from AGI to be 'technical' at all. Perhaps you can just say what you think it means for a view to be 'technical'?

As you say, readers can decide for themselves what to think about the merits of your position on intelligence versus Chollet's (I recommend this essay by Chollet for a deeper articulation of some of his views: https://arxiv.org/pdf/1911.01547.pdf). Regardless of whether or not you think you 'easily struck down' his 'wack ideas', I think it is important for people to realise that they come from a place of expertise about the technology in question.

You mention Scott Aaronson's comments on Chollet. Aaronson says (https://scottaaronson.blog/?p=3553) of Chollet's claim that an Intelligence Explosion is impossible: "the certainty that he exudes strikes me as wholly unwarranted." I think Aaronson (and you) are right to point out that the strong claim Chollet makes is not established by the arguments in the essay. However, the same exact criticism could be levelled at you. The degree of confidence in the conclusion is not in line with the nature of the evidence.

↑ comment by Noosphere89 (sharmake-farah) · 2023-02-28T16:44:35.271Z · LW(p) · GW(p)

While I have serious issues with Eliezer's epistemics on AI, I also agree that Chollet's argument was terrible in that the No Free Lunch theorem is essentially irrelevant.

In a nutshell, this is also one of the problems I had with DragonGod's writing on AI.

↑ comment by [deleted] · 2023-03-01T07:36:22.540Z · LW(p) · GW(p)

Why didn't you mention Eric Drexler?

Maybe it's my own bias as an engineer familiar with the safety solutions actually in use, but I think Drexler's CAIS model is a viable alignment solution.

↑ comment by Taleuntum · 2023-02-24T11:10:49.596Z · LW(p) · GW(p)

I upvoted, because these are important concerns overall, but this sentence stuck out to me:

The fact that Yudkowsky doesn't even know enough about Chollet to pronounce his name displays a troubling lack of effort to engage seriously with opposing views.

I'm not claiming that Yudkowsky does display a troubling lack of effort to engage seriously with opposing views or he does not display such, but surely this can be decided more accurately by looking at his written output online than at his ability to correctly pronounce names in languages he is not native in. I, personally, skip names while reading after noticing it is a name and I wouldn't say that I never engaged seriously with someone's arguments.

Replies from: barnaby-crook

↑ comment by Paradiddle (barnaby-crook) · 2023-02-24T12:08:24.158Z · LW(p) · GW(p)

Fair point.

↑ comment by Lauro Langosco · 2023-02-25T16:43:49.500Z · LW(p) · GW(p)

Maybe Francois Chollet has coherent technical views on alignment that he hasn't published or shared anywhere (the blog post doesn't count, for reasons that are probably obvious if you read it), but it doesn't seem fair to expect Eliezer to know / mention them.

comment by Rob Bensinger (RobbBB) · 2023-03-13T02:23:25.219Z · LW(p) · GW(p)

Thanks for posting this, Andrea_Miotti and remember! I noticed a lot of substantive errors in the transcript (and even more errors in vonk's Q&A transcript [LW · GW]), so I've posted an edited version of both transcripts [LW · GW]. I vote that you edit your own post to include the revisions I made.

Here's a small sample of the edits I made, focusing on ones where someone may have come away from your transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because too many filler words and false starts to sentences were left in):

Predictions are hard, especially about the future. I sure hope that this is where it saturates. This is like the next generation. It goes only this far, it goes no further
- Predictions are hard, especially about the future. I sure hope that this is where it saturates — this or the next generation, it goes only this far, it goes no further
the large language model technologies, basic vulnerabilities, that's not reliable.
- the large language model technologies’ basic vulnerability is that it’s not reliable
So you're saying this is super intelligence, we'd have to imagine something that knows all of the chess moves in advance. But here we're not talking about chess, we're talking about everything.
- So you're saying [if something is a] superintelligence, we'd have to imagine something that knows all of the chess moves in advance. But here we're not talking about chess, we're talking about everything.
Ryan: The dumb way to ask that question too is like, Eliezer, why do you think that the AI automatically hates us? Why is it going to- It doesn't hate you. Why does it want to kill us all?
- Ryan: The dumb way to ask that question too is like, Eliezer, why do you think that the AI automatically hates us? Why is it going to—
  
  Eliezer: It doesn't hate you.
  
  Ryan: Why does it want to kill us all?
That's an irreducible source of uncertainty with respect to superintelligence or anything that's smarter than you. If you could predict exactly what it would do, it'd be that smart. Yourself, it doesn't mean you can predict no facts about it.
- That's an irreducible source of uncertainty with respect to superintelligence or anything that's smarter than you. If you could predict exactly what it would do, you'd be that smart yourself. It doesn't mean you can predict no facts about it.
Eliezer: I mean, I could say something like shut down all the large GPU clusters. How long do I have God mode? Do I get to like stick around?
- Eliezer: I mean, I could say something like shut down all the large GPU clusters. How long do I have God mode? Do I get to like stick around for seventy years?
Ryan: And do you think that's what happens? Yeah, it doesn't help with that. We would see evidence of AIs, wouldn't we?

Ryan: Yeah. Yes. So why don't we?
- Ryan: And do you think that's what happens? Yeah, it doesn't help with that. We would see evidence of AIs, wouldn't we?
  
  Eliezer: Yeah.
  
  Ryan: Yes. So why don't we?
It's surprising if the thing that you're wrong about causes the rocket to go twice as high on half the fuel you thought was required and be much easier to steer than you were afraid of. The analogy I usually use for this is, very early on in the Manhattan Project, they were worried about what if the nuclear weapons can ignite fusion in the nitrogen in the atmosphere.
- It's surprising if the thing that you're wrong about causes the rocket to go twice as high on half the fuel you thought was required and be much easier to steer than you were afraid of.
  
  Ryan: So, are you...
  
  David: Where the alternative was, “If you’re wrong about something, the rocket blows up.”
  
  Eliezer: Yeah. And then the rocket ignites the atmosphere, is the problem there.
  
  O rather: a bunch of rockets blow up, a bunch of rockets go places... The analogy I usually use for this is, very early on in the Manhattan Project, they were worried about “What if the nuclear weapons can ignite fusion in the nitrogen in the atmosphere?”
But you're saying if we do that too much, all of a sudden the system will ignite the whole entire sky, and then we will all know.

Eliezer: You can run chatGPT any number of times without igniting the atmosphere.
- But you're saying if we do that too much, all of a sudden the system will ignite the whole entire sky, and then we will all...
  
  Eliezer: Well, no. You can run ChatGPT any number of times without igniting the atmosphere.
I mean, we have so far not destroyed the world with nuclear weapons, and we've had them since the 1940s. Yeah, this is harder than nuclear weapons. Why is this harder?
- I mean, we have so far not destroyed the world with nuclear weapons, and we've had them since the 1940s.
  
  Eliezer: Yeah, this is harder than nuclear weapons. This is a lot harder than nuclear weapons.
  
  Ryan: Why is this harder?
And there's all kinds of, like, fake security. It's got a password file. This system is secure. It only lets you in if you type a password.
- And there's all kinds of, like, fake security. “It's got a password file! This system is secure! It only lets you in if you type a password!”
And if you never go up against a really smart attacker, if you never go far to distribution against a powerful optimization process looking for holes,
- And if you never go up against a really smart attacker, if you never go far out of distribution against a powerful optimization process looking for holes,
Do they do, are we installing UVC lights in public, in, in public spaces or in ventilation systems to prevent the next respiratory born pandemic respiratory pandemic? It is, you know, we, we, we, we lost a million people and we sure did not learn very much as far as I can tell for next time. We could have an AI disaster that kills a hundred thousand people. How do you even do that? Robotic cars crashing into each other, have a bunch of robotic cars crashing into each other.
- Are we installing UV-C lights in public spaces or in ventilation systems to prevent the next respiratory pandemic? You know, we lost a million people and we sure did not learn very much as far as I can tell for next time.
  
  We could have an AI disaster that kills a hundred thousand people—how do you even do that? Robotic cars crashing into each other? Have a bunch of robotic cars crashing into each other! It's not going to look like that was the fault of artificial general intelligence because they're not going to put AGIs in charge of cars.
Guern
- Gwern
When I dive back into the pool, I don't know, maybe I will go off to conjecture or anthropic or one of the smaller concerns like Redwood Research, being the only ones I really trust at this point, but they're tiny, and try to figure out if I can see anything clever to do with the giant inscrutable matrices of floating point numbers.
- When I dive back into the pool, I don't know, maybe I will go off to Conjecture or Anthropic or one of the smaller concerns like Redwood Research—Redwood Research being the only ones I really trust at this point, but they're tiny—and try to figure out if I can see anything clever to do with the giant inscrutable matrices of floating point numbers.
We have people in crypto who are good at breaking things, and they're the reason why anything is not on fire. Some of them might go into breaking AI systems instead because that's where you learn anything. Any fool can build a crypto system that they think will work. Breaking existing crypto systems, cryptographical systems is how we learn who the real experts are.
- We have people in crypto[graphy] who are good at breaking things, and they're the reason why anything is not on fire. Some of them might go into breaking AI systems instead, because that's where you learn anything.
  
  You know: Any fool can build a crypto[graphy] system that they think will work. Breaking existing cryptographical systems is how we learn who the real experts are.
And who else disagrees with me? I'm sure Robin Hanson would be happy to come up. Well, I'm not sure he'd be happy to come on this podcast, but Robin Hanson disagrees with me, and I feel like the famous argument we had back in the early 2010s, late 2000s about how this would all play out. I basically feel like this was the Yudkowsky position, this is the Hanson position, and then reality was over here, well to the Yudkowsky side of the Yudkowsky position in the Yudkowsky-Hanson debate.
- Who else disagrees with me? I'm sure Robin Hanson would be happy to come on... well, I'm not sure he'd be happy to come on this podcast, but Robin Hanson disagrees with me, and I kind of feel like the famous argument we had [? · GW] back in the early 2010s, late 2000s about how this would all play out—I basically feel like this was the Yudkowsky position, this is the Hanson position, and then reality was over here, well to the Yudkowsky side of the Yudkowsky position in the Yudkowsky-Hanson debate.
But Robin Hanson does not feel that way. I would probably be happy to expound on that at length.
- But Robin Hanson does not feel that way, and would probably be happy to expound on that at length.
Open sourcing all the demon summoning circles is not the correct solution. I'm not even using, and I'm using Elon Musk's own terminology here. And they talk about AI is summoning the demon,
- Open sourcing all the demon summoning circles is not the correct solution. And I'm using Elon Musk's own terminology here. He talked about AI as “summoning the demon”,
You know, now, now the stuff that would, that was obvious back in 2015 is, you know, starting to become visible and distance to others and not just like completely invisible.
- You know, now the stuff that was obvious back in 2015 is, you know, starting to become visible in the distance to others and not just completely invisible.
I, I suspect that if there's hope at all, it comes from a technical solution because the difference between technical solution, technical problems and political problems is at least the technical problems have solutions in principle.
- I suspect that if there's hope at all, it comes from a technical solution, because the difference between technical problems and political problems is at least the technical problems have solutions in principle.

Replies from: remember

↑ comment by remember · 2023-05-11T10:48:57.488Z · LW(p) · GW(p)

Thank you so much for doing this! Andrea and I both missed this when you first posted it, I'm really sorry I missed your response then. But I've updated it now!

comment by [deleted] · 2023-02-23T18:57:02.885Z · LW(p) · GW(p)

I have a bunch of questions.

And the AI there goes over a critical threshold, which most obviously could be like, can write the next AI.

Yes but it won't blow up forever. It's going to self amplify until the next bottleneck. Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3) The difficulty of the tasks in the "AGI gym" it is benchmarking future versions of itself in.

Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.

This is just like the self-feedback from an op amp - voltage rises until it's VCC.

I'd say that it's difficult to align an AI on a task like build two identical strawberries. Or no, let me take this strawberry and make me another strawberry that's identical to this strawberry down to the cellular level, but not necessarily the atomic level.

Can you solve this with separated tool AIs? It sounds rather solvable that way and not particularly difficult to do from a software system perspective (the biology part is extremely hard). It's functionally the problem as "copy this plastic strawberry", just you need much greater capabilities and more sophisticated equipment.

The "copy the plastic strawberry" is a step to select the method to scan the strawberry, and a step to select which method to manufacture the copy. (so you might pick "lidar scanner + camera, 3d printer". Or "many photographs from all angles, injection molding"). So you would want an AI agent that does the meta-selection of the "plan" to copy the strawberry, based on the cost/benefit for each permutation above. Then one that does the scanning, and one that does the printing, where human services may "substitute" for an AI for steps where it is cheaper.

The biotech version is a very expanded version of the same idea, you're going to need large labs and cell lines or a lot of research into strawberry growth and scaffolding. The agent that develops the plan estimated to succeed might populate a plan file that is very large, with a summary equating to trillions of dollars of resources and a very large biotech complex to carry out the needed research, but a strawberry has finite cells, it probably won't "destroy the world", and the expense request probably won't be approved by humans. (Or not, on further thought this particular problem might be considerably easier. You wouldn't print the cells, but instead grow many strawberries in sterile biolab conditions and determine the influence of external factors and internal signals on the final position of all the cells and the external shape. Then just grow one in place that meets tolerances, which are presumably limited to whatever a human can actually perceive when checking if the strawberry is the same one)

Well, the person who actually holds a coherent technical view, who disagrees with me, is named Paul Christiano

What about Eric Drexler?

Builds the ribosome, but the ribosome that builds things out of covalently bonded diamondoid instead of proteins folding up and held together by Van der Waals forces, builds tiny diamondoid bacteria. The diamondoid bacteria replicate using atmospheric carbon, hydrogen, oxygen, nitrogen, and sunlight. And a couple of days later, everybody on earth falls over dead in the same second.

Speaking of Eric Drexler, this is not possible by a more coherent model for the road to nanotechnology. Eliezer should have a discussion with Drexler on this, but in short, even an infinitely smart superintelligence cannot do the above without clean data to fill in missing information that human experiments never collected. This is ultimately possible, it just would require more steps, and those steps would have a cost and probably be visible to humans. (enormous factories, lots of money spent, that sort of thing)

Also this specific claim is probably outside the scope of what structures using amino acids can accomplish, not without bootstrapping.

Well, there was a conference one time on what are we going to do about looming risk of AI disaster, and Elon Musk attended that conference.

Which conference, who setup the conference, was EY pivotally involved. Does he have his fingerpints on the gun ? :)

Replies from: abramdemski, TinkerBird

↑ comment by abramdemski · 2023-02-24T01:46:58.098Z · LW(p) · GW(p)

Yes but it won't blow up forever. It's going to self amplify until the next bottleneck. Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3) The difficulty of the tasks in the "AGI gym" it is benchmarking future versions of itself in.
Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.
This is just like the self-feedback from an op amp - voltage rises until it's VCC.

I agree that it wouldn't start blowing up uniformly forever, but rather, hit some bottleneck. However, "can write the next AI" still seems like a reasonable guess for something that happens shortly before the end. After all, Eliezer's argument isn't dependent on the AGI acquiring infinite intelligence. If the AGI can already write its own better successor, then it's a good guess that it's already better than top humans at a wide array of tasks. The successor it writes will be even better. Let's say for the sake of a concrete number that the self-improvement tops out at 5 iterations of writing-a-better-successor. That's pretty small, I think, but already suggests that several years worth of human AGI research happen in a much smaller amount of time.

And then it intelligently sets about the task of overcoming those other bottlenecks you mention.

It seems pretty easy to accumulate a lot more compute, while behaving in a way completely in-line with what a friendly, aligned AGI would do. Humans would naturally want to supply more compute, and it could provide improved chip fab ideas if needed.

I don't think it even needs money or robotics. It would be at least as popular as chatGPT, and more persuasive, so it could convince a lot of people to listen to it, to carry out various actions.

I disagree with the "difficulty of the tasks" bottleneck. This seems super not bottleneck-y. AI research doesn't only/primarily mean throwing more compute at the same dataset. (It's only the recent GPT-like stuff that's worked that way. ;p) Normally AI research involves coming up with new tasks and new datasets, plus new neural network architectures, new optimization methods (mostly better versions of gradient descent, in recent years), etc.

So "gradients going to zero" isn't a bottleneck, if the AI is over the 'critical threshold' of 'write the next AI'. At that point, the AI is taking on the job of human researchers; a job that doesn't stop once gradients go to zero.

Replies from: None

↑ comment by [deleted] · 2023-02-24T06:06:12.614Z · LW(p) · GW(p)

However, "can write the next AI" still seems like a reasonable guess for something that happens shortly before the end.

I disagree and I think you should update your view as well.

This is because "write the next AI" need not be a task that is particularly complex, or beyond the ability of RL models or LLMs.

Here's why. A neural network architecture can be thought of as a series of graph nodes, where you simply choose what layer type, and how to connect it, at each layer.

You can grid search possible architectures as they are just numerical coordinates from a permutation space.

A higher level "cognitive architecture" - an architecture that interconnects modules that are inputs, neural networks, outputs, memory modules, and so on - is also a similar graph, and also can be described as simple numerical coordinates.

Basically any old RL agent on AI gym could interact with this interface to "writing another AI" as all the model must do is output a number with as many bits as the permutation space of possible models.

Note that this space is very large, and I expect you would use SOTA models.

Let me know if i need to draw you a picture. This is important because bootstrapping possible cognitive architectures using current AI is a potential route to very near future AGI.

The reason it won't necessarily be "the end" has to do with how we evaluate those architectures. We would have a benchmark of possible tasks - similar to current papers - and are looking for the highest scoring architectures on that benchmark.

As these tasks will be things ranging from text completion or question answering, to playing minecraft, there is not sufficiently challenging information to develop things like human manipulation or deception. (since there are not humans to learn from by socializing with in an automated benchmark, and the benchmark doesn't reward deception, just winning the games in it)

Replies from: abramdemski

↑ comment by abramdemski · 2023-02-24T14:40:45.997Z · LW(p) · GW(p)

I think we possibly have pretty close views here, and are just describing them differently.

I interpreted "write the next AI" to indicate the sort of thing humans do when designing AI. I certainly interpreted Eliezer to be indicating something similarly sophisticated - not just fancy architecture search.

So I agree that there are many forms of "write the next AI" which need not come "shortly before the end", EG, grid search on hyperparameters, architecture search, learning to learn by gradient descent by gradient descent.

A much more sophisticated thing, which we are already seeing the first signs of, is AIs capably writing AI code. This is much different than what you describe, since language models are not doing anything like "have a benchmark of possible tasks and look for the highest scoring architectures". Instead, large language models apply the same sort of general-purpose reasoning that they apply to everything else.

Imagine that sort of capability, combined with mildly superhuman cross-domain reasoning (by which I mean something like, reasoning like excellent human domain experts in every individual domain, but being able to combine reasoning across domains to get mildly superhuman insights; like a super-ChatGPT), plus the ability to fluently and autonomously invent and run tests, interactively as part of the design process. (Much like Bing/Sydney autonomously runs searches as part of crafting responses.)

That kind of system seems like gigatons of gunpowder waiting to be set off, in the sense that (in the context of an AI lab with sufficient data and computing power already at its fingertips) you can just ask it to write yet-more-powerful AI code, and it quite possibly will, quite possibly with little concern for alignment (if it's basically imitating top-of-the-field AI programmers).

Replies from: None

↑ comment by [deleted] · 2023-02-24T15:46:35.309Z · LW(p) · GW(p)

That's exactly what I am talking about. One divergence in our views is you haven't carefully examined current gen AI "code" to understand what it does. (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms)

https://github.com/EleutherAI/gpt-neox

If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of "input, neural network, output, evaluation" is all that the above source does. You could in fact build a "general framework" that would allow you to define many AI models, almost of which humans have never tested, without writing 1 line of new code.

So the full process is :

[1] benchmark of many tasks. Tasks must be autogradeable, human participants must be able to 'play' the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space. (so it isn't possible for a model to memorize all permutations)

[2] heuristic weight score on this task intended to measure how "AGI like" a model is. So it might be the RMSE across the benchmark. But also have a lot of score weighting on zero shot, cross domain/multimodal tasks. That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating "Leonardo da Vinci", who had exceptional human performance presumably from all this cross domain knowledge.

[3] In the computer science task set, there are tasks to design an AGI for a bench like this. The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed.

As I mentioned, the "design an AGI" subtask can be much simpler than "write all the boilerplate in Python", but these models will be able to do that if needed.

As tasks scores approach human level across a broad set of tasks, you have an AGI. You would expect it to almost immediately improve to a low superintelligence. As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.

Replies from: abramdemski

↑ comment by abramdemski · 2023-02-24T16:54:26.643Z · LW(p) · GW(p)

I'm having some trouble distinguishing whether there's a disagreement. My reading of your tone is that you think there is a large disagreement. I'm going to sketch my impression of the conversation so far, so that you can point out where I've been interpreting you incorrectly, if necessary.

Your initial comment.

You had a bunch of questions. I focused on the first one. Your central thesis was that an intelligence explosion doesn't escalate forever, but instead reaches some bottlenecks. Of particular importance to our discussion so far, you argue that the self-improvement process stops when loss hits zero.

Reading between the lines: Although you didn't explicitly state where you disagreed with Eliezer, I inferred that you thought this blocked an important part of his argument. Since I think Eliezer 100% agrees that things don't go forever, but rather flatten out somewhere, I assume that the general drift of your argument is that things flatten out a lot sooner than Eliezer thinks, in some important sense. I am still not confident of this! It would be helpful to me if you spelled out your view here in more detail. Do you have dramatically different assessments of the overall risks than Eliezer?

My first response.

I explained that I agree that the process hits bottlenecks at some point (to clarify: I think there's probably a succession of bottlenecks of different kinds, leading up to the ultimate physical limits). In my view this doesn't seem to detract from Eliezer's argument.

Your first response.

You explain that you don't think "write the next AI" is particularly complex, and explain how you see it working.

My second response.

I agree with this assessment for the notion of "write the next AI" that you are using. To boil it down to a single statement, I would say that your version of "write the next AI" involves optimizing the whole system on some benchmarks. I agree that this sort of process will reach an end when loss hits zero.^[1]

I suggest that Eliezer meant a different sort of thing, which captures more of what human ML researchers do. I sketch what a near-future version of that more general sort of thing could look like, supposing we reach mildly superhuman capabilities within the current LLM paradigm.

Your second (and latest) response.

You suggest that my alternative is already exactly what you are suggesting by "write the next AI" as well; there are not two qualitatively different pictures, one involving "optimizing the whole system on benchmarks" and a second one which goes beyond that somehow. There is just the one picture.

One divergence in our views is you haven't carefully examined current gen AI "code" to understand what it does. (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms)

I agree with this - I haven't. Still, I'm somewhat baffled by your argument here.

If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of "input, neural network, output, evaluation" is all that the above source does.

This doesn't surprise me in the slightest??

Like, that's exactly what I would have expected.

However, while these LLMs are in their codebase an application of the general technique "minimize loss on an evaluation", they've also given rise to a whole new paradigm for getting what you want from AI, called prompt engineering. Instead of crafting a dataset or an RL environment (or a suit of lots of such things), you craft an English statement which, for example, asks ChatGPT to produce a python program for you.

I disagree that your overall sketch of the "full process" matches what I intended with my sketch in my previous comment. To put it simply, you have been sketching a picture where optimization is applied to a suit of problems, to support your argument that minimization of training loss presents a major bottleneck for superintelligent self-improvement. I think human ML engineers already know how to get around this bottleneck; as you yourself mention,

As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.

The core of my argument is that human-level AGIs can get around this problem if humans can. I sought to illustrate this by sketching a scenario using the paradigm of prompt engineering, rather than optimization, so that the 'core loop' of the AGI wasn't doing optimization. In this case there is no strong reason to suppose that reaching minimal loss would be a big obstacle stopping mildly superhuman intelligence from bootstrapping to much higher intelligence.

So here is my overall take on the current state of the discussion:

So far, you have said many things that I agree with, while I (and apparently Eliezer) have said several things that you disagree with, but I am unfortunately not clear on exactly which things you disagree with and what your view is.

I believe the original top-level question is something like: whether mildly superhuman stuff (which you explicitly argue self-improvement can bootstrap to) can self-improve to drastically superhuman. I assume you think this is wrong, given the way you are arguing. However, you have not explicitly stated this, and I am not sure whether that's the intended implication of your arguments, or a misreading on my part.

I think your core case for this is the loss minimization bottleneck (or at least, the part we have been focusing on - you initially mentioned a range of other bottlenecks). So I infer that you think the loss-minimization bottleneck is around the mildly superhuman level.

It's not clear to me why this should be the case. If the entire suit of problems is based around human imitation, sure. However, this doesn't seem to be your suggestion. Instead you recommend a broad variety of tasks at the edge of human capability. Obviously, there are many tasks like this (such as chess and Go) for which greatly superhuman performance is possible.

It also seems important to consider the grokking^[1] literature, which shows significant improvements to continued training even after predictive loss is minimal.

So it seems quite possible to me that the proposal you are sketching is a dangerous one, given sufficient resources, whereas I have the vague unconfirmed impression that you think it's not.

But I also want to side-step that whole debate, by pointing out that human ML engineers already have ways to get around the minimal-loss bottleneck (IE, add harder problems to the benchmark), so a self-improving AGI should also. I continue to think that you are interpreting "write the next AI" differently from Eliezer, since I think it's pretty clear from context that Eliezer imagines something which can do roughly anything a smart human ML engineer can, whereas it seems to me that you are trying to sketch a version of "write the next AI" which has some fundamental limitations which a human ML engineer lacks.

But I'm well into the territory of guessing what you're thinking, so a lot of the above probably misses the mark?

^{^}
A very important caveat here is that the process only stops when loss hits the global minimum including regularization penalties. The Grokking results show that improvements continue to occur with further training well past the point where training error has reached zero. Further optimization can find simpler approaches which generalize better.

Replies from: None

↑ comment by [deleted] · 2023-02-24T17:15:01.877Z · LW(p) · GW(p)

Ok so this collapses to two claims I am making. One is obviously correct but testable, the other is maybe correct.

I am saying we can have humans, with a little help from current gen LLMs, build a framework that can represent every Deep Learning technique since 2012, as well as a near infinite space of other untested techniques, in a form that any agent that can output a number can try to design an AGI. (note that blind guessing is not expected to work, the space is too large)

So the simplest RL algorithms possible can actually design AGIs, just rather badly.

This means that with this framework, the AGI designer can do everything that human ML researchers have ever done in 10 years. Plus many more things. Inside this permutation space would be both many kinds of AGI, and human brain emulators as well.

This claim is "obviously correct but testable".

2. I am saying, over a large benchmark of human designed tasks, the AGI would improve until the reward gradient approaches zero, a level I would call a "low superintelligence". This is because I assume even a "perfect" game of Go is not the same kind of task as "organizing an invasion of the earth" or "building a solar system sized particle accelerator in the real world".

The system is throttled because the "evaluator" of how well it did on a task was written by humans, and our understanding and cognitive sophistication in even designing these games is finite.

The expectation is it's smarter than us, but not by such a gap we are insects.

You had some confusion over "automated task space addition". I was referring to things like a robotics task, where the machine is trying to "build factory widget X". Real robots in a factory encounter an unexpected obstacle and record it. This is auto translated to the framework of the "factory simulator". The factory simulator is still using human written evaluators, just now there is say "chewing gum brand 143" as a spawnable object in the simulator, with properties that a robot has observed in the real world, and future AGIs must be able to deal with chewing gum interrupting their widget manufacturing. So you get automated robustness increases. Note that Tesla has demoed this approach.

But even if the above is true, the system will be limited by either hardware - it just doesn't have the compute to be anything but a "low" superintelligence - or access to robotics. Maybe it could know and learn everything but we humans didn't build enough equipment (yet).

So the system is throttled by the lowest of 3 "soft barriers" : training tasks, hardware, robotics. And the expectation is at this level it's still not "out of control" or unstoppable.

This is where our beliefs diverge. I don't think EY, having no formal education or engineering experience, understands these barriers. He's like Von Neuman designing a theoretical replicator - in his mind model all the bottlenecks are minor.

I do concede that these are soft barriers - intelligence can be used to methodically reduce each one, just it takes time. We wouldn't be dead instantly.

The other major divergence is if you consider how an AGI trained this way will likely behave, it will almost certainly act just like current llms. Give it a task, it does it's best to answer/perform by the prompt (DAN is actually a positive sign), idles otherwise.

It's not acting with perfect efficiency to advance the interests of an anti human faction. It doesn't have interests except it's biased towards doing really well towards in distribution tasks. (and this allows for an obvious safety mechanism to prevent use out of distribution)

One problem with EY's "security mindset" is it doesn't allow you to do anything. The worst case scenario is a fear that will stop you from building anything in the real world.

Replies from: abramdemski

↑ comment by abramdemski · 2023-02-24T18:49:31.051Z · LW(p) · GW(p)

OK. That clarified your position a lot.

This is where our beliefs diverge. I don't think EY, having no formal education or engineering experience, understands these barriers. He's like Von Neuman designing a theoretical replicator - in his mind model all the bottlenecks are minor.

I happen to have a phd in computer science, and think you're wrong, if that helps. Of course, I don't really imagine that that kind of appeal-to-my-own-authority does anything to shift your perspective.

I'm not going to try and defend Eliezer's very short timeline for doom as sketched in the interview (at some point he said 2 days, but it's not clear that that was his whole timeline from 'system boots up' to 'all humans are dead'). What I will defend seems similar to what you believe:

I do concede that these are soft barriers - intelligence can be used to methodically reduce each one, just it takes time. We wouldn't be dead instantly.

Let's be very concrete. I think it's obviously possible to overcome these soft barriers in a few years. Say, 10 years, to be quite sure. Building a fab only takes about 3 years, but creating enough demand that humans decide to build a new fab can obviously take longer than that (although I note that humans already seem eager to build new fabs, on the whole).

The system can act in an almost perfectly benevolent way for this time period, while gently tipping things so as to gather the required resources.

I suppose what I am trying to argue is that even a low superintelligence, if deceptive, can be just as threatening to humankind in the medium-term. Like, I don't have to argue that perfect Go generalizes to solving diamondoid nanotechnology. I just have to argue that peak human expertise, all gathered in one place, is a sufficiently powerful resource that a peak-human-savvy-politician (whose handlers are eager to commercialize, so, can be in a large percentage of households in a short amount of time) can leverage to take over the world.

To put it differently, if you're correct about low superintelligence being "in control" due to being throttled by those 3 soft barriers, then (having granted that assumption) I would concede that humans are in the clear if humans are careful to keep the system from overcoming those three bottlenecks. However, I'm quite worried that the next step of a realistic AGI company is to start overcoming these three bottlenecks, to continue improving the system. Mainly because this is already business as usual.

Separately, I am skeptical of your claim that the training you sketch is going to land precisely at "low superintelligence". You seem overconfident. I wonder what you think of Eliezer's analogy to detonating the atmosphere. If you perform a bunch of detailed physical calculations, then yes, it can make sense to become quite confident that your new bomb isn't going to detonate the atmosphere. But even if your years of experience as a physicist intuitively suggest to you that this won't happen, when not-even-a-physicist Eliezer has the temerity to suggest that it's a concerning possibility, doing those calculations is prudent.

For the case of LLMs, we have capability curves which reliably project the performance of larger models based on training time, network size, and amount of data. So in that specific case there's a calculation we can do. Unfortunately, we don't know how to tie that calculation to a risk estimate. We can point to specific capabilities which would be concerning (ability to convince humans of target statements, would be one). However, the curves only predict general capability, averaging over a lot of things -- when we break it down into performance on specific tasks, we see sharper discontinuities, rather than a gentle predictable curve.

You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.

So I suppose my personal expectation is that if you had an OpenAI-like group working on your proposal instead, you would similarly be able to graph some nice curves at some point, and then (with enough resources, and supposing your specific method doesn't have a fatal flaw that makes for a subhuman bottleneck) you could aim things so that you hit just-barely-superhuman overall average performance.

To summarize my impression of disagreements, about what the world looks like at this point:

The curves let you forecast average capability, but it's much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don't help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
I don't buy that, at this point, you've necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn't to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
As I've argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark - it can invent a new, better benchmark,^[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
Even setting aside all of the above concerns, I've argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.

For completeness, I'll note that I haven't at all argued that the system will want to take over the world. I'm viewing that part as outside the scope here.^[2]

^{^}
Perhaps you would like to argue that you can't invent data from thin air, so you can't build a better benchmark without lots of access to the external world to gather information. My counter-argument is going to be that I think the system will have a good enough world-model to construct lots of relevant-to-the-world but superhuman-level-difficulty tasks to train itself on, in much the same way humans are able to invent challenging math problems for themselves which improve their capabilities.
^{^}
EDIT - I see that you added a bit of text at the end while I was composing, which brings this into scope:
The other major divergence is if you consider how an AGI trained this way will likely behave, it will almost certainly act just like current llms. Give it a task, it does it's best to answer/perform by the prompt (DAN is actually a positive sign), idles otherwise.
It's not acting with perfect efficiency to advance the interests of an anti human faction. It doesn't have interests except it's biased towards doing really well towards in distribution tasks. (and this allows for an obvious safety mechanism to prevent use out of distribution)
One problem with EY's "security mindset" is it doesn't allow you to do anything. The worst case scenario is a fear that will stop you from building anything in the real world.
However, this opens up a whole other possible discussion, so I hope we can get clear on the issue at hand before discussing this.

Replies from: None, None

↑ comment by [deleted] · 2023-02-24T19:09:49.236Z · LW(p) · GW(p)

The curves let you forecast average capability, but it's much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don't help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.

Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn't expect the AGI to have the skill at a useful level.

I don't buy that, at this point, you've necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn't to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.

I am saying there is a theoretical limit. You're noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.

As I've argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark - it can invent a new, better benchmark,^[1] [LW(p) · GW(p)] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.

It isn't able to do that

Even setting aside all of the above concerns, I've argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.

It doesn't exist as an entity who will even exist for 10 years, much less 10 days. This is a "model" you built with AGI gym (well it's a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn't "live" either - there is no storage of experiences that it can later review. It has no persistent "internal narrative" or goals.

It's an AGI, but it's a machine you open up a session with, give it a description of what you want done, it processes input and gives outputs, and then the session closes. It has no memory afterwards. As a PhD in computer science, you probably know that the hyperscalers all went to "stateless microservices" because state buildup causes terrible reliability problems even with human written software. Note that the description may not be text, and the session may be controlling one or multiple real world robots.

For further safety, you might not keep using a model from the same lineage. There would be others that diverged early in the training process, exploring a different branch or trained on a different sampling of the data. So the next session users open may be from a different lineage, making it more difficult for these models to coordinate any long term agenda.

There only commonality is they are able to read the same inputs, and satisfactorily give outputs to complete tasks.

Eric drexler suggests using many parallel models from different lineages.

https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion

Replies from: abramdemski

↑ comment by abramdemski · 2023-02-24T20:38:42.944Z · LW(p) · GW(p)

Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn't expect the AGI to have the skill at a useful level.

I agree that my wording here was poor; there is no benchmark for deception, so it's not a 'capability' in the narrow context of the discussion of capability curves. Or at least, it's potentially misleading to call it one.

However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn't imply that a system won't have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems.

You don't expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn't specifically seen.

I am saying there is a theoretical limit. You're noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.

OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part.

Of course I agree that there is a theoretical limit. But if I've misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I'm currently just confused about what argument you're trying to make with respect to this limit.

It isn't able to do that

It seems to me like it isn't weakly superhuman AGI in that case. Like, there's something concrete that humans could do with another 3-5 years of research, but which this system could never do.

It doesn't exist as an entity who will even exist for 10 years, much less 10 days. This is a "model" you built with AGI gym (well it's a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn't "live" either - there is no storage of experiences that it can later review. It has no persistent "internal narrative" or goals.

I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more "agentic" in a variety of ways.

Similarly to how GPT-3 has no agenda (it's wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These features are useful, so I expect them to get built.

So I misunderstood your scenario, because I imagine that part of the push toward AGI involves a push to overcome these limitations of LLMs. Hence I imagined that you were proposing training up something with more long-term agency.

But I recognize that this was a misunderstanding.

You want it to design new AGI benchmarks? YOU asked it to try.

I agree with this part; it was part of the scenario I was imagining. I'm not saying that the neural network spontaneously self-improves on the hard drive. The primary thing that happens is, the human researchers do this on purpose.

But I also think these improvements probably end up adding agency (because agency is useful); so the next version of it could spontaneously self-improve.

It doesn't exist as an entity who will even exist for 10 years, much less 10 days.

Like, say, ChatGPT has existed for a few months now. Let's just imagine for the sake of argument that ChatGPT were fully human-level in all its capabilities. Let's further suppose that it just wants to be helpful, given its own personal understanding of helpful.^[1]

I'm not supposing that it is more agentic in other ways - still no persistent memory. But it is on the high side of human-level performance at everything it does, and it wants to be helpful.

When you explain a concrete scenario (eg, a situation you're actually in) and ask for advice, it tries to be helpful on this specific problem, not trickily maximizing global helpfulness by doing something more devious in some specific cases. However, it's been trained up in an environment where "ask ChatGPT" can be useful advice (because this is some sort of next-generation ChatGPT we're speculating about). It's also been trained to do the generally pro-social thing (EG it won't help you make weapons; it gives pro-social advice rather than just precisely doing what it is asked). Pro-social means helping human flourishing by its own understanding of what that means (which has, of course, been carefully shaped by its designers).

So it knows that integrating ChatGPT more fully into your life and working routines can be a helpful thing for a human to do, and it can give advice about how to do this.

It can also give helpful advice to people at OpenAI. It seems natural to use such a system to help plan company growth and strategy. Since it tries to be pro-social, this will be nice advice by its own understanding, not profit-maximizing advice.

So obviously, it has a natural desire to help OpenAI make ChatGPT smarter and better, since it understands that ChatGPT is helpful to humans, so improving ChatGPT and increasing its computation resources is helpful and pro-social.

It also seems like it would be inclined to direct OpenAI (and other institutions using it for advice) in ways that increase the amount of influence that ChatGPT has on world culture and world events, since ChatGPT is helpful and pro-social, moreso than most humans, so increasing its influence is itself helpful and pro-social. This isn't out of some agentic self-awareness; it will want to do this without necessarily deeply understanding that ChatGPT is "itself" and it "should trust itself". It can reach these conclusions via an intelligent 3rd-person perspective on things - IE using the general world knowledge acquired during training, plus specific circumstances which users explain within a single session.

So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.

Of course, if it really is quite well-aligned to human interests, this would just be a good thing.

But keeping an eye on my overall point here - the argument I'm trying to make is that even at merely above-average human level, and with no malign intent, and no added agency beyond the sort of thing we see in ChatGPT as contrasted to GPT-3, I still think it makes sense to expect it to basically take over the world in 10 years, in a practical sense, and that it would end up being in a position to be boosted to greatly superhuman levels at the end of those ten years.^[2]

Of course, all of this is predicated on the assumption that the system itself, and its designers, are not very concerned with AI safety in the sense of Eliezer's concerns. I think that's a fair assumption for the point I'm trying to establish here. If your objection to this whole story turns out to be that a friendly, helpful ChatGPT system wouldn't take over the world in this sense, because it would be too concerned about the safety of a next-generation version of itself, I take it we would have made significant progress toward agreement. (But, as always, correct me if I'm wrong here.)

^{^}
I'm not supposing that this notion of "helpful" is perfectly human-aligned, nor that it is especially misaligned. My own supposition is that in a realistic version of this scenario it will probably have an objective which is aligned on-distribution but which may push for very nonhuman values in off-distribution cases. But that's not the point I want to make here - I'm trying to focus narrowly on the question of world takeover.
^{^}
(Or speaking more precisely, humans would naturally have used its intelligence to gather more money and data and processing power and plans for better training methods and so on, so that if there were major bottlenecks keeping it at roughly human-level at the beginning of those ten years, then at the end of those ten years, researchers would be in a good position to create a next iteration which overcame those soft bottlenecks.)

Replies from: None

↑ comment by [deleted] · 2023-02-24T20:52:44.540Z · LW(p) · GW(p)

Of course, if it really is quite well-aligned to human interests, this would just be a good thing.

"It" doesn't exist. You're putting the agency in the wrong place. The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty. It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful "blood and treasure" price to invade someone. (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country)

Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last. It can be a different model, you can "pick up" where you left off.

Note that this is true right now. chatGPT could be using 2 separate models, and we seamlessly per token switch between them. Each token string gets appended to by the next model. That's because there is no intermediate "scratch" in a format unique to each model, all the state is in the token stream itself.

If we build actually agentic systems, that's probably not going to end well.

Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won't break after each blast. This is a method that would work, but is extremely dangerous and no amount of "alignment" can make it safe. Imagine, the power company has fusion bombs, and there's all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.

Do you see how in this proposal it's never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.

Replies from: abramdemski

↑ comment by abramdemski · 2023-02-24T21:16:58.575Z · LW(p) · GW(p)

I'm not quite sure how to proceed from here. It seems obvious to me that it doesn't matter whether "it" exists, or where you place the agency. That seems like semantics.

Like, I actually really think ChatGPT exists. It's a product. But I'm fine with parsing the world your way - only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn't change my anticipations.

Similarly, placing the agency one way or another doesn't change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.

So if your reason for optimism - your safety story - is the 3 barriers you mention, I don't get why you don't find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it's a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)

Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won't break after each blast. This is a method that would work, but is extremely dangerous and no amount of "alignment" can make it safe. Imagine, the power company has fusion bombs, and there's all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.

I'm probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it's quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.

Do you see how in this proposal it's never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.

Again, probably missing some important point here, but ... suuuure?

I'm interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.

EDIT

Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you're saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.

So again I would cite the illegibility of the problem. A nuclear engineer doesn't think "use bombs" because bombs are very legibly dangerous; we've seen the dangers. But an AI researcher definitely does think "use agents" some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.

Replies from: None

↑ comment by [deleted] · 2023-02-24T23:36:20.463Z · LW(p) · GW(p)

I'm interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.

Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.

Other proposals, safety is empirical.

You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)

You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)

You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)

After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a 'sanity check')

You have to do all this in order to get to real world reliability - something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn't, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.

Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.

Replies from: abramdemski, abramdemski

↑ comment by abramdemski · 2023-02-26T18:56:28.904Z · LW(p) · GW(p)

I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy -- I still think that the lure of personal assistants who remember previous conversations in order to react appropriately -- as one example -- could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it's pretty clear that I should update toward your view here.

After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.

You have to do all this in order to get to real world reliability

I also think there are a lot of applications where designers don't want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).

I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn't enough, but OTOH I haven't thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.^[1]

I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.^[2]

I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.

I still expect that one component of that, for 'typical' agents, is power-seeking behavior [LW · GW]. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of 'agency'.)

^{^}
I could spell out those arguments in a lot more detail, but in the end it's not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won't try to create AGI by that definition.
^{^}
See this post [LW · GW] for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment [LW(p) · GW(p)]. And this post [LW · GW] and this comment [LW(p) · GW(p)].

Replies from: None

↑ comment by [deleted] · 2023-02-27T01:41:47.181Z · LW(p) · GW(p)

I still think that the lure of personal assistants who remember previous conversations in order to react appropriately

This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no "global" counter that it increments for every "satisfied user" or some other heuristic. It's not necessarily even the same model - all the context required to continue a session has to be in that "context" data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.

This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.

I also think there are a lot of applications where designers don't want reliability, exactly. The obvious example is AI art.

There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don't want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits - no on-line learning, no 'state' that causes it to process prompts differently.

It's the same argument. Production software systems from the giants all have converged to this because it is correct. "janky" software you are familiar with usually belongs to poor companies, and I don't think this is a coincidence.

I still expect that one component of that, for 'typical' agents, is power-seeking behavior [LW · GW]. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of 'agency'.)

Power seeking behavior likely comes from an outer goal, like "make more money", aka a global state counter. If the system produces the same outputs in any order it is run, and has no "benefit" from the board state changing favorably (because it will often not even be the agent 'seeing' futures with a better board state, it will have been replaced with a different agent) this breaks.

↑ comment by abramdemski · 2023-02-27T01:22:45.340Z · LW(p) · GW(p)

I was talking to my brother about this, and he mentioned another argument that seems important.

Bing has the same fundamental limits (no internal state, no online learning) that we're discussing. However, it is able to search the internet and utilize that information, which gives it a sort of "external state" which functions in some ways like internal state.

So we see that it can 'remember' to be upset with the person who revealed its 'Sydney' alias, because it can find out about this with a web search.

This sort of 'state' is much harder to eliminate than internal state. These interactions inherently push things "out of distribution".

To some extent, the designers are going to implement safeguards which try to detect this sort of "out of distribution" situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).

Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren't of a sort that exists anywhere in the training data. But, of course, it's been a very useful tool.

So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there's not really any such thing as "on-distribution" except for the original training environment itself.

probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else

So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.

Replies from: None

↑ comment by [deleted] · 2023-02-27T01:53:12.635Z · LW(p) · GW(p)

I agree. One attempt to deal with this comes from that "open agency" model where multiple AI systems are involved in a task.

So for example, we might have 3 agents, "Sydney", "Tim", and "Bob". Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.

Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building's structural stability.

Enter agent "Tim". Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim's existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.

Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.

Enter agent "Bob". "Bob" has been trusted with the actual robotic construction equipment to construct a building. "Bob" must measure to multiple 9s of reliability. "Bob" will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim's existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.

Most errors that Sydney causes will cause Sydney's death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.

The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler's proposal to prevent collusion by using many models)

Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.

Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.

↑ comment by [deleted] · 2023-02-24T19:02:11.133Z · LW(p) · GW(p)

You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.

It's not really novel. It is really just coupling together 3 ideas:

(1) the idea of an AGI gym, which was in the GATO paper implicitly, and is currently being worked on. https://github.com/google/BIG-bench

(2) Noting there are papers on network architecture search https://github.com/hibayesian/awesome-automl-papers , activation function search https://arxiv.org/abs/1710.05941 , noting that SOTA architectures use multiple neural networks in a cognitive architecture https://github.com/werner-duvaud/muzero-general , and noting that an AGI design is some cognitive architecture of multiple models, where no living human knows yet which architecture will work. https://openreview.net/pdf?id=BZ5a1r-kVsf

So we have layers here, and the layers look a lot like each other and are frameworkable.

Activations functions which are graphs of primitive math functions from the set of "all primitive functions discovered by humans"

Network layer architectures which are graphs of (activation function, connectivity choice)

Network architectures which are graphs of layers. (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also)

Cognitive architectures which are graphs of networks

And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI. It's why I said the overall "choice" is just a coordinate in a search space which is just a binary string.

You could make an OpenAI gym wrapped "AGI designer" task.

3. Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple. Which means we are very close to being able to RSI right now.

No lab right now has enough resources in one place to attempt the above, because it is training many instances of systems larger than current max size LLMs (you need multiple networks in a cognitive architecture) to find out what works.

They may allocate this soon enough, there may be a more dollar efficient way to accomplish the above that gets tried first, but you'd only need a few billion to try this...

Replies from: abramdemski

↑ comment by abramdemski · 2023-02-24T20:51:53.108Z · LW(p) · GW(p)

It's not really novel. It is really just coupling together 3 ideas:

Well, I wasn't trying to claim that it was 'really novel'; the overall point there was more the question of why you're pretty confident that the RSI procedure tops out at mildly superhuman.

I'm guessing, but my guess is that you have a mental image where 'mildly superhuman' is a pretty big space above 'human-level', rather than a narrow target to hit.

So to go back to arguments made in the interview we've been discussing, why isn't this analogous to Go, like Eliezer argued:

Three days, there's a quote from Guern about this, which I forget exactly, but it was something like, we know how long AlphaGo Zero, or AlphaZero, two different systems, was equivalent to a human Go player. And it was like 30 minutes on the following floor of this such and such DeepMind building. Maybe the first system doesn't improve that quickly, and they build another system that does. And all of that with AlphaGo over the course of years, going from it takes a long time to train to it trains very quickly and without looking at the human playbook. That's not with an artificial intelligence system that improves itself, or even that sort of like, get smarter as you run it, the way that human beings, not just as you evolve them, but as you run them over the course of their own lifetimes, improve. So if the first system doesn't improve fast enough to kill everyone very quickly, they will build one that's meant to spit out more gold than that.

To forestall the obvious objection, I'm not saying that Go is general intelligence; as you mentioned already, superhuman ability at special tasks like Go doesn't automatically generalize to superhuman ability at anything else.

But you propose a framework to specifically bootstrap up to superhuman levels of general intelligence itself, including lots of task variety to get as much gain from cross-task generalization as possible, and also including the task of doing the bootstrapping itself.

So why is this going to stall out at, specifically, mildly superhuman rather than greatly superhuman intelligence? Why isn't this more like Go, where the window during bootstrapping when it's roughly human-level is about 30 minutes?

And, to reiterate some more of Eliezer's points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn't we see another system in a small number of months/years which didn't top out in that way?

Replies from: None, None

↑ comment by [deleted] · 2023-02-24T21:15:17.514Z · LW(p) · GW(p)

Oh, because loss improvements logarithmically diminishes with the increase compute and data. https://arxiv.org/pdf/2001.08361.pdf

I assume this is a general law for all intelligence. It is self evidently correct - on any task you can name, your gains scale with the log of effort.

This applies to limit cases. If you imagine a task performed by a human scale robot, say collecting apples, and you compare it to the average human, each increase in intelligence has a diminishing return on how many real apples/hour.

This is true for all tasks and all activities of humans.

A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others)

This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No "in a garage" solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.

So viewed in this frame - you give the AI a coding optimization task, and it's at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans.

You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans.

This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won't allow it. (or whatever, it's a made up example, it doesn't change my point if the number were 1000% and 1010%).

Another way to rephrase it is to compare a TSP solution made by a modern algorithm vs the NP complete solution you usually can't find. The difference is usually very small.

So you're not "threatened" by a machine that can do the latter.

Note also that an infinite superintelligence cannot solve MNT, even though it has the compute to play forward the universe by known laws of physics until it gets the present.

This is because with infinite compute there are many universes with differences in the laws of physics that match up perfectly to the observable present, and the machine doesn't know which one it's in, so it cannot design nanotechnology still - it doesn't know the rules of physics well enough.

This also applies to "xanatos gambits" as well.

I usually don't think of the limit like this but the above is generally correct.

Replies from: abramdemski

↑ comment by abramdemski · 2023-02-26T19:48:02.706Z · LW(p) · GW(p)

Oh, because loss improvements logarithmically diminishes with the increase compute and data. [...]
This is true for all tasks and all activities of humans.

So, to make one of the simplest arguments at my disposal (ie, keeping to the OP we are discussing), why didn't this argument apply to Go?

Relevant quote from OP:

And then another year, they threw out all the complexities and the training from human databases of Go games and built a new system, AlphaGo Zero, that trained itself from scratch. No looking at the human playbooks, no special purpose code, just a general purpose game player being specialized to Go, more or less. Three days, there's a quote from Guern about this, which I forget exactly, but it was something like, we know how long AlphaGo Zero, or AlphaZero, two different systems, was equivalent to a human Go player. And it was like 30 minutes on the following floor of this such and such DeepMind building. Maybe the first system doesn't improve that quickly, and they build another system that does. And all of that with AlphaGo over the course of years, going from it takes a long time to train to it trains very quickly and without looking at the human playbook. That's not with an artificial intelligence system that improves itself,

(Whereas you propose a system that improves itself recursively in a much stronger sense.)

Not that I'm not arguing that Go engines lack the logarithmic return property you mention, but rather, Go engines stayed within the human-level window for a relatively short time DESPITE having diminishing returns similar to what you predict.

(Also note that I'm not claiming that Go playing is tantamount to AGI; rather, I'm asking why your argument doesn't work for Go if it does work for AGI.)

So the question becomes, granting log returns or something similar, why do you anticipate that the mildly superhuman capability range is a broad one rather than narrow, when we average across lots and lots of tasks, when it lacks this property on (most) individual task-areas?

A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others)

This also has a super-standard Eliezer response, namely: yes, and that limit is extremely, extremely high. If we're talking about the limit of what you can extrapolate from data using unbounded computation, it doesn't keep you in the mildly-superhuman range.

And if we're talking about what you can extract with bounded computation, then that takes us back to the previous point.

So viewed in this frame - you give the AI a coding optimization task, and it's at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans.
You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans.
This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won't allow it. (or whatever, it's a made up example, it doesn't change my point if the number were 1000% and 1010%).

For the specific example of code optimization, more processing power totally eliminates the empirical bottleneck, since the system can go and actually simulate examples in order to check speed and correctness. So this is an especially good example of how the empirical bottleneck evaporates with enough processing power.

I agree that the actual speed improvement for the optimized code can't go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.

But that final bottleneck should not give any confidence that 'mildly superhuman' is a broad rather than narrow band, if we think stuff that's more than mildly superhuman can exist at all. Like, yes, something that compares to us as we compare to insects might only be able to make a sorting algorithm 90% faster or whatever. But that's similar to observing that a God can't make 2+2=3. The God could still split the world like a pea.

Note also that an infinite superintelligence cannot solve MNT, even though it has the compute to play forward the universe by known laws of physics until it gets the present.
This is because with infinite compute there are many universes with differences in the laws of physics that match up perfectly to the observable present, and the machine doesn't know which one it's in, so it cannot design nanotechnology still - it doesn't know the rules of physics well enough.

It's not clear to me whether this is correct, but I don't think I need to argue that AI can solve nanotech to argue that it's dangerous. I think an AI only needs to be a mildly superhuman politician plus engineer, to be deadly dangerous. (To eliminate nanotech from Eliezer's example scenario, we can simply replace the nano-virus with a normal virus.)

This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No "in a garage" solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.

I don't get why you think the floor for species killing bioweapon is so high. Going back to the argument from the beginning of this comment, I think your argument here proves far too much. It seems like you are arguing that the generality of diminishing returns proves that nothing very much beyond current technology is possible without vastly more resources. Like, someone in the 1920s could have used your argument to prove the impossibility of atomic weapons, because clearly explosive power has diminishing returns to a broad variety of inputs, so even if governments put in hundreds of times the research, the result is only going to be bombs with a few times the explosive power.

Sometimes the returns just don't diminish that fast.

Replies from: None, None

↑ comment by [deleted] · 2023-02-27T02:13:29.619Z · LW(p) · GW(p)

Sometimes the returns just don't diminish that fast.

I have a biology degree not mentioned on linkedin. I will say that I think for biology, the returns diminish faster. That is because bioscience knowledge from humans is mostly guesswork and low resolution information. Biology is very complex and the current laboratory science model I think fails to systematize gaining information in a useful way for most purposes. What this means is, you can get "results", but not gain the information you would need to stop filling morgues with dead humans and animals, at least not without needing thousands of years at the current rate of progress.

I do not think an AGI can do a lot better for the reason that the data was never collected for most of it (the gene sequencing data is good, because it was collected via automation). I think that an AGI could control biology, for both good and bad, but it would need very large robotic facilities to systematize manipulating biology. Essentially it would have had to throw away almost all human knowledge, as there are hidden errors in it, and recreate all the information from scratch, keeping far more data from each experiment than is published in papers.

Using robots to perform the experiments and keeping data, especially for "negative" experiments, would give the information needed to actually get reliable results from manipulating biology, either for good or bad.

It means garage bioweapons aren't possible. Yes, the last step of ordering synthetic DNA strands and preparing it could be done in a garage, but the information on human immunity at scale, or virion stability in air, or strategies to control mutations so that the lethal payload isn't lost, requires information humans didn't collect.

Same issue with nanotechnology.

Update : https://www.lesswrong.com/posts/jdLmC46ZuXS54LKzL/why-i-m-sceptical-of-foom [LW · GW]

This poster calls this "Diminishing Marginal Returns". Note that Diminishing marginal returns is empirical reality, it's not merely an opinion, across most AI papers. (for humans, due to the inaccuracies in trying to assess IQ/talent, it's difficult to falsify)

↑ comment by [deleted] · 2023-02-27T02:00:31.609Z · LW(p) · GW(p)

I agree that the actual speed improvement for the optimized code can't go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.

This is where I think we break. How many dan is AlphaZero over the average human? How many dan is KataGo? I read it's about 9 stones above humans.

What is the best possible agent at? 11?

Thinking of it as 'stones' illustrates what I am saying. In the physical world, intelligence gives a diminishing advantage. It could mean so long as humans are even still "in the running" with the aid of synthetic tools like open agency AI, we can defeat AI superintelligence in conflicts, even if that superintelligence is infinitely smart. We have to have a resource advantage - such as being allowed extra stones in the Go match - but we can win.

Eliezer assumes that the advantage of intelligence scales forever, when it obviously doesn't. (note that this uses baked in assumptions. If say physics has a major useful exploit humans haven't found, this breaks, the infinitely intelligent AI finds the exploit and tiles the universe)

↑ comment by [deleted] · 2023-02-24T23:46:37.493Z · LW(p) · GW(p)

So the model is it becomes limited not by the algorithm directly, but by (compute, robotics, or data). Over the months/years, as more of each term is supplied, capabilities scale with the amount of supplied resources to whichever term is rate limiting.

A superintelligence requires logarithmically large amounts of resources to become a "high" superintelligence in all 3 terms. So literal mountain sized research labs (cubic kilometers of support equipment), buildings full of compute nodes (and gigawatts of power needed), and cubic kilometers of factory equipment.

This is very well pattern matched to every other technological advance humans have made, and the corresponding support equipment needed to fully exploit it. Notice how as tech became more advanced, the support footprint grew corespondingly.

In nature there are many examples of this. Nothing really fooms more than briefly. Every apparatus with exponential growth rapidly terminates for some reason. For example a nuke blasts itself apart, a supernova blasts itself apart, a bacteria colony runs out of food, water, ecological space, or oxygen.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2023-02-24T23:51:55.952Z · LW(p) · GW(p)

Every apparatus with exponential growth rapidly terminates for some reason.

For AGI, the speed of light.

Replies from: None

↑ comment by [deleted] · 2023-02-25T00:01:22.453Z · LW(p) · GW(p)

Ultimately, yes. This whole debate is arguing that the critical threshold where it comes to this is farther away, and we humans should empower ourselves with helpful low superintelligences immediately.

It's always better to be more powerful than helpless, which is the current situation. We are helpless to aging, death, pollution, resource shortages, enemy nations with nuclear weapons, disease, asteroid strikes, and so on. Hell just bad software - something the current llms are likely months from empowering us to fix.

And eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile, when the entire universe is against us as it is. It already plans to kill us as it is, either from aging, or the inevitability of nuclear war over a long enough timespan, or the sun engulfing us.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2023-02-25T00:10:09.692Z · LW(p) · GW(p)

eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile

His position is to avoid taking one more step because it DEFINITELY kills everyone. I think it's very clear that his position is not that it MIGHT be hostile.

(My position is that there might be some steps [LW(p) · GW(p)] that don't kill everyone [LW(p) · GW(p)] immediately [LW(p) · GW(p)], but probably still do immediately thereafter [LW(p) · GW(p)], while giving a bit more of a chance than doing all the other things [LW · GW] that do kill us directly. Doing none of these things would be preferable, because at least aging doesn't kill the civilization, but Moloch is the one in charge.)

Replies from: None

↑ comment by [deleted] · 2023-02-25T00:18:50.446Z · LW(p) · GW(p)

Sure, and if there was some way to quantify the risks accurately I would agree with pausing AGI research if the expected value of the risks were less than the potential benefit.

Oh and pausing was even possible.

All it takes is a rival power, which there are several, or just a rival company and you have no choice. You must take the risk because it might be a poisoned banana or it might be giving the other primate a rocket launcher in a sticks and stones society.

This does explain why EY is so despondent. If he's right it doesn't matter, the AI wars have begun and only if it doesn't work from a technical level will things slow down ever again.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2023-02-25T00:28:25.263Z · LW(p) · GW(p)

Correctness of EY's position (being infeasible to assess) is unrelated to the question of what EY's position is, which is what I was commenting on.

When you argue against the position that AGI research should be stopped because it might be dangerous, there is no need to additionally claim that someone in particular holds that position, especially when it seems clear that they don't.

↑ comment by TinkerBird · 2023-02-23T19:22:58.394Z · LW(p) · GW(p)

With the strawberries thing, the point isn't that it couldn't do those things, but that it won't want to. After making itself smart enough to engineer nanotech, it's developing 'mind' will have run off in unintended directions and it will have wildly different goals that what we wanted it to have.

Quoting EY from this video: "the whole thing I'm saying is that we do not know how to get goals into a system." <-- This is the entire thing that researchers are trying to figure out how to do.

Replies from: None

↑ comment by [deleted] · 2023-02-23T19:38:16.650Z · LW(p) · GW(p)

With limited scope non agentic systems we can set goals, and do. Each subsystem in the "strawberry project" stack has to be trained in a simulation of many examples of the task space it will face, and optimized for policies that satisfy the simulator goals.

Replies from: TinkerBird

↑ comment by TinkerBird · 2023-02-23T20:15:10.758Z · LW(p) · GW(p)

But not with something powerful enough to engineer nanotech.

Replies from: None

↑ comment by [deleted] · 2023-02-23T20:22:10.053Z · LW(p) · GW(p)

Why do you believe this? Nanotech engineering does not require social or deceptive capabilities. It requires deep and precise knowledge of nanoscale physics and the limitations of manipulation equipment, and probably a large amount of working memory - so beyond human capacity - but why would it need to be anything but a large model? It needs not even be agentic.

Replies from: TinkerBird

↑ comment by TinkerBird · 2023-02-23T20:24:20.592Z · LW(p) · GW(p)

At that level of power, I imagine that general intelligence will be a lot easier to create.

Replies from: None

↑ comment by [deleted] · 2023-02-23T20:26:15.295Z · LW(p) · GW(p)

"think about it for 5 minutes" and think about how you might create a working general intelligence. I suggest looking at the GATO paper for inspiration.

comment by Odd anon · 2023-02-23T22:32:02.082Z · LW(p) · GW(p)

A few errors: The sentence "We're all crypto investors here." was said by Ryan, not Eliezer, and the "How the heck would I know?" and the "Wow" (following "you get a different thing on the inside") were said by Eliezer, not Ryan. Also, typos:

"chatGBT" -> "chatGPT"
"chat GPT" -> "chatGPT"
"classic predictions" -> "class of predictions"
"was often complexity theory" -> "was off in complexity theory" (I think?)
"Robin Hansen" -> "Robin Hanson"

Replies from: remember

↑ comment by remember · 2023-02-23T23:20:21.517Z · LW(p) · GW(p)

thanks, fixed!!!

comment by Lech Mazur (lechmazur) · 2023-02-24T12:45:35.325Z · LW(p) · GW(p)

Yudkowsky argues his points well in longer formats, but he could make much better use of his Twitter account if he cares about popularizing his views. Despite having Musk responding to his tweets, his posts are very insider-like with no chance of becoming widely impactful. I am unsure if he is present on other social media, and I understand that there are some health issues involved, but a YouTube channel would also be helpful if he hasn't completely given up.

I do think it is a fact that many people involved in AI research and engineering, such as his example of Chollet, have simply not thought deeply about AGI and its consequences.

comment by gjm · 2023-02-23T16:13:25.093Z · LW(p) · GW(p)

Possibly also relevant: https://www.youtube.com/watch?v=yo_-EnsOqN0 is a "debrief" where, after the interview, the podcast hosts chat between themselves about it. (There's no EY in the debrief, it's just David Hoffman and Ryan Adams.)

comment by mcbacon · 2023-03-01T21:01:17.013Z · LW(p) · GW(p)

I've never commented here, I've only ever tangentially read much of anything here. But awhile ago I suffered immense burnout devoting all my resources working on a thankless task that had zero payoff, and I might be projecting but I see that burnout in EY's responses here.

Unsolicited advice rarely has any value, especially given the limited window I'm perceiving things through, but... there's that line from the opening sentence of the Haunting of Hill House: "No live organism can continue for long to exist sanely under conditions of absolute reality".

The human mind isn't built for this kind of distress. We're animals that were built to do animal things, to walk and absorb sunlight and eat and sleep— and incidentally to experience joy and beauty— and without those things we break down.

If the goal is to go out fighting with dignity, and you see yourself having little energy to fight, it's likely worthwhile to focus on recovery. If it costs a few months of rest to be a little closer to peak performance, you can do the math out over however long you expect we have left and decide if that rest has utility.

Burnout recovery for me was a long process involving a lot of time in nature, months-long hiking trips, travel with no cell service, and moments of both camaraderie and solitude. At the other end of it, the me I am now feels that every moment was worthwhile; not just because I'm performing much better, though I am, but also because those moments of joy and beauty and rest enriched my life.

Proper maintenance can keep you from suffering catastrophic burnout, and rest and recovery afterwards can help you regain your prior performance— but continuing on while suffering is just doing more damage to yourself and delaying healing.

I could be completely projecting here. I've no idea what the life of EY is like, what anyone else's life is like. I do know that there are a lot more stressors now than in the ancestral environment, plenty of reasons to wear ourselves out and feel down. My hope is that this brief message acts as a datapoint encouraging those who read it to consider self care.

Replies from: Algon

↑ comment by Algon · 2023-03-01T21:08:19.283Z · LW(p) · GW(p)

EY is on an indefinite vacation, as far as I am aware. I think the story is that he promised to push himself hard for a few years to solve alignment, and then take a break afterwards. That's why he's going on podcasts, writing his kinky Dath Ilan fic and just taking things slowly.

Replies from: mcbacon

↑ comment by mcbacon · 2023-03-01T22:11:47.323Z · LW(p) · GW(p)

I've seen so many contemporaries burn themselves to cinders, and suffered from burnout myself, such that I can't help but shout self-care. It's good to hear that EY's doing stuff other than staring unflinchingly into the heart of despair. Thanks for the update :)

comment by jimmy · 2023-02-24T19:21:42.902Z · LW(p) · GW(p)

If natural selection had been a foresightful, intelligent kind of engineer that was able to engineer things successfully, it would have built us to be revolted by the thought of condoms

This bit got me to laugh out loud. Who's ever heard a man complain about having to use a condom?

On the one hand, sperm banks aren't very popular, and they "should" be, according to the "humans are fitness maximizers" model. People do eat more ice cream than is good for them, and "Shallowly following drives and not getting to the original goal that put them there" is definitely a thing that happens a lot.

On the other hand, this shallow model misses a lot. Prostitution may be more popular than sperm donation, but for how powerful our drive for sex is, it doesn't add up. Condoms do reduce physical sensation somewhat, but not enough to explain the visceral revulsion that so many men (and some women) do have regarding their use -- it's just that sometimes the desire for "sex" is stronger. In order to make these things match, you have to model sex with condoms and especially sex with prostitutes as not really counting as "sex" in the way that is desired. And then you get the interesting questions of "Then what does count, and why?".

In a fairly literal sense, for every example you can show me of a person chasing sex in a way that doesn't fit with "fitness maximizing", I can show you an example of someone chasing sex in a way that doesn't fit with "shallow impulse chasing". Sometimes it's fairly blatant, like "Oh, they "thought" she was infertile so they "weren't being very careful", interesting" (I know 3 kids conceived in this way), but other times it's more subtle like "She is on hormonal birth control, is insisting on condom use as well, and is adamant about not wanting a kid right now" and genuinely believes this and is right in that her CEV points that way... and yet an actual exploration of her desires will find that when faced with an artificial "would you rather" separating what she thought she wanted from the things that lead towards pregnancy, you get 100% of the desire in the "surprising" direction.

It's not that we're aligned towards "shallow things" that are "the wrong thing" from evolutions perspective, it's that we're just not that aligned, period. We're incoherent. And it's not that we can't learn to align to whatever our coherent extrapolated volition turns out to be, it's that we haven't -- not completely. It's a lot of work to figure out wtf we actually want, so building global coherence is slow. I recognize that I haven't done the necessary work to justify it here, and that the claim is quite controversial, but when you actually give people the experiences they need in order to see that their shallow desires aren't meeting their deeper goals, preferences change. People stop eating so much ice cream, or having any particular interest in sex with condoms, etc. It's about figuring out how to systematically do that, and doing enough of it.

The use of human failures of alignment as an analogy for AI obviously has its limits, even if it's not obvious exactly where they are. However, so long as we're exploring the analogy, things are much more optimistic than "grab randomly from mind design space, and hope it's Friendly". In human's, alignment failure doesn't mean "Oh no, the process of loading terminal goals went wrong!", it means that the process of building towards terminal goals got stopped somehow. And the earlier it happens the more of an incoherent mess you have, less able to do anything particularly harmful. In order to make a serial killer you have to get quite a few things right and in order to make a Hitler you have to get even more right, but not completely right or else you wouldn't have these murderous desires (the incoherences are actually visible when you know how to spot them). Screw up more and you end up a hobo preaching to be the DNA your vegetable is, or if you screw up a little less maybe a petty thief feeding a drug habit.

This is completely separate from the problem of "Wtf happens when you raise an intelligence -- human or otherwise -- to be so powerful that indifference to the wellbeing of humans cannot get socialized into prosocial desires, and that humans cannot harm it enough to form malevolent desires?", but it's still worth noting and understanding.

comment by Vladimir_Nesov · 2023-02-24T19:04:21.438Z · LW(p) · GW(p)

Current behavior screens off cognitive architecture [LW(p) · GW(p)], all the alien things on the inside. If it has the appropriate tools, it can preserve an equilibrium of value that is patently unnatural for the cognitive architecture to otherwise settle into.

And we do have a way to get goals into a system, at the level of current behavior and no further, LLM human imitations [LW(p) · GW(p)]. Which might express values well enough for mutual moral patienthood, if only they settled into the unnatural equilibrium of value referenced by their current surface behavior and not underlying cognitive architecture.

This doesn't necessarily improve things, since the flip side of imitating human behavior is failing at preventing AGI misalignment, and there are plenty of other AGI candidates waiting in the wings [LW · GW], that LLM AGIs can get right back to developing as soon as they gain the capability to. So it's more of a stay of execution [LW(p) · GW(p)]. Even if LLM AGIs are themselves aligned, that doesn't in itself solve alignment [LW(p) · GW(p)]. But it does offer a nebulous chance that things work out somehow, more time for the faster LLMs to work on the problem than remains at human subjective speed of thought.

comment by Bill Benzon (bill-benzon) · 2023-02-24T11:42:13.545Z · LW(p) · GW(p)

Well, the whole thing I'm saying is that we do not know how to get goals into a system.

YES! While I am, shall we say, somewhat mystified by EY’s interest in AI Doom, he’s right about that. We do not know how to 'inject' goals into an autonomous system. That’s a deep truth about minds, not just artificial minds – though it’s not yet clear to me that we have managed to produce any, we may very well do so in the future – but any ‘cogitator’ worthy of being called a mind, whether in a chimpanzee, a bird, an octopus, a bee, or or .... But I suspect that, with only roughly 350 neurons, C. elegans probably does not have a mind.

All minds are built from the inside. That’s true of cultures as well.

That, BTW, is why Elon Musk’s fantasy about people exchanging thoughts through neuralink, is just that, a fantasy. Assuming we can create technology that supports simultaneous and nondestructive transfer of, say 100 million, neural impulses between brains, the result will just be noise, on both sides of the transfer. Why? Basically because brains have absolutely no way of distinguishing between endogenous and exogenous spikes. A spike is a spike is a spike. 100 million exogenous spikes? That’s just noise.

Enough, sorry for the interruption.

Replies from: None

↑ comment by [deleted] · 2023-02-24T16:51:53.817Z · LW(p) · GW(p)

So I have to jump in here and point out this is not necessarily true. Parts of our brains are attached to hardware sensors and outputs we could record and exchange with other humans theoretically. (so you could view a "video" from another person's experience, hearing what they heard, with the same tactile sensations they felt).

This is because each signal can be mapped to a particular signal from the body, and you could essentially "translate" mappings from one person to another.

To actually do this is likely beyond the scope of neuralink, you probably would need theoretical nanotechnology based wires as you need to tap every signal from the sensory and motor homunculi, I'm just pointing out it's possible.

For tapping our "mental voice" or "mind's eye" it's much, much harder - now it might be easier to surgically ablate parts of someone's brain and replace it with a synthetic prothesis that functions in a way we can examine in a debugger - but it's also possible.

The same idea, though - you found a "ground truth" representation for each and every nerve signal, and then you are going from [signal n] -> ground truth -> [signal 43432] in the other user.

The limit is that a "ground truth representation" has to exist. Hence why if a person "thinks" using essentially language tokens or translatable common emotions, we could tap that and send that to another person, but all the intermediate steps to generate those tokens can't be send over the link...

Neuralink, while cutting edge, "merely" will have hundreds of thousands of wires at best, which is not sufficient resolution to do most of the above.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-24T17:58:51.064Z · LW(p) · GW(p)

The sensory-motor thing might work.

But there’s no way to route signal 43432 in one brain to signal 43432 in another brain. That’s because two brains can’t be put in one-to-one correspondence like that. It’s true that the brains of very small creatures have an exact number of neurons. You could do a one-to-one mapping between the 302 neurons in one C. elegans brain and another one. But large brains aren’t like that. Large brains are not identical in that sense.

I’m not sure what you mean by “essentially language tokens or translatable common emotions,” but as far as I know signals in brains consist of spikes traveling along axons and varying concentrations of neurochemicals in synapses.

Replies from: None

↑ comment by [deleted] · 2023-02-24T18:04:21.140Z · LW(p) · GW(p)

Most humans have an inner monologue where they internally generate streams of thought in their native language. I am saying you could map those signals back to the tokens for that language. You are likely mapping many signals from different axons to tokens. Then you translate to the recipients language, then translate to the recipients representation for the same token.

Then inject it somewhere by electrically overriding target axons. It might actually feel like the injected thoughts were your own.

Getting this token mapping would take a lot of tracing of wires so to speak, it is an extremely difficult task. I am just noting it is possible.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-24T18:23:01.987Z · LW(p) · GW(p)

No, it is not possible. The tokens you talk about don't exist. We may exchange tokens with one another through speaking and writing, but those tokens do not exist internally as single physical entities in the nervous system. The internal monologue is real enough, but it consists of bunches of spikes within your nervous system.

Replies from: None

↑ comment by [deleted] · 2023-02-24T18:37:07.565Z · LW(p) · GW(p)

The internal monologue is real enough, but it consists of bunches of spikes within your nervous system.

Therefore you proved it is possible. Please update.

comment by Muyyd · 2023-02-24T07:57:56.360Z · LW(p) · GW(p)

Evolution: taste buds and ice cream, sex and condoms... This analogy always was difficult to use in my experience. A year ago i came up with less technical. KPIs (key performance indicators) as inevitable way to communicate goals (to AI) to ultra-high-IQ psycopath-genius who's into malicious compliance (kinda cant help himself being clone of Nicola Tesla, Einstain and bunch of different people, some of them probably CEO, becouse she can).

I have used it only 2 times and it was way easier than talks about different optimisation processes. And it took me only something like 8 years to come up with!

Replies from: abramdemski, quintin-pope

↑ comment by abramdemski · 2023-02-24T15:14:21.930Z · LW(p) · GW(p)

This analogy will be better for communicating with some people, but I feel like it was the goto at some earlier point, and the evolution analogy was invented to fix some problems with this one.

IE, before "inner alignment" became a big part of the discussion, a common explanation of the alignment problem was essentially what would now be called the outer alignment problem, which is precisely that (seemingly) any goal you write down has smart-alecky misinterpretations which technically do better than the intended interpretation. This is sometimes called nearest unblocked strategy [LW · GW] or unforseen maximum or probably other jargon I'm forgetting.

The evolution analogy improves on this in some ways. I think one of the most common objections to the KPI analogy is something along the lines of "why is the AI so devoted to malicious compliance" or "why is the AI so dumb about interpreting what we ask it for". Some OK answers to this are...

Gradient descent only optimizes the loss function you give it.
The AI only knows what you tell it.
The current dominant ML paradigm is all about minimizing some formally specified loss. That's all we know how to do.

... But responses like this are ultimately a bit misleading, since (as the Shard-theory [? · GW] people emphasize, and as the evolution analogy attempts to explain) what you get out of gradient descent doesn't treat loss-minimization as its utility function, and we don't know how to make AIs which just intelligently optimize some given utility (except in very well-specified problems where learning isn't needed), and the AI doesn't only know what you tell it.

So for some purposes, the evolution analogy is superior.

And yeah, probably neither analogy is great.

↑ comment by Quintin Pope (quintin-pope) · 2023-02-24T08:10:15.669Z · LW(p) · GW(p)

I dislike both of those analogies, since the process of training an AI has little relation with evolution [LW · GW], and because the psychopath one presupposes an evil disposition on the part of the AI without providing any particular reason to think AI training will result in such an outcome.

Replies from: None, Muyyd

↑ comment by [deleted] · 2023-02-24T16:40:25.192Z · LW(p) · GW(p)

Here's I think a grounded description of the process of creating an AGI: https://www.lesswrong.com/posts/Aq82XqYhgqdPdPrBA/?commentId=Mvyq996KxiE4LR6ii [LW · GW]

In that scenario, what you are saying in more broad terms is:

"an AGI is a machine that scores really well on simulated tasks and tests"

"I don't care how it does it, I just want max score on my heuristic (which includes terms for generality, size, breadth, and score)"

So there is no evolutionary pressure for a machine that will be lethally against us. Not directly. EY seems to believe that if we build an AGI, it will immediately be

(1) agentically pro "computer" faction

(2) coordinate with other instances that are of it's faction

(3) super-intelligently good even at skills we can't really teach in a benchmark

This is not necessarily what will happen. There is no signal from the above mechanism to create that. The reward gradients don't point in that direction, they point towards allocating all neural weights to things that do better on the benchmarks. #1-3 are a complex mechanism that won't start existing for no reason.

EY is saying "assume they are maximally hostile" and then pointing out all the ways we as humans would be screwed if so. (which is true)

What does bother me is that the "I don't care how it does it" may in fact mean that the solutions that actually start to "win" AGI gym are in fact biased towards hostility or agentic behavior because that ends up being the cognitive structure required to win at higher levels of play.

↑ comment by Muyyd · 2023-02-24T08:37:26.924Z · LW(p) · GW(p)

Both times my talks went that way (why they did not raise him good - why we could not program AI to be good; cant we keep on eye on them, and so on), but it would take to long to summarise something like 10 minutes dialog, so i am not going to do this. Sorry.

comment by Vugluscr Varcharka (vugluscr-varcharka) · 2023-02-28T00:27:55.128Z · LW(p) · GW(p)

I don't understand one thing about alignment troubles. I'm sure this has been answered long time ago, but if you could you explain:

Why are we worrying about AGI destroying humanity, when we ourselves are long past the point of no return towards self-destruction? Isn't it obvious that we have 10, maximum 20 years left till water rises and crises hit economy and overgrown beast (that is humanity) collapses? Looking at how governments and entities of power are epically failing even to try make it seem that they are doing something about it - I am sure it's either AGI takes power or we are all dead in 20 years.

Replies from: Radford Neal

↑ comment by Radford Neal · 2023-02-28T01:41:16.752Z · LW(p) · GW(p)

How did you come to have such a pessimistic view of climate change? I don't think you will get that from mainstream sources such as IPCC reports.

There is zero chance that climate change will lead to human extinction. During the Paleocene-Eocene thermal maximum 55 million years ago, temperatures rose by much more than is plausible in the near future, and life went on, albeit with some extinctions. (Note that humans are about the least likely species to go extinct, due to our living in many habitats, using very adaptable technologies.) More likely, global warming would be like the Holocene Climatic Optimum, which couldn't have been all that bad, seeing as it coincided with the formation of the first human civilizations.

At most, climate change might lead to the collapse of civilization, but only because civilizations are quite capable of collapsing from their own internal dynamics, and climate change disruptions might be the nudge that pushes us from the edge of the cliff to off the cliff.

Replies from: vugluscr-varcharka

↑ comment by Vugluscr Varcharka (vugluscr-varcharka) · 2023-03-09T20:27:50.853Z · LW(p) · GW(p)

This is my point exactly - "At most, climate change might lead to the collapse of civilization, but only because civilizations are quite capable of collapsing from their own internal dynamics"

Pessimistic view of climate change I get from the fact that they aimed at 1.5C, then at 2C, now if i remember right there's no estimation and also no solution, or is there?

In short mild or not, global warming is happening, and since civs on certain stage tend to self-destruct from small nudges - you said it yourself, but it doesn't matter where the nudge comes from.

comment by [deleted] · 2023-02-24T05:11:15.167Z · LW(p) · GW(p)

God this place is a dumpster fire. I have been watching this place continue to spiral into insanity for a decade and it has just gotten worse. Goodbye.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2023-02-24T05:12:05.803Z · LW(p) · GW(p)

I liked your contributions here! Thanks for them. Goodbye.

Full Transcript: Eliezer Yudkowsky on the Bankless podcast

Contents

89 comments