Thoughts after the Wolfram and Yudkowsky discussion

post by Tahp · 2024-11-14T01:43:12.920Z · LW · GW · 5 comments

Contents

5 comments

I recently listened to the discussion between Wolfram and Yudkowsky about AI risk. In some ways this conversation was tailor-made for me, so I'm going to write some things about it and try to get it out in one day instead of letting it sit in my drafts for 3 weeks as I tend to do. Wolfram has lately obsessed over fundamental physics, which is a special interest of mine. Yudkowsky is one of the people thinking most carefully about powerful AI, which I think will kill us all, and I’d like to firm up that intuition. Throw them on a podcast for a few hours, and you have my attention.

That said, for the first hour I was just incredibly frustrated. Wolfram keeps running down rabbit holes that were basically “aha! You haven’t thought about [thing Yud wrote ten thousand words on in 2008]!” But a miracle happens somewhere in the second hour and Wolfram is asking actually relevant questions! His framework of small accidental quirks in machine learning algorithms leading to undesired behavior later was basically an actual issue. It was kind of a joy listening to two smart people trying to mutually get on the same page. Wolfram starts out bogged down in minutia about what 'wanting' is and whether it constitutes anthropomorphism, but finally finds a sort of more abstract space about steering to goals and trying to see Yudkowsky’s point in terms of the relative dangers of sections of the space of goals under sufficient optimization. The abstraction was unfortunate in some ways, because I was interested in some of the minutia once they were both nearly talking about the same thing, but also, if Wolfram kept running down rabbit holes like “actually quarks have different masses at different energy scales” when Yudkowsky said something like “the universe runs on quarks everywhere all at once no matter what we think the laws of physics are,” then they were never going to get to any of the actual arguments. That said, I don't see how Wolfram got to anything close to the actual point at all, and maybe the rabbit holes were necessary to get there.

My impression was that Yudkowsky was frustrated that he couldn’t get Wolfram to say, “actually everyone dying is bad and we should figure out whether that happens from our point of view.” There was an interesting place where something like this played out among one of Wolfram’s physics detours. He said something I agree with, which is that the concept of space is largely one which we construct and even changing our perception by the small adjustment of “think a million times faster" could break that construct. He argued that an AI might have a conception of physics which is totally alien to us and also valid. However, he then said it would still look to us like it was following our physics without making the (obvious to me) connection that we could just consider it in our reference frame if we want to know whether it kills us. This was emblematic of several rabbit holes. Yudkowsky would say something like “AI will do bad things” and Wolfram would respond with something like “well what is 'bad' really.” It would have been, in my view, entirely legitimate to throw out disinterested empiricism and just say, from our point of view, we don’t want to all die, so let’s figure out whether that happens. We might mess up the fine details of the subjective experience of the AI or what its source code is aiming for, but we can just draw a circle around things that from our point of view steer the universe to certain configuration and ask whether we’ll like those configurations.

I was frustrated by how long they spent finding a framework they could both work in. At the risk of making a parody of myself, part of me wished that Yudkowsky chose to talk to someone who had read the sequences. Aside from the selection issues inherent to only arguing with people who have already read a bunch of Yudkowsky, I don’t think it would help anyway. This conversation was in some ways less frustrating to me than the one Yudkowsky had with Ngo a few years ago [LW · GW], and Ngo has steeped himself in capital-R Bay Area Rationalism. As a particular example, it seemed to me like Ngo thought you could train an AI to make predictions about the world and you would be free to use that prediction to do things in the world, because you just asked the AI to make a prediction instead of doing anything. I don't see how what he was saying wasn't isomorphic to saying that you can stop someone from ever making bad things happen by letting them tell you to do things and you do it instead of them doing it. Maybe this was a deficiency of security mindset, maybe it was intuition about the type of AI that would arise from current research trends based in experience, or who knows, but I kept thinking to myself that Ngo wasn’t thinking outside of the box enough when he argued against doom. In that sense, Wolfram was more interesting to listen to, because he actually chased down the idea of where bizarre goals might come from in gradient descent, abstracted that out to “AI will likely have at least one subgoal that wasn't really intended from the space of goals,” and then considered the question of whether an arbitrary goal is, on average, lethal. His intuition seemed to be that if you fill every goal in goal space you end up with something like the set of every possible mollusk shell which each ends up serving some story in the environment. He didn’t have an intuition for goal+smart=omnicide, and he also he got too hung up on what "goal" and "smart" actually "meant" rather than just running with the thing which it seems to me that Yudkowsky is clearly aiming at even if Yudkowsky uses anthropomorphism to point at that thing. At least he ended up with something that seemed to directionally resemble Yudkowsky’s actual concerns, even if it wasn’t what he wanted to talk about for some reason. Also, Wolfram gets to the end and says "hey man, you should firm up your back-of-envelope calculations because we don't have shared intuition" when the thing Yudkowsky was trying to do with him for the past three hours was firm up those intuitions.

I keep listening to Yudkowsky argue with people about AI ruin because I have intuitions for why it is hard to create AI that won't kill us, but I think that Yudkowsky thinks it's even harder, and I don't actually know why. I get that something that is smart and wants something a lot will tend to get the thing even if killing me is a consequence. But my intuition says that AI will have goals which lead it to kill me primarily because humans are bad at making AI to the specifications that they intended rather than because goals are inherently dangerous. The current regime of AI development where we just kind of try random walks through the space of linear algebra until we get algorithms that do what we want seems to obviously be a good way to make something sort of aligned with us with wild edge cases that will kill us once it generalizes. If we were actually creating our algorithms by hand, I can just look out in the world of code full of bugs and easily imagine a bug that only shows up as a misaligned goal in the AI once it’s deployed out in the world and too smart to stop. I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms. I'm guessing that there is a counterfactual problem set that I could complete that would help me truly understand why most perfect algorithms that recreate a strawberry on the molecular cellular level destroy the planet as well. Yudkowsky has said that he’s not even sure it would be aligned if you took his brain and ran it many times faster with more memory. I’ve read enough Dath Ilan fiction to guess that he’s (at least) worried about something in the class of “human brains have some exploitable vulnerability that leads to occasional optical illusions in the current environment but leads to omnicide out of distribution,” but I’m not sure that’s right because I haven’t yet seen someone ask him that question. People keep asking him to refute terribly clever solutions which he already wrote about not working in 2007 rather than actually nailing down why he's worried.

If I was going to try to work out for myself why (or if) even humans who make what they intend to make get AI wrong on their first try instead of wistfully hoping Yudkowsky explains himself better some day, I would probably follow two threads: One is instrumental convergence, which leads anything going hard enough to move toward collecting all available negentropy (or money or power depending on the limits of the game, hopefully I don't have to explain this one here). I don't actually get why almost every goal will make an AI go hard enough, but I can imagine an AI being told to build as much manufacturing capability as possible going hard enough, and that's an obvious place to point an AI, so I guess the world is already doomed. The second is to start with simple goals like paperclips or whatever and build some argument that generalizes from discrete physical goals which are obviously lethal if you go hard enough to complex goals like "design, but do not implement, a safe fusion reactor" that it seems obvious to point an AI at. I suppose it doesn’t matter if I figure this out because I’m already convinced AI will kill us if we keep doing what we’re doing, so why chase down edge cases where we die anyway pursuing paths that humanity doesn’t seem to possess enough dignity to pursue? Somehow I find myself wanting to know anyway, and I don’t yet have the feeling of truly understanding.

5 comments

Comments sorted by top scores.

comment by IgnatzMouse · 2024-11-14T10:50:35.140Z · LW(p) · GW(p)

I agree with the frustration. Wolfram was being deliberately obtuse. Eliezer summarised it well toward the end, something like "I am telling you that the forest is on fire and you are telling me that we first need to define what do we mean by fire". I understand that we need definitions for things like "agency" or technology "wanting" something and even why do we mean by a "human" in the year 2070. But Wolfram went a bit too far. A naive genius that did not want to play along in the conversation. Smart teenagers talk like that.

Another issue with this conversation was that, even though they were listening to each other, Wolfram was too keen to go back to his current pet ideas. Eliezer's argument is (not sure) independent on whether we think the AIs will fall under computational "irreducibilteh", but he kept going back to this over and over.

I blame the ineffective exchange primarily on Wolfram in this case. Eliezer is also somewhat responsible for the useless rabbitholes in this conversation. He explains his ideas vividly and clearly. But there is something about his rhetoric style that does not persuade those who have not spent time engaging with his ideas beforehand, even someone as impressive as Wolfram. He also goes on too long on some detail or some contrived example rather than ensuring that the interlocutor in the same epistemological plane. 

Anyway, fun thing to listen to

comment by localdeity · 2024-11-14T04:04:21.997Z · LW(p) · GW(p)

why most perfect algorithms that recreate a strawberry on the molecular level destroy the planet as well.

Phrased like this, the answer that comes to mind is "Well, this requires at least a few decades' worth of advances in materials science and nanotechnology and such, plus a lot of expensive equipment that doesn't exist today, and e.g. if you want this to happen with high probability, you need to be sure that civilization isn't wrecked by nuclear war or other threats in upcoming decades, so if you come up with a way of taking over the world that has higher certainty than leaving humanity to its own devices, then that becomes the best plan."  Classic instrumental convergence, in other words.

Replies from: Tahp
comment by Tahp · 2024-11-14T14:42:44.040Z · LW(p) · GW(p)

Oops, I meant cellular, and not molecular. I'm going to edit that.

I can come up with a story in which AI takes over the world. I can also come up with a story where obviously it's cheaper and more effective to disable all of the nuclear weapons than it is to take over the world, so why would the AI do the second thing? I see a path where instrumental convergence leads anything going hard enough to want to put all of the atoms on the most predictable path it can dictate. I think the thing that I don't get is what principle it is that makes anything useful go that hard. Something like (for example, I haven't actually thought this through) "it is hard to create something with enough agency/creativity to design and implement experiments toward a purpose without also having it notice and try to fix things in the world which are suboptimal to the purpose."

Replies from: localdeity
comment by localdeity · 2024-11-14T15:35:28.956Z · LW(p) · GW(p)

I can also come up with a story where obviously it's cheaper and more effective to disable all of the nuclear weapons than it is to take over the world, so why would the AI do the second thing?

Erm... For preventing nuclear war on the scale of decades... I don't know what you have in mind for how it would disable all the nukes, but a one-off breaking of all the firing mechanisms isn't going to work.  They could just repair/replace that once they discovered the problem.  You could imagine some more drastic thing like blowing up the conventional explosives on the missiles so as to utterly ruin them, but in a way that doesn't trigger the big chain reaction.  But my impression is that, if you have a pile of weapons-grade uranium, then it's reasonably simple to make a bomb out of it, and since uranium is an element, no conventional explosion can eliminate that from the debris.  Maybe you can melt it, mix it with other stuff, and make it super-impure?

But even then, the U.S. and Russia probably have stockpiles of weapons-grade uranium.  I suspect they could make nukes out of that within a few months.  You would have to ruin all the stockpiles too.

And then there's the possibility of mining more uranium and enriching it; I feel like this would take a few years at most, possibly much less if one threw a bunch of resources into rushing it.  Would you ruin all uranium mines in the world somehow?

No, it seems to me that the only ways to reliably rule out nuclear war involve either using overwhelming physical force to prevent people from using or making nukes (like a drone army watching all the uranium stockpiles), or being able to reliably persuade the governments of all nuclear powers in the world to disarm and never make any new nukes.  The power to do either of these things seems tantamount to the power to take over the world.

comment by Foyle (robert-lynn) · 2024-11-14T06:35:34.939Z · LW(p) · GW(p)

It was a very frustrating conversation to listen to, because Wolfram really hasn't engaged his curiosity and done the reading on AI-kill-everyoneism.  So we just got a torturous number of unnecessary and oblique diversions from Wolfram who didn't provide any substantive foil to Eliezer

I'd really like to find Yudkowsky debates with better prepared AI optimists prepared to try and counter his points.  Do any exist?