What are the flaws in this argument about p(Doom)?

post by William the Kiwi · 2023-08-08T20:34:56.687Z · LW · GW · 25 comments

Technical alignment is hard

Technical alignment will take 5+ years

AI capabilities are currently subhuman in some areas (driving cars), about human in some areas (Bar exam), and superhuman in some areas (playing chess)

Capabilities scale with compute

The doubling time for AI compute is ~6 months

In 5 years compute will scale 2^(5÷0.5)=1024 times

In 5 years, with ~1024 times the compute, AI will be superhuman at most tasks including designing AI

Designing a better version of itself will increase an AI's reward function

An AI will design a better version of itself and recursively loop this process until it reaches some limit

Such any AI will be superhuman at almost all tasks, including computer security, R&D, planning, and persuasion

The AI will deploy these skills to increase its reward function

Human survival is not in the AIs reward function

The AI will kill of most or all humans to prevent the humans from possibly decreasing its reward function

Therefore: p(Doom) is high within 5 years


Despite what the title says this is not a perfect argument tree. Which part do you think is the most flawed?

Edit: As per request the title has been changed from the humourous "An utterly perfect argument about p(Doom)" to "What are the flaws in this argument about p(Doom)?"

Edit2: yah Frontpage! Totally for the wrong reasons though

Edit3: added ", with ~1024 times the compute," to "In 5 years AI will be superhuman at most tasks including designing AI"

25 comments

Comments sorted by top scores.

comment by Raemon · 2023-08-08T21:24:11.368Z · LW(p) · GW(p)

I kinda wanna downvote for clickbaity title.

Replies from: Green_Swan, Mitchell_Porter
comment by Jacob Watts (Green_Swan) · 2023-08-09T00:32:24.683Z · LW(p) · GW(p)

Personally, I found it obvious that the title was being playful and don't mind that sort of tongue-in-cheek thing. I mean "utterly perfect" is kind of a give away that they're not being serious.

Replies from: William the Kiwi
comment by William the Kiwi · 2023-08-09T21:26:51.684Z · LW(p) · GW(p)

You are correct, I was not being serious. I was a little worried someone might think I was, but considered it a low probably.

Edit: this little stunt has cost me a 1 hour time limit on replies. I will reply to the other replies soon

comment by Mitchell_Porter · 2023-08-08T23:50:31.346Z · LW(p) · GW(p)

Yes, I wanted to downvote too. But this is actually a good little argument to analyze. @William the Kiwi, please change the title to something like "What are the weaknesses in this argument for doom?"

Replies from: William the Kiwi
comment by William the Kiwi · 2023-08-10T00:56:25.030Z · LW(p) · GW(p)

As requested I have updated the title. How does the new one look?

Edit: this is a reply to the reply below, as I am commenting restricted but still want to engage with the other commenters: deleted

Edit2: reply moved to actual reply post

Replies from: Mitchell_Porter, William the Kiwi
comment by Mitchell_Porter · 2023-08-10T07:30:15.732Z · LW(p) · GW(p)

It's fine. I have no authority here, that was really meant as a suggestion... Maybe the downvoters thought it was too basic a post, but I like the simplicity and informality of it. The argument is clear and easy to analyze, and on a topic as uncertain and contested as this one, it's good to return to basics sometimes. 

Replies from: William the Kiwi
comment by William the Kiwi · 2023-08-10T22:24:47.635Z · LW(p) · GW(p)

I think it was a helpful suggestion. I am happy that you liked the simplicity of the argument. The idea was it was meant to be as concise as possible to make the flaws seem more easy to spot. The argument relies on a range of assumptions but I deliberately left out the more confident assumptions. I find the topic of predicting AI development challenging, and was hoping this argument tree would be an efficient way of recognizing the more challenging parts.

Replies from: William the Kiwi , William the Kiwi , William the Kiwi
comment by William the Kiwi · 2023-08-11T11:42:15.716Z · LW(p) · GW(p)

Disagree vote this post if you disagree that the topic of predicting AI development is challenging.

comment by William the Kiwi · 2023-08-11T11:40:43.675Z · LW(p) · GW(p)

Disagree vote this post if you disagree with liking the simplicity of the original post.

comment by William the Kiwi · 2023-08-11T11:40:09.813Z · LW(p) · GW(p)

The above reply has two disagreement votes. I am trying to discern which reasons they are for. Disagree vote this post if you disagree that Mitchell_Porters suggestion was helpful.

comment by William the Kiwi · 2023-08-10T22:43:03.735Z · LW(p) · GW(p)

Oops I realized I have used "flaws" rather than "weaknesses". Do you consider these to be appropriate synonyms? I can update if not.

comment by TAG · 2023-08-11T16:27:38.379Z · LW(p) · GW(p)

Designing a better version of itself will increase an AI’s reward function

An AI doesn't have to have a reward function, or one that implies self improvement. RFs often only apply at the training stage.

Replies from: William the Kiwi
comment by William the Kiwi · 2023-08-12T13:56:05.813Z · LW(p) · GW(p)

How would an AI be directed without using a reward function? Are there some examples I can read?

Replies from: programcrafter
comment by ProgramCrafter (programcrafter) · 2023-08-13T15:50:22.768Z · LW(p) · GW(p)

Current AIs are mostly not explicit expected-utility-maximizers. I think this is illustrated by RLHF (https://huggingface.co/blog/rlhf).

Replies from: William the Kiwi
comment by William the Kiwi · 2023-08-18T19:17:47.347Z · LW(p) · GW(p)

But isn't that also using a reward function? The AI is trying to maximise the reward it receives from the Reward Model. The Reward Model that was trained using Human Feedback.

comment by Christopher King (christopher-king) · 2023-08-09T01:59:54.781Z · LW(p) · GW(p)

Technical alignment is hard

Technical alignment will take 5+ years

This does not follow, because subhuman AI can still accelerate R&D.

Replies from: rhollerith_dot_com, William the Kiwi
comment by RHollerith (rhollerith_dot_com) · 2023-08-10T14:57:59.648Z · LW(p) · GW(p)

The OP's argument can be modified to be immune to your objection:

Technical alignment is harder than capability research. Technical alignment will take longer than we have before capability research kills us all.

Replies from: William the Kiwi
comment by William the Kiwi · 2023-08-10T22:47:16.058Z · LW(p) · GW(p)

This too seems like an improvement. However I would leave out the "kills us all" bit as this is meant to be the last line of the argument.

comment by William the Kiwi · 2023-08-10T22:45:01.365Z · LW(p) · GW(p)

A fair comment. Would the following be an improvement? "Some technical alignment engineers predict with current tool and resources technical alignment will take 5+ years?"

comment by William the Kiwi · 2023-08-11T11:33:28.659Z · LW(p) · GW(p)

For those who are downvoting this post: A short one sentence comment will help the original poster make better articles in the future. 

comment by Jacob Watts (Green_Swan) · 2023-08-09T00:57:32.507Z · LW(p) · GW(p)

The doubling time for AI compute is ~6 months

 

Source?

In 5 years compute will scale 2^(5÷0.5)=1024 times

 

This is a nitpick, but I think you meant 2^(5*2)=1024

 

In 5 years AI will be superhuman at most tasks including designing AI

 

This kind of clashes with the idea that AI capabilities gains are driven mostly by compute. If "moar layers!" is the only way forward, then someone might say this is unlikely. I don't think this is a hard problem, but I thing its a bit of a snag in the argument.

 

An AI will design a better version of itself and recursively loop this process until it reaches some limit

I think you'll lose some people on this one. The missing step here is something like "the AI will be able to recognize and take actions that increase its reward function". There is enough of a disconnect between current systems and systems that would actually take coherent, goal-oriented actions that the point kind of needs to be justified. Otherwise, it leaves room for something like a GPT-X to just kind of say good AI designs when asked, but which doesn't really know how to actively maximize its reward function beyond just doing the normal sorts of things it was trained to do. 

Such any AI will be superhuman at almost all tasks, including computer security, R&D, planning, and persuasion

I think this is a stronger claim than you need to make and might not actually be that well-justified. It might be worse than humans at loading the dishwasher bc that's not important to it, but if it was important, then it could do a brief R&D program in which it quickly becomes superhuman at dish-washer-loading. Idk, maybe the distinction I'm making is pointless, but I guess I'm also saying that there's a lot of tasks it might not need to be good at if its good at things like engineering and strategy.

Overall, I tend to agree with you. Most of my hope for a good outcome lies in something like the "bots get stuck in a local maximum and produce useful superhuman alignment work before one of them bootstraps itself and starts 'disempowering' humanity". I guess that relates to the thing I said a couple paragraphs ago about coherent, goal-oriented actions potentially not arising even as other capabilities improve.

I am less and less optimistic about this as research specifically designed to make bots more "agentic" continues. In my eyes, this is among some of the worst research there is.

Replies from: William the Kiwi
comment by William the Kiwi · 2023-08-09T23:10:35.094Z · LW(p) · GW(p)

Thank you Jacob for taking the time for a detailed reply. I will do my best to respond to your comments.

The doubling time for AI compute is ~6 months

Source?

Source: https://www.lesswrong.com/posts/sDiGGhpw7Evw7zdR4/compute-trends-comparison-to-openai-s-ai-and-compute [LW · GW]. They conclude 5.7 months from the years 2012 to 2022. This was rounded to 6 months to make calculations more clear. They also note that "OpenAI’s analysis shows a 3.4 month doubling from 2012 to 2018"

In 5 years compute will scale 2^(5÷0.5)=1024 times

This is a nitpick, but I think you meant 2^(5*2)=1024

I actually wrote it the (5*2) way in my first draft of this post, then edited it to (5÷0.5) as this is [time frame in years]÷[length of cycle in years], which is technically less wrong.

In 5 years AI will be superhuman at most tasks including designing AI

This kind of clashes with the idea that AI capabilities gains are driven mostly by compute. If "moar layers!" is the only way forward, then someone might say this is unlikely. I don't think this is a hard problem, but I thing its a bit of a snag in the argument.

I think this is one of the weakest parts of my argument, so I agree it is definitely a snag. The move from "superhuman at some tasks" to "superhuman at most tasks" is a bit of a leap. I also don't think I clarified what I meant very well. I will update to add ", with ~1024 times the compute,".

An AI will design a better version of itself and recursively loop this process until it reaches some limit

I think you'll lose some people on this one. The missing step here is something like "the AI will be able to recognize and take actions that increase its reward function". There is enough of a disconnect between current systems and systems that would actually take coherent, goal-oriented actions that the point kind of needs to be justified. Otherwise, it leaves room for something like a GPT-X to just kind of say good AI designs when asked, but which doesn't really know how to actively maximize its reward function beyond just doing the normal sorts of things it was trained to do.

Would adding that suggested text to the previous argue step help? Perhaps "The AI will be able to recognize and take actions that increase its reward function. Designing a better version of itself will increase that reward function" But yea I tend to agree that there needs to be some sort of agentic clause in this argument somewhere.

Such any AI will be superhuman at almost all tasks, including computer security, R&D, planning, and persuasion

I think this is a stronger claim than you need to make and might not actually be that well-justified. It might be worse than humans at loading the dishwasher bc that's not important to it, but if it was important, then it could do a brief R&D program in which it quickly becomes superhuman at dish-washer-loading. Idk, maybe the distinction I'm making is pointless, but I guess I'm also saying that there's a lot of tasks it might not need to be good at if its good at things like engineering and strategy.

Would this be an improvement? "Such any AI will be superhuman, or able to become superhuman, at almost all tasks, including computer security, R&D, planning, and persuasion"

Overall, I tend to agree with you. Most of my hope for a good outcome lies in something like the "bots get stuck in a local maximum and produce useful superhuman alignment work before one of them bootstraps itself and starts 'disempowering' humanity". I guess that relates to the thing I said a couple paragraphs ago about coherent, goal-oriented actions potentially not arising even as other capabilities improve.

I would speculate that most of our implemented alignment strategies would be meta-stable, they only stay aligned for a random amount of time. This would mean we mostly rely on strategies that hope to get x before we get y. Obviously this is a gamble.

I am less and less optimistic about this as research specifically designed to make bots more "agentic" continues. In my eyes, this is among some of the worst research there is.

I speculate that a lot of the x-risk probability comes from agentic models. I am particularly concerned with better versions of models like AutoGPT that don't have to be very intelligent (so long as they are able to continuously ask GPT5+ how to act intelligent) to pose a serious risk.

Meta question: how do I dig my way out of a karma grave when I can only comment once per hour and post once per 5 days?

Meta comment: I will reply to the other comments when the karma system allows me to.

Edit: formatting