My Plan to Build Aligned Superintelligence

post by apollonianblues · 2022-08-21T13:16:32.697Z · LW · GW · 7 comments

Contents

  Problem Statement
  OK, that’s the problem statement. My plan for making an AI that models things in a way that you model things, AND will hopefully shut down when you ask it to, AND you can get to do a goal you actually want, is below. I assume you use something basically like modern deep learning because I haven’t ha...
  My central intuition here is you have to get it to do things with a world-model that matches yours, and it doesn’t actually seem like there’s some key conceptual solution that will make it do this, you just have to do a bunch of training to make it do this and check a bunch of ways it could go wrong...
None
7 comments

2ish months ago, I realized making sure superintelligent AI doesn’t kill everyone was the biggest problem period. I resolved to understand and maybe help solve the problem. A friend of mine who does alignment research said that, in order to clarify my thinking, I should sit down and try to write my best solution to aligning a superintelligent AI in an hour. I did this, and found it extremely helpful; I then edited it and stared angrily at it for a few more hours to make it less embarrassing to post on Lesswrong. This is, so far, my best plan to solve AI alignment. I found writing this very helpful for clarifying my own model of alignment, and what I was still confused about. I would highly recommend other people starting to think about AI alignment try this exercise. Enjoy!

Problem Statement

To solve alignment and build an aligned artificial general intelligence, you’d need a smarter-than-human system that does what you want it to do, you’d need some confidence that it will do this thing before you hand it a machete and say “sure, do whatever,” and you’d also probably need it to change its behavior in response to you telling it to not do something, telling you what to do, etc., but in a way that represents what you’re actually asking it to do, not some proxy that it can mess with. You’d also need to do this before anyone makes an unaligned AGI, and it would have to prevent new people from making unaligned AGIs.

To do this, you want an AGI that understands what you actually want to happen in the real physical world, is incentivized to model this instead of optimizing on one specific criterion like “I see a video of the thing that maximizes my reward,” and will change its behavior in response to you indicating you want it to do something else. It seems like this kind of system would need to be actively trying to do something similar to how we “do science” to figure out about the world. Instead of optimizing one specific criterion, which it could game, like n of people who press an “I’m happy” button or smile in front of a camera, it would understand there’s some sort of underlying process, that specific details are proxies for, that it’s trying to model. 

We sometimes take this quality for granted because humans actually care about things in the real world, since our training environment required us to locate and consume real food with specific nutrients so we could walk around and make real babies. Part of the tricky bit is what reward you actually put in place that correlates with “Do x things in the real world” and not just “patterns of camera data that match the original training set I gave you,” or “some heuristic that correlates to humans smiling in the training distribution but is actually best matched by a weird pattern to tile the universe in outside of the training distribution.” It seems like whatever you do to make this happen doesn’t involve just trying to program some kind of friendliness/deference. If you ask an AI to turn a brick into a diamond, and it projects a video of a brick turning into a diamond in front of your camera, it’s not necessarily going like “haha, I am deceiving the human into thinking there is a brick turning into a diamond here, but really there isn’t!!!” It’s quite likely the reasoning process seems more like “I want to make it s. t. there’s this video output, this is the most efficient way to do it given my available resources.” Similarly, if you try to ask it to defer to you/map your preferences before it does something, but it can simulate a better/easier to work with version of “you,” it seems pretty likely to do this unless you know you’re getting it to care about You, Real Human Who Is Making AGI.

OK, that’s the problem statement. My plan for making an AI that models things in a way that you model things, AND will hopefully shut down when you ask it to, AND you can get to do a goal you actually want, is below. I assume you use something basically like modern deep learning because I haven’t had time to think of a new paradigm for artificial intelligence in the past few hours.

First, take several steps where you have an AI model the physical world before you deploy it.

I claim that if you do this kind of training strategy right, you can get it to predict if real diamonds are getting made in a room, train some sort of thing trying to make diamonds with a loss function of this prediction model’s prediction on whether there are diamonds in the environment, and solve Eliezer’s toy problem of how you can get an AI to turn things into diamonds and nothing else [although you shouldn’t build this because it’d be dangerous].

Next, you have to solve a pretty hard problem, where you go from your predictor AI you have in your dark room underground modeling specific toy physical scenarios to your predictor AI understanding broader abstractions/general goals, anywhere from getting you a cup of coffee to, say, burning all the GPUs that aren’t it but without killing any humans/doing anything you find shitty.

Note: as you’re doing this I also imagine frequently doing a bunch of “it’s not trying to kill you” interpretability checks, the specific form of which depends on what your predictor slash budding AGI looks like. I also imagine you doing this in an underground bunker where you feed it hard drives of data and try not to give it access to the Internet. This is kind of a big ask given how current AGI development is going, but whatever, let’s pretend I can do this.

OK, that’s my best plan to make aligned AGI. TL;DR:

My central intuition here is you have to get it to do things with a world-model that matches yours, and it doesn’t actually seem like there’s some key conceptual solution that will make it do this, you just have to do a bunch of training to make it do this and check a bunch of ways it could go wrong while keeping an eye on it to make sure it’s not trying to kill you.

I’ve thought of other ways that stuff in this plan might go wrong, and specific patches you try to do to make this less likely to happen, but they’d be exhausting to write down or make the post too long; am very happy to hear objections/criticisms of why this approach seems wrong or why people think the actual solution looks very different, and I’ll reply to your comment and tell you my patches for it or be like “damn, you’re right” if your comment is really good. Or send me hate mail at my email apollonianblues@gmail.com.

Thanks for reading my Very Great Plan for Making the Aligned Superintelligence! Again, I found doing this quite helpful for clarifying my personal models, and I’d highly recommend other people write down their Plans for Aligning Superintelligence as well.

7 comments

Comments sorted by top scores.

comment by harsimony · 2022-08-21T18:03:51.758Z · LW(p) · GW(p)

You may also want to check out John Wentworth's natural abstraction hypothesis work:

https://www.lesswrong.com/posts/cy3BhHrGinZCp3LXE/testing-the-natural-abstraction-hypothesis-project-intro [LW · GW]

Replies from: apollonianblues
comment by apollonianblues · 2022-08-21T19:25:05.811Z · LW(p) · GW(p)

I have LOL thanks tho

Replies from: apollonianblues
comment by apollonianblues · 2022-08-21T19:27:25.280Z · LW(p) · GW(p)

TBH my naive thought is that if John's project succeeds it'll solve most of what I think of as the hard part of alignment, and so it seems like one of the more promising approaches to me, but in my model of the world it seems quite unlikely that there are natural abstractions in the way that John seems to think there are.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-08-21T14:44:10.236Z · LW(p) · GW(p)

Seems pretty reasonable to me. I mean, there's still a number of places where I think, "sure, but how exactly do we do that step?". One thing that has been working me is that we really don't seem ready to figure out the ethical aspects of wildly out of distribution stuff like uploaded humans and nanotech and such. So I agree that the goal of "go just barely far enough out of distribution to prevent unaligned AGIs from being built, then go slow and think things through carefully" is a good one. I also agree that "actually get the AI to model and care about the real world" seems like a necessary precursor to the "stop other AGIs" goal. Some problems that concern me are things like, "assuming that many actors are racing towards AGI, and assuming that some of these actors will be state actors with well-guarded secret labs, doesn't this seem like kind of a dangerously high power level to have our hopefully-but-not-definitely aligned AGI operating at?" Also, getting that much lead in the capabilities race seems impractical, without the competitors having close-enough-to-insanely-powerful proto-AGIs to make the bar for defeating them even higher. So I worry that trying to race for a pivotal act is not the best path forward. Can you think of other ways forward?

comment by dmav · 2022-08-21T22:51:12.771Z · LW(p) · GW(p)

Here are some objections I have to your post:
How are you going to specify the amount of optimization pressure the AI exerts on answering a question/solving a problem? Are you hoping to start out training a weaker AI that you later augment? 
If so, I'd be concerned about any distributional shifts in its optimization process that occur during that transition
If not, it's not clear to me how you have the AI 'be safe' through this training process.

At the point where you, the human, is labeling data to train the AI to identify concepts with measurements/feature - you now have a loss function that's dependent on human feedback, and which, once again, you can't specify in terms of the concepts you want the AI to identify. It seems like the AI is pretty incentivized to be deceptive here (or really at any point in the process).
I.e. if i's superintelligent and you accidentally gave it the loss function 'maximize paperclips', but it models humans as potentially not realizing they gave it this loss function, then I think it would act indistinguishably from an AI with the loss function you intended (at least during this stage of training you outline).

Even if, say, it does do things at first that look like things a paperclip maximizer would try to do, instead of whatever you actually want it to do (label things appropriately) - say, it tries to get a human user to upload it to the internet or something, but your safe-guards are sufficiently strong to prevent things like this - then I think as you train away actions like this, you're not just training it to have better utility functions or whatever, but you're training it to be more effectively deceptive.

comment by ryanhelsing · 2022-08-21T18:43:13.410Z · LW(p) · GW(p)

Is it generally accepted that an aligned super-intelligence will prevent other super-intelligences from existing?

Replies from: apollonianblues
comment by apollonianblues · 2022-08-21T19:24:13.437Z · LW(p) · GW(p)

My assumption is that it would do this to prevent other people from making superintelligences that are unaligned. At least Eliezer thinks you need to do this (see bullet point 6 in this post [LW · GW]), and I think it generally comes up in conversations people have about pivotal acts. Some people think if you think of an alignment solution that's good and easy to implement, everyone building AGI will use it, and so you won't have to prevent other people from building unaligned AGI, but this seems unrealistic and risky to me.