My Alignment "Plan": Avoid Strong Optimisation and Align Economy

post by VojtaKovarik · 2024-01-31T17:03:34.778Z · LW · GW · 9 comments

Contents

  Descriptive, rather than Normative
  My Alignment "Plan"
    (Optional) Step 0: Hope, or assume, that sharp left turn will not happen.
    Step 1: Convince everybody to avoid building the kind of AI that could undergo sharp left turn.
    Step 2: Build AI that automates trusted processes.
    Step 3: Solve problems caused by automating the economy.
  Follow-up Questions
None
9 comments

Summary: Many people seem to put their hopes on something like the following "plan":

If this is true, I think there should be more acknowledgment of this fact, and more discussion of the failure modes of this plan.

Epistemic status: Descriptive, rather than normative.


Descriptive, rather than Normative

I label the epistemic status of this post as "descriptive, rather than normative". What do I mean by that? And what do I mean by alignment "plan"?

While I thought a lot about AI alignment, I still have many uncertainties about the topic. And I don't have any plan, for helping us build beneficial AGI, that I would be optimistic about. But I keep working on this, and I have opinions and preferences over which projects to undertake. So, the question I ask in this post is: To the extent that my actions and beliefs seem to be in line with any plan at all, what "plan" do they seem to be following?

I should disclaim that my actual beliefs are a bit more nuanced than the description given here. But for the sake of brevity, I will stick with the simpler formulations below. 

The main reason I write this post is that I suspect that many other people might be putting their hopes into a "plan" similar to what I describe. (In the case of alignment researchers, this might be explicit and due to the absence of better ideas. In the case of capabilities researchers, this might happen implicitly, as a result of background assumptions and not having thought about the topic.) To the extent that this is the case, I think it would be useful to acknowledge that this is what is happening, such that we can discuss the plan explicity. To the extent that other people have a significantly different plan, I would be curious to know what the plan is.

Finally, note that I make no claim that the plan described here, or even my more nuanced version of it, is good. In fact, I do not think it is good --- I just don't have a better one. And I think that explicitly describing the plan is the first step towards improving it.

My Alignment "Plan"

(Optional) Step 0: Hope, or assume, that sharp left turn will not happen.

By sharp left turn [? · GW], I mean a scenario where an AI undergoes a sudden and extreme growth in capability, possibly until it becomes vastly more powerful than anything else around it. Some people seem convinced that sharp left turn cannot, or will not, happen. I think that being confident about this is misguided.[1]

However, it does seem plausible to me that we live in a universe where sharp left turn is impossible.

I also find it plausible that sharp left turn is possible in principle, but it is still far away in the "technological tree". In particular, it is possible that we still have a very long time until this problem needs to be addressed. Moreover, there is also the possiblity that kind of AI that could undergo sharp left turn will only become available at the point where the background level of capabilities is very large. In such a scenario, undergoing sharp left turn might no longer convey a sufficient advantage for the AI to make much of an impact.

Looking at my actions from the outside, it seems that aside from "don't build AI capable of sharp left turn" (see Steps 1-2), my only "strategy" for handling sharp left turn is

  1. hope somebody else solves it (despite being convinced that all of the current agendas fail if sharp left turn occurs, similarly to [1 [LW · GW], 2 [? · GW]]), and
  2. hope we live in a universe where sharp left turn won't happen anytime soon.[2]

Step 1: Convince everybody to avoid building the kind of AI that could undergo sharp left turn.

I don't have any good ideas for controling the kind of AI that could undergo sharp left turn, and neither am I aware of any recent work that would make progress on this problem. Instead, I am excited[3] about work which demonstrates the dangers of powerful AI --- ideally in ways that are salient even to ML researchers, policy makers, and the public. Two examples of such results are:

It seems conceivable to me that with enough such results, a majority of people could adopt the view that powerful-AI-soon is probably unsurvivable. More specifically, the scenario that seems conceivable to me is that the groups that adopt this view are:

In scenarios like these, I expect that the change in opinion to suffice for civilisation to attempt to avoid building powerful AI. However, this does automatically mean the attempt will succeed. In particular, we still need to tackle issues such as:

Ultimately, the hope with this step is that we can delay the development of sharp-left-turn-capable AI until we solve the alignment problem for such AI, or until civilisation becomes sufficiently robust to stop being vulnerable to AI takeover. (Recall that I am merely describing the plan, rather than making any claims about how likely it is to succeed.)

Step 2: Build AI that automates trusted processes.

Even if there is a general consensus that powerful AI is unsurvivable, I still expect any attempts to pause all AI progress to be unsustainable. As a result, we might try to increase our chances of controlling AI progress by white-listing approaches that seem relatively safe. But which approaches are those?

One intuition is that if we are currently doing some process without the use of AI, and we already trust that process is safe, then automating that process and doing more of it is (probably) also safe. (I don't think this intuition is completely right, but since I discuss those reservations in Step 3, I will leave them aside for now.) To give a few positive examples, we can consider:

In contrast, the following strategies would not fall under the approach above:

Overall, this approach to building AI seems much slower and more expensive than building larger and larger foundation models and turning them into agents. However, it should still be sufficient to eventually automate most of the economy, which should in turn allow us to eliminate poverty, greatly speed up science, solve all problems that can be solved using technology, etc. So the "only" issues are whether we can successfully take Steps 1-2 ... and the minor detail of whether automating the economy might perhaps come with problems of its own.

Step 3: Solve problems caused by automating the economy.

As one might expect, even if the approach of automating trusted processes goes as well as possible, there will still be many remaining problems to solve. Some of these are:

All of these problems sound like they have the potential to cause human extinction, or worse. At the same time, most problems have the property that one can tell a scary story about how the given problem will cause the world to end. So, uhm, perhaps we can wing it and it will all be fine?

Follow-up Questions

Finally, here are some related questions that I have:

 

  1. ^

    Depending on how I wake up each day, I feel that the chance of sharp left turn happening in time to be relevant is something between 5% and 95%. And most days I am above 50%. (This is besides the point of this post, but it does seem somewhat relevant for context.)

  2. ^

    Personally, I endorse the sentiment [LW · GW] that one should first figure out in which universe they are, and then try to do the best they can in that universe --- as opposed to focusing on worlds where they know how to make progress. That is why this plan has Steps 1-2.

  3. ^

    Well, at least more excited than about any other work.

9 comments

Comments sorted by top scores.

comment by cousin_it · 2024-02-01T12:46:02.507Z · LW(p) · GW(p)

a company that pays its employees a below-subsistance wages will get outcompeted by companies that offer better conditions... once we automate a large fraction of the economy and society, this relationship between competitiveness and being beneficial to humans can cease to hold

Walmart is one of the biggest employers in the world, and its salaries are notoriously so low that a large percentage of employees depend on welfare to survive (in addition to walmart salary). The economy is already pretty far from what I'd call aligned. If we want to align it, the best time to start was a couple centuries ago, the second best time is now. Let's not wait until AI increases concentration of power even more.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-31T18:01:29.715Z · LW(p) · GW(p)

I think some things we can do to better our chances include:

  • enforcing sandboxed testing of frontier models before they are deployed, using independent audits by governments or outside companies. This could potentially prevent a model which has undergone a sharp left turn from escaping.
  • better ways of testing for potential harms from AI systems, expanding the set of available evals of various sorts of risk
  • putting more collective resources into AI safety: alignment research, containment preparations, worldwide monitoring, international treaties
  • ensure that a militarily dominant coalition of nations is in agreement that should a Rogue AGI arise in the world, that their best chance of survival would be a rapid forceful response to stomp it out before it gains too much power. Have sufficient definitions and agreed upon procedures in place such that action could follow automatically from detection without need for lengthy discussion.
Replies from: RussellThor, VojtaKovarik
comment by RussellThor · 2024-02-01T03:06:42.297Z · LW(p) · GW(p)

What about quickly distributing frontier AI when it is shown to be safe? That is risky of course if it isn't safe, however if the deployed AI is as powerful and distributed as far as possible, then a bad AI would need to be more powerful comparatively to take over. 

So

AI(x-1) is everywhere and protecting as much as possible, AI(x) is sandboxed 

VS

AI(x-2) is protecting everything, AI(x-1) is in a few places, AI(x) is sandboxed.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2024-02-01T04:18:20.565Z · LW(p) · GW(p)

or the bad ai is able to hack every copy of the widely distributed ai the same way, making the question moot.

Replies from: RussellThor
comment by RussellThor · 2024-02-01T05:07:23.192Z · LW(p) · GW(p)

But it would surely be more likely to hack x-2 than x-1?

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2024-02-01T05:12:27.538Z · LW(p) · GW(p)

Right, and it would be easier to hack, since it has the same adversarial examples, right?

Oh, wait, I see what you're saying. No I think hacking x-1 and x-2 will both be trivial. AIs are basically zero secure right now.

Replies from: VojtaKovarik
comment by VojtaKovarik · 2024-02-01T19:25:25.459Z · LW(p) · GW(p)

I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the "distribute AI(x-1) quickly" part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the "single point of failure" effect, though it seems unclear how large.)

comment by VojtaKovarik · 2024-01-31T20:09:12.223Z · LW(p) · GW(p)

To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.

But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.

comment by Seth Herd · 2024-01-31T20:10:32.206Z · LW(p) · GW(p)

Excellent post. I think this is not a plan that's likely to succeed, but I think you've correctly and explicitly layed out the plan that many are following without being explicit about it - and therefore its limitations.

I'm very curious how many alignment researchers would agree that this is roughly their plan.