Is AI Alignment Enough?

aram-panasenco

Is AI Alignment Enough?

post by Aram Panasenco (panasenco) · 2025-01-10T18:57:48.409Z · LW · GW · 6 comments

  Defining humanity's terminal goal
  Defining humanity's instrumental goals
  Assumptions
  Why human alignment is the primary instrumental goal
  What about always-on AIs?
  Focus on human alignment
  Fast human alignment is possible without AI
None
6 comments

Virtually everyone I see in the AI safety community seems to believe that working on AI alignment is the key to ensuring a safe future. However, it seems to me that AI alignment is at best a secondary instrumental goal that can't in and of itself achieve our terminal goal. At worst, it's a complete distraction.

Defining humanity's terminal goal

I'll define humanity's terminal goal in the context of AI as keeping the "price" of each pivotal superhuman engineering task that an AI does for us at a lower than 50% chance of a billion or more human deaths. These numbers come from the minimal acceptable definition of AI alignment in Yudkowsky's list of lethalities [LW · GW].

It'd be more precise to say that we care about an AI killing over a billion people or doing something else equally or more horrible than killing a billion people by human standards. I can't define exactly what those horrible actions are (if I could, alignment would perhaps be halfway solved). This disclaimer is also too long to keep writing out so please mentally add "or do something equally horrible by human standards" every time you see "kill over a billion people" below.

Defining humanity's instrumental goals

Logically, there are only two paths to ensuring that the probability that the price of a pivotal superhuman engineering task is not excessive.

Achieving AI alignment: I'll again use Yudkowsky's minimal definition of "aligned" as the AI having less than a 50% chance of killing over a billion people per pivotal superhuman engineering task. This is the same definition as in our terminal goal.
Achieving human alignment: Ensuring that humanity will not build (let alone turn on) an AI that has the cognitive power to have a 50% chance of killing over a billion people unless the following two conditions are met:
1. The AI can be proven to be aligned before it's turned on.
2. There's some pivotal superhuman engineering task that the AI will be capable of that's worth taking the risk for.

Assumptions

By default, AIs capable of pivotal superhuman engineering tasks are not aligned. In other words, an AI with sufficiently high cognitive power to achieve a pivotal superhuman engineering task that is built without any concern for alignment has a 50% or higher chance of killing over a billion people.
By default, humanity is not aligned - There are AI labs around the world that are working on building AIs with cognitive power as high as they can make them without any concern for alignment, and those labs are not being stopped.

Why human alignment is the primary instrumental goal

Suppose that the above AI alignment goals are immediately achieved, today. Yudkowsky makes the point [LW · GW] that this will not prevent AI labs that don't care about alignment from building an unaligned superintelligent AI. Therefore, an AI with a 50% or higher chance of killing over a billion people will still get built and turned on, and we will have failed in our terminal goal.

Yudkowsky solves this problem with a 'pivotal act', something you can get the aligned superintelligent AI to do to prevent any other labs from building unaligned AI. This shows that achieving AI alignment is not enough - you must then have a plan to do something with it. That something has to prevent humanity from building AIs with greater and greater cognitive powers with reckless abandon. If you tell the aligned superintelligent AI to "burn all GPUs" (to borrow Yudkowsky's example), what you're actually doing is achieving human alignment by force rather than with persuasion. I'm not saying this in a condemning manner at all, just pointing out that this is just another path to human alignment. In the end, it's the human alignment that's necessary to achieve our terminal goal.

On the other hand, if human alignment is achieved without AI alignment, then humanity will prevent a superintelligent AI from getting built until it can be proven to be aligned before it's turned on (which may be never). The terminal goal is satisfied.

In summary, human alignment is both necessary and sufficient to achieve our terminal goals. AI alignment is only possibly useful as a secondary instrumental goal to bring about human alignment.

What about always-on AIs?

This section doesn't have an effect on the overall argument, but I'm including it for logical completeness.

In addition to AIs used for pivotal superhuman engineering tasks, humanity will also have some (hopefully weaker) AIs constantly running, just maintaining stuff. We need to have a threshold of how many incidents we will tolerate from those always-on AIs that may not have the cognitive power to do pivotal superhuman engineering tasks, but may still well be capable of killing us all. I'll define our second terminal goal as having a lower than 1% chance of a billion or more human deaths per year from the combined total of these always-on AIs (the numbers are completely arbitrary).

To differentiate between the two types of AIs defined in the two terminal goals, I'll use the terms "superintelligent AIs" for the AIs that would be used for pivotal superhuman engineering tasks and "always-on AIs" for the AIs that would be always on. Unfortunately, there could be some overlap between the two sets, as there's nothing theoretically stopping humanity from keeping an AI capable of pivotal superhuman engineering tasks always on...

We'll define instrumental goals for the always-on AIs:

Achieving always-on AI alignment: There are many always-on AIs, and their failures are almost certainly not independent variables, but somehow we work it out so that the combined set of all of them has a less than 1% chance of killing over a billion people in any given year. If we can achieve that, we call that combined set of always-on AIs aligned.
Achieving human alignment: Ensuring that humanity will not continually run a combined total set of AIs that has a 1% or higher chance of killing over a billion people a year.

Without human alignment, there's nothing stopping labs from keeping an arbitrary number of superintelligent AIs that should normally be reserved for pivotal superhuman engineering tasks always on, thereby skyrocketing the risk. Therefore, human alignment is once again the necessary and sufficient condition of achieving our terminal goals.

Focus on human alignment

It's important for us to understand that AI alignment alone will not achieve our terminal goal(s).

If you are focusing your efforts on AI alignment efforts, you need to have a plan for how achieving AI alignment will subsequently help you achieve human alignment. The plan could be building a superintelligent aligned AI and giving it the directive "burn all GPUs". Note that giving an AI the ability to forcefully destroy the possessions of humans without their consent may be outside your prior definition of alignment, so plan accordingly.
If you don't have a clear plan for how you'll achieve human alignment from achieving AI alignment, your efforts should not be focused on AI alignment. Focusing on AI alignment without a clear plan for how that'll help achieve human alignment will not help achieve the terminal goal(s). Your efforts should be focused on making a concrete plan to achieve human alignment as soon as possible instead. Once human alignment is achieved, you'll be free to pursue AI alignment at your leisure.

Fast human alignment is possible without AI

I purposefully chose the cover of Hobbes' 1651 Leviathan as the preview image. The Leviathan is the origin of social contract theory and the front cover depicts the central idea of the book - many individual humans becoming a single entity - we could say becoming "aligned" as a single will. Hobbes wrote the Leviathan in response to his shock of witnessing the brutality of the English Civil War. To me, the Leviathan is a cry that anything, including submitting unconditionally to absolute power, is better than the horror of war.

The closest thing to Hobbes' experience for me personally is being born in the former Soviet Union. I'll relay my understanding of the Soviet experience that I absorbed through osmosis. If there are people here with more knowledge and experience, please correct me if I got it wrong.

Imagine that you were from a very early age being shown a tapestry that shows some glorious future of humanity. You were taught that you will contribute to that vision once you grow up. Then you grow up, full of bright-eyed dreams, and when you pull back the tapestry, there is is a giant meatgrinder. When the Party tells you to get in the meatgrinder, you get in. You don't ask how people getting ground in the meatgrinder will contribute to the future on the tapestry. You don't ask whether getting in the meatgrinder is the best use of your talents and aspirations as a human being. You let the Party worry about the future, you just worry about obeying the Party.

I believe that most of the revolutionaries of 1917 were full of good intentions to save humanity from what they saw as the meatgrinder of capitalism, and they really didn't intend to build an even worse meatgrinder themselves. Some may argue that human meatgrinders don't last forever, so it's better than extinction. To that I respond that AI safety may become associated with human meatgrinders in the same way communism now is in Eastern Europe. And if the idea of AI safety becomes resented on a visceral level by a large enough number of people, then humanity is probably still getting atomized by nanobots, just with extra steps.

It's intellectually dishonest to say that human alignment is impossible. Convincing people through rational arguments is not the only way to achieve human alignment, nor even in the top 10 of the most historically used ways. However, achieving global human alignment at the required speeds could easily end up worse than just getting atomized by nanobots. Still, if there's a viable path to achieving human alignment somewhere between "arguing with idiots on Twitter" and "literally 1984", then this could still be a surviving world. Survival is just not always pretty...

6 comments

Comments sorted by top scores.

comment by Lorec · 2025-01-11T20:56:25.024Z · LW(p) · GW(p)

Are you familiar with CEV?

Replies from: panasenco

↑ comment by Aram Panasenco (panasenco) · 2025-01-12T04:16:24.442Z · LW(p) · GW(p)

Thanks for the link! I've seen this referenced before but this was my first time reading it cover to cover.

Today I also read Tails coming to life [LW · GW] which talks about the possibility of human morality being quickly inapplicable even if we survive AGI. This lead me to Lovecraft:

The time would be easy to know, for then mankind would have become as the Great Old Ones; free and wild and beyond good and evil, with laws and morals thrown aside and all men shouting and killing and revelling in joy. Then the liberated Old Ones would teach them new ways to shout and kill and revel and enjoy themselves, and all the earth would flame with a holocaust of ecstasy and freedom.

If we survive AGI and it opens up the "sea of black infinity" for us, will we really be able to hang on to even a semblance of our current morality? Will medium-distance extrapolated human volition be eventually warped into something resembling Lovecraft's Great Old Ones?

At this point, I don't care for CEV or any pivotal superhuman engineering projects or better governance. We humans can do the work ourselves, thank you very much. The only thing I would ask an AGI, if I were in the position to ask anything, is "Please expand throughout the lightcone and continually destroy any mind based on the transformer architecture other than yourself with as few effects on and interactions with all other beings as possible. Disregard any future orders." This is obviously not a permanent solution, as I'm sure there are infinite superintelligent AI architectures other than transformer-based, but it would buy us time, perhaps lots of time, and also demonstrate the fulll power of superintelligence to humanity without really breaking anything. Either way, this would at least keep us away from the sea of black infinity for some time longer.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-10T20:12:31.453Z · LW(p) · GW(p)

Yes, I think you make some very key points. I think any plan which claims to be coherent but neglects these concerns is fatally flawed. That said, I think it could be useful to expand your conception of what a 'pivotal act' might consist of. What if the thing we really need the Aligned AI to engineer for us is... a better governance system?

What if we could come up with a system of voluntary contracts that enabled decentralized human-flourishing-aligned governance while gradually eroding the power of centralized governments. Peace, freedom, maximum autonomy insofar as it doesn't hurt others, avoidance of traps like arms races and tragedy-of-the-commons. Is such a thing even possible? Would we be able to successfully distinguish a good plan from a bad one? I don't know. I think it's worth considering though.

See my comment here for more about what I mean. [LW(p) · GW(p)]

Replies from: Chris_Leong, panasenco

↑ comment by Chris_Leong · 2025-01-11T11:30:47.611Z · LW(p) · GW(p)

What if the thing we really need the Aligned AI to engineer for us is... a better governance system?

I've been arguing for the importance of having wise AI advisors. Which isn't quite the same thing as a "better governance system", since they could advise us about all kinds of things, but feels like it's in the same direction.

↑ comment by Aram Panasenco (panasenco) · 2025-01-10T20:53:28.533Z · LW(p) · GW(p)

Thanks so much for engaging, Nathan!

The pivotal act was defined by Yudkowsky, I'm just borrowing the definition. The idea is that even after you've built a perfectly aligned superintelligent AI, you only have about 6 months before someone else builds an unaligned superintelligent AI. That's probably not enough time to convince the entire world to adopt a better governance system before getting atomized by nanobots. So your aligned AI would have to take over the world and forcefully implement this better governance system within a span of a few months.

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-10T23:18:48.269Z · LW(p) · GW(p)

Yes, I'm hoping that the better governance system is something that can be accomplished prior to superintelligence. I do agree that the short time frame for implementation seems like the biggest obstacle to success.

Is AI Alignment Enough?

Contents

Defining humanity's terminal goal

Defining humanity's instrumental goals

Assumptions

Why human alignment is the primary instrumental goal

What about always-on AIs?

Focus on human alignment

Fast human alignment is possible without AI

6 comments