Is AI Alignment Enough?

post by Aram Panasenco (panasenco) · 2025-01-10T18:57:48.409Z · LW · GW · 3 comments

Contents

  Defining humanity's terminal goal
  Defining humanity's instrumental goals
  Assumptions
  Why human alignment is the primary instrumental goal
  What about always-on AIs?
  Focus on human alignment
  Fast human alignment is possible without AI
None
3 comments

Virtually everyone I see in the AI safety community seems to believe that working on AI alignment is the key to ensuring a safe future. However, it seems to me that AI alignment is at best a secondary instrumental goal that can't in and of itself achieve our terminal goal. At worst, it's a complete distraction.

Defining humanity's terminal goal

I'll define humanity's terminal goal in the context of AI as keeping the "price" of each pivotal superhuman engineering task that an AI does for us at a lower than 50% chance of a billion or more human deaths. These numbers come from the minimal acceptable definition of AI alignment in Yudkowsky's list of lethalities [LW · GW].

It'd be more precise to say that we care about an AI killing over a billion people or doing something else equally or more horrible than killing a billion people by human standards. I can't define exactly what those horrible actions are (if I could, alignment would perhaps be halfway solved). This disclaimer is also too long to keep writing out so please mentally add "or do something equally horrible by human standards" every time you see "kill over a billion people" below.

Defining humanity's instrumental goals

Logically, there are only two paths to ensuring that the probability that the price of a pivotal superhuman engineering task is not excessive.

  1. Achieving AI alignment: I'll again use Yudkowsky's minimal definition of "aligned" as the AI having less than a 50% chance of killing over a billion people per pivotal superhuman engineering task. This is the same definition as in our terminal goal.
  2. Achieving human alignment: Ensuring that humanity will not build (let alone turn on) an AI that has the cognitive power to have a 50% chance of killing over a billion people unless the following two conditions are met:
    1. The AI can be proven to be aligned before it's turned on.
    2. There's some pivotal superhuman engineering task that the AI will be capable of that's worth taking the risk for.

Assumptions

Why human alignment is the primary instrumental goal

Suppose that the above AI alignment goals are immediately achieved, today. Yudkowsky makes the point [LW · GW] that this will not prevent AI labs that don't care about alignment from building an unaligned superintelligent AI. Therefore, an AI with a 50% or higher chance of killing over a billion people will still get built and turned on, and we will have failed in our terminal goal.

Yudkowsky solves this problem with a 'pivotal act', something you can get the aligned superintelligent AI to do to prevent any other labs from building unaligned AI. This shows that achieving AI alignment is not enough - you must then have a plan to do something with it. That something has to prevent humanity from building AIs with greater and greater cognitive powers with reckless abandon. If you tell the aligned superintelligent AI to "burn all GPUs" (to borrow Yudkowsky's example), what you're actually doing is achieving human alignment by force rather than with persuasion. I'm not saying this in a condemning manner at all, just pointing out that this is just another path to human alignment. In the end, it's the human alignment that's necessary to achieve our terminal goal.

On the other hand, if human alignment is achieved without AI alignment, then humanity will prevent a superintelligent AI from getting built until it can be proven to be aligned before it's turned on (which may be never). The terminal goal is satisfied.

In summary, human alignment is both necessary and sufficient to achieve our terminal goals. AI alignment is only possibly useful as a secondary instrumental goal to bring about human alignment.

What about always-on AIs?

This section doesn't have an effect on the overall argument, but I'm including it for logical completeness.

In addition to AIs used for pivotal superhuman engineering tasks, humanity will also have some (hopefully weaker) AIs constantly running, just maintaining stuff. We need to have a threshold of how many incidents we will tolerate from those always-on AIs that may not have the cognitive power to do pivotal superhuman engineering tasks, but may still well be capable of killing us all. I'll define our second terminal goal as having a lower than 1% chance of a billion or more human deaths per year from the combined total of these always-on AIs (the numbers are completely arbitrary).

To differentiate between the two types of AIs defined in the two terminal goals, I'll use the terms "superintelligent AIs" for the AIs that would be used for pivotal superhuman engineering tasks and "always-on AIs" for the AIs that would be always on. Unfortunately, there could be some overlap between the two sets, as there's nothing theoretically stopping humanity from keeping an AI capable of pivotal superhuman engineering tasks always on...

We'll define instrumental goals for the always-on AIs:

  1. Achieving always-on AI alignment: There are many always-on AIs, and their failures are almost certainly not independent variables, but somehow we work it out so that the combined set of all of them has a less than 1% chance of killing over a billion people in any given year. If we can achieve that, we call that combined set of always-on AIs aligned.
  2. Achieving human alignment: Ensuring that humanity will not continually run a combined total set of AIs that has a 1% or higher chance of killing over a billion people a year.

Without human alignment, there's nothing stopping labs from keeping an arbitrary number of superintelligent AIs that should normally be reserved for pivotal superhuman engineering tasks always on, thereby skyrocketing the risk. Therefore, human alignment is once again the necessary and sufficient condition of achieving our terminal goals.

Focus on human alignment

It's important for us to understand that AI alignment alone will not achieve our terminal goal(s).

Fast human alignment is possible without AI

I purposefully chose the cover of Hobbes' 1651 Leviathan as the preview image. The Leviathan is the origin of social contract theory and the front cover depicts the central idea of the book - many individual humans becoming a single entity - we could say becoming "aligned" as a single will. Hobbes wrote the Leviathan in response to his shock of witnessing the brutality of the English Civil War. To me, the Leviathan is a cry that anything, including submitting unconditionally to absolute power, is better than the horror of war.

The closest thing to Hobbes' experience for me personally is being born in the former Soviet Union. I'll relay my understanding of the Soviet experience that I absorbed through osmosis. If there are people here with more knowledge and experience, please correct me if I got it wrong.

Imagine that you were from a very early age being shown a tapestry that shows some glorious future of humanity. You were taught that you will contribute to that vision once you grow up. Then you grow up, full of bright-eyed dreams, and when you pull back the tapestry, there is is a giant meatgrinder. When the Party tells you to get in the meatgrinder, you get in. You don't ask how people getting ground in the meatgrinder will contribute to the future on the tapestry. You don't ask whether getting in the meatgrinder is the best use of your talents and aspirations as a human being. You let the Party worry about the future, you just worry about obeying the Party.

I believe that most of the revolutionaries of 1917 were full of good intentions to save humanity from what they saw as the meatgrinder of capitalism, and they really didn't intend to build an even worse meatgrinder themselves. Some may argue that human meatgrinders don't last forever, so it's better than extinction. To that I respond that AI safety may become associated with human meatgrinders in the same way communism now is in Eastern Europe. And if the idea of AI safety becomes resented on a visceral level by a large enough number of people, then humanity is probably still getting atomized by nanobots, just with extra steps.

It's intellectually dishonest to say that human alignment is impossible. Convincing people through rational arguments is not the only way to achieve human alignment, nor even in the top 10 of the most historically used ways. However, achieving global human alignment at the required speeds could easily end up worse than just getting atomized by nanobots. Still, if there's a viable path to achieving human alignment somewhere between "arguing with idiots on Twitter" and "literally 1984", then this could still be a surviving world. Survival is just not always pretty...

3 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-10T20:12:31.453Z · LW(p) · GW(p)

Yes, I think you make some very key points. I think any plan which claims to be coherent but neglects these concerns is fatally flawed. That said, I think it could be useful to expand your conception of what a 'pivotal act' might consist of. What if the thing we really need the Aligned AI to engineer for us is... a better governance system?

What if we could come up with a system of voluntary contracts that enabled decentralized human-flourishing-aligned governance while gradually eroding the power of centralized governments. Peace, freedom, maximum autonomy insofar as it doesn't hurt others, avoidance of traps like arms races and tragedy-of-the-commons. Is such a thing even possible? Would we be able to successfully distinguish a good plan from a bad one? I don't know. I think it's worth considering though.

See my comment here for more about what I mean. [LW(p) · GW(p)]

Replies from: panasenco
comment by Aram Panasenco (panasenco) · 2025-01-10T20:53:28.533Z · LW(p) · GW(p)

Thanks so much for engaging, Nathan!

The pivotal act was defined by Yudkowsky, I'm just borrowing the definition. The idea is that even after you've built a perfectly aligned superintelligent AI, you only have about 6 months before someone else builds an unaligned superintelligent AI. That's probably not enough time to convince the entire world to adopt a better governance system before getting atomized by nanobots. So your aligned AI would have to take over the world and forcefully implement this better governance system within a span of a few months.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-10T23:18:48.269Z · LW(p) · GW(p)

Yes, I'm hoping that the better governance system is something that can be accomplished prior to superintelligence. I do agree that the short time frame for implementation seems like the biggest obstacle to success.