A concise definition of what it means to win

post by testingthewaters · 2025-01-25T06:37:37.305Z · LW · GW · 0 comments

This is a link post for https://aclevername.substack.com/p/a-concise-definition-of-what-it-means

Contents

  A concise definition of what it means to win[1]
    Amor vincit omnia
None
No comments

A concise definition of what it means to win[1]

Amor vincit omnia

What does it mean for AI alignment to have “gone well”? Many answers have been proposed, but here is mine. A few basic requirements:

I will now argue that all of these are at least necessary factors for an AI launch to have “gone well”. I will do this by starting with the assumption that all of these factors are met, and then taking away one factor at a time and seeing what happens.

Given these requirements, what can we say about an AI launch that goes well? it seems that there will be some factors that need to be true for our hypothetical Good AI system:

Note also that the AI will most likely be imperfect, since it will be the artefact of physical computational devices with bounded computational power, so creativity and adaptiveness are actually not nice-to-haves. Furthermore, just because AIs might be orders of magnitude smarter than us does not necessarily mean that they will be able to solve all of our problems (or kill us all) with the wave of a hand: If universal human happiness turns out to depend on cracking P=NP, reversing entropy, or deriving an analytical solution to the three body problem, there’s a real chance that AIs the size of dyson spheres have to throw up their metaphorical arms in defeat.

Given all of the above, what goals might we set a hypothetical Good AI system? A simple answer might be “improve the world”, or “make humans happy”. However, the requirement that it have the leeway to interpret our goals but also be as loyal to them as possible creates a difficult problem: how specific should we be in our definition of human happiness, or global utility? There’s not much room for creativity or mid-flight adjustment for the goal “maximise dopamine production in the brains of worldwide members of homo sapiens”. For a scalable and flexible AI we want a goal that is itself scalable and flexible, such that as the AI system grows in power it gains in its ability to interpret and execute the goal faithfully, rather than being limited by the wisdom of the goal-setters. When an AI system is fairly limited the goal should prescribe limited or harmless action, when it is powerful it should use its power for good. In short, we want a goal that is something like what the crew come up with in this scene in Inception: a deep, atomic desire that will manifest organically in the form of our desired “business strategy”, which is “improve the world” and “make humans happy”. Importantly, the implementation of the goal is up to the AI, but we define the spirit of the goal, making this still our problem (at least at the start). I will further argue that, if we are truly aiming to help and respect everyone in the world, our ultimate goal is something not very different from the religious or philosophical concept of universal love.

But what does it even mean for a machine to love humanity or a human? After all, an AI system might not have emotions or desires in the way we do. What does it mean for something we usually think of as an inanimate object (a computer) to love us? Such a relationship seems like it would not be reciprocal or reflexive in the way love between humans is usually conceived. To examine this question, then, we might try flipping it around—if it is true that we are capable of loving, what does it mean for us to love inanimate objects?

Here I have some good news—you probably have some experience of this. We probably all have a favourite belonging, or a lucky charm we carry around, or some attachment to a place (a home, a park, a favourite cafe) that brings us some level of joy. In some sense, the object, thing, or place becomes a part of us thanks to our love. If our favourite cafe burns down or your house is burgled, it hurts like we have been personally hurt or violated. If you lose your favourite pen, it feels like losing a bit of yourself, even though you could probably walk to the store and buy an identical new pen. When two people love each other, the self-incorporation becomes mutual. They each take their conception of the other into their conception of themselves, which is why arguing with someone we love hurts so much—It is literally our mental self turning against itself. Historical poetic and literary concepts of love are much the same, to the point of describing the negative effects of love, such as a jealous possessiveness of someone who doesn’t feel the same about you.

In technical language, my proposal is perhaps the most similar to this one about dissolving the self-other boundary [LW · GW], although slightly inverted (instead of dissolving the boundary between the concept of the self and the concept of the other, designing a system to incorporate its concept of the other into the concept of the self. To this I would add the concept of homeostasis [LW · GW], which is about balancing different needs such that no one goal is pursued destructively at the cost of all others. To give a short, one sentence formulation, this is the goal (or rather meta-goal) I think we should set a good AI: learn to understand and love the richness of everything and everyone, and learn to incorporate their goals and desires into your own goals and desires.

  1. ^

    For various reasons, I am quite opposed to the frame of "winning", but this gets the idea across.

0 comments

Comments sorted by top scores.