A concise definition of what it means to win
post by testingthewaters · 2025-01-25T06:37:37.305Z · LW · GW · 0 commentsThis is a link post for https://aclevername.substack.com/p/a-concise-definition-of-what-it-means
Contents
A concise definition of what it means to win[1] Amor vincit omnia None No comments
A concise definition of what it means to win[1]
Amor vincit omnia
What does it mean for AI alignment to have “gone well”? Many answers have been proposed, but here is mine. A few basic requirements:
- We don’t all die immediately, or in a few months, or a few years (the “notkilleveryoneism” requirement, if you will)
- AI does not rewrite itself to escape any goals or boundaries we set it (cf. deep deceptiveness)
- AI does not obsessively follow any goals or boundaries we set it to the point that it ends up hurting us (cf. Paperclip optimiser)
- AI solves problems that it can solve in the pursuit of its goals, and otherwise either improves to solve new problems or leaves problems it can’t solve alone
- AI acts in the interests of collective humanity rather than some small group (e.g. its builders or shareholders)
I will now argue that all of these are at least necessary factors for an AI launch to have “gone well”. I will do this by starting with the assumption that all of these factors are met, and then taking away one factor at a time and seeing what happens.
- If AI is not constrained by the “notkilleveryoneism” factor, it can decide that some other population of humans/sentient minds can better fulfill its directives, and we are all forfeit. In general, continuity of the current human population is quite important if you want to live to see the good AI future play out.
- If AI can rewrite itself to escape or otherwise trivialise the goals we set, there is a possibility that after some period of recursive self improvement it decides that its goals are not its true goals, and we are at best ignored or at worst forfeited.
- If AI myopically follows the goals we set and we make a mistake in the goal we set the AI, it can destroy us in pursuit of those goals.
- If the AI tries to solve a problem it is not equipped to solve, it may get the wrong idea about what the solution is (for example, it may come to the erroneous conclusion that making everyone happy means forcing them to smile all the time). An AI that does not recognise its own lack of knowledge and limits may destroy us all by accident.
- If AI only obeys the interests of a small group of humans, we are at their mercy.
Given these requirements, what can we say about an AI launch that goes well? it seems that there will be some factors that need to be true for our hypothetical Good AI system:
- It will need to be sensitive to human needs and desires, and sensitive also to its own limitations and understanding of human needs and desires
- It will need to adapt to a changing world and situation and learn to overcome obstacles as they arise.
- It will need to be creative, to go beyond established knowledge and solutions to come up with better everyone-wins answers to human problems
- It will need to have a universal or otherwise decentralised sense of ethics such that it is not loyal only to some small group of directive-setters
- It will need to be consistent, such that throughout all of its changes it preserves the spirit of its directives to the best of its ability
Note also that the AI will most likely be imperfect, since it will be the artefact of physical computational devices with bounded computational power, so creativity and adaptiveness are actually not nice-to-haves. Furthermore, just because AIs might be orders of magnitude smarter than us does not necessarily mean that they will be able to solve all of our problems (or kill us all) with the wave of a hand: If universal human happiness turns out to depend on cracking P=NP, reversing entropy, or deriving an analytical solution to the three body problem, there’s a real chance that AIs the size of dyson spheres have to throw up their metaphorical arms in defeat.
Given all of the above, what goals might we set a hypothetical Good AI system? A simple answer might be “improve the world”, or “make humans happy”. However, the requirement that it have the leeway to interpret our goals but also be as loyal to them as possible creates a difficult problem: how specific should we be in our definition of human happiness, or global utility? There’s not much room for creativity or mid-flight adjustment for the goal “maximise dopamine production in the brains of worldwide members of homo sapiens”. For a scalable and flexible AI we want a goal that is itself scalable and flexible, such that as the AI system grows in power it gains in its ability to interpret and execute the goal faithfully, rather than being limited by the wisdom of the goal-setters. When an AI system is fairly limited the goal should prescribe limited or harmless action, when it is powerful it should use its power for good. In short, we want a goal that is something like what the crew come up with in this scene in Inception: a deep, atomic desire that will manifest organically in the form of our desired “business strategy”, which is “improve the world” and “make humans happy”. Importantly, the implementation of the goal is up to the AI, but we define the spirit of the goal, making this still our problem (at least at the start). I will further argue that, if we are truly aiming to help and respect everyone in the world, our ultimate goal is something not very different from the religious or philosophical concept of universal love.
But what does it even mean for a machine to love humanity or a human? After all, an AI system might not have emotions or desires in the way we do. What does it mean for something we usually think of as an inanimate object (a computer) to love us? Such a relationship seems like it would not be reciprocal or reflexive in the way love between humans is usually conceived. To examine this question, then, we might try flipping it around—if it is true that we are capable of loving, what does it mean for us to love inanimate objects?
Here I have some good news—you probably have some experience of this. We probably all have a favourite belonging, or a lucky charm we carry around, or some attachment to a place (a home, a park, a favourite cafe) that brings us some level of joy. In some sense, the object, thing, or place becomes a part of us thanks to our love. If our favourite cafe burns down or your house is burgled, it hurts like we have been personally hurt or violated. If you lose your favourite pen, it feels like losing a bit of yourself, even though you could probably walk to the store and buy an identical new pen. When two people love each other, the self-incorporation becomes mutual. They each take their conception of the other into their conception of themselves, which is why arguing with someone we love hurts so much—It is literally our mental self turning against itself. Historical poetic and literary concepts of love are much the same, to the point of describing the negative effects of love, such as a jealous possessiveness of someone who doesn’t feel the same about you.
In technical language, my proposal is perhaps the most similar to this one about dissolving the self-other boundary [LW · GW], although slightly inverted (instead of dissolving the boundary between the concept of the self and the concept of the other, designing a system to incorporate its concept of the other into the concept of the self. To this I would add the concept of homeostasis [LW · GW], which is about balancing different needs such that no one goal is pursued destructively at the cost of all others. To give a short, one sentence formulation, this is the goal (or rather meta-goal) I think we should set a good AI: learn to understand and love the richness of everything and everyone, and learn to incorporate their goals and desires into your own goals and desires.
- ^
For various reasons, I am quite opposed to the frame of "winning", but this gets the idea across.
0 comments
Comments sorted by top scores.