Social interaction-inspired AI alignment

chipmonk

Social interaction-inspired AI alignment

post by Chipmonk · 2024-06-24T08:10:08.719Z · LW · GW · 2 comments

This is a link post for https://chrislakin.blog/p/social-interaction-inspired-ai-alignment

  Potential past example: boundaries
  Potential future example: Goodness
  Why might alignment be like social interaction?
None
2 comments

Conjecture: Understanding the psychology of human social interaction will help with AI alignment.

And personally, I think this will likely form a relatively large and neglected component of alignment research.

Below: one potential past example and one potential future example.

Potential past example: boundaries

Boundaries seem to be a useful concept in both human psychology and AI safety.

Boundaries codify the autonomy of individuals. When the boundaries between individuals are preserved, unnecessary conflict between those individuals is largely minimized. I think boundaries could help specify the safety in AI safety [LW · GW]. (See claim 4 here [LW · GW], #9 here [AF · GW].)

My work in this area has mostly been writing distillations [? · GW] and convening researchers [CBW retrospective, MBW retrospective [LW · GW]]. I became interested in this topic after I started thinking about the natural boundaries between humans. It was my interest in psychological boundaries that got me interested in understanding the boundaries/causal distance [LW · GW] between agents in general. I reasoned that these ideas would be helpful for understanding the boundaries between humans and AIs, and, as it turned out, other researchers were already [LW · GW] thinking about this.

[FAQ: How might boundaries be applied in AI safety? The most near-term way is as a formal spec [AF · GW] for provably safe AI [AF · GW] in davidad’s [LW · GW] £59m ARIA programme.]

My interest and intuitive understanding of AI safety boundaries came from psychology. I don’t know if this was the case for others interested in boundaries, but I do wonder. For example, how much was Andrew Critch’s thinking about boundaries between agents [LW · GW] and boundaries in AI safety [LW · GW] inspired by thinking about boundaries in human social interaction [? · GW]?

Potential future example: Goodness

Conjecture: Understanding ‘Goodness’ in human social interaction will help with AI alignment — potentially greatly.

Context: One way I like to think about what I want from ‘full alignment [AF · GW]’ is in terms of two (somewhat-independent) properties:

I want goodness to be present and unsafety to be absent.

(Is there a better term than “Goodness”?)

Also, notice that I’m not smooshing Goodness and Safety into one axis (one that might more commonly be called “Utility”). I think these can’t be cleanly placed on the same spectrum.

Recall that I see boundaries as a way to mostly specify safety. However, even if you’re safe, that doesn’t necessarily mean that goodness is present. So boundaries don’t necessarily specify Goodness. Open question: How can Goodness be specified?

At the same time, in my psychology thinking, I’ve been wondering: What causes joy, connection, and collaboration? What generates Goodness?

In my own life, once I learned to do boundaries well, I became much less concerned about social conflicts. And while I was glad to feel less anxious and more safe, I also wasn’t immediately and automatically connecting with other people / being actively happy / feeling Goodness.

What can I do to create Goodness?

I don’t expect what I’m about to say to convince anyone who isn’t already convinced, but currently, I suspect that the most common missing factor for Goodness, in both psychology and AI alignment, is actually collective intelligence. I’ll leave the explanation to another post.

But if that’s right, I think the best feedback loop we have for understanding collective intelligence in general is to understand the collective intelligence that already exists in human social interactions.

Why might alignment be like social interaction?

As Michael Levin says, “all intelligence is collective intelligence”. There is no such thing as a centralized (and also significant) intelligence. Every intelligence worth worrying about is made of smaller parts, and those parts must figure out how to coordinate and align with each other.

In which case, I think alignment is really a question of, How do you align parts to a greater whole? How do you avoid internal conflicts (e.g., cancer)?

Social interaction psychology deals with the same questions.

Thanks to thinking partners Alex Zhu, Adam Goldstein, Ivan Vendrov, and David Spivak. Thanks to Stag Lynn for help editing.

2 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-06-24T22:21:14.858Z · LW(p) · GW(p)

I'm not sure I'm totally folllowing, but briefly FWIW:

This reminds me of an idea for a post, "Friendliness is probably a fairly natural abstraction, but Human is probably not".

I'd call your goodness either niceness or friendliness. I'd define it as the tendency to want to help beings with desires to achieve those desires. I think the source in the natural world is the game-theoretic utility of cooperating for humans and other mammels (among other hypothetical life forms). This utility gives evolution an inclination to create mechanisms that make us good, nice, or friendly.

I don't think this in itself helps hugely with alignment, except to note that there's probably not a useful eniversal ethics that AGI will discover itself. Even though the above could be taken as a universal ethics of sorts, we humans won't like the results; lots of beings beside us have desires, so a maximally friendly being will probably end up optimizing us away in favor of something that fits its criteria better. For more on how we'd probably lose a universal weighing of being good to sentient beings, to either insects below us or ASIs of some sort above us, see Roger Dearnaley's 5. Moral Value for Sentient Animals? Alas, Not Yet [LW · GW] and the remainder of that sequence.

The argument is that we probably need some sort of human chauvinism if we want human survival, and unfortunately I find this argument compelling. If there were a universal ethics, we probably wouldn't much like it. And there's just probably not. The game-theoretic virtue of cooperation applies to humans and many creatures, but not to all equally; a being that can improve its capabilities and make copies of itself would seem to have little need of cooperation.

Switching directions: Your claim that all intelligence is social intelligence seems wrong. Humans are trained by other humans, but we can accomplish a lot alone. The worry is that we get the same thing with an AGI, and it lacks both the cognitive weaknesses and arbitrary inlaid drives that make humans largely social creatures.

Replies from: Chipmonk

↑ comment by Chipmonk · 2024-06-25T13:29:47.516Z · LW(p) · GW(p)

Thanks for commenting.

Your claim that all intelligence is social intelligence seems wrong. Humans are trained by other humans, but we can accomplish a lot alone.

Hm, would it help if I clarified that individual human minds have multiple internal parts [? · GW] too? So even when "alone" humans are still social by this definition.

Social interaction-inspired AI alignment

Contents

Potential past example: boundaries

Potential future example: Goodness

Why might alignment be like social interaction?

2 comments