How I turned doing therapy into object-level AI safety research

chipmonk

How I turned doing therapy into object-level AI safety research

post by Chipmonk · 2024-03-14T01:54:47.290Z · LW · GW · 5 comments

  Backstory
  Ways in which AI safety boundaries and psychological boundaries are similar
    Minimizing conflict between agents
    Distinct from preferences
    Markov blankets
  Personal development → AI safety research pipeline
  update
None
5 comments

It surprises me that I was able to turn what I learned from being depressed into object-level AI safety research.

But it seems like it's working? For example, I'm running a 5-day workshop on the topic next month, and I just ran a 3-day workshop [LW · GW] on the topic last month.

My topic is boundaries [? · GW].

Backstory

For half of 2022, I was pretty depressed. My best explanation for why I was that being depressed was a way for me to avoid social interaction when it felt unsafe. One of my largest fears back then was that if I interacted with other people, they would be able to control me. For example, that they would be able to make me feel bad in ways I couldn't resist. So social interaction felt unsafe.

At the same time, I was also trying to control other people to make them like me. But it wasn't working in the way I expected and it was making me very confused and I was suffering about it.

I spoke to a skilled counselor about this, and I realized that I was misunderstanding the natural boundaries between people [LW · GW]. I realized that I cannot actually unilaterally control other people, and they cannot unilaterally control me either. There are natural boundaries!

But as I began to learn more about boundaries, I became frustrated with the way other people spoke about them. The way many people talk about boundaries seems really inconsistent to me [LW · GW]. Most of the time I heard people say the word "boundaries" it seemed to me like they really meant "preferences".

So I tried to develop my own logically consistent understanding of psychological boundaries instead.

After a few months (exactly a year ago at the time of writing) I had some conclusions about psychological boundaries, and I was explaining them to a friend (@Ulisse Mini [LW · GW]). And he's an AI safety researcher, so I joked, "And, hey, maybe all of this boundaries stuff applies to AI safety, too. Just have AI respect the natural boundaries and that's safety, right? Haha…"

And he said, "No yeah, Andrew Critch already wrote a series about that [LW · GW]."

I read the series.

And I had never wanted to work in ""AI Safety"" before, but boundaries seemed like a cool way to understand the world better, so sure why not. I didn't have funding, and I was still unemployed as I had mostly been for the past year. From there, I wrote the boundaries compilation post [LW · GW], created the boundaries/membranes tag [? · GW] on LW, went to AI safety workshops and spoke to a bunch of people about boundaries…

(In the beginning, I'm pretty sure a bunch of the people I spoke to thought I was crazy. Back then I would explain AI safety boundaries by the symmetries with psychology.)

Anyways, I wrote a bunch more posts^[1]; planned, got funding for, and ran a 3-day workshop [LW · GW] on boundaries in AI safety (https://formalizingboundaries.ai/), got personal research funding, wrote more things [? · GW], got more funding, and I'm about to run a 5-day Mathematical Boundaries Workshop next month.

As it turns out, understanding the boundaries (more precisely: the causal distance [LW · GW]) between agents in psychology seems to be very useful for understanding the boundaries between humans and AIs!

A bunch of people are excited about boundaries now, too. But this definitely wasn't the case when I first started.

Ways in which AI safety boundaries and psychological boundaries are similar

Here are a few ways in which I have similar intuitions for boundaries in both AI safety boundaries and psychology.

Minimizing conflict between agents

I was first interested in psychological boundaries because of my depression thing, but then potentially as a more general solution to social conflict. I noticed that many social conflicts seemed to reduce to "one person is trying to control another and expecting it to work" or "one person is expecting another person to be able to control them, and the first person could resist but isn't". But this is just misunderstanding natural boundaries. People can't actually control each other like this.

At the time, I thought that if people could just understand natural boundaries better, then this would reduce a lot of conflict. So if I could just figure out the correct understanding of natural boundaries and then teach others that, then they'd have a lot less conflict.

(Later I realized that being good at psychological boundaries has little to do with consciously understanding boundaries [LW · GW] and actually requires something deeper, but that's beyond the scope of this post and not that related to AI.)

I took this same intuition to AI safety. "Safety" seemed like absence of conflict. For example, AIs controlling humans or dissolving the boundaries of humans. That would be bad. So maybe just… preserve the boundaries [LW · GW] instead? That intuitively feels like it gets at most of what "safety" is (albeit not full alignment [AF · GW]).

I think minimizing social conflict turns out to be very similar to minimizing AI un-safety.

Distinct from preferences

Boundaries in both AI safety and psychology are a distinct concept from preferences/desires.

Andrew Critch, «Boundaries» Sequence, Part 3b [? · GW]:

my goal is to treat boundaries as more fundamental than preferences, rather than as merely a feature of them. In other words, I think boundaries are probably better able to carve reality at the joints than either preferences or utility functions, for the purpose of creating a good working relationship between humanity and AI technology

In AI safety, I think it's much easier to talk about boundaries than preferences because true boundaries don't really contradict between individuals (humans or AIs).

I found something similar thing in my psychological research. (What I consider to be) the real boundaries [LW · GW] are distinct from preferences.

(That said, some fraction of the time when people say "I'm setting a boundary: […]", I think they really mean "this is my preference" [LW · GW].)

Markov blankets

I model both AI safety boundaries and psychological boundaries by thinking about Markov blankets / causal distance / causal bottlenecks.

When I explain AI safety boundaries on a technical level, the main formalism I refer to is Markov blankets [LW · GW].

Meanwhile, the way that I usually explain [LW · GW] psychological boundaries is "You (the person) are a bundle of communication bandwidth. You have far more causal influence / mutual information with your arm and mind than anyone else, and you have less causal influence / mutual information with everything else and everyone else in the world…"^[2]

And that's basically equivalent to saying, "It is useful for you (as an individual) to model the Markov blanket that exists between you and everyone else."

Personal development → AI safety research pipeline

So that's how I turned my depression and doing therapy into object-level AI safety research.

Going forward, I'm hoping to once again turn insights in my personal life into AI safety intuitions.

Basically, I see boundaries helpful for "formalizing badness". Anything that dissolves or violates [LW · GW] an important membrane is bad.

(I don't think boundaries formalize all of badness, but I think they get a good chunk.)

My rough hope is that if "badness" can be formally specified, then "the absence of badness" can be written a as a spec for provably safe AI. (This is how I interpret davidad's plan [LW · GW]. Also see this post [AF · GW] and this tag [? · GW].)

But, what is Goodness? What would be necessary for the positive components of alignment [AF · GW]?

For example, in my personal life I've learned how to minimize the conflict I have with others (via boundaries), and I think I've gotten really good at that. But, for example, I feel like I haven't yet fully learned how to connect with people and make positive things happen. Similarly, I'm no longer be depressed or anxious, but I also haven't totally figured out joy yet.

I have the intuition that if I figured out positive things in my personal life then I would make progress on formalizing goodness more generally.

So that's the story of my last year. Please reach out if you're a woo AI alignment funder.

Thanks to Alex Zhu for support and encouragement since the beginning.

update

see https://chrislakin.blog/p/social-interaction-inspired-ai-alignment

^{^}
though I later deleted the bad posts
^{^}
Fun fact: I recently shared this definition with Frank Yang (enlightened guy on youtube) and he already agreed with me.

5 comments

Comments sorted by top scores.

comment by Chris_Leong · 2024-03-14T14:00:34.780Z · LW(p) · GW(p)

I think it's much easier to talk about boundaries than preferences because true boundaries don't really contradict between individuals

I'm quite curious about this. What if you're stuck on an island with multiple people and limited food?

Replies from: Chipmonk

↑ comment by Chipmonk · 2024-03-14T15:35:56.032Z · LW(p) · GW(p)

Where do you think the boundary is here?

comment by Alex K. Chen (parrot) (alex-k-chen) · 2024-03-14T16:08:56.397Z · LW(p) · GW(p)

Isn't having boundaries also partly to do with full on consent (proactive and retroactive) with your implied preferences being unknown?

Consent is tricky because almost no one who isn't unschooled grows up consenting to anything. People grow used to consenting to things that make them feel unhappy because they don't know themselves well enough, and they trap themselves into structures that punish you for dropping out or for not opting into anything. In that sense, the system does not respect your own boundaries for your own self autonomy - your actions don't have the proper markov boundary from the rest of the system and thus you can't act as an independent agent. Some unschooled people have the most robust markov boundaries. The very structure of many school and work environments (one that penalizes work at home) is one that inherently creates power structures that cross people's boundaries, especially their energetic ones.

Even the state starts out by eroding some of the boundaries between person and state, without consent..

These people have stronger boundaries on ONE layer of abstraction - https://www.thepsmiths.com/p/review-the-art-of-not-being-governed?utm_source=profile&utm_medium=reader2. This does not necessarily translate to better boundaries on the object level

https://twitter.com/karpathy/status/1766509149297189274?t=ms8cmXL0em2zB4xdJyUblA&s=19 on mimetic boundaries

(Now that AI is creating new wealth very quickly, it becomes more possible for people to default not consent to all the mazes that everyone else seemingly "consents to"). Zvi's mazes post makes sense here

Replies from: Chipmonk

↑ comment by Chipmonk · 2024-03-14T19:53:15.881Z · LW(p) · GW(p)

I'd like to reply to your comment but I didn't understand your first sentence

comment by sweenesm · 2024-03-14T12:03:08.560Z · LW(p) · GW(p)

Thanks for the interesting post! I agree that understanding ourselves better through therapy or personal development it is a great way to gain insights that could be applicable to AI safety. My personal development path got started mostly due to stress from not living up to my unrealistic expectations of how much I "should" have been succeeding as an engineer. It got me focused on self-esteem, and that's a key feature of the AI safety path I'm pursuing [LW · GW].

If other AI safety researchers are interested in a relatively easy way to get started on their own path, I suggest this online course which can be purchased for <$20 when on sale: https://www.udemy.com/course/set-yourself-free-from-anger

Good luck on your boundaries work!

How I turned doing therapy into object-level AI safety research

Contents

Backstory

Ways in which AI safety boundaries and psychological boundaries are similar

Minimizing conflict between agents

Distinct from preferences

Markov blankets

Personal development → AI safety research pipeline

update

5 comments