Protecting agent boundaries

post by Chipmonk · 2024-01-25T04:13:50.993Z · LW · GW · 6 comments

Contents

  How agent boundaries get violated
  Protecting agent boundaries
      1. There was a potential threat.
      2. There was a collision.
      3. The victim failed to defend itself.
    How human societies already try to solve this problem
  How this applies to AI safety:
      Empower membranes to be better at self-defense
None
6 comments

If the preservation of an agent's boundary is necessary for that agent's safety [LW · GW], how can that boundary/membrane [? · GW] be protected?

How agent boundaries get violated

In order to protect boundaries, we must first understand how they get violated.

Let’s say there’s a cat, and it gets stabbed by a sword. That’s a boundary violation (a.k.a. membrane piercing). In order for that to have happened, three conditions must have been met:

  1. There was a sword.
  2. The cat and the sword collided.
  3. The cat wasn’t strong enough to resist penetration from the sword.

More generally, in order for any existing membrane to be pierced, three conditions must have all been met:

  1. There was a potential threat. (E.g., a sword, or a person with a sword.)
  2. The moral patient and the threat collided.
  3. The victim failed to adequately defend itself. (Because if the cat was better at self-defense — if its skin was thicker or if it was able to dodge — then it would not have been successfully stabbed.)

Protecting agent boundaries

Each of these three conditions then implies ways of preventing boundary violations (a.k.a. membrane piercing):

1. There was a potential threat.

2. There was a collision.

3. The victim failed to defend itself.

How human societies already try to solve this problem

As a helpful analogy, here’s some examples of how modern human societies try to solve this problem:

Minimize potential threats

Minimize dangerous collisions

Empower membranes to be better at self-defense

How this applies to AI safety:

Minimize potential AI threats

(this is obvious/boring so I'm omitting it

Minimize dangerous AI collisions

(this is obvious/boring so I'm omitting it

Empower membranes to be better at self-defense

Empower the membranes of humans and other moral patients to be more resilient to collisions with threats. Examples:

6 comments

Comments sorted by top scores.

comment by the gears to ascension (lahwran) · 2024-01-25T04:42:32.543Z · LW(p) · GW(p)

solid, but I still think you're missing structure that makes this approach less effective than it seems on the face:

in full generality, what's a "threat"?

in full generality, what's a "dangerous" collision?

I worry that the current failure mode of attempting to empower in order to defend is that the defense is actually used to strike inside another's boundary, as has been the case for ~all weapons

Replies from: Chipmonk, Chipmonk
comment by Chipmonk · 2024-01-25T05:03:15.172Z · LW(p) · GW(p)

in full generality, what's a "threat"?

in full generality, what's a "dangerous" collision?

Hm I'm not immediately sure how to define these

comment by Chipmonk · 2024-01-25T05:02:25.282Z · LW(p) · GW(p)

is that the defense is actually used to strike inside another's boundary, as has been the case for ~all weapons

Yeah, I am worried about this. 

This is notably not the case for infosec and encryption, where defensive capability doesn't imply offensive capability. However, I'm unsure if this is also true for any physical interventions. (e.g.:  Vaccines? No, bioweapons… Nanotech? No…)

That said, physical interventions do seem to be defense-dominant when there is coordination among a sufficiently large portion of society/power.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2024-01-25T18:11:12.468Z · LW(p) · GW(p)

I don't think I'm convinced physical interactions are defense dominant. The easiest-to-formally-certify defense is to enclose something in a hunk of impenetrable matter, and that only can be certified up to a given impact energy level. Above that energy level, the defense will simply be stripped away. Only MAD seems able to be game theoretically durable, and certifying that a MAD situation will endure requires proving through a simulation of the opposition.

comment by VojtaKovarik · 2024-01-25T19:58:36.100Z · LW(p) · GW(p)

Might be obvious, but perhaps seems worth noting anyway: Ensuring that our boundaries are respected is, at least with a straightforward understanding of "boundaries", not sufficient for being safe.
For example:

  • If I take away all food from your local supermarkets (etc etc), you will die of starvation --- but I haven't done anything with your boundaries.
  • On a higher level, you can wipe out humanity without messing with our boundaries, by blocking out the sun.
Replies from: Chipmonk
comment by Chipmonk · 2024-01-25T21:16:33.858Z · LW(p) · GW(p)

Yes, see Agent membranes/boundaries and formalizing “safety” [LW · GW] and davidad's comment. 

(Also, I'm not necessarily agreeing that your examples are not violations of boundaries. First one isn't a violation of end-person (although probably the farmer). Second one could be.)