«Boundaries/Membranes» and AI safety compilation

post by Chipmonk · 2023-05-03T21:41:19.124Z · LW · GW · 17 comments

Contents

  «Boundaries» definition recap:
  Posts & researchers that link «Boundaries» and AI safety
    Davidad’s OAA
    Andrew Critch
    Scott Garrabrant
    Mark Miller
    Other researchers interested:
    Miscellaneous connections
    What I may have missed
  Closing notes
      Please contact me with any «Boundaries»-related tips, ideas, or requests.
None
17 comments

In this post I outline every post I could find that meaningfully connects the concept of «Boundaries/Membranes» (tag [? · GW]) with AI safety.[1] This seems to be a booming subtopic: interest has picked up substantially within the past year. 

Update (2023 Dec): we're now running a workshop [LW · GW] on this topic!

Perhaps most notably, Davidad includes the concept in his Open Agency Architecture for Safe Transformative AI alignment paradigm. For a preview of the salience of this approach, see this comment [LW · GW] by Davidad (2023 Jan):

“defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility. 

This post also compiles work from Andrew Critch, Scott Garrabrant, Mark Miller, and others. But first I will recap what «Boundaries» are:

«Boundaries» definition recap:

You can see «Boundaries» Sequence [? · GW] for a longer explanation, but I will excerpt from a more recent post by Andrew Critch, 2023 March [LW · GW]:

By boundaries, I just mean the approximate causal separation of regions in some kind of physical space (e.g., spacetime) or abstract space (e.g., cyberspace).  Here are some examples from my «Boundaries» Sequence [? · GW]:

  • a cell membrane (separates the inside of a cell from the outside);
  • a person's skin (separates the inside of their body from the outside);
  • a fence around a family's yard (separates the family's place of living-together from neighbors and others);
  • a digital firewall around a local area network (separates the LAN and its users from the rest of the internet);
  • a sustained disassociation of social groups (separates the two groups from each other)
  • a national border (separates a state from neighboring states or international waters).

Also, beware [? · GW]:

When I say boundary, I don't just mean an arbitrary constraint or social norm.

Update: see Agent membranes and causal distance [LW · GW] for a better exposition of the agent membranes/boundaries idea.

Posts & researchers that link «Boundaries» and AI safety

All bolding in the excerpts below is mine.

Davidad’s OAA

Saliently, Davidad uses «Boundaries» for one of the four hypotheses he outlines in An Open Agency Architecture for Safe Transformative AI [LW · GW] (2022 Dec)

  • Deontic Sufficiency Hypothesis: There exists a human-understandable set of features of finite trajectories in such a world-model, taking values in , such that we can be reasonably confident that all these features being near 0 implies high probability of existential safety, and such that saturating them at 0 is feasible[2] [LW(p) · GW(p)] with high probability, using scientifically-accessible technologies.

Further explanation of this can be found in Davidad's Bold Plan for Alignment: An In-Depth Explanation [LW · GW] (2023 Apr) by Charbel-Raphaël and Gabin:

Getting traction on the deontic feasibility hypothesis 

Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don't die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.

Also: 

(*) Elicitors: Language models assist humans in expressing their desires using the formal language of the world model. […] Davidad proposes to represent most of these desiderata as violations of Markov blankets. Most of those desiderata are formulated as negative constraints because we just want to avoid a catastrophe, not solve the full value problem. But some of the desiderata will represent the pivotal process that we want the model to accomplish.

Also see this comment [LW · GW] by Davidad (2023 Jan):

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility. 

Reframing inner alignment [LW · GW] by Davidad (2022 Dec):

I'm also excited about Boundaries [? · GW] as a tool for specifying a core safety property to model-check policies against—one which would imply (at least) nonfatality—relative to alien and shifting predictive ontologies.

I’ve also collected all of Davidad’s tweets about «Boundaries» into this twitter thread.

Update 2023 May: I've written a post about how Davidad conceives of «boundaries» applying to alignment: «Boundaries» for formalizing a bare-bones morality [LW · GW].

Update 2023 August: Davidad explains this most directly in A list of core AI safety problems and how I hope to solve them [AF · GW]:

9. Humans cannot be first-class parties to a superintelligence values handshake.

[…]

OAA Solution: (9.1) Instead of becoming parties to a values handshake, keep superintelligent capabilities in a box and only extract plans that solve bounded tasks for finite time horizons and verifiably satisfy safety criteria that include not violating the natural boundaries [? · GW] of humans. This can all work without humans ever being terminally valued by AI systems as ends in themselves.

Update 2024 Jan 28: See Davidad's reply to this comment [LW(p) · GW(p)] about specific examples of boundary violations.

Andrew Critch

Andrew Critch has written «Boundaries» Sequence [? · GW] with four posts to date: 

AI alignment is a notoriously murky problem area, which I think can be elucidated by rethinking its foundations in terms of boundaries between systems, including soft boundaries and directional boundaries. […] I'm doing that now, for the following problem areas:

  • Preference plasticity & corrigibility
  • Mesa-optimizers
  • AI boxing / containment
  • (Unscoped) consequentialism
  • Mild optimization & impact regularization
  • Counterfactuals in decision theory

[…] 

You many notice that throughout this post that I've avoided saying things like "the humans prefer that {some boundary} be respected".  That's because my goal is to treat boundaries as more fundamental than preferences, rather than as merely a feature of them.  In other words, I think boundaries are probably better able to carve reality at the joints [LW · GW] than either preferences or utility functions, for the purpose of creating a good working relationship between humanity and AI technology.

Critch also included «Boundaries» in his plan for Encultured AI [LW · GW] (2022 Aug):

boundaries may be treated as constraints, but they are more specific than that: they delineate regions or features of the world in which the functioning of a living system occurs.  We believe many attempts to mollify the negative impacts of AI technology in terms of “minimizing side effects” or “avoiding over-optimizing” can often be more specifically operationalized as respecting boundaries.  Moreover, we believe there are abstract principles for respecting boundaries that are not unique to humans, and that are simple enough to be transferable across species and scales of organization. […]

And most recently, Critch wrote Acausal normalcy [LW · GW] (2023 March):

Which human values are most likely to be acausally normal? 

A complete answer is beyond this post, and frankly beyond me.  However, as a start I will say that values to do with respecting boundaries are probably pretty normal from the perspective of acausal society.

Scott Garrabrant

Andrew Critch connects «Boundaries» to Scott Garrabrant’s Cartesian Frames (in Part 3a [LW · GW] of his «Boundaries» Sequence):

The formalism here is lot like a time-extended version of a Cartesian Frame (Garrabrant, 2020), except that what Scott calls an "agent" is further subdivided here into its "boundary" and its "viscera"

See Cartesian Frames [? · GW] (Intro [? · GW]) (2020 Oct) for a related formalization of the «Boundaries» core concept.  

Cartesian frames are a way to add a first-person perspective (with choices, uncertainty, etc.) on top of a third-person "here is the set of all possible worlds," in such a way that many of these problems either disappear or become easier to address.

Note: See this summary by Rohin Shah [LW · GW] for a conceptual summary of Cartesian Frames.

Scott Garrabrant also wrote Boundaries vs Frames [LW · GW] (2022 Oct) which compares the two concepts.

Note: I suspect Garrabrant’s work on Embedded Agency (pre- Cartesian Frames) and Finite Factored Sets (post- Cartesian Frames) are also related, but I haven’t looked into this myself.

Mark Miller

Mark Miller, Senior Fellow at the Foresight Institute (wiki), has worked on the Object-capability model, which applies «boundaries» to create secure systems (computer security). The goal is to make sure that only the processes that should have read and/or write permissions to a resource have those permissions. This can then be enforced with cryptography.

Other researchers interested:

John Wentworth (@johnswentworth [LW · GW])

John Wentworth lists boundaries in a comment [LW(p) · GW(p)] addressing “what's my list of open problems in understanding agents?”:

I claim that, once you dig past the early surface-level questions about alignment, basically the whole cluster of "how do agents work?"-style questions and subquestions form the main barrier to useful alignment progress. So with that in mind, here are some of my open questions about understanding agents (and the even deeper problems one runs into when trying to understand agents)

[…]

  • What's up with boundaries [LW · GW] and modularity?
    • To what extent [LW · GW] do boundaries/modules typically exist "by default" in complex systems, vs require optimization pressure (e.g. training/selection) to appear?
    • Why are biological systems so modular [LW · GW]? To what extent will that generalize to agents beyond biology?
    • How modular are trained neural nets? Why, and to what extent will it generalize?
    • What is the right mathematical language in which to talk about modularity, boundaries, etc?
    • How do modules/boundaries interact with thermodynamics - e.g. can we quantify the negentropy/bits-of-optimization requirements to create new boundaries/modules, or maintain old ones?
    • Can we characterize the selection pressures on transboundary transport/information channels in a general way?
    • To what extent do agents in general form internal submodules? Why?

[…]

He also wrote in this comment [LW(p) · GW(p)] that he considers boundaries to be prerequisite for understanding ‘agenty’ phenomena (2023 Apr).

Also see: Content and Takeaways from SERI MATS Training Program with John Wentworth: Week 4, Day 1 - Boundaries Exercises [LW · GW] (2022 Dec) where the «Boundaries» concept is used as a SERI MATS training exercise.

[There is likely to be other content I’ve missed from John Wentworth.]

Vladimir Nesov (@Vladimir_Nesov [LW · GW]


Miscellaneous connections

I’ve also created a “Boundaries [technical]” tag [? · GW], and tagged all of «Boundaries»-related[2] LW posts I could find. 

What I may have missed

There are surely many topics which I haven’t yet looked into which deserve to be linked in this post. I have noted those that I think are likely to be related below. 

If you know of any other posts I should link in this post, let me know and I’ll add them. 

Closing notes

I’m personally extremely excited about this topic, and I will be covering further developments. 

I am also writing several more posts on the topic. Subscribe to my posts [LW · GW] and/or the boundaries [technical] tag [? · GW] to get notified.

Please contact me with any «Boundaries»-related tips, ideas, or requests.

 

Post last edited: 2023-05-30.
 

  1. ^

    Here's why I use the word "membranes" as opposed to "boundaries": "Membranes" is better terminology than "boundaries" alone [LW · GW].

  2. ^

    («Boundaries»/boundaries [technical]-related posts, not necessarily “boundaries”-related posts [LW(p) · GW(p)].)

17 comments

Comments sorted by top scores.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-05-04T15:29:16.186Z · LW(p) · GW(p)

Lovely! I think this is valuable. This comment is just to cheer you on. I hope that's allowed by LW rules.

Replies from: Raemon, Chipmonk
comment by Raemon · 2023-05-04T18:58:54.222Z · LW(p) · GW(p)

Yup, that's fine. :)

comment by Chipmonk · 2023-05-07T14:49:57.904Z · LW(p) · GW(p)

Thanks:)

Replies from: awg
comment by awg · 2023-05-07T15:48:51.442Z · LW(p) · GW(p)

Chiming in to also say that I think this post is valuable. But not only this post--posts like this in general. I really appreciate the work of you and people like you who are able to take a complex topic explained across multiple posts/sequences and by multiple people and distill it into a concise summary that feels approachable and understandable while also giving relevant links to find deeper material.

Like I think this is really, extremely valuable. So thank you and I look forward to reading more from you and anyone else who wants to submit work like this here!

comment by Roman Leventov · 2023-06-12T12:10:18.851Z · LW(p) · GW(p)

To what extent do boundaries/modules typically exist "by default" in complex systems, vs require optimization pressure (e.g. training/selection) to appear?

Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are "ubiqutuous" in high-dimensional (i.e., complex) systems.

comment by Roman Leventov · 2023-06-11T06:00:20.927Z · LW(p) · GW(p)

Getting traction on the deontic feasibility hypothesis


Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don't die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.

I think any finitely-specified deontology wouldn't ensure existential safety, and even more likely following just a finite deontology (such as "don't interfere with others' boundaries") can lead to a dystopian scenario for humanity.

In my current meta-ethical view, ethics is a style of behaviour (i.e., dynamics of a physical system) that is inferred by the system (or its supra-system, such as in the course of genetic or cultural evolution). The style could be characterised/described in the context of multiple different (or, perhaps infinitely many) modelling frameworks/theories for describing the dynamics of the system (perhaps, on various levels of description). Examples of such modelling frameworks are "raw" neural dynamics/connectomics (note: this is already a modelling framework, not the "bare" reality!), Bayesian Brain/Active Inference, Reinforcement Learning, cognitive psychology, evolutionary game theory, etc. All these theories would lead to somewhat different descriptions of the same behaviour which don't completely cover each other[1].

It seems easy to find counterexamples when intruding into someone's boundaries is an ethical thing to do and obtaining from that would be highly unethical. Sorting out multilevel conflicts/frustrations between infinitely many system/boundary partitions of the world[2] in the context of infinitely many theoretical frameworks (such as quantum mechanics[3], neural network framework[4], theory of conscious agents[5], etc.) should guide the attenuation of the best ethical style that we (AI agents) can attain, but I think it couldn't nearly be captured by a single deontic rule.

  1. ^

    However, in "Mathematical Foundations for a Compositional Account of the Bayesian Brain" (2022), Smithe establishes that it might be possible to formally convert between these frameworks using category theory.

  2. ^

    Vanchurin, V., Wolf, Y. I., Katsnelson, M. I., & Koonin, E. V. (2022). Toward a theory of evolution as multilevel learning. Proceedings of the National Academy of Sciences, 119(6), e2120037119. https://doi.org/10.1073/pnas.2120037119

  3. ^

    Fields, C., Friston, K., Glazebrook, J. F., & Levin, M. (2022). A free energy principle for generic quantum systems. Progress in Biophysics and Molecular Biology, 173, 36–59. https://doi.org/10.1016/j.pbiomolbio.2022.05.006

  4. ^

    Vanchurin, V. (2020). The World as a Neural Network. Entropy, 22(11), 1210. https://doi.org/10.3390/e22111210

  5. ^

    Hoffman, D. D., Prakash, C., & Prentner, R. (2023). Fusions of Consciousness. Entropy, 25(1), 129.

Replies from: Chipmonk
comment by Chipmonk · 2023-06-11T11:13:28.924Z · LW(p) · GW(p)

Okay, I'll try to summarize your main points. Please let me know if this is right

  1. You think «membranes» will not be able to be formalized in a consistent way, especially in a way that is consistent across different levels of modeling
  2. "It seems easy to find counterexamples when intruding into someone's boundaries is an ethical thing to do and obtaining from that would be highly unethical."

Have I missed anything? I'll respond after you confirm.

Also, would you please share any key example(s) of #2?

Replies from: Roman Leventov
comment by Roman Leventov · 2023-06-12T11:51:23.730Z · LW(p) · GW(p)

You think «membranes» will not be able to be formalized in a consistent way, especially in a way that is consistent across different levels of modeling

No, I think membranes could be formalised (Markov blankets, objective "joints" of the environment as in https://arxiv.org/abs/2303.01514, etc.; though theory-laden, I think that the "diff" between the boundaries identifiable from the perspective of different theories is usually negligible).

We, humans, intrude into each others' boundaries, boundaries of animals, organisations, communities, etc. all the time. A surgeon intruding into the boundaries of a patient is an ethical thing to do. If AI automated the entire economy, then waited until humanity completely loses the ability to run the civilisation on their own, and then suddenly stopped any maintenance of the automated systems that support the lives of humans and sees how humans die out because they cannot support themselves would be "respecting humans' boundaries", but would also be an evil treacherous turn. Messing with Hitler's boundaries (i.e., killing him) in 1940 would be an ethical action from the perspective of most systems that may care about that (individual humans, organisations, countries, communities). 

I think that boundaries (including consciousness boundaries: what is the locus of animal consciousness? Just the brain or the whole body, or it even extends beyond the body? What is the locus of AI's consciousness?) is an undeniably important concept that is usable for inferring ethical behaviour. But I don't think a simple "winning" deontology is derivable from this concept. I'm currently preparing an article where I describe that from the AI engineering perspective, deontology, virtue ethics, and consequentialism could be seen as engineering techniques (approaches) that could help to produce and continuously infer the ethical style of behaviour. None of these "classical" approaches to normative ethics is either necessary or sufficient, but they all could help to improve the ethics in some cognitive architectures.

Replies from: Chipmonk
comment by Chipmonk · 2023-06-12T12:12:10.290Z · LW(p) · GW(p)

I think that boundaries […] is an undeniably important concept that is usable for inferring ethical behaviour. But I don't think a simple "winning" deontology is derivable from this concept.

I see

I'm currently preparing an article where I describe that from the AI engineering perspective, deontology, virtue ethics, and consequentialism

please lmk when you post this. i've subscribed to your lw posts too


FWIW, I don't think the examples given necessarily break «membranes» as a "winning" deontological theory.

A surgeon intruding into the boundaries of a patient is an ethical thing to do. 

If the patient has consented, there is no conflict.

(Important note: consent does not always nullify membrane violations. In this case it does, but there are many cases where it doesn't.)

If AI automated the entire economy, then waited until humanity completely loses the ability to run the civilisation on their own, and then suddenly stopped any maintenance of the automated systems that support the lives of humans and sees how humans die out because they cannot support themselves would be "respecting humans' boundaries", but would also be an evil treacherous turn. 

I think a way to properly understand this might be.. If Alice makes a promise to Bob, she is essentially giving Bob a piece of herself, and that changes how he plans for the future and whatnot. If she revokes that by terms not part of the original agreement, she has stolen something from Bob, and that is a violation of membranes. ?

If the AI promises to support humans under an agreement, then breaks that agreement, that is theft.

Messing with Hitler's boundaries (i.e., killing him) in 1940 would be an ethical action from the perspective of most systems that may care about that (individual humans, organisations, countries, communities). 

In a case like this I wonder if the theory would also need something like "minimize net boundary violations", kind of like how some deontologies make murder okay sometimes.

But then this gets really close to utilitarianism and that's gross imo. So I'm not sure. Maybe there's another way to address this? Maybe I see what you mean

comment by Raemon · 2023-05-03T22:25:14.900Z · LW(p) · GW(p)

Re: the "boundaries" tag, are you calling "«Boundaries»" vs "boundaries" to indicate you're referring to a special definition? Critch seemed to have an explicit definition for his posts in mind and maybe it's worth specifying one-particular-definition-over-others, but if this tag is (AFAICT) listing all content that's about boundaries in a technical-sense, I think it most likely makes sense to either just call the tag "Boundaries", or maybe "Boundaries [technical]".

I'm a bit curious what @Andrew_Critch [LW · GW] meant when he used the "«" marker.

Replies from: Chipmonk
comment by Chipmonk · 2023-05-03T22:33:50.089Z · LW(p) · GW(p)

I believe I'm abiding by the definition inherent to his sequence, but anyone is free to convince me otherwise.

(Please also let me know if I've violated some norm about naming conventions.)

I've decided to use "«boundaries»" instead of "boundaries" because "boundaries" colloquially refers to something that's more like "Hey you crossed my boundaries, you're so mean!" (see this post for examples), and while I think that these two concepts are related, I find them extraordinarily confusing to consider simultaneously (because "crossing 'boundaries'" does not imply "crossing «boundaries»"), so I try to be explicit as possible with the use.

In the future I plan to use that word as little as possible because of this, but unfortunately that's the name of the sequence.

But "Boundaries [technical]" could do…

Replies from: Raemon
comment by Raemon · 2023-05-03T23:26:46.524Z · LW(p) · GW(p)

LW is somewhat opinionated about how to do tags. (This doesn't mean there's a hard-and-fast-rule, just that when we're evaluating what makes good tags and considering whether to re-organize tags, the mods reflect on the entire experience of the LW userbase). Generally, we want tags that are "neither too narrow nor too broad". 

In this case, if there were other people writing about boundaries-in-a-technical-sense which for some reason was notably different from Critch's definition, and there were some people (maybe just Critch, maybe Critch-plus-a-few-collaborators) who specifically wanted to focus on his definition, then having two tags would make sense. By guess is that anyone writing about boundaries-in-a-technical-sense would end up with a definition similar to Critch's, and there should be just be one tag for all similar work, and the '«' symbol doesn't make sense for the tag.

Replies from: Chipmonk
comment by Chipmonk · 2023-05-03T23:35:50.095Z · LW(p) · GW(p)

Ok, I will rename the tag from "«Boundaries»" to "Boundaries [technical]". Fwiw I consider both strings as referring to the same concept, but I see how it might be weird to use «».

comment by Chipmonk · 2023-05-28T22:55:56.540Z · LW(p) · GW(p)

Thread of edits

Replies from: Chipmonk, Chipmonk
comment by Chipmonk · 2023-08-09T21:54:52.822Z · LW(p) · GW(p)

I have just updated the post to add more details about Mark Miller’s Object-capabilities model.

comment by Chipmonk · 2023-05-28T22:56:09.683Z · LW(p) · GW(p)

Today I've slightly updated the post to reflect what I think will be less-confusing terminology for this concept going forward [LW · GW]. 

comment by Chipmonk · 2023-05-03T22:09:25.448Z · LW(p) · GW(p)

Here are some more posts which might be also related, but less obviously so. I will leave them in this comment for now, but feel free to argue me into including or excluding any of these.

Also, lmk if anything else should be linked in the main post.