post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by Gordon Seidoh Worley (gworley) · 2024-03-29T23:36:19.743Z · LW(p) · GW(p)

I've spent some time thinking about this and can share some thoughts.

I find the framing around preserving boundaries a bit odd. Seems like lack of preservation is one way things could be bad, but I don't think it's a full accounting for badness (or at least it seems that way based on my understanding of boundaries).

In humans, I strongly suspect that we can model badness as negative valence (@Steven Byrnes [LW · GW] recent series on valence [? · GW] is a good reference), and that the reason "bad" is a simple and fundamental word in English and most languages is because it's basic to the way our minds work: bad is approximately stuff we don't like, and good is stuff we do like, where liking is a function of how much it makes the world the way we want it to be, and wanting is a kind of expectation about future observations.

I also think we can generalize badness from humans and other animals with valence-oriented brains by using the language of control theory. There, we might classify sensor readings as bad if they signal movement away from rather than towards a goal. And since we can model living things as a complex network of layered negative feedback circuits, this suggests that anything is bad if it works against achieving a system's purpose.

(I have a bit more of my thoughts on this in a draft book chapter, [? · GW] but I was not specifically trying to address this question so you might need to read between the lines a bit.)

Goodness, in these models, is simply the reverse of badness: positive valence things are good, as are sensor readings that signal a goal is being achieved.

There are some interesting caveats around what happens when you get multiple layers in the system that contradict each other, like if smoking a cigarette feels good but we know it's bad for us, but the basic point stands.

comment by MiguelDev (whitehatStoic) · 2024-03-30T01:06:36.801Z · LW(p) · GW(p)

I can't think of anything else that would be missing from a full specification of badness.

 

Hello there! This idea might improve your post: I think no one can properly process the problem of badness without thinking of what is "good" at the same time. So I think the core idea I am trying to make here is that we should be able to train models with an accurate simulation of our world where both good and evil (badness) exist.

I wrote something about this here [LW(p) · GW(p)] if you are interested.