The two-layer model of human values, and problems with synthesizing preferences

kaj_sotala

The two-layer model of human values, and problems with synthesizing preferences

post by Kaj_Sotala · 2020-01-24T15:17:33.638Z · LW · GW · 16 comments

  The two-layer/ULM model of human values
  Preference synthesis as a character-level model
  My confusion about a better theory of values
None
16 comments

I have been thinking about Stuart Armstrong's preference synthesis research agenda [LW · GW], and have long had the feeling that there's something off about the way that it is currently framed. In the post I try to describe why. I start by describing my current model of human values, how I interpret Stuart's implicit assumptions to conflict with it, and then talk about my confusion with regard to reconciling the two views.

The two-layer/ULM model of human values

In Player vs. Character: A Two-Level Model of Ethics [LW · GW], Sarah Constantin describes a model where the mind is divided, in game terms, into a "player" and a "character". The character is everything that we consciously experience, but our conscious experiences are not our true reasons for acting. As Sarah puts it:

In many games, such as Magic: The Gathering, Hearthstone, or Dungeons and Dragons, there’s a two-phase process. First, the player constructs a deck or character from a very large sample space of possibilities. This is a particular combination of strengths and weaknesses and capabilities for action, which the player thinks can be successful against other decks/characters or at winning in the game universe. The choice of deck or character often determines the strategies that deck or character can use in the second phase, which is actual gameplay. In gameplay, the character (or deck) can only use the affordances that it’s been previously set up with. This means that there are two separate places where a player needs to get things right: first, in designing a strong character/deck, and second, in executing the optimal strategies for that character/deck during gameplay. [...]

The idea is that human behavior works very much like a two-level game. [...] The player determines what we find rewarding or unrewarding. The player determines what we notice and what we overlook; things come to our attention if it suits the player’s strategy, and not otherwise. The player gives us emotions when it’s strategic to do so. The player sets up our subconscious evaluations of what is good for us and bad for us, which we experience as “liking” or “disliking.”

The character is what executing the player’s strategies feels like from the inside. If the player has decided that a task is unimportant, the character will experience “forgetting” to do it. If the player has decided that alliance with someone will be in our interests, the character will experience “liking” that person. Sometimes the player will notice and seize opportunities in a very strategic way that feels to the character like “being lucky” or “being in the right place at the right time.”

This is where confusion often sets in. People will often protest “but I did care about that thing, I just forgot” or “but I’m not that Machiavellian, I’m just doing what comes naturally.” This is true, because when we talk about ourselves and our experiences, we’re speaking “in character”, as our character. The strategy is not going on at a conscious level. In fact, I don’t believe we (characters) have direct access to the player; we can only infer what it’s doing, based on what patterns of behavior (or thought or emotion or perception) we observe in ourselves and others.

I think that this model is basically correct, and that our emotional responses, preferences, etc. are all the result of a deeper-level optimization process. This optimization process, then, is something like that described in The Brain as a Universal Learning Machine [LW · GW]:

The universal learning hypothesis proposes that all significant mental algorithms are learned; nothing is innate except for the learning and reward machinery itself (which is somewhat complicated, involving a number of systems and mechanisms), the initial rough architecture (equivalent to a prior over mindspace), and a small library of simple innate circuits (analogous to the operating system layer in a computer). In this view the mind (software) is distinct from the brain (hardware). The mind is a complex software system built out of a general learning mechanism. [...]

An initial untrained seed ULM can be defined by 1.) a prior over the space of models (or equivalently, programs), 2.) an initial utility function, and 3.) the universal learning machinery/algorithm. The machine is a real-time system that processes an input sensory/observation stream and produces an output motor/action stream to control the external world using a learned internal program that is the result of continuous self-optimization. [...]

The key defining characteristic of a ULM is that it uses its universal learning algorithm for continuous recursive self-improvement with regards to the utility function (reward system). We can view this as second (and higher) order optimization: the ULM optimizes the external world (first order), and also optimizes its own internal optimization process (second order), and so on. Without loss of generality, any system capable of computing a large number of decision variables can also compute internal self-modification decisions.

Conceptually the learning machinery computes a probability distribution over program-space that is proportional to the expected utility distribution. At each timestep it receives a new sensory observation and expends some amount of computational energy to infer an updated (approximate) posterior distribution over its internal program-space: an approximate 'Bayesian' self-improvement.

Rephrasing these posts in terms of each other, in a person's brain "the player" is the underlying learning machinery, which is searching the space of programs (brains) in order to find a suitable configuration; the "character" is whatever set of emotional responses, aesthetics, identities, and so forth the learning program has currently hit upon.

Many of the things about the character that seem fixed, can in fact be modified by the learning machinery. One's sense of aesthetics can be updated by propagating new facts into it [LW · GW], and strongly-held identities (such as "I am a technical person") can change [LW · GW] in response to new kinds of strategies becoming viable. Unlocking the Emotional Brain describes [LW · GW] a number of such updates, such as - in these terms - the ULM eliminating subprograms blocking confidence after receiving an update saying that the consequences of expressing confidence will not be as bad as previously predicted.

Another example of this kind of a thing was the framework that I sketched in Building up to an Internal Family Systems model [LW · GW]: if a system has certain kinds of bad experiences, it makes sense for it to spawn subsystems dedicated to ensuring that those experiences do not repeat. Moral psychology's social intuitionist model claims that people often have an existing conviction that certain actions or outcomes are bad, and that they then level seemingly rational arguments for the sake of preventing those outcomes. Even if you rebut the arguments, the conviction remains. This kind of a model is compatible with an IFS/ULM style model, where the learning machinery sets the goal of preventing particular outcomes, and then applies the "reasoning module" for that purpose.

Qiaochu Yuan notes that once you see people being upset at their coworker for criticizing them and you do therapy approaches with them, and this gets to the point where they are crying about how their father never told them that they were proud of them... then it gets really hard to take people's reactions to things at face value. Many of our consciously experienced motivations, actually have nothing to do with our real motivations. (See also: Nobody does the thing that they are supposedly doing [LW · GW], The Elephant in the Brain [LW · GW], The Intelligent Social Web [LW · GW].)

Preference synthesis as a character-level model

While I like a lot of the work that Stuart Armstrong has done on synthesizing human preferences [LW · GW], I have a serious concern about it which is best described as: everything in it is based on the character level, rather than the player/ULM level.

For example, in "Our values are underdefined, changeable, and manipulable [LW · GW]", Stuart - in my view, correctly - argues for the claim stated in the title... except that, it is not clear to me to what extent the things we intuitively consider our "values", are actually our values. Stuart opens with this example:

When asked whether "communist" journalists could report freely from the USA, only 36% of 1950 Americans agreed. A follow up question about Amerian journalists reporting freely from the USSR got 66% agreement. When the order of the questions was reversed, 90% were in favour of American journalists - and an astounding 73% in favour of the communist ones.

From this, Stuart suggests that people's values on these questions should be thought of as underdetermined. I think that this has a grain of truth to it, but that calling these opinions "values" in the first place is misleading.

My preferred framing would rather be that people's values - in the sense of some deeper set of rewards which the underlying machinery is optimizing for - are in fact underdetermined, but that is not what's going on in this particular example. The order of the questions does not change those values, which remain stable under this kind of a consideration. Rather, consciously-held political opinions are strategies for carrying out the underlying values. Receiving the questions in a different order caused the system to consider different kinds of information when it was choosing its initial strategy, causing different strategic choices.

Stuart's research agenda does talk about incorporating meta-preferences [LW · GW], but as far as I can tell, all the meta-preferences are about the character level too. Stuart mentions "I want to be more generous" and "I want to have consistent preferences" as examples of meta-preferences; in actuality, these meta-preferences might exist because of something like "the learning system has identified generosity as a socially admirable strategy and predicts that to lead to better social outcomes" and "the learning system has formulated consistency as a generally valuable heuristic and one which affirms the 'logical thinker' identity, which in turn is being optimized because of its predicted social outcomes".

My confusion about a better theory of values

If a "purely character-level" model of human values is wrong, how do we incorporate the player level?

I'm not sure and am mostly confused about it, so I will just babble [? · GW] & boggle at my confusion for a while, in the hopes that it would help.

The optimistic take would be that there exists some set of universal human values which the learning machinery is optimizing for. There exist various therapy frameworks which claim to have found something like this.

For example, the NEDERA model claims that there exist nine negative core feelings whose avoidance humans are optimizing for: people may feel Alone, Bad, Helpless, Hopeless, Inadequate, Insignificant, Lost/Disoriented, Lost/Empty, and Worthless. And pjeby mentions [LW(p) · GW(p)] that in his empirical work, he has found three clusters of underlying fears which seem similar to these nine:

For example, working with people on self-image problems, I've found that there appear to be only three critical "flavors" of self-judgment that create life-long low self-esteem in some area, and associated compulsive or avoidant behaviors:

Belief that one is bad, defective, or malicious (i.e. lacking in care/altruism for friends or family)

Belief that one is foolish, incapable, incompetent, unworthy, etc. (i.e. lacking in ability to learn/improve/perform)

Belief that one is selfish, irresponsible, careless, etc. (i.e. not respecting what the family or community values or believes important)

(Notice that these are things that, if you were bad enough at them in the ancestral environment, or if people only thought you were, you would lose reproductive opportunities and/or your life due to ostracism. So it's reasonable to assume that we have wiring biased to treat these as high-priority long-term drivers of compensatory signaling behavior.)

Anyway, when somebody gets taught that some behavior (e.g. showing off, not working hard, forgetting things) equates to one of these morality-like judgments as a persistent quality of themselves, they often develop a compulsive need to prove otherwise, which makes them choose their goals, not based on the goal's actual utility to themself or others, but rather based on the goal's perceived value as a means of virtue-signalling. (Which then leads to a pattern of continually trying to achieve similar goals and either failing, or feeling as though the goal was unsatisfactory despite succeeding at it.)

So - assuming for the sake of argument that these findings are correct - one might think something like "okay, here are the things the brain is trying to avoid, we can take those as the basic human values".

But not so fast. After all, emotions are all computed in the brain, so "avoidance of these emotions" can't be the only goal any more than "optimizing happiness" can. It would only lead to wireheading.

Furthermore, it seems like one of the things that the underlying machinery also learns, is situations in which it should trigger these feelings. E.g. feelings of irresponsibility can be used as an internal carrot and stick scheme, in which the system comes to predict that if it will feel persistently bad, this will cause parts of it to pursue specific goals in an attempt to make those negative feelings go away.

Also, we are not only trying to avoid negative feelings. Empirically, it doesn't look like happy people end up doing less than unhappy people, and guilt-free people may in fact do more than guilt-driven people. The relationship is nowhere linear, but it seems like there are plenty of happy, energetic people who are happy in part because they are doing all kinds of fulfilling things.

So maybe we could look at the inverse of negative feelings: positive feelings. The current mainstream model of human motivation and basic needs is self-determination theory, which explicitly holds that there exist three separate basic needs:

Autonomy: people have a need to feel that they are the masters of their own destiny and that they have at least some control over their lives; most importantly, people have a need to feel that they are in control of their own behavior.

Competence: another need concerns our achievements, knowledge, and skills; people have a need to build their competence and develop mastery over tasks that are important to them.

Relatedness (also called Connection): people need to have a sense of belonging and connectedness with others; each of us needs other people to some degree

So one model could be that the basic learning machinery is, first, optimizing for avoiding bad feelings; and then, optimizing for things that have been associated with good feelings (even when doing those things is locally unrewarding, e.g. taking care of your children even when it's unpleasant). But this too risks running into the wireheading issue.

A problem here is that while it might make intuitive sense to say "okay, if the character's values aren't the real values, let's use the player's values instead", the split isn't actually anywhere that clean. In a sense the player's values are the real ones - but there's also a sense in which the player doesn't have anything that we could call values. It's just a learning system which observes a stream of rewards and optimizes it according to some set of mechanisms, and even the reward and optimization mechanisms themselves may end up getting at least partially rewritten. The underlying machinery has no idea about things like "existential risk" or "avoiding wireheading" or necessarily even "personal survival" - thinking about those is a character-level strategy, even if it is chosen by the player using criteria that it does not actually understand.

For a moment it felt like looking at the player level would help with the underdefinability and mutability of values, but the player's values seem like they could be even less defined and even more mutable. It's not clear to me that we can call them values in the first place, either - any more than it makes meaningful sense to say that a neuron in the brain "values" firing and releasing neurotransmitters. The player is just a set of code, or going one abstraction level down, just a bunch of cells.

To the extent that there exists something that intuitively resembles what we call "human values", it feels like it exists in some hybrid level which incorporates parts of the player and parts of the character. That is, assuming that the two can even be very clearly distinguished from each other in the first place.

Or something. I'm confused.

16 comments

Comments sorted by top scores.

comment by Steven Byrnes (steve2152) · 2020-01-25T03:02:30.128Z · LW(p) · GW(p)

I definitely agree that the player vs character distinction is meaningful, although I would define it a bit differently.

I would identify it with cortical vs subcortical, a.k.a. neocortex vs everything else. (...with the usual footnotes, e.g. the hippocampus counts as "cortical" :-D)

(ETA: See my later post Inner alignment on the brain [LW · GW] for a better discussion of some of the below.)

The cortical system basically solves the following problem:

Here is (1) a bunch of sensory & other input data, in the form of spatiotemporal patterns of spikes on input neurons, (2) occasional labels about what's going on right now (e.g. "something good / bad / important is happening"), (3) a bunch of outgoing neurons. Your task is to build a predictive model of the inputs, and use that to choose signals to send into the outgoing neurons, to make more good things happen.

The result is our understanding of the world, our consciousness, imagination, memory, etc. Anything we do that requires understanding the world is done by the cortical system. This is your "character".

The subcortical system is responsible for everything else your brain does to survive, one of which is providing the "labels" mentioned above (that something good / bad / important / whatever is happening right now).

For example, take the fear-of-spiders instinct. If there is a black scuttling blob in your visual field, there's a subcortical vision system (in the superior colliculus) that pattern-matches that moving blob to a genetically-coded template, and thus activates a "Scary!!" flag. The cortical system sees the flag, sees the spider, and thus learns that spiders are scary, and it can plan intelligent actions to avoid spiders in the future.

I have a lot of thoughts on how to describe these two systems at a computational level, including what the neocortex is doing [LW · GW], and especially how the cortical and subcortical systems exchange information [LW · GW]. I am hoping to write lots more posts with more details about the latter, especially about emotions.

even the reward and optimization mechanisms themselves may end up getting at least partially rewritten.

Well, there is such a thing as subcortical learning, particularly for things like fine-tuning motor control programs in the midbrain and cerebellum, but I think most or all of the "interesting" learning happens in the cortical system, not subcortical.

In particular, I'm not really expecting the core emotion-control algorithms to be editable by learning or thinking (if we draw an appropriately tight boundary around them).

More specifically: somewhere in the brain is an algorithm that takes a bunch of inputs and calculates "How guilty / angry / happy / smug / etc. should I feel right now?" The inputs to this algorithm come from various places, including from the body (e.g. pain, hunger, hormone levels), and from the cortex (what emotions am I expecting or imagining or remembering?), and from other emotion circuits (e.g. some emotions inhibit or reinforce each other). The inputs to the emotion calculation can certainly change, but I don't expect that the emotion calculation itself changes over time.

It feels like emotion-control calculations can change, because the cortex can be a really dominant input to those calculations, and the cortex really can change, including by conscious effort. Why is the cortex such a dominant input? Think about it: the emotion-calculation circuits don't know whether I'm likely to eat tomorrow, or whether I'm in debt, or whether Alice stole my cookie, or whether I just got promoted. That information is all in the cortex! The emotion circuits get only tiny glimpses of what's going on in the world, particularly through the cortex predicting & imagining emotions, including in empathetic simulation of others' emotions. If the cortex is predicting fear, well, the amygdala obliges by creating actual fear, and then the cortex sees that and concludes that its prediction was right all along! There's very little "ground truth" that the emotion circuits have to go on. Thus, there's a wide space of self-reinforcing habits of thought. It's a terrible system! Totally under-determined. Thus we get self-destructive habits of thought that linger on for decades.

Anyway, I have this long-term vision of writing down the exact algorithm that each of the emotion-control circuits is implementing. I think AGI programmers might find those algorithms helpful, and so might people trying to pin down "human values". I have a long way to go in that quest :-D

there's also a sense in which the player doesn't have anything that we could call values ...

I basically agree; I would describe it by saying that the subcortical systems are kinda dumb. Sure, the superior colliculus can recognize scuttling spiders, and the emotion circuits can "dislike" pain. But any sophisticated concept like "flourishing", "fairness", "virtue", etc. can only be represented in the form of something like "Neocortex World Model Entity ID #30962758", and these things cannot have any built-in relationship to subcortical circuits.

So the player's "values" are going to (1) simple things like "less pain is good", and (2) things that don't have an obvious relation to the outside world, like complicated "preferences" over the emotions inside our empathetic simulations of other people.

If a "purely character-level" model of human values is wrong, how do we incorporate the player level?

Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P

But either way, I'm all for trying to get a better understanding of how I (the character / cortical system) am "built" by the player / subcortical system. :-)

Replies from: Kaj_Sotala, Charlie Steiner

↑ comment by Kaj_Sotala · 2020-01-26T16:11:16.480Z · LW(p) · GW(p)

Great comment, thanks!

Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P

Well, to make up a silly example, let's suppose that you have a conscious belief that you want there to be as much cheesecake as possible. This is because you are feeling generally unsafe, and a part of your brain has associated cheesecakes with a feeling of safety, so it has formed the unconscious prediction that if only there was enough cheesecake, then you would finally feel good and safe.

So you program the AI to extract your character-level values, it correctly notices that you want to have lots of cheesecake, and goes on to fill the world with cheesecake... only for you to realize that now that you have your world full of cheesecake, you still don't feel as happy as you were on some level expecting to feel, and all of your elaborate rational theories of how cheesecake is the optimal use of atoms start feeling somehow hollow.

Replies from: Linda Linsefors

↑ comment by Linda Linsefors · 2022-02-07T17:06:16.155Z · LW(p) · GW(p)

There is a missmatch in saying cortex=charcter and subcortex=player.

If I understand the player-character model right, then uncosuios coping strategies would be player level tactic. But these are learned behaviours, and would therfore be part of cortex.

In Kaj's example, the idea that cheescake will make the bad go away exist in the cortex's world model.

According to Steven's model of how the brain works (which I think is probably ture), the subcortex is part of the game the player is playing. Specificcally, the subcortex provides the reward signal, and some other importat game stats (stamina level, hit-points, etc). The subcortex is also sort of like a tutorial, drawing your attention to things that the game creator (evoulution) thinks might be usefull, and occational cut scenes (acting out pre-programed behaviour).

ML comparasion:
* The character is the pre trained nerual net
* The player is the backprop
* The cortex is the neural net and backprop
* Subcortex is the reward signarl and sometimes supervisory signal.

Also, I don't like the the player-character model much. Like all models it is at best a simplification, and it does catch some of what is going on, but I think it is more wrong than right and I think something like multi-agent model is much better. I.e. there are coping mechanmisms and other less consious strategies living in your brains side by side with who you think you are. But I don't think these are compleetly invissible the way the player is invissible to the character. They are predictive models (e.g. "cheescake will make me safe"), and it is possible to query them for predictions. And almost all of these models are in the cortex.

↑ comment by Charlie Steiner · 2020-01-29T08:05:10.924Z · LW(p) · GW(p)

You might have already mentioned this elsewhere, but do you have any reading recommendations for computation and the brain?

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2020-01-29T19:28:03.506Z · LW(p) · GW(p)

Meh, I haven't found any author that I especially love and who thinks like me and answers all my questions. :-P But here's every book I've read [LW(p) · GW(p)]. I've also obviously read a bunch of articles, but too many to list and none of them especially stands out unless you ask something more specific. Warning: sometimes I have an overly confident tone but don't really know what I'm talking about :-P

comment by Charlie Steiner · 2020-01-28T00:48:56.755Z · LW(p) · GW(p)

Speaking as a character, I too think the player can just go jump in a lake.

My response to this post is to think about something else instead, so if you'll excuse me getting on a hobby horse...

I agree that when we look at someone making bizarre rationalizations, "their values" are not represented consciously, and we have to jump to a different level to find human values. But I think that conscious->unconscious is the wrong level jump to make.

Instead, the jump I've been thinking about recently is to our own model of their behavior. In this case, our explanation of their behavior relies on the unconscious mind, but in other cases, I predict that we'll identify values with conscious desires when that is a more parsimonious explanation of behavior. An AI learning human values would then not merely be modeling humans, but modeling humans' models of humans. But I think it might be okay if it makes those models out of completely alien concepts (at least outside of deliberately self-referential special cases - there might be an analogy here to the recursive modeling of Gricean communication).

comment by moridinamael · 2020-01-27T17:48:03.610Z · LW(p) · GW(p)

Fantastic post; I'm still processing.

One bite-sized thought the occurs to me is that maybe this coupling of the Player and the Character is one of the many things accomplished by dreaming. The mind-system confabulates bizarre and complex scenarios, drawn in some sense from the distribution of possible but not highly probable sensory experiences. The Player provides an emotional reaction to these scenarios - you're naked in school, you feel horrifying levels of embarrassment in the dream, and the Character learns to avoid situations like this one without ever having to directly experience it.

I think that dreaming does this sort of thing in a general way, by simulating scenarios and using those simulations to propagate learning through the hierarchy, but in particular it would seem that viewing the mind in terms of Player/Character gives you a unique closed-loop situation that really bootstraps the ability of the Character to intuitively understand the Player's wishes.

Replies from: Kaj_Sotala

↑ comment by Kaj_Sotala · 2020-01-29T12:28:23.899Z · LW(p) · GW(p)

Fantastic post

Thanks!

comment by Stuart_Armstrong · 2020-01-27T16:55:13.401Z · LW(p) · GW(p)

It's not clear to me that we can call them values in the first place, either

It looks like the player values can be satisfied by wireheading, while the character values are more about the state of the world/state of their identity.

So, to satisfy them both, give the person pleasure and also satisfy their other preferences?

In a sense, it feels like the player's preferences are more base level, and the character's more meta?

Replies from: Kaj_Sotala

↑ comment by Kaj_Sotala · 2020-01-27T17:07:45.607Z · LW(p) · GW(p)

Hmm... it's hard for me to get what you mean from a comment this short, but just the fact that I seem to have a lot of difficulty connecting your comment with my own model suggests that I didn't communicate mine very well. Could you say more about how you understood it?

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2020-01-28T15:49:27.182Z · LW(p) · GW(p)

The player seems to value emotional states, while the character values specific situations it can describe? Does that seem right?

Replies from: Charlie Steiner, Kaj_Sotala

↑ comment by Charlie Steiner · 2020-01-29T08:14:42.378Z · LW(p) · GW(p)

My take is that we (the characters) have some wireheadable goals (e.g. curing a headache), but we also have plenty of goals best understood externally.

But the "player" is a less clearly goal-oriented process, and we can project different sorts of goals onto it, ranging from "it wants to make the feedback signal from the cortical neurons predict the output of some simple pattern detector" to "it wants us to avoid spiders" to "it wants us to be reproductively fit."

↑ comment by Kaj_Sotala · 2020-01-29T12:04:40.489Z · LW(p) · GW(p)

Hmm... several thoughts about that.

One is that I don't think we really know what the player does value. I had some guesses and hand-waving in the post, but nothing that I would feel confident enough about to use as the basis for preference synthesis or anything like that. I'm not even certain that our values can be very cleanly split into a separate character and player, though I do think that the two-layer model is less wrong than the naive alternative.

In Sarah's original analogy, the player first creates the character; then the character acts based on the choices that the player has made beforehand. But I should have mentioned in the post that one aspect in which I think the analogy is wrong, is that the player keeps changing the character. (Maybe you could think of this as one of those games that give you the option to take back the experience points that you've used on your character and then lets you re-assign them...)

Part of normal learning and change is that when you have new experiences, the learning process which I've been calling the player is involved in determining how those experiences affect your desires and personality. E.g. the changes in values and preferences that many people experience after having their first child - that might be described as the work of the player writing the "parental values" attribute into the character sheet. Or someone who goes to college, uncertain of what they want to study, tries out a few different subjects, and then switches their major to something which they found surprisingly interesting and motivating - the player giving them a preference to study that thing.

Those examples seem complicated enough that it seems a little too simplified to say that the player values emotional states; to some extent it seems to, but it also seems to itself create emotional states as suit its purposes. Probably what it "values" can't be simplified into any brief verbal description; it's more like it's a godshatter with a thousand different optimization criteria [LW · GW], all being juggled together to create something like the character.

I read your original comment as suggesting that we give the player sufficient pleasure that it is content; and then we also satisfy the character's preferences. But

1. Assuming for the sake of argument that this was possible, it's not clear what "the player being content" would do to a person's development. One possibility is that they would stop growing and responding to changed circumstances at all, because the mechanisms that were updating their behavior and thinking were all in the player. (Maybe even up to the point of e.g. not developing new habits in response to having moved to a new home with different arrangements, or something similar.)

2. There's anecdotal evidence suggesting that the pursuit of pleasure is actually also one of those character-level things. In "Happiness is a chore", the author makes the claim that even if you give people a technique which would consistently make them happy, and people try it out and become convinced of this, they might still end up not using it - because although "the pursuit of happiness" is what the character thinks they are doing, it is actually not what the player is optimizing for. If it was, it might be in the player's power to just create the happiness directly. Compare e.g. pjeby's suggestion [LW(p) · GW(p)] that things like happiness etc. are things that we feel by default, but the brain learns to activate systems which block happiness, because the player considers that necessary for some purpose:

So if, for example, we don't see ourselves as worthless, then experiencing ourselves as "being" or love or okayness is a natural, automatic consequence. Thus I ended up pursing methods that let us switch off the negatives and deal directly with what CT and IFS represent as objecting parts, since these objections are the constraint on us accessing CT's "core states" or IFS's self-leadership and self-compassion.

These claims also match my personal experience; I have, at various times, found techniques that I know would make me happy, but then I find myself just not using them. At one point I wrote "I have available to me some mental motions for reaching inside myself and accessing a source of happiness, but it would require a bit of an active effort, and I find that just being neutral is already good enough, so I can't be bothered." Ironically, I think I've forgotten what exactly that particular mental move was, because I ended up not using it very much...

There's also a thing in meditative traditions where people develop the ability to access some really pleasant states of mind ("jhanas"). But then, although some people do become "jhana-junkies" and mostly just want to hang out in them, a lot of folks don't. One friend of mine who knows how to access the jhanas was once asked something along the lines of "well, if you can access pure pleasure, why aren't you doing it all the time". That got him thoughtful, and then he afterwards mentioned something about pure bliss just getting kinda stale / boring after a while. Also, getting into a jhana requires some amount of effort and energy, and he figures that he might as well spend that effort and energy on something more meaningful than just pure pleasure.

3. "Not getting satisfied" seems like a characteristic thing of the player. The character thinks that they might get satisfied: "once I have that job that I've always wanted, then I'll be truly happy"... and then after a while they aren't anymore. If we model people's goals as setpoints [LW(p) · GW(p)], it seems like frequently when one setpoint has been reached (which the previous character would have been satisfied with), the player looks around and changes the character to give it a new target setpoint. (I saw speculation somewhere that this is an evolutionary hack for getting around the fact that the brain has only a limited range of utility that it can represent - by redefining the utility scale whenever you reach a certain threshold, you can effectively have an unbounded utility function even though your brain can only represent bounded utility. Of course, it comes with costs such as temporally inconsistent preferences.)

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2020-01-29T16:21:10.066Z · LW(p) · GW(p)

Interesting. I will think more...

comment by algon33 · 2020-08-25T19:26:55.153Z · LW(p) · GW(p)

How does this relate to the whole "no-self" thing? Is the character becoming aware of the player there?

Replies from: Kaj_Sotala

↑ comment by Kaj_Sotala · 2020-08-25T20:55:42.977Z · LW(p) · GW(p)

Good question. I think that at least some approaches to no-self do break down the mechanisms by which the appearance of a character is maintained, but the extent to which it actually gives insight to the nature of the player (as opposed to giving insight to the non-existence of the character) is unclear to me.

The two-layer model of human values, and problems with synthesizing preferences

Contents

The two-layer/ULM model of human values

Preference synthesis as a character-level model

My confusion about a better theory of values

16 comments