We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

johnswentworth

We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

post by johnswentworth, David Lorell · 2024-09-19T22:22:05.307Z · LW · GW · 48 comments

  Background: “Learning” vs “Learning About”
  We Humans Learn About Our Values
  Two Puzzles
    Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap
    Puzzle 2: The Role of Reward/Reinforcement
    Using Each Puzzle To Solve The Other
  What This Looks Like In Practice
  The Next Question
None
48 comments

Background: “Learning” vs “Learning About”

Adaptive systems, reinforcement “learners”, etc, “learn” in the sense that their behavior adapts to their environment.

Bayesian reasoners, human scientists, etc, “learn” in the sense that they have some symbolic representation of the environment, and they update those symbols over time to (hopefully) better match the environment (i.e. make the map better match the territory).

These two kinds of “learning” are not synonymous^[1]. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing^[2].

We Humans Learn About Our Values

“I thought I wanted X, but then I tried it and it was pretty meh.”

“For a long time I pursued Y, but now I think that was more a social script than my own values.”

“As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight.”

The ubiquity of these sorts of sentiments is the simplest evidence that we do not typically know our own values^[3]. Rather, people often (but not always) have some explicit best guess at their own values, and that guess updates over time - i.e. we can learn about our own values.

Note the wording here: we’re not just saying that human values are “learned” in the more general sense of reinforcement learning. We’re saying that we humans have some internal representation of our own values, a “map” of our values, and we update that map in response to evidence. Look again at the examples at the beginning of this section:

“I thought I wanted X, but then I tried it and it was pretty meh.”
“For a long time I pursued Y, but now I think that was more a social script than my own values.”
“As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight.”

Notice that the wording of each example involves beliefs about values. They’re not just saying “I used to feel urge X, but now I feel urge Y”. They’re saying “I thought I wanted X” - a belief about a value! Or “now I think that was more a social script than my own values” - again, a belief about my own values, and how those values relate to my (previous) behavior. Or “I endorsed the view that Z is the highest objective” - an explicit endorsement of a belief about values. That’s how we normally, instinctively reason about our own values. And sure, we could reword everything to avoid talking about our beliefs about values - “learning” is more general than “learning about” - but the fact that it makes sense to us to talk about our beliefs about values is strong evidence that something in our heads in fact works like beliefs about values, not just reinforcement-style “learning”.

Two Puzzles

Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap

Very roughly speaking, an agent could aim to pursue any values regardless of what the world outside it looks like; “how the external world is” does not tell us “how the external world should be”. So when we “learn about” values, where does the evidence about values come from? How do we cross the is-ought gap?

Puzzle 2: The Role of Reward/Reinforcement

It does seem like humans have some kind of physiological “reward”, in a hand-wavy reinforcement-learning-esque sense, which seems to at least partially drive the subjective valuation of things. (Something something mesocorticolimbic circuit.) For instance, Steve Byrnes claims [LW · GW] that reward is a signal from which human values are learned/reinforced within lifetime^[4]. But clearly the reward signal is not itself our values. So what’s the role of the reward, exactly? What kind of reinforcement does it provide, and how does it relate to our beliefs about our own values?

Using Each Puzzle To Solve The Other

Put these two together, and a natural guess is: reward is the evidence from which we learn about our values. Reward is used just like ordinary epistemic evidence - so when we receive rewards, we update our beliefs about our own values just like we update our beliefs about ordinary facts in response to any other evidence. Indeed, our beliefs-about-values can be integrated into the same system as all our other beliefs, allowing for e.g. ordinary factual evidence to become relevant to beliefs about values in some cases.

What This Looks Like In Practice

I eat some escamoles for the first time, and my tongue/mouth/nose send me enjoyable reward signals. I update to believe that escamole are good - i.e. I assign it high value.

Now someone comes along and tells me that escamoles are ant larvae. An unwelcome image comes into my head, of me eating squirming larvae. Ew, gross! That’s a negative reward signal - I downgrade my estimated value of escamoles.

Now another person comes along and tells me that escamoles are very healthy. I have cached already that "health" is valuable to me. That cache hit doesn’t generate a direct reward signal at all, but it does update my beliefs about the value of escamoles, to the extent that I believe this person. That update just routes through ordinary epistemic machinery, without a reward signal at all.

At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it's less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values. So, I partially undo the value-downgrade I had assigned to escamoles in response to the “ew, gross” reaction. I might still feel some disgust, but I consciously override that disgust to some extent.

That last example is particularly interesting, since it highlights a nontrivial prediction of this model. Insofar as reward is treated as evidence about values, and our beliefs about values update in the ordinary epistemic manner, we should expect all the typical phenomena of epistemic updating to carry over to learning about our values. Explaining-away is one such phenomenon. What do other standard epistemic phenomena look like, when carried over to learning about values using reward as evidence?

Escamoles, ant larvae in Mexico City | Eat Your World — Escamoles

The Next Question

Meta: This section added day after the post went up, in response to comments.

The phrase "learning about values" makes it sounds like there are some "real values" out there which humans learn about. However, it would be more accurate to say that humans reason as though there were some "real values" out there which humans learn about.

That sets up the main next question: when, and to what extent, is there a referent of the "values" which humans think they're learning about? In ordinary epistemic reasoning, we might say that e.g. "my estimate of the position of the dog in my apartment refers to something when there is in fact a dog in my apartment, and doesn't refer to anything when there isn't a dog in my apartment". Insofar as values work via ordinary epistemics, we should be able to apply similar reasoning to questions of when humans' value-estimates refer to some "real values".

... and that's probably a topic for future posts.

^{^}
as far as we currently know [LW · GW]
^{^}
Of course sometimes a Bayesian reasoner’s beliefs are not “about” any particular external thing either, because the reasoner is so thoroughly confused that it has beliefs about things which don’t exist - like e.g. beliefs about the current location of my dog. I don’t have a dog. But unlike the adaptive system case, for a Bayesian reasoner such confusion is generally considered a failure of some sort.
^{^}
Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:
- The human’s stated preferences
- The human’s revealed preferences
- The human’s in-the-moment experience of bliss or dopamine or whatever
- <whatever other readily-measurable notion of “values” springs to mind>
The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure. Things like stated preferences, revealed preferences, etc are useful proxies for human values, but those proxies importantly break down in various cases - even in day-to-day life.
^{^}
Steve uses the phrase “model-based RL” here, though that’s a pretty vague term and I’m not sure how well the usual usage fits Steve’s picture.

48 comments

Comments sorted by top scores.

comment by Steven Byrnes (steve2152) · 2024-09-20T02:11:40.463Z · LW(p) · GW(p)

The post is mostly arguing that desires can shift around as we learn and think, which I agree with. But a couple parts of the post (including the title and some of the subheadings) seem to suggest something more than that: it’s not just desire-shifts, but desire convergence, towards a thing called “our values”.

(In other words, if I say that “I’m learning about blah”, then I’m strongly implying that there’s a fact-of-the-matter about blah, and that my beliefs about blah are approximately-monotonically converging towards that fact-of-the-matter. Right?)

Do you think there’s a thing (“human values”) to which desires gradually converge, via the kinds of processes described in this post? (I don’t, see §2.7 here [? · GW].)

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-09-20T16:16:35.742Z · LW(p) · GW(p)

Close, but not quite what this post is saying. The core claim in this post is that our brains model the world as though there's a thing called "our values", and tries to learn about those values in the usual epistemic way.

That does not necessarily imply that there actually is a thing called "our values". A fairly precise analogy: consider a Bayesian reasoner which is hardcoded to believe it's in the Game of Life, but is in fact hooked up to my webcam. That reasoner will, for instance, try to learn about the positions of rocks and oscillators and gliders, but there are not actually any rocks or oscillators or gliders in its environment.

You are exactly right that if I say “I’m learning about blah”, then I’m strongly implying that there’s a fact-of-the-matter about blah, and that I think I'm updating in such a way that my beliefs approximately-monotonically converge towards the fact-of-the-matter. But whether there actually exists a fact-of-the-matter, and whether I'm actually converging towards it, is a separate question from whether I think there is, or whether I model the world as though there's a fact-of-the-matter.

And that's the main next question which this post sets up: when, and to what extent, is there a referent of the "values" which humans think they're learning about? Insofar as this all works like ordinary epistemics, we should be able to answer that question in much the same way that we answer any other question about when a human who thinks they're learning about blah is learning about an actual thing. Just like the example of the position of the dog in my apartment.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-21T06:37:20.253Z · LW(p) · GW(p)

The core claim in this post is that our brains model the world as though there's a thing called "our values", and tries to learn about those values in the usual epistemic way.

I find that a very strange idea, as strange as Plato’s Socrates’ parallel idea that learning is not the acquisition of something new, but recollection of what one had forgotten.

If I try X, anticipating that it will be an excellent experience, and find it disappointing, I have not learned something about my values, but about X.

I have never eaten escamoles. If I try them, what I will discover is what they are like to eat. If I like them, did I always like them? That is an unheard-falling-trees question.

If I value a thing at one period of life and turn away from it later, I have not discovered something about my values. My values have changed. In the case of the teenager we call this process “maturing”. Wine maturing in a barrel is not becoming what it always was, but simply becoming, according to how the winemaker conducts the process.

But people have this persistent illusion that how they are today is how they always were and always will be, and that their mood of the moment is their fundamental nature, despite the evidence of their own memory.

Replies from: tailcalled, MinusGix

↑ comment by tailcalled · 2024-09-21T07:41:36.478Z · LW(p) · GW(p)

In Bayesian decision theory, there's the distinction between expected utility, which changes as one learns about the environment, and actual utility, which does not. Under this frame, I'd be inclined to round you off to using the words "values"/"liking"/etc. to refer to expected utility. Would you agree with that? If not, why not?

It might be tempting to round the OP off to use the word "values"/"ought" to refer to actual utility, but the details of that are kind of awkward at the edges so I would hold off on that.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-21T08:19:14.601Z · LW(p) · GW(p)

That is just replacing the idea of fixed values with a fixed utility function. But it is just as changeable whatever you call it.

Show me your utility function before you were born.

Replies from: tailcalled

↑ comment by tailcalled · 2024-09-21T09:04:30.247Z · LW(p) · GW(p)

I don't actually personally agree with Bayesian decision theory anymore and am currently inclined to treat value more like an objective fact about the world than as an individual preference. The provocative position would be a Beff-like one that value = entropy, but while that is an incremental improvenent on utilitarianims/value = negentropy, it is hellish and therefore I can't endorse it fully.

Regardless of the issues with my own position, I'm confused about your worldview. Do you not have a distinction between expected and actual utility, or do you consider there to be two different kinds of changes in values? How do you model value of information? (If you do model it, that is.)

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-21T12:10:11.652Z · LW(p) · GW(p)

Expected utility is what you have before the outcome of an action is known. Actual utility is what you have after the outcome is known. Here, the utility function has remained the same and you have acquired knowledge of the outcome.

Someone no longer finding a thing valuable that they used to, has either re-evaluated the thing in the light of new information about it, or changed the value they (their utility function) put on it.

Replies from: tailcalled

↑ comment by tailcalled · 2024-09-21T12:17:51.994Z · LW(p) · GW(p)

So you're basically working with a maximally-shattered model of agency where life consists of a bunch of independent activities that can be fully observed post-hoc and which have no connection between them?

So e.g. if you sometimes feel like eating one kind of food and other times feel like eating another kind of food, you just think "ah, my food preference arbitrarily changed", not "my situation changed to make so that the way to objectively improve my food intake is different now than it was in the past"?

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-21T21:02:34.472Z · LW(p) · GW(p)

No. I can't make any sense of where that came from.

So e.g. if you sometimes feel like eating one kind of food and other times feel like eating another kind of food, you just think "ah, my food preference arbitrarily changed", not "my situation changed to make so that the way to objectively improve my food intake is different now than it was in the past"?

No, there is simply no such thing as a utility function over foodstuffs.

Replies from: tailcalled

↑ comment by tailcalled · 2024-09-21T21:06:41.863Z · LW(p) · GW(p)

I'm basically confused about what a canonical example of changing values looks like to you. Like I assume you have some examples that make you postulate that it is possible or something. I've seen changes in food taste used as a canonical example before, but if that's not your example, then I would like to hear what your example is.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-22T12:07:59.792Z · LW(p) · GW(p)

One example would be the generic one from the OP: "As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight." This hypothetical teenager's values (I suggest, in disagreement with the OP) have changed. Their knowledge about the world has no doubt also changed, but I see no need to postulate some unobservable deeper value underlying their earlier views that has remained unchanged, only their knowledge about Z having changed.

Long-term lasting changes in one's food preferences might also count, but not the observation that whatever someone has for lunch, they are less likely to have again for dinner.

Utility theory is overrated. There is a certain mathematical neatness to it for "small world" problems, where you know all of the possible actions and their possible effects, and the associated probabilities and payoffs, and you are just choosing the best action, once. Eliezer has described the situation as like a set of searchlights coherently pointing in the same direction. But as soon as you try to make it a universal decision theory it falls apart for reasons that are like another set of searchlights pointing off in all directions, such as unbounded utility, St Petersburg-like games, "outcomes" consistsing of all possible configurations of one's entire future light-cone, utility monsters, repugnant conclusions, iterated games, multi-player games, collective utility, agents trying to predict each other, and so on, illuminating a landscape of monsters surrounding the orderly little garden of VNM-based utility.

Replies from: tailcalled

↑ comment by tailcalled · 2024-09-22T12:16:49.499Z · LW(p) · GW(p)

A generic example is kind of an anti-example though.

If you reject utility theory, what approach do you use for modelling values instead, and what makes you feel that approach is helpful?

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-22T12:26:10.421Z · LW(p) · GW(p)

I don't have one. What would I use it for? I don't think anyone else yet has one, at least not something mathematically founded, with the simplicity and inevitability of VNM. People put forward various ideas and discuss the various "monsters" I listed, but I see no sign of a consensus.

Replies from: tailcalled

↑ comment by tailcalled · 2024-09-22T12:27:17.201Z · LW(p) · GW(p)

What's the use in saying that values change, rather than just saying that you aren't interested in concepts involving values, then?

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-23T07:10:28.368Z · LW(p) · GW(p)

I can still be interested, even if I don't have the answers.

Replies from: tailcalled

↑ comment by tailcalled · 2024-09-23T07:50:18.859Z · LW(p) · GW(p)

Right, but I'm asking why. Like even if you don't have a complete framework, I'd think you'd have a general motive for your interest or something.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-23T08:51:17.674Z · LW(p) · GW(p)

It's an interesting open problem.

Here is an analogy. Classical utility theory, as developed by VNM, Savage, and others, the theory of which Eliezer made the searchlight comment, is like propositional calculus. The propositional calculus exists, it's useful, you cannot ever go against it without falling into contradiction, but there's not enough there to do much mathematics. For that you need to invent at least first-order logic, and use that to axiomatise arithmetic and eventually all of mathematics, while fending off the paradoxes of self-reference. And all through that, there is the propositional calculus, as valid and necessary as ever, but mathematics requires a great deal more.

The theory that would deal with the "monsters" that I listed does not yet exist. The idea of expected utility may thread its way through all of that greater theory when we have it, but we do not have it. Until we do, talk of the utility function of a person or of an AI is at best sensing what Eliezer has called the rhythm of the situation [? · GW]. To place over-much reliance on its letter will fail.

Replies from: tailcalled

↑ comment by tailcalled · 2024-09-23T08:54:39.089Z · LW(p) · GW(p)

But propositional calculus and first-order logic exist to support mathematics, which was developed before formal logix. What's your mathematics-of-value, rather than your logic-of-value?

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-23T16:48:16.363Z · LW(p) · GW(p)

That was an analogy, a similarity between two things, not an isomorphism.

The mathematics of value that you are asking for is the thing that does not exist yet. People, including me, muddle along as best they can; sometimes at less than that level. Post-rationalists like David Chapman valorise this as "nebulosity", but I don't think 19th century mathematicians would have been well served by that attitude.

Replies from: cubefox

↑ comment by cubefox · 2024-09-28T00:38:05.234Z · LW(p) · GW(p)

Richard Jeffrey has a nice utility theory which applies to a Boolean algebra of propositions (instead of e.g. to Savage's acts/outcomes/states of the world), similar to probability theory.

In fact, it consists of just two axioms plus the three probability axioms.

The theory doesn't involve time, like probability theory. It also applies to just one agent, again like probability theory.

It doesn't solve all problems, but neither does probability theory, which e.g. doesn't solve the sleeping beauty paradox.

Do you nonetheless think utility theory is significantly more problematic than probability theory? Or do you reject both?

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2024-09-28T06:38:16.754Z · LW(p) · GW(p)

Utility theory is significantly more problematic than probability theory.

In both cases, from certain axioms, certain conclusions follow. The difference is in the applicability of those axioms in the real world. Utility theory is supposedly about agents making decisions, but as I remarked earlier in the thread, these are "agents" that make just one decision and stop, with no other agents in the picture.

I have read that Morgenstern was surprised that so much significance was read into the VNM theorem on its publication, when he and von Neumann had considered it to be a rather obvious and minor thing, relegated to the appendix of their book. I have come to agree with that assessment.

[Jeffrey's] theory doesn't involve time, like probability theory. It also applies to just one agent, again like probability theory.

Probability theory is not about agents. It is about probability. It applies to many things, including processes in time.

That people fail to solve the Sleeping Beauty paradox does not mean that probability theory fails. I have never paid the problem much attention, but Ape in the coat [LW · GW]'s analysis seems convincing to me.

Replies from: cubefox

↑ comment by cubefox · 2024-09-28T22:03:29.131Z · LW(p) · GW(p)

I mean in a subjective interpretation, a probability function represents the beliefs of one person at one point in time. Equally, a (Jeffrey) utility function can represent the desires of one person at one particular point in time. As such it is a theory of what an agent believes and wants.

Decisions can come into play insofar individual actions can be described by propositions ("I do A", "I do B") and each of those propositions is equivalent to a disjunction of the form "I do A and X happens or I do A and not-X happens", which is subject to the axioms. But decisions is not something which is baked into the theory, much like probability theory isn't necessarily about urns and gambles.

↑ comment by MinusGix · 2024-09-28T12:05:49.698Z · LW(p) · GW(p)

If I value a thing at one period of life and turn away from it later, I have not discovered something about my values. My values have changed. In the case of the teenager we call this process “maturing”. Wine maturing in a barrel is not becoming what it always was, but simply becoming, according to how the winemaker conducts the process.

Your values change according to the process of reflection - the grapes mature into wine through fun chemical reactions.
From what you wrote, it feels like you are mostly considering your 'first-order values'. However, you have an updating process that you also have values about. Like that I wouldn't respect simple mind control that alters my first-order values, because my values consider mind-control as disallowed. Similar to why I wouldn't take a very potent drug even if I know my first-order values would rank the feeling very highly, because I don't endorse that specific sort of change.

I have never eaten escamoles. If I try them, what I will discover is what they are like to eat. If I like them, did I always like them? That is an unheard-falling-trees question.

Then we should split the question. Do you have a value for escamoles specifically before eating them? No. Do you have a system of thought (of updating your values) that would ~always result in liking escamoles? Well, no in full generality. You might end up with some disease that affects your tastebuds permanently. But in some reasonably large class of normal scenarios, your values would consistently update in a way that would end up liking escamoles were you to ever eat them. (But really, the value for escamoles is more instrumental of a value for [insert escamole flavor, texture, etc.] here, that the escamoles are learned to be a good instance of.)

What johnwentworth mentions would then be the question of "Would this approved process of updating my values converge to anything"; or tend to in some reasonable reference class; or at least have some guaranteed properties that aren't freely varying. I don't think he is arguing that the values are necessarily fixed and always persistent (I certainly don't always handle my values according to my professed beliefs about how I should updatethem), but that they're constrained. That the brain also models them as reasonably constrained, and that you can learn important properties of them.

comment by jessicata (jessica.liu.taylor) · 2024-09-20T00:10:50.792Z · LW(p) · GW(p)

I discussed something similar in the "Human brains don't seem to neatly factorize" section of the Obliqueness [LW · GW] post. I think this implies that, even assuming the Orthogonality Thesis, humans don't have values that are orthogonal to human intelligence (they'd need to not respond to learning/reflection to be orthogonal in this fashion), so there's not a straightforward way to align ASI with human values by plugging in human values to more intelligence.

comment by Eli Tyre (elityre) · 2024-10-09T04:43:16.631Z · LW(p) · GW(p)

At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it's less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values.

I have a "what?" reaction to this part. Unpacking that makes me doubt this whole frame.

From your description earlier in this post, it sounds like your "values" are a sort of hypothetical construct, that's shape is constrained by the reward-signal information. I think you're not positing that there's a physical structure somewhere in the brain, which encodes a human's values in full, our beliefs about our values are mediated by reward signals.

Values aren't a "real thing" that you can discover facts about, and in an important sense there's "just the reward signal", and our beliefs about the underlying function(s) that determine the outputs of the reward signal. The values themselves aren't in there somewhere, separately from the reward circuit.

Given that, I don't know what it would mean to update that any particular instance of negative reward is "explained away" by some cause other than the values.

It seems like one of several things can be happening in that situation:

You adopt some abstract beliefs about this made up construct that you call "your values." The underlying reward input-output do change at all. Nothing changed except your self model.
The there were two parts / subcomponents of your brain (I speculate, literal subnetworks) each of which were hooked into the reward circuitry. Those networks interact, somehow and at least one of them updates the other. Some implicit models underlying the reward-signal function, updates, and so you have at least a partially different "yuck"/"yum" response to escamoles.
A combination of 1 and 2: You adopt your abstract beliefs about this hypothetical thing called "your values". Because of an important complexity of human motivation, one's abstract beliefs about one's "values" is an input to the reinforcement circuitry (both because your beliefs about yourself influence your yuck/yum reactions directly, and because your beliefs about yourself influence your actions, which can differentially reinforce some reactions). Changing your self-model changes your self.
1. Importantly, because of this loopiness, there can be degrees of freedom, there might be multiple stable attractors in the space of self-models/reward-outputs. Which means that the process of "updating your abstract beliefs" is not strictly a matter of epistemics, it's a matter of agency: you can choose what you want your "values" to be, to the exact extent that your abstract beliefs influence the reward circuitry and no more.

None of those were an epistemic process of updating your model of your "values" to better conform to the evidence about what your "values" are. All of these are about bidirectional interactions between the reward function(s) and various implicit or explicit beliefs^[1]. But they don't seem well modeled by an epistemic process of "My values are a static thing that are out there/ in here, and I'm doing Bayesian updates to model those values more and more accurately."

^{^}
Well, except for the first one, which was a unidirectional interaction.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-10-09T04:59:51.912Z · LW(p) · GW(p)

An anology which might help: imagine a TV showing a video of an apple. The video was computer generated by a one-time piece of code, so there's no "real apple" somewhere else in the world which the video is showing. Nonetheless, there's still some substantive sense in which the apple on screen is "a thing" - even though I can only ever see it through the screen, I can still look at that apple on the screen and e.g. predict that a spot on the apple will still be there later, I can still discover things about the apple, etc.

Values, on this post's model, are like that apple, and our reward signals are like the TV.

So the values are "not real" in the sense that they don't correspond to anything else in the world beyond the metaphorical TV (i.e. our rewards). But there can still be a substantive sense in which the values "displayed by" the rewards are "a thing" - there are still consistent patterns in the reward stream which can mostly be predictively well-modeled as "showing some values", much like the TV is well-modeled as "showing an apple".

Now, imagine that I'm watching the video of the apple, and suddenly the left half of the TV blue-screens. Then I'd probably think "ah, something messed up the TV, so it's no longer showing me the apple" as opposed to "ah, half the apple just turned into a big blue square". Likewise, if I see some funny data in my reward stream, I think "ah, something is messing with my reward stream" as opposed to "ah, my values just completely changed into something weirder".

A similar analogy applies to values changing over time. If I'm watching the video of the apple, and suddenly a different apple appears, or if the apple gradually morphs into a different apple... well, I can see on the screen that the apple is changing. The screen consistently shows one apple at one time, and a different apple at a later time. Likewise for values and reward: if something physiologically changes my rewards on a long timescale, I may consistently see different values earlier vs later on that long timescale, and it makes sense to interpret that as values changing over time.

Did that help?

Replies from: elityre, johnswentworth

↑ comment by Eli Tyre (elityre) · 2024-10-10T07:05:32.222Z · LW(p) · GW(p)

Yes! that analogy is helpful for communicating what you mean!

I still have issues with your thesis though.

I agree that this "explaining away" thing could be a reasonable way to think about if eg the situation where I get sick, and while I'm sick, some activity that I usually love (let's say singing songs) feels meaningless. I probably shouldn't conclude that "my values" changed, just that the machinery that implements my reward circuitry is being thrown off by my being sick.

On the other hand, I think I could just as well describe this situation as extending the domain over which I'm computing my values. eg "I love and value singing songs, when I'm healthy, but when I'm sick in a particular way, I don't love it. Singing-while-healthy is meaningful; not singing per-se."

In the same way, I could choose to call the blue screen phenomenon an error in the TV, or I could include that dynamic as part of the "predict what will happen with the screen" game. Since there's no real apple that I'm trying to model, only an ephemeral image of the apple, there's not a principled place to stand on whether to view the blue-screen as an error, or just part of the image generating process.

For any given fuckery with my reward signals, I could call them errors, misrepresenting my "true values" or I could embrace them as expressing a part of my "true values." And if two people disagree about which conceptualization to go with, I don't know how they could possibly resolve it. They're both valid frames, fully consistent with the data. And they can't get distinguishing evidence, even in principle.

(I think this is not an academic point. I think people disagree about values in this way reasonably often.

Is enjoying masturbating to porn an example of your reward system getting hacked by external super-stimuli, or is that just part of the expression of your true values? Both of these are valid ways to extrapolate from the reward data time series. Which things count as your reward system getting hacked, and which things count as representing your values. It seems like a judgement call!

The classic and most fraught example is that some people find it drop dead obvious that they care about the external world, and not just their sense-impressions about the external world. They're horrified by the thought of being put in an experience machine, even if their subjective experience would be way better.

Other people just don't get this. "But your experience would be exactly the same as if the world was awesome. You wouldn't be able to tell the difference", they say. It's obvious to them that they would prefer the experience machine, as long as their memory was wiped so they didn't know they were in one.^[1])

Talking about an epistemic process attempting to update your model of an underlying not-really-real-but-sorta structure seems to miss the degrees of freedom in the game. Since there's no real apple, no one has any principled place to stand in claiming that "the apple really went half blue right there" vs. "no the TV signal was just interrupted." Any question about what the apple is "really doing" is a dangling node [? · GW]. ^[2]

As a separate point, while I agree the "explaining away disruptions" phenomenon is ever a thing that happens, I don't think that's usually what's happening when a person reflects on their values. Rather I guess that it's one of the three options that I suggested above.

^{^}
Tangentially, this is why I expect that the CEV of humans diverges. I think some humans, on maximal reflection, wirehead, and others don't.
^{^}
Admittedly, I think the question of which extrapolation schema to use is itself decided by "your values", which ultimately grounds out in the reward data. Some people have perhaps a stronger feeling of indigence about others hiding information from them, or perhaps a stronger sense of curiosity, or whatever, that crystalizes into a general desire to know what's true. Other people have less of that. And so they have different responses to the experience-machine hypothetical.

Because which extrapolation procedure any given person decides to use is itself a function of "their values" it all grounds out in the reward data eventually. Which perhaps defeats my point here.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-10-10T15:45:04.786Z · LW(p) · GW(p)

For any given fuckery with my reward signals, I could call them errors, misrepresenting my "true values" or I could embrace them as expressing a part of my "true values." And if two people disagree about which conceptualization to go with, I don't know how they could possibly resolve it. They're both valid frames, fully consistent with the data. And they can't get distinguishing evidence, even in principle.

I'd classify this as an ordinary epistemic phenomenon. It came up in this thread [LW(p) · GW(p)] with Richard just a couple days ago.

Core idea: when plain old Bayesian world models contain latent variables, it is ordinary for those latent variables to have some irreducible uncertainty - i.e. we'd still have some uncertainty over them even after updating on the entire physical state of the world. The latents can still be predictively useful and meaningful, they're just not fully determinable from data, even in principle.

Standard example (copied from the thread with Richard): the Boltzman distribution for an ideal gas - not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes' rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy... but has nonzero spread. There's small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there's nothing else to learn about which would further reduce my uncertainty in T.

Bringing it back to wireheading: first, the wireheader and the non-wireheader might just have different rewards; that's not the conceptually interesting case, but it probably does happen. The interesting case is that the two people might have different value-estimates given basically-similar rewards, and that difference cannot be resolved by data because (like temperature in the above example) the values-latent is underdetermined by the data. In that case, the difference would be in the two peoples' priors, which would be physiologically-embedded somehow.

↑ comment by johnswentworth · 2024-10-09T05:12:29.679Z · LW(p) · GW(p)

(Another angle: consider Harry Potter. Harry Potter is fictional, but he's still "a thing"; I can know things about Harry Potter, the things I know about Harry Potter can have predictive power, and I can discover new things about Harry Potter. So what does it mean for Harry to be "fictional"? Well, it means we can only ever "see" Harry through metaphorical TV screens - be it words on a page, or literal screens.

Values are "fictional" in that same sense; reward is the medium in which the fiction is expressed.)

comment by dxu · 2024-09-20T20:56:13.075Z · LW(p) · GW(p)

These two kinds of “learning” are not synonymous. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing.

I think I am confused both about whether I think this is true, and about how to interpret in such a way that it might be true. Could you go into more detail on what it means for a learner to learn something without there being some representational semantics that could be used to interpret what it's learned, even if the learner itself doesn't explicitly represent those semantics? Or is the lack of explicit representation actually the core substance of the claim here?

comment by Johannes C. Mayer (johannes-c-mayer) · 2024-09-20T10:58:36.106Z · LW(p) · GW(p)

reward is the evidence from which we learn about our values

A sadist might feel good each time they hurt somebody. I am pretty sure it is possible for a sadist to exist who does not endorse hurting people, meaning they feel good if they hurt people, but they avoid it nonetheless.

So to what extent is hurting people a value? It's like the sadist's brain tries to tell them that they ought to want to hurt people, but they don't want to. Intuitively the "they don't want to" seems to be the value.

Replies from: Measure

↑ comment by Measure · 2024-09-20T13:34:37.821Z · LW(p) · GW(p)

This seems similar to the ant larvae situation where they reflectively argue around the hardcoded reward signal. Hurting people might still be considered a value the sadist has, but it trades off against other values.

Replies from: David Lorell, johannes-c-mayer

↑ comment by David Lorell · 2024-09-20T21:33:18.898Z · LW(p) · GW(p)

Not quite what we were trying to say in the post. Rather than tradeoffs being decided on reflection, we were trying to talk about the causal-inference-style "explaining away" which the reflection gives enough compute for. In Johannes's example, the idea is that the sadist might model the reward as coming potentially from two independent causes: a hardcoded sadist response, and "actually" valuing the pain caused. Since the probability of one cause, given the effect, goes down when we also know that the other cause definitely obtained, the sadist might lower their probability that they actually value hurting people given that (after reflection) they're quite sure they are hardcoded to get reward for it. That's how it's analagous to the ant thing.

↑ comment by Johannes C. Mayer (johannes-c-mayer) · 2024-09-20T20:54:51.631Z · LW(p) · GW(p)

Yes exactly. The larva example illustrates that there are different kinds of values. I thought it was underexplored in the OP to characterize exactly what these different kinds of values are.

In the sadist example we have:

the hardcoded pleasure of hurting people.
And we have, let's assume, the wish to make other people happy.

These two things both seem like values. However, they seem to be qualitatively different kinds of values. I intuit that more precisely characterizing this difference is important. I have a bunch of thoughts on this that I failed to write up so far.

comment by Steven Byrnes (steve2152) · 2024-09-20T01:40:17.966Z · LW(p) · GW(p)

Steve uses the phrase “model-based RL” here, though that’s a pretty vague term and I’m not sure how well the usual usage fits Steve’s picture.

Here [LW · GW]’s an elaboration of what I mean:

The human brain has a model-based RL system that it uses for within-lifetime learning. I guess that previous sentence is somewhat controversial, but it really shouldn’t be:
The brain has a model—If I go to the toy store, I expect to be able to buy a ball.
The model is updated by self-supervised learning (i.e., predicting imminent sensory inputs and editing the model in response to prediction errors)—if I expect the ball to bounce, and then I see the ball hit the ground without bouncing, then next time I see that ball heading towards the ground, I won’t expect it to bounce.
The model informs decision-making—If I want a bouncy ball, I won’t buy that ball, instead I’ll buy a different ball.
There’s reinforcement learning—If I drop the ball on my foot just to see what will happen, and it really hurts, then I probably won’t do it again, and relatedly I will think of doing so as a bad idea.
…And that’s all I mean by “the brain has a model-based RL system”.
I emphatically do not mean that, if you just read a “model-based RL” paper on arxiv last week, then I think the brain works exactly like that paper you just read. On the contrary, “model-based RL” is a big tent comprising many different algorithms, once you get into details. And indeed, I don’t think “model-based RL as implemented in the brain” is exactly the same as any model-based RL algorithm on arxiv.

comment by Eli Tyre (elityre) · 2024-10-09T00:54:56.361Z · LW(p) · GW(p)

Notice that the wording of each example involves beliefs about values. They’re not just saying “I used to feel urge X, but now I feel urge Y”. They’re saying “I thought I wanted X” - a belief about a value! Or “now I think that was more a social script than my own values” - again, a belief about my own values, and how those values relate to my (previous) behavior. Or “I endorsed the view that Z is the highest objective” - an explicit endorsement of a belief about values. That’s how we normally, instinctively reason about our own values. And sure, we could reword everything to avoid talking about our beliefs about values - “learning” is more general than “learning about” - but the fact that it makes sense to us to talk about our beliefs about values is strong evidence that something in our heads in fact works like beliefs about values, not just reinforcement-style “learning”.

Importantly, this isn't the only way that people talk about their values.

Sometimes a person will say "I used to care deeply about X, but as I got older, I don't care as much", or "Y used to be the center of my life, but that was a long time ago", etc.

In those cases the person isn't claiming to have been mistaken about their values. Rather their verbiage expresses that they correctly ascertained their values, but their values themselves changed over time.

This could just be a matter of semantics, but these could also be distinct non-mutually exclusive phenomena. Sometimes we learn more about ourselves and our beliefs about our values change. Sometimes we change, not just our beliefs about ourselves and our values.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-10-09T01:04:47.097Z · LW(p) · GW(p)

Agreed.

I would also hypothesize that, in practice, the way this usually happens is that some physiological or environmental shift results in different reward signals. Then, insofar as the brain is treating the reward-signal-stream as evidence of some coherent underlying values, it makes sense for the person to feel like their reward-signal-stream was reasonably-consistently pointed at one set of values during one time period, and a different set of values during a different time period.

comment by Raemon · 2024-09-20T18:39:50.121Z · LW(p) · GW(p)

I felt a bit confused by this bit:

At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it's less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values. So, I partially undo the value-downgrade I had assigned to escamoles in response to the “ew, gross” reaction. I might still feel some disgust, but I consciously override that disgust to some extent.
That last example is particularly interesting, since it highlights a nontrivial prediction of this model. Insofar as reward is treated as evidence about values, and our beliefs about values update in the ordinary epistemic manner, we should expect all the typical phenomena of epistemic updating to carry over to learning about our values. Explaining-away is one such phenomenon. What do other standard epistemic phenomena look like, when carried over to learning about values using reward as evidence?

I feel like this sort of makes sense but don't quite parse why this counted as "explaining away." How do I know my hardcoded reactions aren't values?

Replies from: David Lorell, johnswentworth

↑ comment by David Lorell · 2024-09-20T19:58:29.082Z · LW(p) · GW(p)

Suppose you have a randomly activated (not dependent on weather) sprinkler system, and also it rains sometimes. These are two independent causes for the sidewalk being wet, each of which are capable of getting the job done all on their own. Suppose you notice that the sidewalk is wet, so it definitely either rained, sprinkled, or both. If I told you it had rained last night, your probability that the sprinklers went on (given that it is wet) should go down, since they already explain the wet sidewalk. If I told you instead that the sprinklers went on last night, then your probability of it having rained (given that it is wet) goes down for a similar reason. This is what "explaining away" is in causal inference. The probability of a cause given its effect goes down when an alternative cause is present.

In the post, the supposedly independent causes are "hardcoded ant-in-mouth aversion" and "value of eating escamoles", and the effect is negative reward. Realizing that you have a hardcoded ant-in-mouth aversion is like learning that the sprinklers were on last night. The sprinklers being on (incompletely) "explain away" the rain as a cause for the sidewalk being wet. The hardcoded ant-in-mouth aversion explains away the-amount-you-value-escamoles as a cause for the low reward.

I'm not totally sure if that answers your question, maybe you were asking "why model my values as a cause of the negative reward, separate from the hardcoded response itself"? And if so, I think I'd rephrase the heart of the question as, "what do the values in this reward model actually correspond to out in the world, if anything? What are the 'real values' which reward is treated as evidence of?" (We've done some thinking about that and might put out a post on that soon.)

Replies from: Raemon

↑ comment by Raemon · 2024-09-20T20:36:46.345Z · LW(p) · GW(p)

Okay, I think one crystallization here for me is that "explaining away" is a matter of degree. (I think I found the second half of the comment less helpful, but the combo of the first half + John's response is helpful both for my own updating, and seeing where you guys are currently at)

↑ comment by johnswentworth · 2024-09-20T19:43:26.818Z · LW(p) · GW(p)

The main observation from the quoted block is "man, this sure sounds like explaining away, if I'm treating my hardcoded reactions as a signal which is sometimes influenced by things besides values". But exactly when do I treat my hardcoded reactions as though they're being influenced by non-value stuff? I don't know yet; I don't yet understand that part.

comment by Charlie Steiner · 2024-09-20T04:48:53.573Z · LW(p) · GW(p)

This is related to the question "Are human values in humans, or are they in models of humans?"

Suppose you're building an AI to learn human values and apply them to a novel situation.

The "human values are in humans", ethos is that the way humans compute values is the thing AI should learn, and maybe it can abstract away many kinds of noise, but it shouldn't be making any big algorithmic overhauls. It should just find the value-computation inside the human (probably with human feedback) and then apply it to the novel situation.

The "human values are in models of humans" take is that the AI can throw away a lot of information about the actual human brain, and instead should find good models (probably with human feedback) that have "values" as a component of a coarse-graining of human psychology, and then apply those "good" models to the novel situation.

comment by AprilSR · 2024-09-19T23:44:55.092Z · LW(p) · GW(p)

This post definitely resolved some confusions for me. There are still a whole lot of philosophical issues, but it's very nice to have a clearer model of what's going on with the initial naïve conception of value.

comment by Raemon · 2024-10-03T17:34:27.175Z · LW(p) · GW(p)

I find myself wanting to curate this, because it illustrated a useful new frame and/or gear for thinking about values.

I feel like it's juuuust under the threshold for me feeling good about curating, because the example given feels sort of... simple. It's not exactly trivial but I think I'd be more confident what you meant if you contrasted it with an example that was more stereotypically "values-y" (i.e. I don't really care whether I like escamoles or not, although maybe some people do)

comment by Milan W (weibac) · 2024-09-19T22:59:38.904Z · LW(p) · GW(p)

Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:
The human’s stated preferences
The human’s revealed preferences
The human’s in-the-moment experience of bliss or dopamine or whatever
<whatever other readily-measurable notion of “values” springs to mind>
The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure.
(...)
But clearly the reward signal is not itself our values.
(...)
reward is the evidence from which we learn about our values.

So we humans have a high-level cognitive structure to which we do not have direct access (values), but about which we can learn by observing and reflecting on the stimulus-reward mappings we experience, thus constructing an internal representation of such structure. This reward-based updating bridges the is-ought gap, since reward is a thing we experience and our values encode the way things ought to be.

Two questions:

How accurate is the summary I have presented above?
Where do values, as opposed to beliefs-about-values, come from?

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-09-20T00:52:26.384Z · LW(p) · GW(p)

How accurate is the summary I have presented above?

Basically accurate.

Where do values, as opposed to beliefs-about-values, come from?

That is the right next question to ask. Humans have a map of their values, and can update that map in response to rewards in order to "learn about values", but still leaves the question of when/whether there's any "real values" which the map represents, and what kind-of-things those "real values" are.

A few parts of an answer:

"human values" are not one monolithic thing; we value lots of different stuff, and different parts of our value-estimates can separately represent "a real thing" or fail to represent "a real thing".
we don't yet understand what it means for part of our value-estimates to represent "a real thing", but it probably works pretty similarly to epistemic representation more generally - e.g. my belief about the position of the dog in my apartment represents a real thing (even if the position itself is wrong) exactly when there is in fact a dog in my apartment at all.

Replies from: weibac

↑ comment by Milan W (weibac) · 2024-09-20T03:26:18.712Z · LW(p) · GW(p)

Thank you for the answer. I notice I feel somewhat confused, and that I regard the notion of "real values" with some suspicion I can't quite put my finger on. Regardless, an attempted definition follows.

Let a subject observation set be a complete specification of a subject and it's past and current environment, from the subject's own subjectively accessible perspective. The elements of a subject observation set are observations/experiences observed/experienced by its subject.

Let O be the set of all subject observation sets.

Let a subject observation set class be a subset of O such that all it's elements specify subjects that belong to an intuitive "kind of subject": e.g. humans, cats, parasitoid wasps.

Let V be the set of all (subject_observation_set, subject_reward_value) tuples. Note that all possible utility functions of all possible subjects can be defined as subsets of V, and that
V = O x ℝ.

Let "real human values" be the subset of V such that all subject_observation_set elements belong to the human subject observation set class.^[1]

... this above definition feels pretty underwhelming, and I suspect that I would endorse a pretty small subset of "real human values" as defined above as actually good.

^{^}
Let the reader feel free take the political decision of restricting the subject observation set class that defines "real human values" to sane humans.

comment by NickH · 2025-03-05T12:32:25.431Z · LW(p) · GW(p)

Sounds backwards to me. It seems more like "our values are those things that we anticipate will bring us reward" than that rewards are what tell us about our values.

When you say “I thought I wanted X, but then I tried it and it was pretty meh.” That just seems wrong. You really DID want X. You valued it then because you thought it would bring you reward. Maybe, You just happened to be wrong. It's fine to be wrong about your anticipations. It's kind of weird to say that you were wrong about your values. Saying that your values change is kind of a cop out and certainly not helpful when considering AI alignment - It suggests that we can never truly know our values - We just get to say "not that" when we encounter counter evidence. Our rewards seem much more real and stable.

We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

Contents

Background: “Learning” vs “Learning About”

We Humans Learn About Our Values

Two Puzzles

Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap

Puzzle 2: The Role of Reward/Reinforcement

Using Each Puzzle To Solve The Other

What This Looks Like In Practice

The Next Question

48 comments