We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

post by johnswentworth, David Lorell · 2024-09-19T22:22:05.307Z · LW · GW · 4 comments

Contents

  Background: “Learning” vs “Learning About”
  We Humans Learn About Our Values
  Two Puzzles
    Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap
    Puzzle 2: The Role of Reward/Reinforcement
    Using Each Puzzle To Solve The Other
  What This Looks Like In Practice
None
4 comments

Background: “Learning” vs “Learning About”

Adaptive systems, reinforcement “learners”, etc, “learn” in the sense that their behavior adapts to their environment.

Bayesian reasoners, human scientists, etc, “learn” in the sense that they have some symbolic representation of the environment, and they update those symbols over time to (hopefully) better match the environment (i.e. make the map better match the territory).

These two kinds of “learning” are not synonymous[1]. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing[2].

We Humans Learn About Our Values

“I thought I wanted X, but then I tried it and it was pretty meh.”

“For a long time I pursued Y, but now I think that was more a social script than my own values.”

“As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight.”

The ubiquity of these sorts of sentiments is the simplest evidence that we do not typically know our own values[3]. Rather, people often (but not always) have some explicit best guess at their own values, and that guess updates over time - i.e. we can learn about our own values.

Note the wording here: we’re not just saying that human values are “learned” in the more general sense of reinforcement learning. We’re saying that we humans have some internal representation of our own values, a “map” of our values, and we update that map in response to evidence. Look again at the examples at the beginning of this section:

Notice that the wording of each example involves beliefs about values. They’re not just saying “I used to feel urge X, but now I feel urge Y”. They’re saying “I thought I wanted X” - a belief about a value! Or “now I think that was more a social script than my own values” - again, a belief about my own values, and how those values relate to my (previous) behavior. Or “I endorsed the view that Z is the highest objective” - an explicit endorsement of a belief about values. That’s how we normally, instinctively reason about our own values. And sure, we could reword everything to avoid talking about our beliefs about values - “learning” is more general than “learning about” - but the fact that it makes sense to us to talk about our beliefs about values is strong evidence that something in our heads in fact works like beliefs about values, not just reinforcement-style “learning”.

Two Puzzles

Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap

Very roughly speaking, an agent could aim to pursue any values regardless of what the world outside it looks like; “how the external world is” does not tell us “how the external world should be”. So when we “learn about” values, where does the evidence about values come from? How do we cross the is-ought gap?

Puzzle 2: The Role of Reward/Reinforcement

It does seem like humans have some kind of physiological “reward”, in a hand-wavy reinforcement-learning-esque sense, which seems to at least partially drive the subjective valuation of things. (Something something mesocorticolimbic circuit.) For instance, Steve Byrnes claims [LW · GW] that reward is a signal from which human values are learned/reinforced within lifetime[4]. But clearly the reward signal is not itself our values. So what’s the role of the reward, exactly? What kind of reinforcement does it provide, and how does it relate to our beliefs about our own values?

Using Each Puzzle To Solve The Other

Put these two together, and a natural guess is: reward is the evidence from which we learn about our values. Reward is used just like ordinary epistemic evidence - so when we receive rewards, we update our beliefs about our own values just like we update our beliefs about ordinary facts in response to any other evidence. Indeed, our beliefs-about-values can be integrated into the same system as all our other beliefs, allowing for e.g. ordinary factual evidence to become relevant to beliefs about values in some cases.

What This Looks Like In Practice

I eat some escamoles for the first time, and my tongue/mouth/nose send me enjoyable reward signals. I update to believe that escamole are good - i.e. I assign it high value.

Now someone comes along and tells me that escamoles are ant larvae. An unwelcome image comes into my head, of me eating squirming larvae. Ew, gross! That’s a negative reward signal - I downgrade my estimated value of escamoles.

Now another person comes along and tells me that escamoles are very healthy. I have cached already that "health" is valuable to me. That cache hit doesn’t generate a direct reward signal at all, but it does update my beliefs about the value of escamoles, to the extent that I believe this person. That update just routes through ordinary epistemic machinery, without a reward signal at all.

At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it's less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values. So, I partially undo the value-downgrade I had assigned to escamoles in response to the “ew, gross” reaction. I might still feel some disgust, but I consciously override that disgust to some extent.

That last example is particularly interesting, since it highlights a nontrivial prediction of this model. Insofar as reward is treated as evidence about values, and our beliefs about values update in the ordinary epistemic manner, we should expect all the typical phenomena of epistemic updating to carry over to learning about our values. Explaining-away is one such phenomenon. What do other standard epistemic phenomena look like, when carried over to learning about values using reward as evidence?

Escamoles, ant larvae in Mexico City | Eat Your World
Escamoles
  1. ^
  2. ^

    Of course sometimes a Bayesian reasoner’s beliefs are not “about” any particular external thing either, because the reasoner is so thoroughly confused that it has beliefs about things which don’t exist - like e.g. beliefs about the current location of my dog. I don’t have a dog. But unlike the adaptive system case, for a Bayesian reasoner such confusion is generally considered a failure of some sort.

  3. ^

    Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:

    • The human’s stated preferences
    • The human’s revealed preferences
    • The human’s in-the-moment experience of bliss or dopamine or whatever
    • <whatever other readily-measurable notion of “values” springs to mind>

    The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure. Things like stated preferences, revealed preferences, etc are useful proxies for human values, but those proxies importantly break down in various cases - even in day-to-day life.

  4. ^

    Steve uses the phrase “model-based RL” here, though that’s a pretty vague term and I’m not sure how well the usual usage fits Steve’s picture.

4 comments

Comments sorted by top scores.

comment by jessicata (jessica.liu.taylor) · 2024-09-20T00:10:50.792Z · LW(p) · GW(p)

I discussed something similar in the "Human brains don't seem to neatly factorize" section of the Obliqueness [LW · GW] post. I think this implies that, even assuming the Orthogonality Thesis, humans don't have values that are orthogonal to human intelligence (they'd need to not respond to learning/reflection to be orthogonal in this fashion), so there's not a straightforward way to align ASI with human values by plugging in human values to more intelligence.

comment by AprilSR · 2024-09-19T23:44:55.092Z · LW(p) · GW(p)

This post definitely resolved some confusions for me. There are still a whole lot of philosophical issues, but it's very nice to have a clearer model of what's going on with the initial naïve conception of value.

comment by Milan W (weibac) · 2024-09-19T22:59:38.904Z · LW(p) · GW(p)

Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:

  • The human’s stated preferences
  • The human’s revealed preferences
  • The human’s in-the-moment experience of bliss or dopamine or whatever
  • <whatever other readily-measurable notion of “values” springs to mind>

The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure.
(...)
But clearly the reward signal is not itself our values.
(...)
reward is the evidence from which we learn about our values.


So we humans have a high-level cognitive structure to which we do not have direct access (values), but about which we can learn by observing and reflecting on the stimulus-reward mappings we experience, thus constructing an internal representation of such structure. This reward-based updating bridges the is-ought gap, since reward is a thing we experience and our values encode the way things ought to be.

Two questions:

  • How accurate is the summary I have presented above?
  • Where do values, as opposed to beliefs-about-values, come from?
     
Replies from: johnswentworth
comment by johnswentworth · 2024-09-20T00:52:26.384Z · LW(p) · GW(p)

How accurate is the summary I have presented above?

Basically accurate.

Where do values, as opposed to beliefs-about-values, come from?

That is the right next question to ask. Humans have a map of their values, and can update that map in response to rewards in order to "learn about values", but still leaves the question of when/whether there's any "real values" which the map represents, and what kind-of-things those "real values" are.

A few parts of an answer:

  • "human values" are not one monolithic thing; we value lots of different stuff, and different parts of our value-estimates can separately represent "a real thing" or fail to represent "a real thing".
  • we don't yet understand what it means for part of our value-estimates to represent "a real thing", but it probably works pretty similarly to epistemic representation more generally - e.g. my belief about the position of the dog in my apartment represents a real thing (even if the position itself is wrong) exactly when there is in fact a dog in my apartment at all.