[DISC] Are Values Robust?

post by DragonGod · 2022-12-21T01:00:29.939Z · LW · GW · No comments

This is a question post.

Contents

    Epistemic Status
    Related Posts
  Robust Values Hypothesis
  Why Does it Matter?
  Questions
None
  Answers
    2 Jonathan Stray
    2 Charlie Steiner
    -1 Mark Neyer
None
No comments

Epistemic Status

Discussion question [LW(p) · GW(p)].

 

See also:


Robust Values Hypothesis

Consider the following hypothesis:

  1. There exists a "broad basin of attraction" around a privileged subset of human values[1] (henceforth "ideal values")
    1. The larger the basin the more robust values are
    2. Example operationalisations[2] of "privileged subset" that gesture in the right direction:
      1. Minimal set that encompasses most of the informational content of "benevolent"/"universal"[3] human values
      2. The "minimal latents [LW · GW]" of "benevolent"/"universal" human values
    3. Example operationalisations of "broad basin of attraction" that gesture in the right direction:
      1. A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in )
        1. Larger neighbourhood  larger basin
  2. Said subset is a "naturalish" abstraction
    1. The more natural the abstraction, the more robust values are
    2. Example operationalisations of "naturalish abstraction"
      1. The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
        • More privileged  more natural
      2. Most efficient representations of our universe contain a simple embedding of the subset
        • Simpler embeddings  more natural
  3. Points within this basin are suitable targets for optimisation
    1. The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are.
    2. Example operationalisations of "suitable targets for optimisation":
      1. Optimisation of this target is existentially safe[4]
      2. More strongly, we would be "happy" (where we fully informed) for the system to optimise for these points

 

The above claims specify different dimensions of "robustness". Questions about robustness should be understood as asking about all of them.


Why Does it Matter?

The degree to which values are robust seems to be very relevant from an AI existential safety perspective.


Questions

A. What's the best/most compelling evidence/arguments in favour of robust values

B. What's the best/most compelling evidence/arguments against robust values?

C. To what degree do you think values are robust?


I am explicitly soliciting opinions, so do please answer even if you do not believe your opinion to be particularly informed.

  1. ^

    Using the shard theory conception of "value" as "contextual influence on decision making".

  2. ^

    To be clear, "example operationalisation" in this document does not refer to any kind of canonical formalisations. The example operationalisations aren't even necessarily correct/accurate/sensible. They are meant to simply gesture in the right direction for what those terms might actually cash out to.

  3. ^

    "Benevolent": roughly the subset of human values that we are happy for arbitrarily capable systems to optimise for.

    "Universal": roughly the subset of human values that we are happy for other humans to optimise for.

  4. ^

    Including "astronomical waste" as an existential catastrophe.

  5. ^

    The other approach being to safeguard systems that may not necessarily be optimising for values that we'd be "happy" for them to pursue, were we fully informed.

    Examples of safeguarding approaches: corrigibility, impact regularisation, myopia, non-agentic system design, quantilisation, etc.

Answers

answer by Jonathan Stray · 2022-12-21T21:05:49.439Z · LW(p) · GW(p)

I think there might be a broad set of values that emerge around group survival, essentially game-theoretic or evolutionary pressures that lead to cooperation. But I think the details beyond that are likely to incredibly specific. I'd point to the "preference construction" literature as a more realistic account of how humans make choices, without assuming an underlying consistent preference structure.

answer by Charlie Steiner · 2022-12-21T03:33:40.005Z · LW(p) · GW(p)

My best guess is that if we pretend we knew how to define a space where AIs that are similar under self-modification are close together, there would indeed be basins of attraction around most good points (AIs that do good things with the galaxy). However, I see no particular reason why there should only be one such basin of attraction, at least not without defining your space in an unnatural way. And of course there are going to be plenty of other basins of attraction, you don't ever get alignment by default by just throwing a dart into AI-space.

comment by DragonGod · 2022-12-21T10:12:44.204Z · LW(p) · GW(p)

A load bearing claim of the robust values hypothesis for "alignment by default" is :

  1. Said subset is a "naturalish" abstraction
    1. The more natural the abstraction, the more robust values are
    2. Example operationalisations of "naturalish abstraction"
      1. The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
        • More privileged  more natural
      2. Most efficient representations of our universe contain a simple embedding of the subset
        • Simpler embeddings  more natural

 

The safety comes from , and , but  is why we're not throwing a dart at random into AI space. It's a property that makes value learning easier.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2022-12-21T10:40:56.287Z · LW(p) · GW(p)

Sure. Though see Take 4 [? · GW].

Replies from: DragonGod
comment by DragonGod · 2022-12-21T12:38:56.091Z · LW(p) · GW(p)

Claim #1 (about a "privileged subset") is a claim that there aren't multiple such natural abstractions (e.g. any other subset of human values that satisfies #3 would be a superset of the privileged subset, or a subset of the basin of attraction around the privileged subset.)

[But I haven't yet fully read that post or your other linked posts.]

answer by Mark Neyer · 2022-12-22T01:58:39.053Z · LW(p) · GW(p)

Hi!  I've been an outsider in this community for a while effectively for arguing exactly this: yes, values are robust.  Before I set off all the 'quack' filters, I did manage to persuade Richard Ngo [LW(p) · GW(p)] that an AGI wouldn't want to kill humans right away.

I think that for embodied agents, convergent instrumental subgoals very well likely lead to alignment. 

I think this is definitely not true if we imagine an agent living outside of a universe it can wholly observe and reliably manipulate, but the story changes dramatically when we make the agent an embodied agent in our own universe.


Our universe is so chaotic and unpredictable that actions increasing the likelihood of direct progress towards a goal will become increasingly difficult to compute beyond some time horizon, and the threat of death is going to be present for any agent of any size.  If you can't reliably predict something like, 'the position of the moon 3,000 years from tomorrow' due to the numerical error getting worse over time, i don't see how it's possible to compute far more complicated queries about possible futures involving billions of agents.

Hence I suspect  that the best way to maximize long term progress towards any goal is to increase the number and diversity of agents that have an interest in keeping you alive. The easiest, simplest way to do this is with a strategy of identifying agents whose goals are roughly compatible with yours, identifying the convergent instrumental subgoals of those agents, and helping those agents on their path. This is effectively a description of being loving: figuring out how you can help those around you grow and develop. 

There is also a longer argument which says, 'instrumental rationality, once you expand the scope, turns into something like religion' 

comment by quetzal_rainbow · 2022-12-22T05:17:06.251Z · LW(p) · GW(p)

If your future doesn't have billions of agents, you don't need to predict them.

Replies from: mark-neyer
comment by Mark Neyer (mark-neyer) · 2022-12-22T13:12:48.561Z · LW(p) · GW(p)

Fine, replace the agents with rocks. The problem still holds.

There's no closed form solution for the 3-body problem; you can only numerically approximate the future, with decreasing accuracy as time goes on.  There are far more than 3 bodies in the universe relevant to the long term survival of an AGI that could die in any number of ways because it's made of many complex pieces that can all break or fail.

comment by Seth Herd · 2023-11-17T23:51:41.355Z · LW(p) · GW(p)

The reason we're so concerned with instrumental convergence is that we're usually thinking of an AGI that can recursively self-improve until it can outmaneuver all of humanity and do whatever it wants. If it's a lot smarter than us, any benefit we could give it is small compared to the risk that we'll try to kill it or create more AGIs that will.

The future is hard to predict, that's why it's safest to eliminate any hard to predict parts that might actively try to kill you. If you can. If an AGI isn't that capable, we're not that concerned. But AGI will have many ways to relatively rapidly improve itself and steadily become more capable.

The usual rebuttal at this point is "just unplug it". We'd expect an even decently smart machine to pretend to be friendly and aligned until it has some scheme that prevents us from unplugging it.

Your argument for instrumental rationality converging to being nice only applies when you're on a roughly even playing field, and you can't just win the game solo if you decide to.

No comments

Comments sorted by top scores.