why won't this alignment plan work?

post by KvmanThinking (avery-liu) · 2024-10-10T15:44:59.450Z · LW · GW · 2 comments

This is a question post.

Contents

  Answers
    3 johnswentworth
    1 Milan W
    1 Anon User
None
2 comments

the idea:

Answers

answer by johnswentworth · 2024-10-10T17:43:28.990Z · LW(p) · GW(p)

The relatively easy problems:

  • The humans' utility estimates will be wrong. And not "random noise" kind of wrong, but systematically and predictably wrong.
  • Applying lots of optimization pressure to the humans' estimates will predictably Goodhart the wrongness of the estimates.
  • ... also actions alone are not "good" or "bad", tons and tons of context is relevant.

The hard problem:

  • What exactly is the "list of actions"?
    • Natural language description of actions? Then what is going to make the humans' interpretation of those natural-language symbols accurately represent the things the AI actually does?
    • Examples of actions taken by an AI in a simulation? What is going to make anything learned from those examples generalize well to the physical world during deployment?
answer by Milan W · 2024-10-10T22:31:24.829Z · LW(p) · GW(p)

The set of all possible sequences of actions is really really really big. Even if you have an AI that is really good at assigning the correct utilities[1] to any sequence of actions we test it with, it's "near infinite sized"[2] learned model of our preferences is bound to come apart at the tails [LW · GW] or even at some weird region we forgot to check up on.

  1. ^

    Good luck getting the ethicists to come to a consensus on this.

  2. ^

    Von Neumman: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk".

answer by Anon User · 2024-10-10T17:20:09.195Z · LW(p) · GW(p)
  • We do not know how to create an AI that would not regularly hallucinate. The Values AI hallucinating would be a bad thing.
  • In fact, training AI to closer follow human values seems to just cause it to say what humans want to hear, while being objectively incorrect more often.
  • We do not know how to create an AI that reliability follows the programed values outside of a training set. Your 2nd AI going off the rails outside of the training set would be bad.
comment by Anon User (anon-user) · 2024-10-10T17:27:21.588Z · LW(p) · GW(p)
  • Also, human values, at least the ones we know how to consciously formulate, are pretty fragile - they are things that we want weak/soft optimization for, but would actually be very bad if a superhuman AI would hard-optimize. We do not know how to capture human values in a way that things would not go terribly wrong if the optimization is cranked to the max, and your Values AI is likely to not help enough, as we would not know what missing inputs we are failing to provide it (because they are aspects of our values that would only become important in some future circumstances we cannot even imagine today).
comment by Anon User (anon-user) · 2024-10-10T17:30:26.072Z · LW(p) · GW(p)
  • Finally, we wouldn't get a second try - any bugs in your AIs, particularly the 2nd one, are very likely to be fatal. We do not know how to create your 2nd AI in such a way that the very first time we turn it on, all the bugs were already found and fixed.

2 comments

Comments sorted by top scores.

comment by RHollerith (rhollerith_dot_com) · 2024-10-11T04:33:35.189Z · LW(p) · GW(p)

plug that utility function (the one the first AI wrote) into it

Could some team make an good AGI or ASI that someone could plug a utility function into? It would be very different from all the models being developed by the leading labs. I'm not confident that humanity could do it in the time we have left.

comment by Mitchell_Porter · 2024-10-10T18:14:55.851Z · LW(p) · GW(p)

each one is annotated with how much utility we estimate it to have

How are these estimates obtained?