How to model uncertainty about preferences?

quetzal_rainbow

How to model uncertainty about preferences?

post by quetzal_rainbow · 2023-03-24T19:04:42.005Z · LW · GW · No comments

This is a question post.

  Answers
    1 quetzal_rainbow
    1 baturinsky
None
No comments

I've recently started to think about how nascent "hot mess" superintelligence can reflect on its own values and converge to something consistent. The simplest route to think about this, it seems to me, is model it like a process of resolving uncertainity of superintelligence about its own preferences.

Suppose an agent knows that it is an expected utility maximizer and is uncertain between two utility functions, and $U_{2}$ , with assigned probabilities $p_{1}$ and $p_{2}$ . The agent must choose between two actions, $a_{1}$ and $a_{2}$ . Let's say that the optimal decision for $U_{1}$ is $a_{1}$ and for $U_{2}$ is $a_{2}$ . To maximize the expected value of $p_{1} U_{1} + p_{2} U_{2}$ , the agent chooses $a_{1}$ . However, choosing $a_{1}$ is also a decisive evidence in favor of $U_{1}$ , and therefore, the agent updates $p_{1}$ to 1. This representation of uncertain preferences looks unsatisfactory because it quickly and predictably converges to only one utility function.

Does anyone know of a good model for uncertain preferences that can meet these criteria after some additions?

No weird updates on predictable behavior of the agent.
Controllable updates through something similar to human feedback. In general, there should be a specific class of events/observations in the environment that provide evidence for hypotheses about preferences, and everything else should not.
Preservation of value: if an agent believes it is a paperclip-maximizer with a 60% probability and a human flourishing maximizer with a 40% probability, and it doesn't expect any information about its preferences in the future, it should not bet everything on "maximizing paperclips" and should save at least some chunk of the universe for human flourishing.
Convergence: in the best case scenario, resolution of uncertainty should lead to strong conclusions about preferences.
A "none of the above" hypothesis in the distribution - if an agent suspects that all hypotheses are not compatible with the evidence, it should prioritize the hypothesis "your best decision is to shut down and send your operators an error log."

Nash bargaining (between different hypotheses about preferences) looks like something that is close to desirable properties but I am not sure, may be something better has already been developed.

Answers

answer by quetzal_rainbow · 2023-04-09T18:31:54.699Z · LW(p) · GW(p)

https://www.lesswrong.com/posts/BM24txi8FdGtdDvqq/quetzal_rainbow-s-shortform?commentId=E4KBdytWz56CzMTwC [LW(p) · GW(p)]

I have collected some links on topic!

answer by baturinsky · 2023-03-25T02:26:50.333Z · LW(p) · GW(p)

Correctly handling the uncertainty in values, knowledge and predictions is necessary for reaching any complex goal or executing any complex plan. So, capability of doing that is probably something that AI will have to obtain in order to be AGI.

No comments

Comments sorted by top scores.

How to model uncertainty about preferences?

Contents

Answers

No comments