Instrumental ignoring AI, Dumb but not useless.

donald-hobson

Instrumental ignoring AI, Dumb but not useless.

post by Donald Hobson (donald-hobson) · 2022-10-30T16:55:47.555Z · LW · GW · 6 comments

6 comments

This post is yet another unworkable agent design that uses infinite compute. It outlines an AI that ignores all instrumental values, pursuing only it's terminal values.

What is does it look like to pursue an instrumental value. It means you can predict the AI is likely to do action X, even if you have no idea what it's utility function is.

This AI doesn't do that. It obeys the simple principle that, if you are clueless about it's utility function, then you are clueless about it's actions.

Let $Δ X$ mean the probability distribution over the set $X$ .

Lets have some set of Observations $O$ , Actions $A$ and (bounded? ) Utilities $U$ .

Then Pick some probability distribution over utility functions $U \in Δ U$ , such as uniform or complexity weighted. And some probability distribution $A \in Δ A$ over the action space.

Then let the AI be a function $f : (U, O) \to Δ A$

We can formalize our condition as $\forall o \in O : \forall a \in A : \sum_{u \in U} P (f (u, o) = a) U (u) = A (a)$ .

(Note $U (u) \in R$ refers to the probability that $U$ assigns to $u$ .)

Why might we expect such agents to be safe. Because if we picked a random utility function, we get a safely random output. So all humans need to do is be better than random in our choice of utility function.

Now let $E (o, a, u)$ be the expected utility. Choose $f$ to maximize $\sum_{u \in U} \sum_{a \in A} P (f (u, o) = a) U (u) E (u, o, a)$ subject to the condition. As the condition doesn't relate cases with different values of $o$ , this optimization only needs to be computed for the particular $o \in O$ observed.

Is this AI design safe when repeatedly called with the same utility function?

Imagine the action space just consists of outputting 0 or 1. And there are only 2 utility functions, paperclips and staples. We could imagine that when run millions of times with a utility of paperclips, this AI outputs code for a paperclip maximizer, and when run millions of times with a goal of staples, it outputs a bitwise negation of the paperclip maximizer. (Which could possibly be a staple maximizer, if the AI has done some op code magic)

This would be because the humans choice of the same utility function each time is worse than random. I think this is an actual problem. At least if the AI's have total knowledge of past and future, then this is one Schelling point to their coordination game. (The consider all possibilities doesn't quite work well with multiple different AI's)

I have an intuition that something TDT'ish might fix this.

The other problem that comes to mind is that it isn't easy to whitelist a limited set of instrumental actions you are ok with. You can fake it though, suppose you have a robot, instead of piping the output of this AI directly into the motors, you have it call simple hardcoded functions to walk and pick up objects. The result being, if the AI has a random utility function, instead of thrashing randomly on the floor, it walks to random places and picks up random objects. Because someone hardcoded those low level commands into the robot.

6 comments

Comments sorted by top scores.

comment by Gurkenglas · 2022-10-30T19:08:59.446Z · LW(p) · GW(p)

Your math can be terser: f:U->ΔA. g:ΔU. h:ΔA. The condition is: f and g combine into h. To fix your first problem, make A policies, not actions.

My biggest problem here is that f depends on how we represent g since U identifies u with 2u. Silence this warning by normalizing U, and f depends on how we normalize.

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2022-10-30T19:17:13.448Z · LW(p) · GW(p)

Ah, we've seen my problem before: Solve bargaining [? · GW], then make g bargain to choose f.

comment by TekhneMakre · 2022-10-30T17:22:05.495Z · LW(p) · GW(p)

This seems in danger of being a "sponge alignment" proposal, i.e. the proposed system doesn't do anything useful. https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#:~:text=sponge [LW · GW]

Replies from: donald-hobson, Gurkenglas

↑ comment by Donald Hobson (donald-hobson) · 2022-10-31T01:53:14.801Z · LW(p) · GW(p)

This current version is dumb, but still exerts some optimization pressure. (Just the bits of optimization out are at most the bits of selection put into its utility function.)

↑ comment by Gurkenglas · 2022-10-30T19:12:44.529Z · LW(p) · GW(p)

It could be a conceptual ingredient to something useful. For example, it can select between two plans.

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-10-31T02:26:19.426Z · LW(p) · GW(p)

I agree.

Instrumental ignoring AI, Dumb but not useless.

Contents

6 comments