post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2023-08-12T07:31:27.693Z · LW(p) · GW(p)

If the AI wants (in the usual sense of the word) to achieve things for which bad aggression is instrumentally useful, and you (arguendo) completely prevent it from using bad aggression, you have barely slowed it down. It will still achieve the things it wants almost as fast, just using social manipulation or whatever instead.

The solution is, as ever, to make an AI that wants to achieve good things and not bad things.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-08-12T19:00:36.636Z · LW(p) · GW(p)

Designating some things as off-limits is a frame [LW · GW] for defining agent behavior that's different from frames that emphasize goals. The point is to sufficiently deconfuse this perspective so that we can train AIs that don't circumvent boundaries.

An optimization frame says that this always happens, instrumentally valuable things always happen, unless agent's values are very particular about avoiding them. But an agent doesn't have to be primarily an optimizer [LW · GW], it could instead be primarily a boundary-preserver, and only incidentally an optimizer.

comment by simon · 2023-08-11T19:24:09.522Z · LW(p) · GW(p)

David points a gun at Jess’s head and says “If you don’t give me your passwords, I will shoot you”. Even though Jess doesn’t have to give David her passwords, she does anyway.

If you are going to focus on aggression for AI safety, defining the kind of aggression that you are going to forbid as excluding this case is ... rather narrow.

I don't think enshrining victim blaming (If you were smart/strong willed enough enough, you could have resisted...) is the way to go for actually non-disastrous AI, and where do you draw the line? If one person in the whole world can resist, is it OK?

I don't think you've made the case that the line between "resistable"  and "irresistable" aggression is any clearer or less subjective than between "resistable" aggression and non-aggression, and frankly I think the opposite is the case. It seems to me that you have likely taken an approach to defining aggression that doesn't work well (e.g. perhaps something to do with bilateral relations only?) and are reaching for the "resistable/irresistable" distinction as something to try to salvage your non-working approach. 

FWIW I think "aggression", as actually defined by humans, is highly dependent on social norms which define personal rights and boundaries, and that for an AI to have a useful definition of this, it's going to need a pretty good understanding of these aspects of human relations, just as other alignment schemes also need good understanding of humans (so, this isn't a good shortcut...). If you did properly get an AI to sensibly define boundaries though, I don't think whether someone could hypothetically resist or not is going to be a particularly useful additional concept.

Replies from: Chipmonk
comment by Chipmonk · 2023-08-11T20:18:20.285Z · LW(p) · GW(p)

Hm the David-gun-Jess example does seem weird to me in retrospect. Typical NAP philosophy outlaws threatening aggression just as it outlaws aggression, but I haven't done that so far. Maybe it could be workable to exclude it from resistible aggression and include in irresistible aggression or some other category. 
Update: removed this example from the post.

Frankly, ideally, Jess is in a position where she can't be shot by anyone without her consent.