Corrigibility or DWIM is an attractive primary goal for AGI

seth-herd

Corrigibility or DWIM is an attractive primary goal for AGI

post by Seth Herd · 2023-11-25T19:37:39.698Z · LW · GW · 4 comments

4 comments

While rereading the List of Lethalities [LW · GW] (LoL), I was compelled by the argument against corrigibility. It's really hard to make a goal of "maximize X, except if someone tells you to shut down". I think the same argument applies to Christiano's goal of achieving corrigibility through RL by rewarding correlates of corrigibility. If other things are rewarded more reliably, you may not get your AGI to shut down when you need it to.

But those arguments don't apply if corrigibility in the broad sense is the primary goal. "Doing what this guy means by what he says" is a perfectly coherent goal. And it's a highly attractive one, for a few reasons. Perhaps corrigibility shouldn't be used in this sense and do what I mean (DWIM) is a better term. But it's closely related. It accomplishes corrigibility, and has other advantages. I think it's fairly likely to be the first goal someone actually gives an AGI.

"Do what I mean" sidesteps the difficulty of outer alignment. The difficulty of outer alignment is another point in the LoL. One common plan, which seems sensible, is to keep humans in the loop; to have a Long Reflection [? · GW] to decide what we want. "DWIM" allows you to contemplate and change your mind as much as you like.

Of course, the problem here is: do what WHO means? We'd like an AGI that serves all of humanity, not just one guy or board of directors. And we'd like to not have power struggles.

But from the point of view of a team actually deciding what goal to give their shot at AGI, DWIM will be incredibly attractive for practical reasons. The outer alignment problem is hard. Specifying one person (or a few) to take instructions from is vastly simpler than deciding and specifying a goal that captures all of human flourishing for all time. You don't want to trust an AGI to interpret that goal correctly. Intepreting DWIM is still fraught, but it is naturally self-correcting, and becomes more useful as the AGI gets more capable. A smarter AGI will be better at understanding what you probably mean, and better at realizing when it's not sure what you mean so it can ask for clarification.

This doesn't at all address inner alignment. But when somebody thinks they have good-enough inner alignment to launch a goal-directed, sapient [LW · GW] AGI, DWIM is likely to be the goal they'll choose. This could be good or bad, depending on how well they've implemented inner alignment, and what type of people they are.

4 comments

Comments sorted by top scores.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2023-11-25T23:01:46.709Z · LW(p) · GW(p)

Agree, and I've had similar/related thoughts on how DWIM seems like a pretty natural target for LLM alignment: https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=65czxJGyBuhqhBRex [LW(p) · GW(p)] https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv [LW(p) · GW(p)]

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-11-26T00:08:23.450Z · LW(p) · GW(p)

Thanks! This seems pretty obvious, from this perspective, right? But there's a lot of concern that outer alignment being hard makes the alignment problem much harder. It seems like you can easily just punt on outer alignment, so I think it's very likely that's what people will do.

comment by RogerDearnaley (roger-d-1) · 2023-11-27T06:31:00.524Z · LW(p) · GW(p)

Have you looked into Value Learning [? · GW]? It's basically "figure out what we mean, then do it"

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-11-27T20:02:46.348Z · LW(p) · GW(p)

I hadn't seen value learning, thank you! I am familiar with Stuart Russel's inverse reinforcement learning, which I think is very similar, and closer to a implementable proposal. I am not enthusiastic about IRL. The proposal there is to infer a human's value function from their behavior, or from the behavior they reward in their agents. To me this seems like a very clumsy solution relative to asking the human what they want when it's unclear and the consequences are important. That's what I'm proposing is the obvious and simple approach that will likely be tried. That could be coupled with IRL.

My mental model here is not "figure out what we mean, then do it", but "infer what I mean based on your models of human language, then check with me if your estimate of consequences are past this threshold I set, or if you have conflicting models of what I might mean". You probably would want some cumulative learning of likely intentions, but you would not want to relax the criteria for checking before executing consequential plans by very much.

IRL or other value learning alone puts the weight of understanding human ethics/value function on the AI system. Even if it works, current human ethics/value functions might be an extraordinarily bad outer alignment target. It could be that maximizing our revealed preference leads to all-against-all competition or war, or the elimination of humanity in favor of better fits to our inferred value function. We don't know what we want, so we don't know what we'd get from having an AGI figure out what we really want. See Moral Reality Check (a short story) [LW · GW] and my comment on it [LW(p) · GW(p)]. So I'd prefer we figure out what we want for ourselves, and I think that's going to be a very common motivation among humans. The "long contemplation" suggestion appears to be a common one among people thinking about outer alignment targets.

Corrigibility or DWIM is an attractive primary goal for AGI

Contents

4 comments