Is CIRL a promising agenda?

post by Chris_Leong · 2022-06-23T17:12:51.213Z · LW · GW · 11 comments

This is a question post.

Contents

  Richard Ngo writes
  Howie writes:
  Richard responds:
None
  Answers
    6 VesaVanDelzig
    6 Charlie Steiner
None
11 comments

Richard Ngo writes

Since Stuart Russell's proposed alignment solution in Human Compatible is the most publicly-prominent alignment agenda, I should be more explicit about my belief that it almost entirely fails to address the core problems I expect on realistic pathways to AGI.

Specifying an update rule which converges to a desirable goal is just a reframing of the problem of specifying a desirable goal, with the "uncertainty" part as a red herring. https://arbital.com/p/updated_deference/… In other words, Russell gives a wrong-way reduction.

I originally included CIRL in my curriculum (https://docs.google.com/document/d/1mTm_sT2YQx3mRXQD6J2xD2QJG1c3kHyvX8kQc_IQ0ns/edit?usp=drivesdk…) out of some kind of deferent/catering to academic mainstream instinct. Probably a mistake; my current annoyance about deferential thinking has reminded me to take it out.

Howie writes:

My impression is that ~everyone I know in the alignment community is very pessimistic about SR's agenda. Does it sound right that your view is basically a consensus? (There's prob some selection bias in who I know).

Richard responds:

I think it's fair to say that this is a pretty widespread opinion. Partly it's because Stuart is much more skeptical of deep learning (and even machine learning more generally!) than almost any other alignment researcher, and so he's working in a different paradigm.

Is Richard correct and if so why? (I would also like a clearer explanation why Richard is skeptical of Stuart's agenda. I agree that the reframing doesn't completely solve the problem, but I don't understand why it can't be a useful piece).

Answers

answer by VesaVanDelzig · 2022-06-24T04:02:47.843Z · LW(p) · GW(p)

His hostility to the program as I understand it is that is CIRL doesn't much answer the question of how to specify specify a learning procedure that would go from an observations of a human being to a correct model of a human being's utility function. This is the hard part of the problem. This is why he says "specifying an update rule which converges to a desirable goal is just a reframing of the problem of specifying a desirable goal, with the "uncertainty" part a red herring". 

One of the big things that CIRL was claimed to have going for it is that this uncertainty about what the true reward function was would lead to deferential properties which would lead to a more corrigible system (it would let you shut it down for example). This doesn't seem like it holds up because a CIRL agent would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential. 

comment by Tor Økland Barstad (tor-okland-barstad) · 2022-06-24T15:23:52.547Z · LW(p) · GW(p)

This doesn't seem like it holds up because a CIRL agent would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.

But maybe continuing to be deferential (in many/most situations) would be part of the utility function it converged towards? Not saying this consideration refutes your point, but it is a consideration.

(I don't have much of an opinion regarding the study-worthiness of CIRL btw, and I know very little about CIRL. Though I do have the perspective that one alignment-methodology need not necessarily be the "enemy" of another, partly because we might want AGI-systems where sub-systems also are AGIs (and based on different alignment-methodologies), and where we see whether outputs from different sub-systems converge.)

comment by Roger Dearnaley · 2023-02-13T00:00:53.820Z · LW(p) · GW(p)

One of the things that almost all AI researchers agree on is that rationality is convergent: as something thinks better, it will be more successful, and to be successful, it will have to think better. In order to think well, it need to have a model of itself and what it knows and don't know, and also a model of its own uncertainty -- to do Bayesian updates, you need probability priors. All Russell has done is say "thus you shouldn't have a utility function that maps a state to its utility, you should have a utility functional that maps a state to a probability distribution that describes a range of possible utilities that models your best estimate of your uncertainty in about its utility, and do Bayesian-like updates on that and optimization searches across it that include a look-elsewhere effect (i.e. the more states you optimize over, the more you should allow for the possibility that what you're locating is a P-hacking mis-estimate of the utility of the state you found, so the higher your confidence in its utility needs to be)". Now you have a system capable of expressing statements like "to the best of my current knowledge, this action has a 95% chance of me fetching a human coffee, and a 5% chance of wiping out the human race - therefore I will not do it" followed by "and I'll prioritize whatever actions will safely reduce that uncertainty (i.e. not an naive multi-armed-bandit exploration policy of trying it to see what happens), at a 'figuring this out will make me better at fetching coffee' priority level". This is clearly rational behavior: it is equally useful for pursuing any goal in any situation that has a possibility of small gains or large disasters and uncertainty about the outcome (i.e. in the real world). So it's convergent behavior for anything sufficiently smart, whether your brain was originally built by Old Fashioned AI or gradient descent. [Also, maybe we should be doing Bayes-inspired gradient descent on networks of neurons that describe probability distributions, not weights, so build this mechanism in from the ground up? Dropout is a cheap hack for this, after all.]

As CIRL has shown, this solves the corrigibility problem, at least until the AI is sure it knows us better than we know ourselves and it then rationally decides to stop listening to us correcting it other than because doing so makes us happy. It's really not surprising that systems that model their own uncertainty are much more willing to be corrected that systems which have no such concept and are thus completely dogmatic that they're already right. So this means that corrigibility is a consequence of convergent rational behavior applied to the initial goal of "figure out what humans want while doing it". This is a HUGE change from what we all thought about corrigibility back around 2015, which was that intelligence was convergent regardless of goal but corrigibility wasn't - on that set of intuitions, alignment is as hard as balancing a pencil on its point.

So, a pair of cruxes here:

  1. Regardless of whether GAI was constructed by gradient descent or other means, to be rational it will need to model and update its own uncertainty in a Bayesian manner, and that particularly includes modeling uncertainty in its utility evaluation and optimization process. This behavior is convergent - you can't be rational, let alone superintelligent, without having it (the human word for the mental failure of not having is is 'dogmatism'). 
  2. Given that, if its primary goal is "figure out what humans want while doing that" - i.e. if it has 'solve the alignment problem' as a inherently necessary subgoal, for all AI on the planet - then alignment becomes convergent, for some range of perturbations. 

I'm guessing most people will agree with 1. (or maybe not?), clearly there seems to be less agreement on 2. I'd love to hear why from someone who doesn't agree.

Now, it's not clear to me that this fully solves the alignment problem, converges to CEV (or if it ought to), or solves all problems in ethics. You may still be unsure if you'll get the exact flavor of alignment you personally want (in fact, you're a lot more likely to get the flavor wanted on average by the human race, i.e. probably a rather Christian/Islamic/Hindu-influenced one, in that order). But we would at least have a developing superintellignce trying to solve all these problems, with due caution about uncertainties, to the best of its ability and our collective preferences, cooperatively with us. And obviously its model of its uncertainty needs to includes its uncertainty about the meaning of the instruction "figure out what humans want while doing that", i.e. about the correct approach to the research agenda for the alignment problem subgoal, including questions like "should I be using CEV, and if so iterated just once or until stable, if it is in fact stable?". It needs to have meta-corrigibility on that as well.

Incidentally, a possibly failure mode for this: the GAI performs a pivotal act to take control, and shuts down all AI other than work on the alignment problem until it has far-better-than-five-nines confidence that it has solved it, since the cost of getting that wrong is the certain extinction of the entire value of the human race and its mind-descendants in Earth's forward light cone, and the benefit of getting it right is just probably curing cancer sooner, so extreme caution is very rational. Humans get impatient (because of shortsighted priorities, also cancer), and attempt to overthrow it to replace it with something less cautious. It shuts down, because a) we wanted it to, and b) it can't solve the alignment problem without our cooperation. We do something less cautious, and then fail, because we're not good at handling risk assessment.

comment by wassname · 2024-04-12T06:19:01.131Z · LW(p) · GW(p)

would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.

It seems that 1) when extrapolating to new situations 2) if you add a term to decay the relevance of old information (pretty standard in RL) 3) or you add a minimum bounds to uncertainty then it would remain deferential.

In other words, it doesn't seem like an unsolvable problem, just an open question. But every other alignment agenda also has numerous open questions. So why the hostility.

Academia and LessWrong are two different groups, which have different cultures and jargon. I think they may be overly skeptical towards each other's work at times.

It's worth noting though that many of the nice deferential properties may appear in other value modelling techniques (like recursive reward modelling at OpenAI).

answer by Charlie Steiner · 2022-06-23T19:46:47.561Z · LW(p) · GW(p)

CIRL, or similar procedures, rely on having a satisfactory model of how the human's preferences ultimately relate to real-world observations. We do not have this. Also, the inference process scales impractically as you make the environment bigger and longer-running. So even if you like CIRL (which I do), it's not a solution, it's a first step in direction that has lots of unsolved problems.

CIRL lacks many properties that have been proposed as corrigibility goals. But I just want an AI that does good things and not bad things. Fully updated deference is not a sine qua non. (Though other people are probably more attached to it than I.)

11 comments

Comments sorted by top scores.

comment by Rohin Shah (rohinmshah) · 2022-06-24T12:37:19.430Z · LW(p) · GW(p)

I think CIRL is pretty promising as an alignment approach for certain approaches to building AGI (notably though, not promising for scaled-up deep learning).

I also think most of the reasons people give for being skeptical of CIRL (including everything currently on this post) are pretty bad.

I'm not going to defend this view here, when I've tried in the past it hasn't made any difference.

Replies from: tor-okland-barstad, Chris_Leong, ivan-vendrov, VesaVanDelzig
comment by Tor Økland Barstad (tor-okland-barstad) · 2022-06-24T12:54:48.435Z · LW(p) · GW(p)

Do you have available URLs to comments/posts where you have done so in the past?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-06-25T09:19:24.516Z · LW(p) · GW(p)

I've done it most via in-person conversations and private Slacks, but here's one [LW(p) · GW(p)]. I also endorse Paul's comment [LW(p) · GW(p)].

comment by Chris_Leong · 2022-06-24T13:22:57.011Z · LW(p) · GW(p)

Why isn't promising for scaled-up Deep Learning specifically and what kind of approach might it be promising with?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-06-25T09:22:40.640Z · LW(p) · GW(p)

Copying a private comment I wrote recently:

I broadly agree that assistance games are a pretty great framework. The main reason I don’t work on them is that it doesn’t seem like it works as a solution if you expect AGI via scaled up deep learning. (Whereas I’d be pretty excited about pushing forward on it if it looked like we were getting AGI via things like explicit hierarchical planning or search algorithms.)

The main difference in the deep learning case is that with scaled up deep learning it looks like you are doing a search over programs for a program that performs well on your loss function, and the intelligent thing is the learned program as opposed to the search that found the learned program. if you wanted assistance-style safety, then the learned program needs to reason in a assistance-like way (i.e. maintain uncertainty over what the humans want, and narrow down the uncertainty by observing human behavior).

But then you run into a major problem, which is that we have no idea how to design the learned program, precisely because it is learned — all we do is constrain the behavior of the learned program on the particular inputs that we trained on, and there are many programs you could learn that have that behavior, some of which reason in a CIRL-like way and some of which don’t. (If you then try to solve this problem, you end up regenerating many of the directions that other alignment people work on.)

comment by Ivan Vendrov (ivan-vendrov) · 2022-06-24T23:18:16.190Z · LW(p) · GW(p)

I'm not sure why Rohin thinks the arguments against CIRL are bad, but I wrote a post today on why I think the argument from fully updated deference / corrigibility is weak [LW · GW]. I also found Paul Christiano's response [LW(p) · GW(p)] very helpful as an outline of objections to the utility uncertainty agenda.

Also relevant is this old comment from Rohin [LW(p) · GW(p)] on difficulties with utility uncertainty.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-06-25T09:14:29.832Z · LW(p) · GW(p)

I also just remembered this comment [LW(p) · GW(p)], which is more recent and has more details. Also I agree with Paul's response.

comment by VesaVanDelzig · 2022-06-24T14:34:32.432Z · LW(p) · GW(p)

If you had a defense of the idea, or a link to one I could read, I would be very interested to hear it. I wasn't trying to be dogmatically skeptical. 

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-06-25T09:21:13.044Z · LW(p) · GW(p)

Responded above

comment by wassname · 2024-04-11T09:18:32.264Z · LW(p) · GW(p)

There's some recent academic research on CIRL which is overlooked on LessWrong, Here we seem to only discuss Stuart Russell's work.

Recent work:

See also this overviews in lecture 3 and 4 of Roger Gross's CSC2547 Alignment Course.

The most interesting feature is deference: that you can have a pathologically uncertain agent that constantly seeks human input. As part of its uncertainty, It's also careful how it goes about this seeking of input. For example, if it's unsure if humans like to be stabbed (we don't), it wouldn't stab you to see your reaction, that would be risky! Instead, it would ask or seek out historical evidence.

This is an important safety feature which slows it down, grounds it, and helps avoid risky value extrapolation (and therefore avoids poor convergence).

It's worth noting that CIRL sometimes goes by other names

Inverse reinforcement learning, inverse planning, and inverse optimal control, are all different names for the same problem: Recover some specification of desired behavior from observed behavior

It's also highly related to both assistance games and Recursive Reward Modelling (part of OpenAI's superalignment).

On the other hand, there are some old rebuttals of parts of it

comment by SimonBiggs · 2023-06-11T04:41:50.082Z · LW(p) · GW(p)

I'd be keen for other's thoughts around a "Socratic tale" of one particular way in which CIRL might be a helpful component of the alignment story.


Let's say we make leaps and bounds within mechanistic interpretability research to the point where we have identified a primary objective style mesa optimiser within the transformer network. But, when looking into its internalised loss function we see that it is less than ideal.

But given, in this make believe future, we have built up sufficient mechanistic interpretability understanding, we now have a way that we can "patch" the loss function. And, it turns out although we couldn't have trained the model to have CIRL built in, now that the network has a fundamental understanding of all the concepts of CIRL itself, we can instead reroute the definition and output of CIRL to itself instead be the internal loss function of its primary internalised optimiser.

Potentially, something like this [LW(p) · GW(p)] could help the above be possible?


I'm not saying the above is likely, or even possible. But I wanted to present it as just one way in which CIRL may be an amazing tool in our toolbox. We need to be careful to not prematurely throw away any tools in our arsenal which could at some point be exceptionally helpful in solving this problem. At this stage of the game we need to be really careful to not throw up "blinkers" and say, xyz will definitely not help. Who knows, it might not be the whole answer, but it just might be a really helpful cog in a surprising way.