IRL in General Environments

post by michaelcohen (cocoa) · 2019-07-10T18:08:06.308Z · score: 7 (6 votes) · LW · GW · 12 comments

Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math).

Copying the introduction here:

The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw observational data would be converted into a record of human actions, alongside the space of actions available. For IRL to learn human goals, the AI has to consider general environments, and it has to have a way of identifying human actions. Lest these extensions appear trivial, I consider one of the simplest proposals, and discuss some difficulties that might arise.

12 comments

Comments sorted by top scores.

comment by rohinmshah · 2019-07-10T20:11:15.410Z · score: 6 (5 votes) · LW · GW
My main point is that IRL, as it is typically described, feels nearly complete: just throw in a more advanced RL algorithm as a subroutine and some narrow-AI-type add-on for identifying human actions from a video feed, and voila, we have a superhuman human helper.
[...]
But maybe we could be spending more effort trying to follow through to fully specified proposals which we can properly put through the gauntlet.

Regardless of whether it is intended or not, this sounds like a dig at CHAI's work. I do not think that IRL is "nearly complete". I expect that researchers who have been at CHAI for at least a year do not think that IRL is "nearly complete". I wrote a sequence partly for the purpose of telling everyone "No, really, we don't think that we just need to run IRL to get the one true utility function; we aren't even investigating that plan".

(Sorry, this shouldn't be directed just at you in particular. I'm annoyed at how often I have to argue against this perception, and this paper happened to prompt me to actually write something.)

Also, I don't agree that "see if an AIXI-like agent would be aligned" is the correct "gauntlet" to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.

comment by Wei_Dai · 2019-07-11T01:42:34.174Z · score: 14 (7 votes) · LW · GW

Regardless of whether it is intended or not, this sounds like a dig at CHAI’s work. I do not think that IRL is “nearly complete”. I expect that researchers who have been at CHAI for at least a year do not think that IRL is “nearly complete”. I wrote a sequence partly for the purpose of telling everyone “No, really, we don’t think that we just need to run IRL to get the one true utility function; we aren’t even investigating that plan”.

I think Stuart Russell still gives this impression in his (many) articles and interviews. I remember getting this impression listening to a recent interview, but will quote this Nov 2018 article instead since many of his interviews don't have transcripts:

Machines are beneficial to the extent that their actions can be expected to achieve our objectives [...]

It turns out, however, that it is possible to define a mathematical framework leading to machines that are provably beneficial in this sense. That is, we define a formal problem for machines to solve, and, if they solve it, they are guaranteed to be beneficial to us. In its simplest form, it goes like this:

  • The world contains a human and a machine.
  • The human has preferences about the future and acts (roughly) in accordance with them.
  • The machine’s objective is to optimise for those preferences.
  • The machine is explicitly uncertain as to what they are. [...]

There are two primary sources of difficulty that we are working on right now: satisfying the preferences of many humans and understanding the preferences of real humans. [...]

Machines will need to “invert” actual human behaviour to learn the underlying preferences that drive it.

Does this not sound like a plan of running (C)IRL to get the one true utility function?

comment by rohinmshah · 2019-07-11T03:45:04.289Z · score: 2 (1 votes) · LW · GW
Does this not sound like a plan of running (C)IRL to get the one true utility function?

I do not think that is actually his plan, but I agree it sounds like it. One caveat is that I think the uncertainty over preferences/rewards is key to this story, which is a bit different from getting a single true utility function.

But really my answer is, the inferential distance between Stuart and the typical reader of this forum is very large. (The inferential distance between Stuart and me is very large.) I suspect he has very different empirical beliefs, such that you could reasonably say that he's working on a "different problem", in the same way that MIRI and I work on radically different stuff mostly due to different empirical beliefs.

comment by michaelcohen (cocoa) · 2019-07-11T02:23:43.641Z · score: 9 (5 votes) · LW · GW

I'm sorry it sounded like a dig at CHAI's work, and you're right that "typically described" is at best a generalization over too many people, and worst, wrong. It would be more accurate to say that when people describe IRL, I get the feeling that it's nearly complete--I don't think I've seen anyone presenting an idea about IRL flag the concern that the issue of recognizing the demonstrator's action might jeopardizing the whole thing.

I did intend to cast some doubt on whether the IRL research agenda is promising, and whether inferring a utility function from a human's actions instead of from a reward signal gets us any closer to safety, but I'm sorry to have misrepresented views. (And maybe it's worth mentioning that I'm fiddling with something that bears strong resemblance to Inverse Reward Design, so I'm definitely not that bearish on the whole idea).

comment by michaelcohen (cocoa) · 2019-07-11T04:52:47.372Z · score: 3 (2 votes) · LW · GW
Also, I don't agree that "see if an AIXI-like agent would be aligned" is the correct "gauntlet" to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.

I'm going to do my best to describe my intuitions around this.

Proposition 1: an agent will be competent at achieving goals in our environment to the extent that its world-model converges to the truth. It doesn't have to converge all the way, but the KL-divergence from the true world-model to its world-model should reach the order of magnitude of the KL-divergence from the true world-model to a typical human world-model.

Proposition 2: The world-model resulting from Bayesian reasoning with a sufficiently large model class does converge to the truth, so from Proposition 1, any competent agent's world-model will converge as close to the Bayesian world-model as it does to the truth.

Proposition 3: If the version of an "idea" that uses Bayesian reasoning (on a model class including the truth) is unsafe, then the kind of agent we actually build that is "based on that idea" will either a) not be competent, or b) roughly approximate the Bayesian version, and by default, be unsafe as well (in the absence of some interesting reason why a small confusion about future events will lead to a large deprioritization of dangerous plans).

Letting F be a failure mode that arises when an idea is implemented in the framework of Bayesian agent with a model class including the truth, I expect in the absence of arguments otherwise, that the same failure mode will appear in any competent agent which also implements the idea in some way. However, it can be much harder to spot it, so I think one of the best ways to look for possible failure modes in the sort of AI we actually build is to analyze the idealized version, i.e. an agent it's approximating, i.e. a Bayesian agent with a model class including the truth. And then on the flip side, if the idea still seems to have real value when formalized in a Bayesian agent with a large model class, tractable approximations thereof seem (relatively) likely to work similarly well.

Maybe you can point me toward the steps that seem the most opaque/fishy.

comment by rohinmshah · 2019-07-11T05:08:12.946Z · score: 4 (2 votes) · LW · GW

Sorry in advance for how unhelpful this is going to be. I think decomposing an agent into "goals", "world-model", and "planning" is the wrong way to be decomposing agents. I hope to write a post about this soon.

comment by michaelcohen (cocoa) · 2019-07-11T05:12:10.978Z · score: 1 (1 votes) · LW · GW

No, that's helpful. If it were the right way, do you think this reasoning would apply?

Edit: alternatively, if a proposal does decompose an agent into world-model/goals/planning (as IRL does), does the argument stand that we should try to analyze the behavior of a Bayesian agent with a large model class which implements the idea?

comment by rohinmshah · 2019-07-11T16:29:40.352Z · score: 2 (1 votes) · LW · GW

... Plausibly? Idk, it's very hard for me to talk about the validity of intuitions in an informal, intuitive model that I don't share. I don't see anything obviously wrong with it.

There's the usual issue that Bayesian reasoning doesn't properly account for embeddedness, but I don't think that would make much of a difference here.

comment by michaelcohen (cocoa) · 2019-07-11T04:15:36.808Z · score: 1 (1 votes) · LW · GW
IRL to get the one true utility function

I think I'm understanding you to be conceptualizing a dichotomy between "uncertainty over a utility function" vs. "looking for the one true utility function". (I'm also getting this from your comment below:

One caveat is that I think the uncertainty over preferences/rewards is key to this story, which is a bit different from getting a single true utility function.

).

I can't figure out on my own a sense in which this dichotomy exists. To be uncertain about a utility function is to believe there is one correct one, while engaging in the process of updating probabilities about its identity.

Also, for what it's worth, in the case where there is an unidentifiability problem, as there is here, even in the limit, a Bayesian agent won't converge to certainty about a utility function.

comment by rohinmshah · 2019-07-11T04:59:39.713Z · score: 2 (1 votes) · LW · GW
I think I'm understanding you to be conceptualizing a dichotomy between "uncertainty over a utility function" vs. "looking for the one true utility function".

Well, I don't personally endorse this. I was speculating on what might be relevant to Stuart's understanding of the problem.

I was trying to point towards the dichotomy between "acting while having uncertainty over a utility function" vs. "acting with a known, certain utility function" (see e.g. The Off-Switch Game). I do know about the problem of fully updated deference and I don't know what Stuart thinks about it.

Also, for what it's worth, in the case where there is an unidentifiability problem, as there is here, even in the limit, a Bayesian agent won't converge to certainty about a utility function.

Agreed, but I'm not sure why that's relevant. Why do you need certainty about the utility function, if you have certainty about the policy?

comment by michaelcohen (cocoa) · 2019-07-11T05:21:21.018Z · score: 3 (2 votes) · LW · GW

Okay maybe we don't disagree on anything. I was trying to make different point with the unidentifiability problem, but it was tangential to begin with, so never mind.

comment by Charlie Steiner · 2019-07-12T04:06:24.205Z · score: 2 (1 votes) · LW · GW

A good starting point. I'm reminded of an old Kaj Sotala post [LW · GW] (which then later provided inspiration for me writing a sort of similar post [LW · GW]) about trying to ensure that the AI has human-like concepts. If the AI's concepts are inhuman, then it will generalize in an inhuman way, so that something like teaching a policy though demonstrations might not work.

But of course having human-like concepts is tricky and beyond the scope of vanilla IRL.