Can we learn much by studying the behaviour of RL policies?

post by AidanGoth · 2023-05-15T12:56:25.769Z · LW · GW · No comments

This is a question post.

Contents

No comments

Economists sometimes study revealed preferences, which are preferences that we can infer from choices, e.g. when given the choice between an apple or an orange, if I choose an apple, then I have revealed a preference for an apple over an orange. I'm wondering about the revealed preferences of RL policies (applying behavioural econ / experimental econ to RL policies).  We can elicit revealed preferences from RL policies by observing their actions following various histories and we can see whether the revealed preferences satisfy various decision theoretic axioms.

Revealed preferences don’t tell us anything about the inner workings of an agent but they can tell us whether or not an agent is acting as if they’re following particular decision theories. We can ask questions such as:

These seem like pretty natural questions to ask, so I'm wondering what existing work there is on related questions and how promising this kind of work could be. Knowing if a policy is consistent with the axioms of EUT seems helpful, but maybe not that helpful, since this isn't sufficient for the system to be actually internally maximising expected utility with respect to some utility function and since not behaving in a way consistent with EUT isn't sufficient for being safe.

I imagine this kind of work to be most interesting under distributional shifts and/or significant computational constraints relative to the complexity of the environment, where we might be able to learn about RL failure modes. Because it focuses only on observed behaviour though, studying revealed preferences seems to me both much less useful and much easier than understanding ML systems through mechanistic interpretability. 

I'm interested in what other people think about (i) the object level questions above; (ii) existing work on these questions; (iii) how useful studying these (or similar) questions would be. I'm coming at this from an economics/maths/phil background, I'm less familiar with CS, and might be missing important search terms and basic knowledge.

Answers

No comments

Comments sorted by top scores.