Applying reinforcement learning theory to reduce felt temporal distance

post by Kaj_Sotala · 2014-01-26T09:17:29.099Z · LW · GW · Legacy · 6 comments

Contents

6 comments

(cross-posted from my blog)

It is a basic principle of reinforcement learning to distinguish between reward and value, where the reward of a state is the immediate, intrinsic desirability of the state, whereas the value of the state is proportional to the rewards of the other states that you can reach from that state.

For example, suppose that I’m playing a competitive game of chess, and in addition to winning I happen to like capturing my opponent’s pieces, even when it doesn’t contribute to winning. I assign a reward of 10 points to winning, -10 to losing, 0 to a stalemate, and 1 point to each piece that I capture in the game. Now my opponent offers me a chance to capture one of his pawns, an action that would give me one point worth of reward. But when I look at the situation more closely, I see that it’s a trap: if I did capture the piece, I would be forced into a set of moves that would inevitably result in my defeat. So the value, or long-term reward, of that state is actually something close to -9.

Once I realize this, I also realize that making that move is almost exactly equivalent to agreeing to resign in exchange for my opponent letting me capture one of his pieces. My defeat won’t be instant, but by making that move, I would nonetheless be choosing to lose.

Now consider a dilemma that I might be faced with when coming home late some evening. I have no food at home, but I’m feeling exhausted and don’t want to bother with going to the store, and I’ve already eaten today anyway. But I also know that if I wake up with no food in the house, then I will quickly end up with low energy, which makes it harder to go to the store, which means my energy levels will drop further, and so on until I’ll finally get something to eat much later, after wasting a long time in an uncomfortable state.

Typically, temporal discounting means that I’m aware of this in the evening, but nonetheless skip the visit to the store. The penalty from not going feels remote, whereas the discomfort of going feels close, and that ends up dominating my decision-making. Besides, I can always hope that the next morning will be an exception, and I’ll actually get myself to go to the store right from the moment when I wake up!

And I haven’t tried this out for very long, but it feels like explicitly framing the different actions in terms of reward and value could be useful in reducing the impact of that experienced distance. I skip the visit to the store because being hungry in the morning is something that seems remote. But if I think that skipping the visit is exactly the same thing as choosing to be hungry in the morning, and that the value of skipping the visit is not the momentary relief of being home earlier but rather the inevitable consequence of the causal chain that it sets in motion – culminating in hours of hunger and low energy – then that feels a lot different.

And of course, I can propagate the consequences earlier back in time as well: if I think that I simply won’t have the energy to get food when I finally come home, then I should realize that I need to go buy the food before setting out on that trip. Otherwise I’ll again set in motion a causal chain whose end result is being hungry. So then not going shopping before I leave becomes exactly the same thing as being hungry next morning.

More examples of the same:

Related post: Applied cognitive science: learning from a faux pas.

6 comments

Comments sorted by top scores.

comment by David_Chapman · 2014-01-26T09:58:16.744Z · LW(p) · GW(p)

A puzzling question is why your brain doesn't get this right automatically. In particular, deciding whether to gather some food before sleeping is an issue mammals have faced in the EEA for millions of years.

Temporal difference learning seems so basic that brains ought to implement it reasonably accurately. Any idea why we might do the wrong thing in this case?

Replies from: Kaj_Sotala, ThisSpaceAvailable, Kawoomba
comment by Kaj_Sotala · 2014-01-26T10:57:09.684Z · LW(p) · GW(p)

Temporal difference learning seems so basic that brains ought to implement it reasonably accurately.

I'm guessing that it has to do with the kinds of "things" that are linked to a later consequence. For example, we seem to be pretty good at avoiding or frequenting the kinds of places where we tend to have negative or positive experiences. And we're also good at linking physical items or concrete actions to their consequences - like in Roko's example about the bills:

For example, suppose that you started off in life with a wandering mind and were punished a few times for failing to respond to official letters. Your TDL algorithm began to propagate the pain back to the moment you looked at an official letter or bill. As a result, you would be less effective than average at responding, so you got punished a few more times. Henceforth, when you received a bill, you got the pain before you even opened it, and it laid unpaid on the mantelpiece until a Big Bad Red late payment notice with an $25 fine arrived. More negative conditioning. Now even thinking about a bill, form or letter invokes the flinch response, and your lizard brain has fully cut you out out.

But "not going to the store results in hunger the next morning" seems like a more abstract thing. The fact that it's the lack of an action, rather than the presence of one, seems particularly relevant. Neither the store nor the act of going there is something that's directly associated with getting hungry. If anything it's my earlier thought of possibly needing to go to the store... and I guess it's possible that to the extent that anything gets negatively reinforced, it's the act of me even considering it, since it's the only concrete action that my brain can directly link to the consequence!

Also, if I do go to the store, there isn't any clear reward that would reinforce my behavior. The reward is simply that I won't be hungry the next morning... but that's not something that would be very out of the ordinary, for not-being-hungry is just the normal state of being. And being in a neutral state doesn't produce a reward. I guess that if I enjoyed food more, getting to eat could be more of a reward in itself.

(I'm very sure that there exist mountains of literature on this very topic that could answer the question rather conclusively, but I don't have the energy to go do a lit search right now.)

comment by ThisSpaceAvailable · 2014-01-27T06:32:25.224Z · LW(p) · GW(p)

How is temporal difference learning basic? Do you think that if I give my dog a treat every morning if he obeyed my command to sit the previous day, that would teach him to sit? How would he connect those two events, out of all the events over the day?

comment by Kawoomba · 2014-01-26T10:36:38.846Z · LW(p) · GW(p)

A bird in the hand is worth two in the bush.

Until you've become comparatively good at predicting the future (entails good models, which entails cognitive effort, which necessitates a reasonably developed cognitive architecture), an immediate benefit will often outweigh some nebulous possible future reward (in OP's parlance, value).

comment by moridinamael · 2014-01-26T13:16:30.554Z · LW(p) · GW(p)

Related but not exactly the same thing: it becomes really, really easy to quit something once I suspect that thing is a migraine trigger, even if I had struggled with quitting it in the past. I suspect the logic goes something like, the aversiveness of a migraine at time_future so easily outweighs any possible temporary enjoyment that my brain doesn't even bother tempting me. I jokingly-wish that I could make certain things into migraine triggers.

comment by [deleted] · 2014-01-26T16:06:40.476Z · LW(p) · GW(p)

One name for this evaluation of birds in hands versus birds in bushes is chronemetics...

http://en.m.wikipedia.org/wiki/Chronemics

Here is a possible example: a Wall Street Journal map of who in New York buys monthly public transit passes (more expensive up front) versus who buys day passes (more expensive over all)...

http://graphicsweb.wsj.com/documents/MTAFARES1108/#v=showCommuters