Least-problematic Resource for learning RL?
post by Dalcy (Darcy) · 2023-07-18T16:30:48.535Z · LW · GW · 3 commentsThis is a question post.
Contents
Answers 5 Multicore 2 technicalities 1 Rajlaxmi 1 Dalcy Bremin None 3 comments
Well, Sutton & Barto is the standard choice, but [LW(p) · GW(p)]:
Superficial, not comprehensive, somewhat outdated circa 2018; a good chunk was focused on older techniques I never/rarely read about again, like SARSA and exponential feature decay for credit assignment. The closest I remember them getting to DRL was when they discussed the challenges faced by function approximators.
And also has some issues with eg claiming that the Reward is the optimization target [LW · GW]. Other RL textbooks also seem similarly problematic - very outdated, with awkward language / conceptual confusions.
OpenAI's Spinning Up DRL seems better in the not-being-outdated front, but feels quite high-level, focusing mostly on practicality & implementation - while I'm looking also for a more theoretical discussion of RL.
I'm starting to think that there probably isn't such a resource fitting all my bills, so I'm considering the mix of (1) lightly reading textbooks for old-RL theory discussions and (2) instead covering modern surveys for catching up to the recent DRL stuff.
Are there any resources for learning RL that doesn't contain (any of) the problems I've mentioned above? Would like to know if I'm missing any.
Answers
The Huggingface deep RL course came out last year. It includes theory sections, algorithm implementation exercises, and sections on various RL libraries that are out there. I went through it as it came out, and I found it helpful. https://huggingface.co/learn/deep-rl-course/unit0/introduction
I like Hasselt and Meyn (extremely friendly, possibly too friendly for you)
Answering my own question, review / survey articles like https://arxiv.org/abs/1811.12560 seem like a pretty good intro.
3 comments
Comments sorted by top scores.
comment by gwern · 2024-07-05T21:07:09.407Z · LW(p) · GW(p)
And also has some issues with eg claiming that the Reward is the optimization target.
That's not a problem, because reward is the optimization target. Sutton & Barto literally start with bandits! You want to argue that reward is not the optimization target with bandits? Because even Turntrout doesn't try to argue that. 'reward is not the optimization target' is inapplicable to most of Sutton & Barto (and if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which 'reward is not the optimization target', and why they are not applicable to most AI things right now or in the foreseeable future).
Turntrout's review is correct when he says that it's probably the best RL textbook out there still.
I would disagree with his claim about SARSA not being worth learning: you should at least read about it, even if you don't implement it or do the Sutton & Barto exercises, so you understand better how different methods work and better appreciate the range of possible RL agents and how they think and how to do things like train animals/children. For example, SARSA acts fundamentally differently from Q-learning in AI safety scenarios like whether it would try to manipulate human overseers to avoid being turned off. That's good to know now as you try to think about non-toy agents: is an LLM motivated to manipulate human overseers...?
Replies from: Darcy↑ comment by Dalcy (Darcy) · 2024-09-06T21:17:48.464Z · LW(p) · GW(p)
(the causal incentives paper convinced me to read it, thank you! good book so far)
if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which 'reward is not the optimization target', and why they are not applicable to most AI things right now or in the foreseeable future
Can you explain this part a bit more?
My understanding of situations in which 'reward is not the optimization target' is when the assumptions of the policy improvement theorem don't hold [LW · GW]. In particular, the theorem (that iterating policy improvement step must yield strictly better policies and it converges at the optimal, reward maximizing policy) assumes that each step we're updating the policy by greedy one-step lookahead (by argmaxing the action via ).
And this basically doesn't hold irl because realistic RL agents aren't forced to explore all states (the classic example of "I can explore the state of doing cocaine, and I'm sure my policy will drastically change in a way that my reward circuit considers an improvement, but I don't have to do that). So my opinion that the circumstances under which 'reward is the optimization target' is very narrow remains unchanged, and I'm interested in why you believe otherwise.
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-07-03T20:25:01.484Z · LW(p) · GW(p)
@Vanessa Kosoy [LW · GW] knows more.