LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
I have a concept that I expect to take off in reinforcement learning. I don't have time to test it right now, though hopefully I'd find time later. Until then, I want to put it out here, either as inspiration for others, or as a "called it"/prediction, or as a way to hear critique/about similar projects others might have made:
Reinforcement learning is currently trying to do stuff like learning to model the sum of their future rewards, e.g. expectations using V, A and Q functions for many algorithm, or the entire probability distribution in algorithms like DreamerV3.
Mechanistically, the reason these methods work is that they stitch together experience from different trajectories. So e.g. if one trajectory goes A -> B -> C and earns a reward at the end, it learns that states A and B and C are valuable. If another trajectory goes D -> A -> E -> F and gets punished at the end, it learns that E and F are low-value but D and A are high-value because its experience from the first trajectory shows that it could've just gone D -> A -> B -> C instead.
But what if it learns of a path E -> B? Or a shortcut A -> C? Or a path F -> G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.
Ok, so that's the problem, but how could it be fixed? Speculation time:
You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.
More formally, let's say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:
Hope(s, a) = rs' + f Hope(s', a')
Where s' is the resulting state that experience has shown comes after s when doing a, f is the discounting factor, and a' is the optimal action in s'.
Because the Hope function is multidimensional, the learning signal is much richer, and one should therefore maybe expects its internal activations to be richer and more flexible in the face of new experience.
Here's another thing to notice: let's say for the policy, we use the Hope function as a target to feed into a decision transformer. We now have a natural parameterization, based on which Hope it pursues.
In particular, we could define another function, maybe called the Result function, which in addition to s and a takes a target distribution w as a parameter, subject to the Bellman equation:
Result(s, a, w) = rs' + f Result(s', a', (w-rs')/f)
Where a' is the action recommended by the decision transformer when asked to achieve (w-rs')/f from state s'.
This Result function ought to be invariant under many changes in policy, which should make it more stable to learn, boosting capabilities. Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.
An obvious challenge with this proposal is that states are really latent variables and also too complex to learn distributions over. While this is true, probably it can be hacked by learning some clever embeddings or doing some clever approximations or something. Maybe something as simple as predicting a weighted average of the raw pixel observations that occur as rewards are obtained, though in practice I expect that to be too blurry.
Also this mindset seems to pave way for other approaches, e.g. you could maybe have a Halfway function that factors an ambitious hope into smaller ones or something. Though it's a bit tricky because one needs to distinguish correlation and causation.
teatieandhat on Politics are not serious by defaultInteresting, and very well written. Because you have access to particularly funny examples, you show very well how much politics is an empty status game.
I should probably point out that five years ago, I was a high school student in France, felt more or less the way you do, and went on to study political science at college (I don’t even need to say which college I’m talking about, do I?). It is a deep truth that politics is very unserious for most people, and that is perhaps most true for first-year political science students (or, god forbid, the sort of people who teach them introductory political science classes). I studied political science precisely because I agreed with the sentiment you describe here, and expected something a little more serious.
I definitely did not get it. The average political science undergraduate is very much like your friends—not least because they’re actually the same people a year older—and, while many professors are great, some are scarcely better than their students.
You gave your funny sad stories, here’s one of mine (carefully selected to be the most egregious I’ve seen, but 100% true): first year sociology class, taught by respected specialist of Jewish life in Soviet-era Poland. Me, really curious about why sociology doesn’t dialogue more with some apparently contradictory results in social psychology. I try my best to ask "how does sociology react to that kind of stuff, even though it’s a completely different discipline and all?" in the least offensive way I can.
Teacher’s face suddenly turns dark blue, she jumps off her chair, yelling "THIS IS SCIENCE! THIS IS SCIENTIFIC SCIENCE!". It takes me a few seconds to gather that she’s not blaming psychology for being science. Her brain registered something which kinda sounded like an attack against her discipline, and she’s defending the science-ness of her job. And not, certainly, doing anything like answering my question. In fact, she’s running around the room ("science! Science!"), and has forgotten about me entirely. After five or ten minutes, she eventually goes back to her chair, visibly exhausted ("well… where was I? Ah yes…") and resumes the class.
But the reason I’m writing this comment is exactly because I don’t want you to start seeing the whole lot of them as a bunch of crazies (as I myself did…). It’s really true that everyone who doesn’t end up working in politics, and even most of those who do, when they’re young, treat it as a deeply unserious status game (but, given what LW has to say about politics, I’d be really surprised if it was worse in France than in the US, or basically anywhere else?). It is also true that wanting to work on politics and decision-making doesn’t come with a specific knowledge of rationality. So, yeah, most people who think about politics do so in a very irrational way, because politics is a status game (not to mention being the mind-killer [LW · GW]). But if you think that this is not a strong enough description and that the ones you know are really more crazy than that, I think the difference is because they’re high-schoolers :-) It does get a little bit better with age, but you might miss that if you brand them as crazies and forget to change your mind when most of them have grown enough to be a little less crazy :-)
neel-nanda-1 on Charlie Steiner's ShortformYou may be able to notice data points where the SAE performs unusually badly at reconstruction? (Which is what you'd see if there's a crucial missing feature)
neel-nanda-1 on yanni's ShortformWhat banner?
neil-warren on Politics are not serious by defaultConcept creep is a bastard. >:(
donald-hobson on All About Concave and Convex AgentsOr the sides can't make that deal because one side or both wouldn't hold up their end of the bargain. Or they would, but they can't prove it. Once the coin lands, the losing side has no reason to follow it other than TDT. And TDT only works if the other side can reliably predict their actions.
donald-hobson on How to safely use an optimizerIf the oracle is deceptively withholding answers, give up on using it. I had taken the description to imply that the oracle wasn't doing that.
silentbob on Failures in KindnessJust came to my mind that these are things I tend to think of under the heading "considerateness" rather than kindness
Guess I'd agree. Maybe I was anchored a bit here by the existing term of computational kindness. :)
technicalities on Are extreme probabilities for P(doom) epistemically justifed?As of two years ago, the evidence for this was sparse. Looked like parity overall, though the pool of "supers" has improved over the last decade as more people got sampled.
There are other reasons [LW · GW] to be down on XPT in particular.
silentbob on Failures in KindnessFair point. Maybe if I knew you personally I would take you to be the kind of person that doesn't need such careful communication, and hence I would not act in that way. But even besides that, one could make the point that your wondering about my communication style is still a better outcome than somebody else being put into an uncomfortable situation against their will.
I should also note I generally have less confidence in my proposed mitigation strategies than in the phenomena themselves.