When is reward ever the optimization target?
post by Noosphere89 (sharmake-farah) · 2024-10-15T15:09:20.912Z · LW · GW · No commentsThis is a question post.
Contents
Answers 100 gwern 11 Seth Herd 10 Stephen McAleese 6 J Bostock None No comments
Alright, I have a question stemming from TurnTrout's post on Reward is not the optimization target, where he argues that the premises that are required to get to the conclusion of reward being the optimization target are so narrowly applicable as to not apply to future RL AIs as they gain more and more power:
https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#When_is_reward_the_optimization_target_of_the_agent_ [LW · GW]
But @gwern [LW · GW] argued with Turntrout that reward is in fact the optimization target for a broad range of RL algorithms:
https://www.lesswrong.com/posts/ttmmKDTkzuum3fftG/#sdCdLw3ggRxYik385 [LW · GW]
https://www.lesswrong.com/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world#Tdo7S62iaYwfBCFxL [LW(p) · GW(p)]
So my question is are there known results, ideally proofs, but I can accept empirical studies if necessary that show when RL algorithms treat the reward function as an optimization target?
And how narrow is the space of RL algorithms that don't optimize for the reward function?
A good answer will link to results known in the RL literature that are relevant to the question, and give conditions under which a RL agent does or doesn't optimize the reward function.
The best answers will present either finite-time results on RL algorithms optimizing the reward function, or argue that the infinite limit abstraction is a reasonable approximation to the actual reality of RL algorithms.
I'd like to know which RL algorithms optimize the reward, and which do not.
Answers
BTW, another problem with the thesis "Reward is not the optimization target", even with TurnTrout's stipulation that
This post addresses the model-free policy gradient setting, including algorithms like PPO and REINFORCE.
is that it's still not true even in the model-free policy gradient setting, in any substantive sense, and cannot justify some of the claims that TurnTrout & Belrose make. That is because of meta-learning: the 'inner' algorithm may in fact be a model-based RL algorithm which was induced by the 'outer' algorithm of PPO/REINFORCE/etc. A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.
There is just no hard and fast distinction here, it is dependent on the details of the system, the environment (ie. distribution of data), the amount of compute/ data, the convergence, and so on.** (A good example from an earlier comment on how reward is the optimization target [LW(p) · GW(p)] is Bhoopchand et al 2023, which is about ablating the components.)
So if the expressible set of algorithms is rich enough to include model-based RL algorithms, and there are sufficient conditions, then your PPO algorithm 'which doesn't optimize the reward' simply learns an algorithm which does optimize the reward.
A simple, neat example is given by Botvinick [LW · GW] about NNs like RNNs. (As Shah notes in the comments, this is all considered basic RL and not shocking, and there are many examples of this sort of thing in meta-RL research - although what perspective you take on what is 'outer'/'inner' is often dependent on what niche you are in, and so Table 1 here may be a helpful Rosetta stone.)
You have a binary choice (like a bandit) which yields a 0/1 reward (perhaps stochastic with probability p to make it interesting) and your NN learns which one; you train a, let's say, fully-connected MLP with REINFORCE, which takes no input and outputs a binary variable to choose an arm; it learns that the left arm yields 1 reward and to always take left. You stop training it, and the environment changes to swap it: the left arm yields 0, and now right yields 1. The MLP will still pick 'left', however, because it learned a policy which doesn't try to optimize the reward. In this case, it is indeed the case that "reward is not the optimization target" of the MLP. It just learned a myopic action which happened to be selected for. In fact, even if you resume training it, it may take a long time to learn to instead pick 'right', because you have to 'undo' all of the now-irrelevant training towards 'left'. And you can do this swapping and training several times, and it'll be about the same each time: the MLP will slowly unlearn the old arm and learn the new arm, then the swapping happens, and now it's gotta do the same thing.*
But if you instead train an RNN, and to give it an input, you feed it a history of rewards and you otherwise train the exact same way... You will instead see something entirely different. After a swap, the RNN will pick the 'wrong' arm a few times, say 5 times, and receive 0 reward - and abruptly start picking the 'right' arm even without any further training, just the same frozen RNN weights. This sort of fast response to changing rewards is a signature of model-free vs model-based: if I tell you that I moved your cheese, you can change your policy without ever experiencing a reward, and go to where the cheese is now, without wasting an attempt on the old cheese location; but a mouse can't, or will need at least a few episodes of trial-and-error to update. (Any given agent may use a mix, or hybrids like 'successor representation' which is sorta both; Sutton is fond of that.) This switch is possible because it has learned a new 'policy' over its history and the sufficient statistics encoded into its hidden weights which is equivalent to a Bayesian model of the environment and where it has learned to update its posterior probability of a switch having happened and that it is utility-maximizing to, after a certain number of failures, switch. And that 5 times was just how much evidence you need to overcome the small prior of 'a switch just happened right now'. And this utility-maximizing inner algorithm is incentivized by the outer algorithm, even though the outer algorithm itself has no concept of an 'environment' to be modeling or a 'reward' anywhere inside it. Your 'reward is not the optimization target' REINFORCE algorithm has learned the Bayesian model-based RL algorithm for which reward is the optimization target, and your algorithm as a whole is now optimizing the reward target, little different from, say, AlphaZero doing a MCTS tree search.
(And this should not be a surprise, because evolutionary algorithms are often cited as examples of model-free policy gradient algorithms which cannot 'plan' or 'model the environment' or 'optimize the reward', and yet, we humans were created by evolution and we clearly do learn rich models of the environment that we can plan over explicitly to maximize the reward, such as when we play Go and 'want to win the game'. So clearly the inference from 'algorithm X is itself not optimizing the reward' to 'all systems learned by algorithm X do not optimize the reward' is an illicit one.)
And, of course, in the other direction, it is entirely possible and desirable for model-based algorithms to learn model-free ones! (It's meta-learning all the way down.) Planning is expensive, heuristics cheap and efficient. You often want to distill your expensive model-based algorithm into a cheap model-free algorithm [LW(p) · GW(p)] and amortize the cost. In the case of the RNN above, you can, after the model-based algorithm has done its work and solved the switching bandit, throw that away, and replace it by a much cheaper simple model-free algorithm like if sum(reward_history) > 5 then last_action else last_action × −1
, saving millions of FLOPs per decision compared to the RNN. You can see the Baldwin effect as a version of this: there is no need to relearn the same behavior within-lifetime if it's the same each time, you can just hardwire it into the genes - no matter how sample-efficient model-based RL is within-lifetime, it can't beat a prior hardwired strategy requiring 0 data.
* further illustrating the weakness of 'reward is not the optimization target', it's not even obvious that this must be the case, rather than usually is under most setups. Meta-learning doesn't strictly require explicit conditioning on history nor does it require the clear fast vs slow weight distinction of RNNs or Transformer self-attention. A large enough MLP, continually trained through enough switches, could potentially learn to use the gradient updates+weights themselves as an outsourced history/model, and could eventually optimize its weights into a saddle point, where after exactly k updates by the fixed SGD algorithm, it 'happens to' switch its choice, corresponding to the hidden state being fused into the MLP itself. This would be like MAML. (If you are interested in this vein of thought, you may enjoy my AUNN proposal, which tries to take this to the logical extreme.)
** As Turntrout correctly notes [LW · GW], a GPT-5 could well be agentic, even if it was not trained with 'PPO' or called 'RL research'. What makes an RL agent is not the use of any one specific algorithm or tool, as they are neither necessary nor sufficient, it is the outcome. He asks, what does or does not make a Transformer agentic when we train it on OpenWebtext with a cross-entropy predictive loss, while PPO is assumed to be agentic? The actual non-rhetorical answer is: there is no single reason and little difference between PPO and cross-entropy here; in both cases, it is the combined outcome of the richness of the agent-generated data in OWT combined with the richness of a sufficiently large compute & parameter budget, which will tend to yield agency by learning to imitate the agents which generated that data. A smaller Transformer, a Transformer trained with less compute, or less data, or with text data from non-agents (eg. randomly initialized n-grams), would not yield agency, and further, there will be scaling laws/regimes for all of these critical ingredients. Choke any one of them hard enough, and the agency goes away. (You will instead get a LLM which is able to only predict 'e', or which perfectly models the n-grams and nothing else, or which predicts random outputs, or which asymptotes hard etc.)
↑ comment by Noosphere89 (sharmake-farah) · 2025-01-12T22:02:35.128Z · LW(p) · GW(p)
So in essence, even if reward truly isn't the optimization target at the outer level, that doesn't imply that all policies trained do not maximize the reward, right?
Replies from: gwern↑ comment by gwern · 2025-01-12T22:54:18.416Z · LW(p) · GW(p)
Yes. (And they can learn to predict and estimate the reward too to achieve even higher reward than simply optimizing the reward. For example, if you included an input, which said which arm had the reward, the RNN would learn to use that, and so would be able to change its decision without experiencing a single negative reward. A REINFORCE or evolution-strategies meta-trained RNN would have no problem with learning such a policy, which attempts to learn or infer the reward each episode in order to choose the right action.)
Nor is it at all guaranteed that 'the dog will wag the tail' - depending on circumstances, the tail may successfully wag the dog indefinitely. Maybe the outer level will be able to override the inner, maybe not. Because after all, the outer level may no longer exist, or may be too slow to be relevant, or may be changed (especially by the inner level). The 'homunculus' or 'Cartesian boundary' we draw around each level doesn't actually exist; it's just a convenient, leaky, abstraction.
To continue the human example, we were created by evolution on genes, but within a lifetime, evolution has no effect on the policy and so even if evolution 'wants' to modify a human brain to do something other than what that brain does, it cannot operate within-lifetime (except at even lower levels of analysis, like in cancers or cell lineages etc); or, if the human brain is a digital emulation of a brain snapshot, it is no longer affected by evolution at all; and even if it does start to mold human brains, it is such a slow high-variance optimizer that it might take hundreds of thousands or millions of years... and there probably won't even be biological humans by that point, never mind the rapid progress over the next 1-3 generations in 'seizing the means of reproduction' if you will. (As pointed out in the context of Von Neumann probes or gray goo, if you add in error-correction, it is entirely possible to make replication so reliable that the universe will burn out before any meaningful level of evolution can happen, per the Price equation. The light speed delay to colonization also implies that 'cancers' will struggle to spread much if they take more than a handful of generations.)
I love this question! As it happens, I have some rough draft for a post titled something like "'reward is the optimization target for smart RL agents".
TLDR: I think this is true for some AI systems, but not likely true for any RL-directed AGI systems whose safety we should really worry about. They'll optimize for maximum reward even more than humans do, unless they're very carefully built to avoid that behavior.
In the final comment [LW(p) · GW(p)] on the second thread you linked [LW(p) · GW(p)], TurnTrout says of his Reward is not the optimization target [LW · GW]:
However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE.
Humans are definitely model-based RL learners at least some of the time - particularly for important decisions.[1] So the claim doesn't apply to them. I also don't think it applies to any other capable agent. TurnTrout actually makes a congruent claim in his other post Think carefully before calling RL policies "agents" [LW · GW]. Model-free RL algorithms only have limited agency, what I'd call level 1-of-3:
- Trained to achieve some goal/reward.
- Habitual behavior/model-free RL
- Predicts outcomes of actions and selects ones that achieve a goal/reward.
- Model-based RL
- Selects future states that achieve a goal/reward and then plans actions to achieve that state
- No corresponding terminology, (goal-directed from neuroscience applies to levels 2 and even 1[1]) but pretty clearly highly useful for humans
That's from my post Steering subsystems: capabilities, agency, and alignment [LW · GW].
But humans don't seem to optimize for reward all that often! They make self-sacrificial decisions that get them killed. And they usually say they'd refuse to get in Nozick's experience machine, which would hypothetically remove them from this world and give them a simulated world of maximally-rewarding experiences. They're seeming to optimize for the things that have given them reward, like protecting loved ones, rather than optimizing for reward themselves - just like TurnTrout describes in RINTOT. And humans are model-based for important decisions, presumably using sophisticated models. What gives?
My cognitive neuroscience research focused a lot on dopamine, so I've thought a lot about how reward shapes human behavior. The most complete publication is Neural mechanisms of human decision-making as a summary of how humans seem to learn complex behaviors using reward and predictions of reward. But that's not really very good description of the overall theory, because neuroscientists are highly suspicious of broad theories, and because I didn't really want to accidentally accelerate AGI research by describing brain function clearly. I know.
I think humans do optimize for reward, we just do it badly. We do see some sophisticated hedonists with exceptional amounts of time and money say things like "I love new experiences". This has abstracted almost all of the specifics. Yudkowsky's "fun theory" also describes a pursuit of reward if you grant that "fun" refers to frequent, strong dopamine spikes (I think that's exactly what we mean by fun). I think more sophisticated hedonists will get in the experience box- but this is complicated by the approximations in human decision-making. It's pretty likely that the suffering you'd cause your loved ones by getting in the box and leaving them alone would be so salient, and produce such a negative-reward-prediction, that it would outweigh all of the many positive predictions of reward, just based on saliency and our inefficient way of roughly totaling predicted future reward by imagining salient outcomes and roughly averaging over their reward predictions.
So I think the more rational and cognitively capable a human is, the more likely they'll optimize more strictly and accurately for future reward. And I think the same is true of model-based RL systems with any decent decision-making process.
I realize this isn't the empirically-based answer you asked for. I think the answer has to be based on theory, because some systems will and some won't optimize for reward. I don't know the ML RL literature nearly as well as I know the neuroscience RL literature, so there might be some really relevant stuff out there I'm not aware of. I doubt it, because this is such an AI-safety question.[2]
So that's why I think reward is the optimization target for smart RL agents.
Edit: Thus, RINTOT and similar work has, I think, really confused the AGI safety debate by making strong claims about current AI that don't apply at all to the AGI we're worried about. I've been thinking about this a lot in the context of a post I'd call "Current AI and alignment theory is largely behaviorist. Expect a cognitive revolution".
- ^
For more than you want to know about the various terminologies, see How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing.
We debated the terminologies habitual/goal-directed, automatic and controlled, system 1/system 2, and model-free/model-based for years. All of them have limitations, and all of them mean slightly different things. In particular, model-based is vague terminology when systems get more complex than simple RL - but it is very clear that many complex human decisions (certainly ones in which we envision possible outcomes before taking actions) are far on the model-based side, and meet every definition.
- ^
One follow-on question is whether RL-based AGI will wirehead. I think this is almost the same question as getting into the experience box - except that that box will only keep going if the AGI engineers it correctly to keep going. So it's going to have to do a lot of planning before wireheading, unless its decision-making algorithm is highly biased toward near-term rewards over long-term ones. In the course of doing that planning, its other motivations will come into play - like the well-being of humans, if it cares about that. So whether or not our particular AGI will wirehead probably won't determine our fate.
↑ comment by gwern · 2025-01-13T01:13:44.737Z · LW(p) · GW(p)
But humans don't seem to optimize for reward all that often!
You might be interested in an earlier discussion on whether "humans are a hot mess": https://www.lesswrong.com/posts/SQfcNuzPWscEj4X5E/the-hot-mess-theory-of-ai-misalignment-more-intelligent [LW · GW] https://www.lesswrong.com/posts/izSwxS4p53JgJpEZa/notes-on-the-hot-mess-theory-of-ai-misalignment [LW · GW]
↑ comment by Lukas_Gloor · 2025-01-12T14:59:21.972Z · LW(p) · GW(p)
So I think the more rational and cognitively capable a human is, the more likely they'll optimize more strictly and accurately for future reward.
If this is true at all, it's not going to be a very strong effect, meaning you can find very rational and cognitively capable people who do the opposite of this in decision situations that directly pit reward against the things they hold most dearly. (And it may not be true because a lot of personal hedonists tend to "lack sophistication," in the sense that they don't understand that their own feelings of valuing nothing but their own pleasure is not how everyone else who's smart experiences the world. So, there's at least a midwit level of "sophistication" where hedonists seem overrepresented.)
Maybe it's the case that there's a weak correlation that makes the quote above "technically accurate," but that's not enough to speak of reward being the optimization target. For comparison, even if it is the case that more intelligent people prefer classical music over k-pop, that doesn't mean classical music is somehow inherently superior to k-pop, or that classical music is "the music taste target" in any revealing or profound sense. After all, some highly smart people can still be into k-pop without making any mistake.
I've written about this extensively here [EA · GW] and here [EA · GW]. Some relevant exercepts from the first linked post:
One of many takeaways I got from reading Kaj Sotala’s multi-agent models of mind sequence [? · GW] (as well as comments by him [LW(p) · GW(p)]) is that we can model people as pursuers of deep-seated needs. In particular, we have subsystems (or “subagents”) in our minds devoted to various needs-meeting strategies. The subsystems contribute behavioral strategies and responses to help maneuver us toward states where our brain predicts our needs will be satisfied. We can view many of our beliefs, emotional reactions, and even our self-concept/identity as part of this set of strategies. Like life plans, life goals are “merely” components of people’s needs-meeting machinery.[8]
Still, as far as components of needs-meeting machinery go, life goals are pretty unusual. Having life goals means to care about an objective enough to (do one’s best to) disentangle success on it from the reasons we adopted said objective in the first place. The objective takes on a life of its own, and the two aims (meeting one’s needs vs. progressing toward the objective) come apart. Having a life goal means having a particular kind of mental organization so that “we” – particularly the rational, planning parts of our brain – come to identify with the goal more so than with our human needs.[9]
To form a life goal, an objective needs to resonate with someone’s self-concept and activate (or get tied to) mental concepts like instrumental rationality and consequentialism. Some life goals may appeal to a person’s systematizing tendencies and intuitions for consistency. Scrupulosity or sacredness intuitions may also play a role, overriding the felt sense that other drives or desires (objectives other than the life goal) are of comparable importance.
[...]Adopting an optimization mindset toward outcomes inevitably leads to a kind of instrumentalization of everything “near term.” For example, suppose your life goal is about maximizing the number of your happy days. The rational way to go about your life probably implies treating the next decades as “instrumental only.” On a first approximation, the only thing that matters is optimizing the chances of obtaining indefinite life extension (potentially leading to more happy days). Through adopting an outcome-focused optimizing mindset, seemingly self-oriented concerns such as wanting to maximize the number of happiness moments turn into an almost “other-regarding” endeavor. After all, only one’s far-away future selves get to enjoy the benefits – which can feel essentially like living for someone else.[12]
[12] This points at another line of argument (in addition to the ones I gave in my previous post) to show why hedonist axiology isn’t universally compelling:
To be a good hedonist, someone has to disentangle the part of their brain that cares about short-term pleasure from the part of them that does long-term planning. In doing so, they prove they’re capable of caring about something other than their pleasure. It is now an open question whether they use this disentanglement capability for maximizing pleasure or for something else that motivates them to act on long-term plans.
↑ comment by Noosphere89 (sharmake-farah) · 2024-10-15T21:32:47.025Z · LW(p) · GW(p)
I'd also accept neuroscience RL literature, and also accept theories that would make useful predictions or give conditions on when RL algorithms optimize for the reward, not just empirical results.
At any rate, I'd like to see your post soon.
Replies from: Seth Herd↑ comment by Seth Herd · 2024-10-16T00:41:42.377Z · LW(p) · GW(p)
That's probably as much of that post as I'll get around to. It's not high on my priority list because I don't see how it's a crux for any important alignment theory. I may cover what I think is important about it in the "behaviorist..." post.
Edit: I was going to ask why you were thinking this was important.
It seems pretty cut and dried; even TurnTrout wasn't claiming this was true beyond model-free RL. I guess LLMs are model-free, so that's relevant. I just expect them to be turned into agents with explicit goals, so I don't worry much about how they behave in base form.
Replies from: gwern, sharmake-farah↑ comment by gwern · 2024-10-17T23:41:42.188Z · LW(p) · GW(p)
I guess LLMs are model-free, so that's relevant
FWIW, I strongly disagree with this claim. I believe they are model-based, with the usual datasets & training approaches, even before RLHF/RLAIF.
Replies from: D0TheMath, Seth Herd↑ comment by Garrett Baker (D0TheMath) · 2024-10-19T04:32:50.387Z · LW(p) · GW(p)
What do you mean by "model-based"?
↑ comment by Seth Herd · 2024-10-20T19:10:47.460Z · LW(p) · GW(p)
Interesting. There's certainly a lot going on in there, and some of it very likely is at least vague models of future word occurrences (and corresponding events). The definition of model-based gets pretty murky outside of classic RL, so it's probably best to just directly discuss what model properties give rise to what behavior, e.g. optimizing for reward.
Model-free systems can produce goal-directed behavior. The do this if they have seen some relevant behavior that achieves a given goal, and their input or some internal representation includes the current goal, and they can generalize well enough to apply what they've experienced to the current context. (This is by the neuroscience definition of habitual vs goal-directed: behavior changes to follow the current goal, usually hungry, thirsty or not).
So if they're strong enough generalizers, I think even a model-free system actually optimizes for reward.
I think the claim should be stronger: for a smart enough RL system, reward is the optimization target.
↑ comment by Noosphere89 (sharmake-farah) · 2024-10-16T02:44:47.824Z · LW(p) · GW(p)
IMO, the important crux is whether we really need to secure the reward function from wireheading/tampering, because a RL algorithm optimizing for the reward means you will need to have much more security/make much more robust reward functions than in the case where RL algorithms don't optimize for the reward, because optimization amplifies problems and solutions.
Replies from: Seth Herd↑ comment by Seth Herd · 2024-10-17T18:04:25.258Z · LW(p) · GW(p)
Ah yes. I agree that the wireheading question deserves more thought. I'm not confident that my answer to wireheading applies to the types of AI we'll actually build - I haven't thought about it enough.
FWIW the two papers I cited are secondary research, so they branch directly into a massive amount of neuroscience research that indirectly bears on the question in mammalian brains. None of it I can think of directly addresses the question of whether reward is the optimization target for humans. I'm not sure how you'd empirically test this.
I do think it's pretty clear that some types of smart, model-based RL agents would optimize for reward. Those are the ones that a) choose actions based on highest estimated sum of future rewards (like humans seem to, very very approximately), and that are smart enough to estimate future rewards fairly accurately.
LLMs with RLHF/RLAIF may be the relevant case. They are model-free by TurnTrout's definition, and I'm happy to accept his use of the terminology. But they do have a powerful critic component (at least in training - I'm not sure about deployment, but probably there too)0, so it seems possible that it might develop a highly general representation of "stuff that gives the system rewards". I'm not worried about that, because I think that will happen long after we've given them agentic goals, and long after they've developed a representation of "stuff humans reward me for doing" - which could be mis-specified enough to lead to doom if it was the only factor.
I'll use the definition of optimization from Wikipedia: "Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives".
Best-of-n or rejection sampling is an alternative to RLHF which involves generating responses from an LLM and returning the one with the highest reward model score. I think it's reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.
I'd also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, "At each time step of each simulation, an action is selected from state so as to maximize action value plus a bonus" and the formula is: where is an exploration bonus.
Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or -1 depending on who wins).
The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or -1.
So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.
"Optimization target" is itself a concept which needs deconfusing/operationalizing. For a certain definition of optimization and impact, I've found that the optimization is mostly correlated with reward, but that the learned policy will typically have more impact on the world/optimize the world more than is strictly necessary to achieve a given amount of reward.
This uses an empirical metric of impact/optimization which may or may not correlate well with algorithm-level measures of optimization targets.
https://www.alignmentforum.org/posts/qEwCitrgberdjjtuW/measuring-learned-optimization-in-small-transformer-models [AF · GW]
No comments
Comments sorted by top scores.