Why almost every RL agent does learned optimization

post by Lee Sharkey (Lee_Sharkey) · 2023-02-12T04:58:34.569Z · LW · GW · 3 comments

Contents

  Or "Why RL≈RL2 (And why that matters)"
    Background on RL2 
    The conditions under which RL2 emerges are the default RL training conditions
      Ingredients for an RL2 cake
      Why these ingredients are the default conditions
    So what? (Planning from RL2?)
      Footnote: The Bayesian optimization objective that RL2 agents implicitly optimize has a structure that resembles planning
None
3 comments

Or "Why  (And why that matters)"

 

TL;DR: This post discusses the blurred conceptual boundary between RL and RL (also known as meta-RL). RL is an instance of learned optimization. Far from being a special case, I point out that the conditions under which RL emerges are actually the default conditions for RL training. I argue that this is safety-relevant by outlining the evidence for why learned planning algorithms will probably emerge -- and have probably already emerged in a weak sense -- in scaled-up RL agents.

 

I've found myself telling this story about the relationship between RL and RL  numerous times in conversation. When that happens, it's usually time to write a post about it. 

Most of the first half of the post (which points out that RL is probably more common than most people think) makes points that are probably already familiar to people who've thought a bit about inner alignment. 

The last section of the post (which outlines why learned planning algorithms will probably emerge from scaled up RL systems) contains arguments that may be less widely appreciated among inner alignment researchers, though I still expect the arguments to be familiar to some. 

Background on RL 

RL (Duan et al. 2016), also known as meta-RL (Wang et al. 2016; Beck et al. 2023), is the phenomenon where an RL agent learns to implement another RL algorithm in its internal activations. It's the RL version of 'learning to learn by gradient descent', which is a kind of meta-learning first described in the supervised setting by Hochreiter et al. (2001). These days, in language models it's often called 'in-context learning' (Olssen et al. 2022, Garg et al. 2022).

RL is interesting from a safety perspective because it's a form of learned optimization (Hubinger et al. 2019): The RL algorithm (the outer optimization algorithm) trains the weights of an agent, which learns to implement a separate, inner RL algorithm (the optimization algorithm). 

The inner RL algorithm gives the agent the ability to adapt its policy to a particular task instance from the task distribution on which it is trained. Empirically, agents trained to exhibit RL exhibit rapid adaptation and zero-shot generalization to new tasks (DeepMind Adaptive Agent team et al. 2023), hypothesis driven exploration/experimentation (DeepMind Open Ended Learning Team et al. 2021), and causal reasoning (Dasgupta et al. 2019). RL  may even underlie human planning, decision-making, social cognition, and moral judgement, since there is compelling evidence that the human prefrontal cortex (which is the area of the brain most associated with those capabilties) implements an RL system (Wang et al. 2018). These cognitive capabilities are the kind of things that we're concerned about in powerful AI systems. RL is therefore a phenomenon that seems likely to underlie some major safety risks.

The conditions under which RL emerges are the default RL training conditions

Ingredients for an RL cake

The four 'ingredients' required for RL to emerge are:

  1. The agent must have observations that correlate with reward.
  2. The agent must have observations that correlate with its history of actions.
  3. The agent must have a memory state that persists through time in which the RL algorithm can be implemented.
  4. The agent must be trained on a distribution of tasks.

These conditions let the agent learn an RL algorithm because they let the agent learn to adapt its actions to a particular task according to what led to reward. Here's a more detailed picture of the mechanism by which these ingredients lead to RL:

Why these ingredients are the default conditions

This set of conditions are more common than they might initially appear:

The (admittedly somewhat pedantic) argument that most tasks are, in fact, distributions of tasks points toward a blurred boundary between 'RL' and 'the agent merely adapting during a task'. Some previous debate on the forum about what should count as 'learning' vs. 'adaptation' can be found in comments here [LW(p) · GW(p)] and here [LW(p) · GW(p)].

So what? (Planning from RL?)

I'm making a pretty narrow, technical point in this post. The above indicates that RL is pretty much inevitable in most interesting settings. But that's not necessarily dangerous; RL itself isn't the thing that we should be worried about. We're mostly concerned about agents that have learned how to search or plan (as discussed in Hubinger et al. 2019 and Demski, 2020 [LW · GW]).

Unfortunately, I think there are a few indications that learned planning will probably emerge from scaled-up RL

These are, of course, only weak indications that scaling up RL will yield learned planning. It's still unclear what else, if anything, is required for it to emerge.

  1. ^

    Footnote: The Bayesian optimization objective that RL agents implicitly optimize has a structure that resembles planning

    Ortega et al. (2019) shows that the policy of an RL agent, , is trained to approximate the following distribution: 

      

    where:
     is the optimal action at timestep 
     the action-observation history up to timestep 
     is the probability of choosing the optimal action given the action-observation history, and
     is the set of latent (inaccessible) task parameters that define the task instance. They are sampled from the task distribution.   effectively defines the current world state.

    How might scaled up RL  agents approximate this integral? Perhaps the easiest method to approximate complicated distributions is a Monte Carlo estimate (i.e. take a bunch of samples and take their average). It seems plausible that agents would learn to take Monte Carlo estimate of this distribution within their learned algorithms. Here's a sketch of what this might look like on an intuitive level:

    - The agent has uncertainty over latent task variables/world state given its observation history. It can't consider all the possible configurations of the world state, so it just considers a small sample set of the most likely states of the world according to an internal model of 
    - For each of that small sample set of possible world states, the agent considers what the optimal action would be in each case, i.e. . Generally, it's useful to predict the consequences of actions to evaluate how good they are. So the agent might consider the consequences of different actions given different world states and action-observation histories. 
    - After considering each of the possible worlds, it chooses the action that works best across those worlds, weighted according to how likely each world state is i.e. 
     
    Those steps resemble a planning algorithm. 

    It's not clear whether agents would actually learn to plan (i.e. learning approximations of each term in the integral that unroll serially, as sketched above) vs. something else (such as learning heuristics that, in parallel, approximate the whole integral). But the structure of the Bayesian optimization objective is suggestive of an optimization pressure in the direction of learning a planning algorithm. 

3 comments

Comments sorted by top scores.

comment by Steven Byrnes (steve2152) · 2023-02-13T15:12:21.552Z · LW(p) · GW(p)

My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI [? · GW]”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu [? · GW] and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.

Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models. But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here [LW · GW].

Maybe what you’re thinking is: “Maybe the learned planning algorithm will have some weird and dangerous goal”. My hunch is: (1) if the original RL agent lacks an affordance for planning in the human-written source code, then it won’t work very well, and in particular, it won’t be up to the task of building a sophisticated dangerous planner with a misaligned goal; (2) if the original RL agent has an affordance for planning in the human-written source code, then it could make a dangerous misaligned planner, but it would be a “mistake” analogous to how future humans might unintentionally make misaligned AGIs, and this problem might be solvable by making the AI read about the alignment problem and murphyjitsu and red-teaming etc., and cranking up its risk-aversion etc.

Sorry if I’m misunderstanding. RL² stuff has never made much sense to me.

Replies from: Lee_Sharkey
comment by Lee Sharkey (Lee_Sharkey) · 2023-02-14T20:10:42.796Z · LW(p) · GW(p)

My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI [? · GW]”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu [? · GW] and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.


Hm, I don't think this quite captures what I view the post as saying. 
 

Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models.

As far as there is a safety-related claim in the post, this captures it much better than the previous quote.
 

But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here [AF · GW].

I think my hunch is in the other direction. One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system. But that's a lightly held view. It feels plausible to me that your later points (1) and (2) turn out to be right, but again I think I lean in the other direction from you on (1). 

I can also imagine a middle ground between our hunches that looks something like "We gave our agent a pretty strong inductive bias toward learning a planning algorithm, but still didn't force it to learn one, yet it did." 

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2023-02-21T17:56:03.864Z · LW(p) · GW(p)

Thanks!

One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system.

See Section 3 here [LW · GW] for why I think it would be a lot worse.