# Mesa-Search vs Mesa-Control

post by abramdemski · 2020-08-18T18:51:59.664Z · LW · GW · 45 comments

## Contents

  Search vs Control

Mesa-Learning Everywhere?
Text prediction sounds benign, while RL sounds agentic.
Recurrence.
Mesa-learning isn't mesa-optimization.
This isn't even mesa-learning, it's just "task location".
None


I currently see the spontaneous emergence of learning algorithms [LW · GW] as significant evidence for the commonality of mesa-optimization [LW · GW] in existing ML, and suggestive evidence for the commonality of inner alignment problems in near-term ML.

[I currently think that there is only a small amount of evidence toward this. However, due to thinking about the issues, I've still made a significant personal update in favor of inner alignment problems being frequent.]

This is bad news, in that it greatly increases my odds on this alignment problem arising in practice.

It's good news in that it suggests this alignment problem won't catch ML researchers off guard; maybe there will be time to develop countermeasures while misaligned systems are at only a moderate level of capability.

In any case, I want to point out that the mesa-optimizers suggested by this evidence might not count as mesa-optimizers by some definitions.

# Search vs Control

Nevan Wichers comments [LW · GW] on spontaneous-emergence-of-learning:

I don't think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.

So the policy doesn't have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.

In Selection vs Control [LW · GW], I wrote about two different kinds of 'optimization':

• Selection refers to search-like systems, which look through a number of possibilities and select one.
• Control refers to systems like thermostats, organisms, and missile guidance systems. These systems do not get a re-do for their choices. They make choices which move toward the goal at every moment, but they don't get to search, trying many different things -- at least, not in the same sense.

I take Nevan Wichers to be saying that there is no evidence search is occurring. The mesa-optimization being discussed recently could be very thermostat-like, using simple heuristics to move toward the goal.

## Mesa-Searchers

Defining mesa-optimizers by their ability to search is somewhat nice:

• There is some reason to think that mesa-optimizers which implement an explicit search are the most concerning, because they are the ones which could explicitly model the world, including the outer optimizer, and make sophisticated plans based on this.
• This kind of mesa-optimizer may be more theoretically tractible. If we solve problems with very (very) time-efficient methods, then search-type inner optimizers may be eliminated: whatever answers the search computation finds, there could be a more efficient solution which simply memorized a table of those answers. Paul asks a related theory question [LW · GW]. Vanessa gives a counterexample [LW(p) · GW(p)], which involves a control-type mesa-optimizer rather than one which implements an internal search. [Edit -- that's not really clear; see this comment [LW(p) · GW(p)].]
• So it's possible that we could solve mesa-optimization in theory, by sticking to search-based definitions -- while still having a problem in practice, due to control-type inner optimizers. (I want to emphasize that this would be significant progress, and well worth doing.)

## Mesa-Learners

Mesa-controllers sound like they may not be a huge concern, because they don't strategically optimize based on a world-model in the same way. However, I think the model discussed in the spontaneous-emergence-of-learning post is a significant counterargument to this.

The post discusses RL agents which spontaneously learn an inner RL algorithm. It's important to pause and ask what this means. Reinforcement learning is a task, not an algorithm. It's a bit nonsensical to say that the RL agent is spontaneously learning the RL task inside of itself. So what is meant?

The core empirical claim, as I understand it, is that task performance continues to improve after weights are frozen, suggesting that learning is still taking place, implemented in neural activation changes rather than neural weight changes.

Why might this happen? It sounds a bit absurd: you've already implemented a sophisticated RL algorithm, which keeps track of value estimates for states and actions, and propagates these value estimates to steer actions toward future value. Why would the learning process re-implement a scheme like that, nested inside of the one you implemented? Why wouldn't it just focus on filling in the values accurately?

I've thought of two possible reasons so far.

1. Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.
2. More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

This is more concerning than a thermostat-like bag of heuristics, because an RL algorithm is a pretty agentic thing, which can adapt to new situations and produce novel, clever behavior.

They also suggest that the inner RL algorithm may be model-based while the outer is model-free. This goes some distance toward the "can model you, the world, and the outer alignment process, in order to manipulate it" concern which we have about search-type mesa-optimizers.

# Mesa-Learning Everywhere?

Gwern replies [LW · GW] to a comment by Daniel Kokotajlo:

>Learning still happening after weights are frozen? That’s crazy. I think it’s a big deal because it is evidence for mesa-optimization being likely and hard to avoid.

Sure. We see that elsewhere too, like Dactyl. And of course, GPT-3.

People are jumping on the RL examples as mesa-optimization. But, for all the discussion of GPT-3, I saw only speculative remarks about mesa-optimization in GPT-3. Why does an RL algorithm continuing to improve performance after weights are frozen indicate inner optimization, while evidence of the same thing in text prediction does not?

## 1. Text prediction sounds benign, while RL sounds agentic.

One obvious reason: an inner learner in a text prediction system sounds like just more text prediction. When we hear that GPT-3 learned-to-learn, and continues learning after the weights are frozen, illustrating few-shot learning, we imagine the inner learner is just noticing patterns and extending them. When we hear the same for an RL agent, we imagine the inner learner actively trying to pursue goals (whether aligned or otherwise).

I think this is completely spurious. I don't currently see any reason why the inner learner in an RL system would be more or less agentic than in text prediction.

## 2. Recurrence.

A more significant point is the structure of the networks in the two cases. GPT-3 has no recurrence: no memory which lasts between predicting one token and the next.

The authors of the spontaneous learning paper mention recurrence as one of the three conditions which should be met in order for inner learning to emerge. But that's just a hypothesis. If we see the same evidence in GPT-3 -- evidence of learning after the weights are frozen -- then shouldn't we still make the same conclusion in both cases?

I think the obvious argument for the necessity of recurrence is that, without recurrence, there is simply much less potential for mesa-learning. A mesa-learner holds its knowledge in the activations, which get passed forward from one time-step to the next. If there is no memory, that can't happen.

But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the "learned information" from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?

Another argument might be that the lack of recurrence makes mesa-learners much less likely to be misaligned, or much less likely to be catastrophically misaligned, or otherwise practically less important. I'm not sure what to make of that possibility.

## 3. Mesa-learning isn't mesa-optimization.

One very plausible explanation of why mesa-learning happens is the system learns a probability distribution which extrapolates the future from the past. This is just regular ol' good modeling. It doesn't indicate any sort of situation where there's a new agent in the mix.

Consider a world which is usually "sunny", but sometimes becomes "rainy". Let's say that rainy states always occur twice in a row. Both RL agents and predictive learners will learn this. (At least, RL agents will learn about it in so far as it's relevant to their task.) No mesa-learning here.

Now suppose that rainy streaks can last more than two days. When it's rainy, it's more likely to be rainy tomorrow. When it's sunny, it's more likely to be sunny tomorrow. Again, both systems will learn this. But it starts to look a little like mesa-learning. Show the system a rainy day, and it'll be more prone to anticipate a rainy day tomorrow, improving its performance on the "rainy day" task. "One-shot learning!"

Now suppose that the more rainy days there have been in a row, the more likely it is to be rainy the next day. Again, our systems will learn the probability distribution. This looks even more like mesa-learning, because we can show that performance on the rainy-day task continues to improve as we show the frozen-weight system more examples of rainy days.

Now suppose that all these parameters drift over time. Sometimes rainy days and sunny days alternate. Sometimes rain follows a memoryless distribution. Sometimes longer rainy streaks become more likely to end, rather than less. Sometimes there are repeated patterns, like rain-rain-sun-rain-rain-sun-rain-rain-sun.

At this point, the learned probabilistic model starts to resemble a general-purpose learning algorithm. In order to model the data well, it has to adapt to a variety of situations.

But there isn't necessarily anything mesa-optimize-y about that. The text prediction system just has a very good model -- it doesn't have models-inside-models or anything like that. The RL system just has a very good model -- it doesn't have something that looks like a new RL algorithm implemented inside of it.

At some level of sophistication, it may be easier to learn some kind of general-purpose adaptation, rather than all the specific things it has to adapt to. At that point it might count as mesa-optimization.

## 4. This isn't even mesa-learning, it's just "task location".

Taking the previous remarks a bit further: do we really want to count it as 'mesa-learning' if it's just constructed a very good conditional model, which notices a wide variety of shifting local regularities in the data, rather than implementing an internal learning algorithm which can take advantage of regularities of a very general sort?

In GPT-3: a disappointing paper, Nostalgebraist argues that the second is unlikely to be what's happening in the case of GPT-3 [LW · GW]. It's not likely that GPT-3 is learning arithmetic from examples. It's more likely that it is learning that we are doing arithmetic right now. This is less like learning and more like using a good conditional model. It isn't learning the task, it's just "locating" one of many tasks that it has already learned.

I'll grant that the distinction gets very, very fuzzy at the boundaries. Are literary parodies of Harry Potter "task location" or "task learning"? On the one hand, it is obviously bringing to bear a great deal of prior knowledge in these cases, rather than learning everything anew on the fly. It would not re-learn this task in an alien language with its frozen weights. On the other hand, it is obviously performing well at a novel task after seeing a minimal demonstration.

I'm not sure where I would place GPT-3, but I lean toward there being a meaningful distinction here: a system can learn a general-purpose learning algorithm, or it can 'merely' learn a very good conditional model. The first is what I think "mesa-learner" should mean.

We can then ask the question: did the RL examples discussed previously constitute true mesa-learning? Or did they merely learn a good model, which represented the regularities in the data? (I have no idea.)

In any case, the fuzziness of the boundary makes me think these methods (ie, a wide variety of methods) will continue moving further along the spectrum toward producing powerful mesa-learners as they are scaled up (and hence, mesa-optimizers).

comment by gwern · 2020-08-18T23:10:29.575Z · LW(p) · GW(p)

But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the “learned information” from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?

I think that's exactly it. There's no real difference between a history, and a recurrence. A recurrence is a (lossy) function of a history, so anything a recurrent hidden state can encode, a sufficiently large/deep feedforward model given access to the full history should be able to internally represent as well.

GPT with a context window of 1 token would be unable to do any kind of meta-learning, in much the same way that a RNN with no hidden state (or at its first step with a default hidden state) working one step at a time would be unable to do anything. Whether you compute your meta-learning 'horizontally' by repeated application to a hidden state, updating token by token, or 'vertically' inside a deep Transformer (an unrolled RNN?) conditioned on the entire history, makes no difference aside from issues of perhaps computational efficiency (a RNN is probably faster to run but slower to train) and needing more or less parameters or layers to achieve the same effective amount of pondering time (although see Universal Transformers there).

Replies from: vanessa-kosoy, capybaralet
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-19T18:20:32.004Z · LW(p) · GW(p)

There's no real difference between a history, and a recurrence.

That's true for unbounded agents but false for realistic (bounded) agents. Considering the following two-player zero-sum game:

Player A secretly writes some , then player B says some and finally player B says some . Player A gets reward unless where is a fixed one-way function. If , player A gets a reward in which is the fraction of bits and have in common.

The optimal strategy for player A is producing a random sequence. The optimal strategy for player B is choosing a random , computing , outputting and then outputting . The latter is something that an RNN can implement (by storing in its internal state) but a stateless architecture like a transformer cannot implement. A stateless algorithm would have to recover from , but that is computationally unfeasible.

Replies from: vanessa-kosoy, gwern
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-19T22:29:30.104Z · LW(p) · GW(p)

On second thought, that's not a big deal: we can fix it by interspersing random bits in the input. This way, the transformer would see a history that includes and the random bits used to produce it (which encode ). More generally, such a setup can simulate any randomized RNN.

comment by gwern · 2020-08-19T18:30:12.062Z · LW(p) · GW(p)

Er, maybe your notation is obscuring this for me, but how does that follow? Where is the RNN getting this special randomness from? Why aren't the internal activations of a many-layer Transformer perfectly adequate to first encode, 'storing z', and then transform?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-19T18:36:39.675Z · LW(p) · GW(p)

I'm assuming that either architecture can use a source of random bits.

The transformer produces one bit at a time, computing every bit from the history so far. It doesn't have any state except for the history. At some stage of the game the history consists of only. At this stage the transformer would have to compute from in order to win. It doesn't have any activations to go on besides those that can be produced from .

Replies from: gwern
comment by gwern · 2020-08-19T18:58:35.589Z · LW(p) · GW(p)

And the Transformer can recompute whatever function the RNN is computing over its history, no, as I said? Whatever a RNN can do with its potentially limited access to history, a Transformer can recompute with its full access to history as if it were the unrolled RNN. It can recompute that for every bit, generate the next one, and then recompute on the next step with that as the newest part of its history being conditioned on.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-19T19:03:25.432Z · LW(p) · GW(p)

No, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many "Monte Carlo" iterations until it produces the right history.

Replies from: gwern
comment by gwern · 2020-10-31T18:50:25.343Z · LW(p) · GW(p)

An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it's not, I don't see why that would make a difference, or why a Transformer couldn't be 'not deterministic' in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can't learn 'Monte Carlo iterations' internally (say, one per head).

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-10-31T18:59:39.546Z · LW(p) · GW(p)

I already conceded [LW(p) · GW(p)] a Transformer can be made stochastic. I don't see a problem with backproping: you treat the random inputs as part of the environment, and there's no issue with the environment having stochastic parts. It's stochastic gradient descent, after all.

Replies from: gwern
comment by gwern · 2020-10-31T23:18:22.708Z · LW(p) · GW(p)

Because you don't train the inputs, you're trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it's abusing the term 'stochastic' (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don't understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can't meta-learn in the same way as RNNs internally.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-11-01T07:26:07.290Z · LW(p) · GW(p)

I am not sure what do you mean by "stop cold?" It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.

As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?

comment by capybaralet · 2020-09-17T07:47:04.008Z · LW(p) · GW(p)

Practically speaking, I think the big difference is that the history is outside of GPT-3's control, but a recurrent memory would be inside its control.

comment by evhub · 2020-08-18T19:35:57.544Z · LW(p) · GW(p)

Paul asks a related theory question. Vanessa gives a counterexample, which involves a control-type mesa-optimizer rather than one which implements an internal search.

You may also be interested in my counterexample [LW · GW], which preceded Vanessa's and uses a search-type mesa-optimizer rather than a control-type mesa-optimizer.

I've thought of two possible reasons so far.

1. Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.
2. More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

I would propose a third reason, which is just that learning done by the RL algorithm happens after the agent has taken all of its actions in the episode, whereas learning done inside the model can happen during the episode. Thus, if the task of taking good actions in the episode requires learning, then your model will have to learn some sort of learning procedure to do well.

I don't currently see any reason why the inner learner in an RL system would be more or less agentic than in text prediction.

I agree with this—it has long been my position that language modeling as a task ticks all of the boxes necessary to produce mesa-optimizers.

But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the "learned information" from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?

I agree here too—though also I think it's pretty reasonable to expect that future massive language models will have more recurrence in them.

comment by vlad_m · 2020-09-13T15:00:38.057Z · LW(p) · GW(p)
I would propose a third reason, which is just that learning done by the RL algorithm happens after the agent has taken all of its actions in the episode, whereas learning done inside the model can happen during the episode.

This is not true of RL algorithms in general -- If I want, I can make weight updates after every observation. And yet, I suspect that if I meta-train a recurrent policy using such an algorithm on a distribution of bandit tasks, I will get a 'learning-to-learn' style policy.

So I think this is a less fundamental reason, though it is true in off-policy RL.

Replies from: evhub
comment by evhub · 2020-09-13T22:48:43.748Z · LW(p) · GW(p)

This is not true of RL algorithms in general -- If I want, I can make weight updates after every observation.

You can't update the model based on its action until its taken that action and gotten a reward for it. It's obviously possible to throw in updates based on past data whenever you want, but that's beside the point—the point is that the RL algorithm only gets new information with which to update the model after the model has taken its action, which means if taking actions requires learning, then the model itself has to do that learning.

comment by vlad_m · 2020-09-14T16:49:27.123Z · LW(p) · GW(p)

I interpreted your previous point to mean you only take updates off-policy, but now I see what you meant. When I said you can update after every observation, I meant that you can update once you have made an environment transition and have an (observation, action, reward, observation) tuple. I now see that you meant the RL algorithm doesn't have the ability to update on the reward before the action is taken, which I agree with. I think I still am not convinced, however.

And can we taboo the word 'learning' for this discussion, or keep it to the standard ML meaning of 'update model weights through optimisation'? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn't have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn't necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.

You can't update the model based on its action until its taken that action and gotten a reward for it.

Right, I agree with that.

Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby 'updating' based on returns it hasn't yet received, and actions it hasn't yet made. I agree that it's possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don't do anything so sophisticated). It isn't clear to me a priori that from the point of view of the policy this is the best strategy to take.

But note that this argument means that to the extent learned responsiveness can do more than the RL algorithm's weight updates can, that cannot be due to recurrence. If it was, then the RL algorithm could just simulate the recurrent updates using the agent's weights, achieving performance parity. So for what you're describing to be the explanation for emergent learning-to-learn, you'd need the model to do all of its learned 'learning' within a single forward pass. I don't find that very plausible -- or rather, whatever advantageous responsive computation happens in the forward pass, I wouldn't be inclined to describe as learning.

You might argue that today's RL algorithms can't simulate the required recurrence using the weights -- but that is a different explanation to the one you state, and essentially the explanation I would lean towards [AF(p) · GW(p)].

if taking actions requires learning, then the model itself has to do that learning.

I'm not sure what you mean when you say 'taking actions requires learning'. Do you mean something other than the basic requirement that a policy depends on observations?

Replies from: evhub
comment by evhub · 2020-09-14T19:32:34.126Z · LW(p) · GW(p)

And can we taboo the word 'learning' for this discussion, or keep it to the standard ML meaning of 'update model weights through optimisation'? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn't have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn't necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.

I agree with all of that—I was using the term “learning” to be purposefully vague precisely because the space is so large and the point that I'm making is very general and doesn't really depend on exactly what notion of responsiveness/learning you're considering.

Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby 'updating' based on returns it hasn't yet received, and actions it hasn't yet made. I agree that it's possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don't do anything so sophisticated). It isn't clear to me a priori that from the point of view of the policy this is the best strategy to take.

This does in fact seem like an interesting angle from which to analyze this, though it's definitely not what I was saying—and I agree that current meta-learners probably aren't doing this.

I'm not sure what you mean when you say 'taking actions requires learning'. Do you mean something other than the basic requirement that a policy depends on observations?

What I mean here is in fact very basic—let me try to clarify. Let be the optimal policy. Furthermore, suppose that any polynomial-time (or some other similar constraint) algorithm that well-approximates has to perform some operation . Then, my point is just that, for to achieve performance comparable with , it has to do . And my argument for that is just simply because we know that you have to do to get good performance, which means either has to do or the gradient descent algorithm has to—but we know the gradient descent algorithm can't be doing something crazy like running at each step and putting the result into the model because the gradient descent algorithm only updates the model on the given state after the model has already produced its action.

comment by vlad_m · 2020-09-15T20:41:37.343Z · LW(p) · GW(p)

I am quite confused. I wonder if we agree on the substance but not on the wording, but perhaps it’s worthwhile talking this through.

I follow your argument, and it is what I had in mind when I was responding to you earlier. If approximating within the constraints requires computing , then any policy that approximates must compute . (Assuming appropriate constraints that preclude the policy from being a lookup table precomputed by SGD; not sure if that’s what you meant by “other similar”, though this may be trickier to do formally than we take it to be).

My point is that for = ‘learning’, I can’t see how anything I would call learning could meaningfully happen inside a single timestep. ‘Learning’ in my head is something that suggests non-ephemeral change; and any lasting change has to feed into the agent’s next state, by which point SGD would have had its chance to make the same change.

Could you give an example of what you mean (this is partially why I wanted to taboo learning)? Or, could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning).

Replies from: evhub
comment by evhub · 2020-09-15T21:12:07.797Z · LW(p) · GW(p)

could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning)

How about language modeling? I think that the task of predicting what a human will say next given some prompt requires learning in a pretty meaningful way, as it requires the model to be able to learn from the prompt what the human is trying to do and then do that.

comment by vlad_m · 2020-09-19T11:41:20.073Z · LW(p) · GW(p)

Good point -- I think I wasn't thinking deeply enough about language modelling. I certainly agree that the model has to learn in the colloquial sense, especially if it's doing something really impressive that isn't well-explained by interpolating on dataset examples -- I'm imagining giving GPT-X some new mathematical definitions and asking it to make novel proofs.

I think my confusion was rooted in the fact that you were replying to a section that dealt specifically with learning an inner RL algorithm, and the above sense of 'learning' is a bit different from that one. 'Learning' in your sense can be required for a task without requiring an inner RL algorithm; or at least, whether it does isn't clear to me a priori.

comment by rohinmshah · 2020-08-19T19:29:52.603Z · LW(p) · GW(p)

(EDIT: I responded in more detail on the original post [AF(p) · GW(p)].)

The core empirical claim, as I understand it, is that task performance continues to improve after weights are frozen, suggesting that learning is still taking place, implemented in neural activation changes rather than neural weight changes.

I'm fairly confident this is not what is happening (at least, if I understand your claim correctly). If <what I understand your claim to be> was happening, it would be pretty surprising to me.

I only skimmed the linked paper, but it seems like it studies bandit problems, where each episode of RL is a new bandit problem where the agent doesn't know which arm gives maximal reward. Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.

However, because bandit problems have been studied in the AI literature, and "learning algorithms" have been proposed to solve bandit problems, this very normal fact of a policy depending on observations is now trotted out as "learning algorithms spontaneously emerge". I don't understand why this was surprising to the original researchers, it seems like if you just thought about what the optimal policy would be given the observable information, you would make exactly this prediction.

More broadly, I don't understand what people are talking about when they speak of the "likelihood" of mesa optimization. If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it. If you mean the chance than a policy trained by RL will "learn" without gradient descent, I can't imagine a way that could fail to be true for an intelligent system trained by deep RL -- presumably a system that is intelligent is capable of learning quickly, and when we talk about deep RL leading to an intelligent AI system, presumably we are talking about the policy being intelligent (what else?), therefore the policy must "learn" as it is being executed.

Replies from: abramdemski, evhub
comment by abramdemski · 2020-08-19T20:21:54.915Z · LW(p) · GW(p)

Your assessment here seems to (mostly) line up with what I was trying to communicate in the post.

This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.

This is something I hoped to communicate in the "Mesa-Learning Everywhere?" section, especially point #3.

If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it.

This is a point I hoped to convey in the Search vs Control section.

If you mean the chance than a policy trained by RL will "learn" without gradient descent, I can't imagine a way that could fail to be true for an intelligent system trained by deep RL

Ah, here is where the disagreement seems to lie. In another comment [LW(p) · GW(p)], you write:

Here on LW / AF, "mesa optimization" seems to only apply if there's some sort of "general" learning algorithm, especially one that is "using search", for reasons that have always been unclear to me.

I currently think this:

1. There is a spectrum between "just learning the task" vs "learning to learn", which has to do with how "general" the learning is. DQN looking at the ball is very far on the "just learning the task" side.
2. This spectrum is very fuzzy. There is no clear distinction.
3. This spectrum is very relevant to inner alignment questions. If a system like GPT-3 is merely "locating the task", then its behavior is highly constrained by the training set. On the other hand, if GPT-3 is "learning on the fly", then its behavior is much less constrained by the training set, and have correspondingly more potential for misaligned behavior (behavior which is capably achieving a different goal than the intended one). This is justified by an interpolation-vs-extrapolation type intuition.
4. The paper provides a small amount of evidence that things higher on the spectrum are likely to happen. (I'm going to revise the post to indicate that the paper only provides a small amount of evidence -- I admit I didn't read the paper to see exactly what they did, and should have anticipated that it would be something relatively unimpressive like multi-armed-bandit.)
5. Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.
Replies from: rohinmshah
comment by rohinmshah · 2020-08-19T20:32:15.126Z · LW(p) · GW(p)

All of that seems reasonable. (I indeed misunderstood your claim, mostly because you cited the spontaneous emergence post.)

Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.

Fair enough; I guess I'm unclear on how you can think about it other than this way.

Replies from: abramdemski
comment by abramdemski · 2020-08-19T20:41:50.297Z · LW(p) · GW(p)

Fair enough; I guess I'm unclear on how you can think about it other than this way.

Yeahhh, idk. All I can currently articulate is that, previously, I thought of it as a black swan event.

Replies from: rohinmshah
comment by rohinmshah · 2020-08-24T22:19:19.878Z · LW(p) · GW(p)

Random question: does this also update you towards "alignment problems will manifest in real systems well before they are powerful enough to take over the world"?

Context: I see this as a key claim for the (relative to MIRI) alignment-by-default perspective [AF · GW], and I expect many people at MIRI disagree with this claim (though I don't know why they disagree).

Replies from: capybaralet
comment by capybaralet · 2020-09-17T07:59:10.276Z · LW(p) · GW(p)

I'm very curious to know whether people at MIRI in fact disagree with this claim.

I would expect that they don't... e.g. Eliezer seems to think we'll see them and patch them unsuccessfully: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D

Replies from: rohinmshah
comment by rohinmshah · 2020-09-17T16:20:00.162Z · LW(p) · GW(p)

Yeah it's plausible that the actual claims MIRI would disagree with are more like:

Problems manifest => high likelihood we understand the underlying cause

We understand the underlying cause => high likelihood we fix it (or don't build powerful AI) rather than applying "surface patches"

Replies from: capybaralet
comment by capybaralet · 2020-09-17T19:01:14.255Z · LW(p) · GW(p)

Yep. I'd love to see more discussion around these cruxes (e.g. I'd be up for a public or private discussion sometime, or moderating one with someone from MIRI). I'd guess some of the main underlying cruxes are:

• How hard are these problems to fix?
• How motivated will the research community be to fix them?
• How likely will developers be to use the fixes?
• How reliably will developers need to use the fixes? (e.g. how much x-risk would result from a small company *not* using them?)

Personally, OTTMH (numbers pulled out of my ass), my views on these cruxes are:

• It's hard to say, but I'd say there's a ~85% chance they are extremely difficult (effectively intractable on short-to-medium (~40yrs) timelines).
• A small minority (~1-20%) of researchers will be highly motivated to fix them, once they are apparent/prominent. More researchers (~10-80%) will focus on patches.
• Conditioned on fixes being easy and cheap to apply, large orgs will be very likely to use them (~90%); small orgs less so (~50%). Fixes are likely to be easy to apply (we'll build good tools), if they are cheap enough to be deemed "practical", but very unlikely (~10%) to be cheap enough.
• It will probably need to be highly reliable; "the necessary intelligence/resources needed to destroy the world goes down every year" (unless we make a lot of progress of governance, which seems fairly unlikely (~15%))
Replies from: rohinmshah, capybaralet
comment by rohinmshah · 2020-09-17T21:20:07.281Z · LW(p) · GW(p)

Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:

• ~90% that there aren't problems or we "could" fix them on 40 year timelines
• I'm not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
• "Are fixes used" is not a question in my ontology; something counts as a "fix" only if it's cheap enough to be used. You could ask "did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not" (possibly because they didn't know of its existence), then < 10% and I don't have enough information to distinguish between 0-10%.
• I'll answer "how much x-risk would result from a small company *not* using them", if it's a single small company then < 10% and I don't have enough information to distinguish between 0-10% and I expect on reflection I'd say < 1%.
comment by capybaralet · 2020-09-17T19:02:43.306Z · LW(p) · GW(p)

I guess most of my cruxes are RE your 2nd "=>", and can almost be viewed as breaking down this question into sub-questions. It might be worth sketching out a quantitative model here.

comment by evhub · 2020-08-19T20:57:21.771Z · LW(p) · GW(p)

Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.

Fwiw, I agree with this, and also I think this is the same thing as what I said in my comment on the post [AF(p) · GW(p)] regarding how this is a necessary consequence of the RL algorithm only updating the model after it takes actions.

Replies from: rohinmshah
comment by rohinmshah · 2020-08-19T21:37:07.125Z · LW(p) · GW(p)

I didn't understand what you meant by "requires learning", but yeah I think you are in fact saying the same thing.

comment by vlad_m · 2020-09-13T14:45:41.773Z · LW(p) · GW(p)

I had a similar confusion when I first read Evan's comment. I think the thing that obscures this discussion is the extent to which the word 'learning' is overloaded -- so I'd vote taboo the term and use more concrete language.

comment by Vaniver · 2020-08-19T19:15:50.171Z · LW(p) · GW(p)
The inner RL algorithm adjusts its learning rate to improve performance.

I have come across a lot of learning rate adjustment schemes in my time, and none of them have been 'obviously good', altho I think some have been conceptually simple and relatively easy to find. If this is what's actually going on and can be backed out, it would be interesting to see what it's doing here (and whether that works well on its own).

This is more concerning than a thermostat-like bag of heuristics, because an RL algorithm is a pretty agentic thing, which can adapt to new situations and produce novel, clever behavior.

Most RL training algorithms that we have look to me like putting a thermostat on top of a model; I think you're underestimating deep thermostats [LW · GW].

comment by vlad_m · 2020-09-13T15:16:20.699Z · LW(p) · GW(p)
I've thought of two possible reasons so far.
Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.
More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

I would be more inclined towards a more general version of the latter view, in which gradient updates just aren't a very effective way to track within-episode information.

The central example of learning-to-learn is a policy that effectively explores/exploits when presented with an unknown bandit from within the training distribution. An optimal policy essentially needs to keep track of sufficient statistics of the reward distributions for each action. If you're training a memoryless policy for a fixed bandit problem using RL, then the only way of tracking the sufficient stats you have is through your weights, which are changed through the gradient updates. But the weight-space might not be arranged in a way that's easily traversed by local jumps. On the other hand, a meta-trained recurrent agent can track sufficient stats in its activations, traversing the sufficient statistic space in whatever way it pleases -- its updates need not be local.

This has an interesting connection to MAML, because a converged memoryless MAML solution on a distribution of bandit tasks will presumably arrange the part of its weight-space that encodes bandit sufficient statistics in a way that makes it easy to traverse via SGD. That would be a neat (and not difficult) experiment to run.

comment by Kaj_Sotala · 2020-08-19T05:13:52.694Z · LW(p) · GW(p)

It sounds a bit absurd: you've already implemented a sophisticated RL algorithm, which keeps track of value estimates for states and actions, and propagates these value estimates to steer actions toward future value. Why would the learning process re-implement a scheme like that, nested inside of the one you implemented? Why wouldn't it just focus on filling in the values accurately?

I've thought of two possible reasons so far.

1. Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.
2. More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

Possibly obvious, but just to point it out: both of these seem like they also describe the case of genetic evolution vs. brains.

Replies from: gwern
comment by gwern · 2020-08-19T14:25:00.794Z · LW(p) · GW(p)

I'm a little confused as to why there's any question here. Every algorithm lies on a spectrum of tradeoffs from general to narrow. The narrower a class of solved problems, the more efficient (in any way you care to name) an algorithm can be: a Tic-Tac-Toe solver is going to be a lot more efficient than AIXI.

Meta-learning works because the inner algorithm can be far more specialized, and thus, more performant or sample-efficient than the highly general outer algorithm which learned the inner algorithm.

For example, in Dactyl, PPO trains a RNN to adapt to many possible robot hands on the fly in as few samples as possible; it's probably several orders of magnitude faster than online training of an RNN by PPO directly. "Why not just use that RNN for DoTA2, if it's so much better than PPO?" Well, because DoTA2 has little or nothing to do with robotic hands rotating cubes, an algorithm that excels at robot hand will not transfer to DoTA2. PPO will still work, though.

Replies from: rohinmshah
comment by rohinmshah · 2020-08-19T18:05:58.472Z · LW(p) · GW(p)

Here on LW / AF, "mesa optimization" seems to only apply if there's some sort of "general" learning algorithm, especially one that is "using search", for reasons that have always been unclear to me. Some relevant posts taking the opposite perspective (which I endorse):

comment by rohinmshah · 2020-08-24T22:50:09.350Z · LW(p) · GW(p)

Planned summary for the Alignment Newsletter:

This post discusses several topics related to mesa optimization, and the ideas in it led the author to update towards thinking inner alignment problems are quite likely to occur in practice. I’m not summarizing it in detail here because it’s written from a perspective on mesa optimization that I find difficult to inhabit. However, it seems to me that this perspective is common so it seems fairly likely that the typical reader would find the post useful.

Happy for others to propose a different summary for me to include. However, the summary will need to make sense to me; this may be a hard challenge for this post in particular.

comment by rohinmshah · 2020-08-24T22:37:23.205Z · LW(p) · GW(p)
I lean toward there being a meaningful distinction here: a system can learn a general-purpose learning algorithm, or it can 'merely' learn a very good conditional model.

Does human reasoning count as a general-purpose learning algorithm? I've heard it claimed that when we apply neural nets to tasks humans haven't been trained on (like understanding DNA or materials science) the neural nets can rocket past human understanding, with way less computation and tools (and maybe even data) than humans have had access to (depending on how you measure). Tbc, I find this claim believable but haven't checked it myself. Maybe SGD is the real general-purpose learning algorithm? Human reasoning could certainly be viewed formally as "a very good conditional model".

So overall I lean towards thinking this is a continuous spectrum with no discontinuous changes (except ones like "better than humans or not", which use a fixed reference point to get a discontinuity). So there could be a meaningful distinction, but it's like the meaningful distinction between "warm water" and "hot water", rather than the meaningful distinction between "water" and "ice".

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-21T16:20:11.079Z · LW(p) · GW(p)

Why do you think my counterexample doesn't have internal search? In my counterexample, the circuit is simulating the behavior of another agent, which presumably is doing search, so the circuit is also doing search.

Replies from: abramdemski
comment by abramdemski · 2020-08-21T18:22:28.891Z · LW(p) · GW(p)

True, but it's a minimal circuit. When I wrote the remark, I was thinking: a minimal circuit will never do search; it will instead do something closer to memorizing the output of search (with some abstractions to compress further, so, not just a big memorized table). So I thought of the point of your counterexample as: "a minimal circuit may not do search, but it can implement the same policy as an agent which does search, which is exactly as concerning."

I agree this isn't really clear. I'll revise the remark.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-21T19:00:44.232Z · LW(p) · GW(p)

Well, running a Turning machine for time can be simulated by a circuit of size , so in terms of efficiency it's much closer to "doing search" than to "memorizing the output of search".

Replies from: abramdemski
comment by abramdemski · 2020-08-21T19:44:14.534Z · LW(p) · GW(p)

OK.

So if a search takes time exponential in the input size, the search-simulating circuit is size ... and if memorizing the answers also requires circuit length exponential in input size, they're roughly tied.

So the line where minimal circuits start memorizing rather than running is around there. Any search type worse than exponential, and it'll memorize; anything better, and it won't.