# Sufficiently Advanced Language Models Can Do Reinforcement Learning

post by Zachary Robertson (zachary-robertson) · 2020-08-02T15:32:47.894Z · LW · GW · 7 comments## Contents

The Setup Iterative Chaining Iterative Classification Selecting on GPT Output Is RL None 7 comments

*Epistemological Status: I've suspected this for a while. The mesa-optimization status of GPT seems to be folk-theory at the moment. It seems worthwhile to develop a prototype explanation of it's abilities. Claims about the connection between replication and RL could be worked out to a higher level of clarity using tropical algebra, but that most likely is overkill so I maintain a certain level of informality throughout.*

I want to start by exploring a simple question and the implication of a positive result. Can GPT predict the quality of its output? I'm going to focus on an idealized setting where I assume,

- GPT can be treated as an accurate language model.
- GPT's context window is wide enough to learn the relevant task distributions we're interested in.
- Evaluating the quality of output is easier for GPT than producing the given output for a given input.

I'll show that evaluation can be used to boost the output of this idealized version of GPT. Moreover, this boosting can be interpreted more generally as a natural selection process. As I've argued previously [LW · GW], a natural selection process maps cleanly onto RL in the limit. Because OpenAI trains GPT using log-probability we can directly interpret the model as a replication process for special types of tasks that I introduce as evaluable recurrent tasks.

## The Setup

Empirically, we've seen that GPT can do quite a few things. One could interpret the above assumptions as an attempt to try and explain the empirical success. On the wiki-page for GPT we have two interesting entries I'm going to try and explain in more depth. The first is iterative chaining. The second is the survey trick.

Iterative chaining suggests that,

For example, supposes you've got a set of news headlines but don't know what the labels should be. That's fine! Just start with some prompts, let the API figure out how it wants to label them with your initial prompts as guidance, and then feed the API-labelled prompts back into the API for those and new headlines letting the labels evolve over time.

The survey trick suggests that we can insist GPT qualify certainty in its output on a scale from 1-5 or 1-7. Now, in reality, GPT doesn't work well with numbers so there are still some inconveniences to work out on that front. However, at the moment survey-scales such as,

Not at all, a little, a moderate amount, a lot, or a great deal?

seem to work. Technically speaking, because of assumption one we never actually need to observe this output. Instead at the end of each response, we directly can calculate the probabilities of the various evaluations from GPT.

## Iterative Chaining

The main advantage of doing something like this is that we can effectively hack in conditioning on the tail of an output. For example, if we only take outputs that are evaluated with a "great deal" of confidence we end up conditioning on high confidence outputs.

In a previous post [LW · GW], some machinery is introduced to think about structured tasks that we might give a language model. The gist is that I introduce something called a recurrent task so that I can think of tasks as being properly sampled from a distribution. You can likely avoid reading this if you think of recurrent tasks as random samples of query/answer pairs from a distribution .

Say we have a mapping that we can use to order query/answer pairs with. Consider the recurrent task that consists of mapping queries to answers and the evaluation task that consists of mapping query / answer pairs to evaluations. The existence of makes an evaluable recurrent task. We want to show it's reasonable to wonder whether or not these two tasks are enough to get GPT to learn to accurately sample from .

## Iterative Classification

It'll be easier if we consider an example first. Say we have a recurrent task implied by and we want GPT to match the distribution. Specifically, we're outputting answers to factual questions. Thus, we can evaluate whether or not an answer is correct. Given this, we use some amount of context to condition the model and then test GPT. So we evaluate . Note that if we gave examples we'd have and that .

Our basic problem is that GPT zero-shot does poorly. It has an error rate of . However, on the evaluation task, the probability that it lets through a true/false positive is / . Say we let the evaluation task manage the recurrent task in the following sense:

- Allow GPT to answer the next query.
- Allow GPT to predict the evaluation.
- If the evaluation returns as TRUE append the the q/a pair to a buffer
- If buffer is large enough append to context and repeat

Will this algorithm work? Yes, as long as . The probability of a false positive append is . The probability of a true positive append is . In expectation the proportion of appends that will be false negatives will be, Why do I call this ? Well, once we start appending the positive answers to the context GPT will adapt it's error rate. The rate of improvement is likely difficult (impossible) to analytically figure out. However, if I invoke assumption one then after enough examples are appended GPT will figure out that sometimes it should output correct answers. This gets us to . Naturally, we setup a recurrence, Since we have, You could get into hairy details on convergence rate, but hopefully there's intuition for how this process ends up working. If you keep going in this direction you work towards a description of boosting.

## Selecting on GPT Output Is RL

**In general, if the output from GPT is cherry-picked then GPT does a deformed version of reinforcement learning.** We just saw that somehow the language model is able to iteratively bootstrap itself by conditioning over a special type of recurrent task. Ultimately, this explanation will serve to explain the success of generative pretraining. Let's get into more detail about the process.

In the original paper for GPT, OpenAI's goal was to estimate the log probability of the next token given a context as a form of pre-training for downstream language tasks. Alternatively, we want to maximize the negative log probability of text streams, When we sample from the model we commonly will use a temperature parameter that converts the log-probability back into regular probability, where I ignore the normalization factor. So then according to the thermodynamic interpretation, we actually have a population of attempts at matching that show up according to their, now weighted, probability under GPT. As we send the temperature to zero the only policies that survive are the ones that stick to the classical MLE critical path.

The reason we're introducing all of this machinery is so that we can understand what happens when the user cherry-picks. The beauty of recurrent tasks here is that they are being sampled from which means that we have finite length episodes. Let's hypothesize that the human overseer has slightly different criteria than that implied by called that they want to select for. This is the mapping from above that orders query / answer pairs, however, now we think of as assigning probability to outputs.

Alternatively, we can think of as a model of the probability that a human overseer will cherry-pick a given output. The trick is that we can convert into a soft-selection mechanism by allowing pairs to reproduce with probability . The new probability model is special in that it is history independent across multiple query / answer pairs (permutation invariant). This means that can be used to augment the recurrent task into something we've been calling an evaluable recurrent task.

First, note that enforcing the selection criteria will eventually *adapt* to . This is because when we have an output we allow it to survive with probability . By assumption one, GPT can encode . Thus, as we start adding to the context GPT will adapt to . The normal way of saying this is that cherry-picking rollouts conditions for future queries.

Second, this implies that having an appropriate prompt or answers for the queries is a sufficient, but not necessary condition for getting good results, in terms of , from GPT. While waiting for adaption to occur may be slow, in terms of rollouts, it will still happen. In essence, examples just start us further along in the process.

By assumption three, we can design co-recurrent tasks that evaluate query/answer pairs using for a given recurrent task. Putting everything together, we conclude that sufficiently complex language models can oversee their own adaption to a general class of tasks I'm calling evaluable recurrent tasks.

The kicker is that, as I previously showed in a previous post [LW · GW], selection pressure on a distribution of mixing replicators leads to reinforcement learning as we take the temperature to zero. We can study the population dynamics of viral strains with the mutation matrix using, Let's switch to the MDP setting. Assume the actions of the strain(s) have a deterministic effect on the environment transitions. We're going to interpret the rewards as a fitness score allowing the agent to continue propagating. First, let the transition matrix for the system be given as . Second, transform each reward to the fitness . Note that up to scaling this is identical to what we do with log-probability. The Quasispecies formula relates the population of individuals at each state after one stage of replication after we set .

The individuals aren't intelligent. Instead, the fitness controls the replication rate of transitions. If then the transition is neutral and the number of individuals collected on a state is neither amplified nor diminished. If then the transition is extremely harmful and if the transition is extremely helpful.

Notice that we can study the space of all possible transitions and conclude that,
To make further progress, remember that actions have deterministic effects so it's okay to assume individuals are fully random in their exploration. This allows us to simplify the product into,
In words, we have decomposed the evolution of the population into a summation over all the paths the strains could take through the system. The twist is that each path is weighted by an exponential term proportional to the reward that path receives from the environment. Philosophically, this has the same spirit as the path integral approach used in physics. If we send in the path integral, this is the thermodynamic limit, we'll get back the equations for classical motion. The claim is that the dynamics *reinforce* only the optimal paths in this limit.

It's precisely because GPT uses log-probability / temperature sampling that the mapping is so clean. GPT outputs are already probabilistic. Thus, the replication probabilities are *real*. In our context, we sample from until GPT learns how to reproduce the distribution we have in mind. Humans or GPT itself are the selectors in this process. Moreover, ultimately, we don't even *want* to do RL because we want to sample from . Instead, we end up with a quasi-species cloud of viable samples. The beauty of this interpretation is that it implies that selection-pressure (reward function) is equivalent to modifying the underlying replication rates.

People have correctly pointed out that a *single* instance of GPT cannot learn. I'm not addressing that claim here. Instead, I'm suggesting that if we follow out the math a population of GPT under selection pressure from one another can constructively adapt to evaluable recurrent tasks. With those qualifications, let's be speculative. Define a collection of interfacing GPT- as -G-. Assume the GPT can model interfaces to our computers. I conjecture that you can draw a convex phase-diagram where being in the epigraph is sufficient for -G- to **amplify** itself to any -G-.

## 7 comments

Comments sorted by top scores.

## comment by brockmanmatt · 2020-08-02T19:06:28.266Z · LW(p) · GW(p)

Ah, sorry, I forgot to add a link to how to evolve the labels. There's a couple different methods in http://gptprompts.wikidot.com/context-stuffing if that helps.

## comment by SDM · 2020-08-03T11:46:48.396Z · LW(p) · GW(p)

Appending a reward *modelling *system to GPT-2 directly has already been done - humans were asked to select from among GPT-2 outputs according to some criteria, a reward model was trained on the human selections, and then was applied to train GPT-2. Based on what you've just said, this method is just a much faster, more efficient way of getting a GPT to adapt to perform a recurrent task (since it uses a reward model trained on a few examples of human evaluation, instead of waiting for GPT to adapt by itself to many human selections as you suggest).

We have demonstrated RL fine-tuning of language models tofour NLP tasks: stylistic continuation with high sentiment orphysically descriptive language, and summarization on theCNN/Daily Mail and TL;DR datasets. Rather than buildingtask-specific techniques, we achieve our results by straight-forwardly applying reward learning to language generation.

We extend previous reward learning work with pretrainedmodels and KL regularization to prevent the policy fromdiverging too far from natural language.Our results are mixed. On the continuation tasks we achievegood results vs. the zero-shot baseline as evaluated by hu-mans with very few samples: 2.5k for sentiment and 5kfor descriptiveness. However, for both summarization tasksour policies are only “smart copiers” (extractive rather thanabstractive): they copy from the input text but skip overirrelevant preamble.

No-one has done this reward modelling technique for GPT-3 yet, but it should be trivial since the exact method used for GPT-2 should work. The method notably didn't work as well when used to improve GPT-2 output on more complicated tasks (good on sentiment biasing, mediocre on summarization), but that's because GPT-2 wasn't coherent enough over long enough ranges to properly evaluate the rewards from a reward model representing some complex task or concept. With GPT-3, you might be able to use the reward modelling method to get it to focus on more complicated concepts, or get it to be more 'factually accurate and on-topic'. If you had the humans evaluate 'accurate and on-topic' and built up such a reward model that might be a way to 'bring out' the knowledge GPT-3 has but sometimes doesn't use. I think it would be just like this, but with the reward model helping you get more mileage out of each q/a pair in your buffer by generalising over it a bit:

Allow GPT to answer the next query.

Allow GPT to predict the evaluation.

If the evaluation returns as TRUE append the the q/a pair to a buffer

If buffer is large enough append to context and repeat

Perhaps you'd run into trouble needing a complicated or sophisticated reward model to get much extra mileage out of each new query, but given that it already worked with GPT-2 on simple tasks it might do well with GPT-3 on complex tasks. Essentially, everything you said - except we already have solid evidence that big parts of it can be automated and therefore likely achieved quicker than would otherwise be expected.

Replies from: zachary-robertson## ↑ comment by Zachary Robertson (zachary-robertson) · 2020-08-03T13:32:47.244Z · LW(p) · GW(p)

This paper looks interesting. My understanding is that this paper implemented a form of fine-tuning. However, learning the reward function does not seem to be few-shot whereas GPT3 does few-shot pretty well. That’s the main difference here as I see it.

It seems like there’s slow adaption (this paper) which is useful for more complicated tasks and fast adaption (the method here) that is useful for disposable tasks. I’d think a combination of both approaches is needed. For example, a module that tracks repeatedly occurring tasks can start a larger buffer to perform slow adaption.

Perhaps on a meta-level fine tuning GPT3 to few-shot inverse reinforcement learning would be an example of what could be possible with combining both approaches?

## comment by Jan Rzymkowski (jan-rzymkowski) · 2020-08-02T23:40:04.178Z · LW(p) · GW(p)

There a huge leap between a procedure allowing a predictive model to iteratively decrease False Positive Rate and having an AGI.

Replies from: zachary-robertson## ↑ comment by Zachary Robertson (zachary-robertson) · 2020-08-02T23:56:24.439Z · LW(p) · GW(p)

Correct. That’s why the section on classification and RL are separate. Classification tasks are a subclass of RL. A recurrent task need not be a classification task. In fact that I’d go further and say there’s *still* a huge difference between having an agent that can do RL and having an AGI. That’s why I put such speculation at the end.

Having said all that, it seems plausible to me that a language model might be able to reason about what modules it needs and then design them. I implicitly believe this to be the case, but perhaps I could’ve been more explicit. This is more of an empirical question, but if that were possible the difference between that model and AGI would become much smaller in my opinion.

## comment by dvasya · 2020-08-02T18:49:27.245Z · LW(p) · GW(p)

As I've argued previously [? · GW], a natural selection process maps cleanly onto RL in the limit.

The URL is broken (points to edit page)

Replies from: zachary-robertson## ↑ comment by Zachary Robertson (zachary-robertson) · 2020-08-02T18:54:19.726Z · LW(p) · GW(p)

Hopefully that's fixed! I wrote this as quickly as possible so there may be many tiny errors. Apologies. Let me know if anything else is wrong.