# Learning the prior

post by paulfchristiano · 2020-07-05T21:00:01.192Z · LW · GW · 18 comments

## Contents

    1: human forecasting
2: predicting human forecasts
Learning the human prior
& elaborations
Going beyond the human prior
Z
optimizing Mz and f
Outlook
None


Suppose that I have a dataset D of observed (x, y) pairs, and I’m interested in predicting the label y* for each point x* in some new set D*. Perhaps D is a set of forecasts from the last few years, and D* is a set of questions about the coming years that are important for planning.

The classic deep learning approach is to fit a model f on D, and then predict y* using f(x*).

This approach implicitly uses a somewhat strange prior, which depends on exactly how I optimize f. I may end up with the model with the smallest l2 norm, or the model that’s easiest to find with SGD, or the model that’s most robust to dropout. But none of these are anywhere close to the “ideal” beliefs of a human who has updated on D.

This means that neural nets are unnecessarily data hungry, and more importantly that they can generalize in an undesirable way. I now think that this is a safety problem, so I want to try to attack it head on by learning the “right” prior, rather than attempting to use neural nets as an implicit prior.

#### Warm-up 1: human forecasting

If D and D* are small enough, and I’m OK with human-level forecasts, then I don’t need ML at all.

Instead I can hire a human to look at all the data in D, learn all the relevant lessons from it, and then spend some time forecasting y* for each x*.

Now let’s gradually relax those assumptions.

#### Warm-up 2: predicting human forecasts

Suppose that D* is large but that D is still small enough that a human can extract all the relevant lessons from it (or that for each x* in D*, there is a small subset of D that is relevant).

In this case, I can pay humans to make forecasts for many randomly chosen x* in D*, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D*.

The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.

### Learning the human prior

Now suppose that D is large, such that a human can’t update on it themselves. Perhaps D contains billions of examples, but we only have time to let a human read a few pages of background material.

Instead of learning the unconditional human forecast P(y|x), we will learn the forecast P(y|x, Z), where Z is a few pages of background material that the human takes as given. We can also query the human for the prior probability Prior(Z) that the background material is true.

Then we can train f(y|x, Z) to match P(y|x, Z), and optimize Z* for:

log Prior(Z*) + sum((x, y) ~ D) log f(y|x, Z*)

We train f in parallel with optimizing Z*, on inputs consisting of the current value of Z* together with questions x sampled from D and D*.

For example, Z might specify a few explicit models for forecasting and trend extrapolation, a few important background assumptions, and guesses for a wide range of empirical parameters. Then a human who reads Z can evaluate how plausible it is on its face, or they can take it on faith in order to predict y* given x*.

The optimal Z* is then the set of assumptions, models, and empirical estimates that works best on the historical data. The human never has to reason about more than one datapoint at a time — they just have to evaluate what Z* implies about each datapoint in isolation, and evaluate how plausible Z* is a priori.

This approach has many problems. Two particularly important ones:

• To be competitive, this optimization problem needs to be nearly as easy as optimizing f directly on D, but it seems harder: finding Z* might be much harder than learning f, learning a conditional f might be much harder than learning an unconditional f, and jointly optimizing Z and f might present further difficulties.
• Even if it worked our forecasts would only be “human-level” in a fairly restrictive sense — they wouldn’t even be as good as a human who actually spent years practicing on D before making a forecast on D*. To be competitive, we want the forecasts in the iid case to be at least as good as fitting a model directly.

I think the first point is an interesting ML research problem. (If anything resembling this approach ever works in practice, credit will rightly go to the researchers who figure out the precise version that works and resolve those issues, and this blog post will be a footnote.) I feel relatively optimistic about our collective ability to solve concrete ML problems, unless they turn out to be impossible. I’ll give some preliminary thoughts in the next section “Notes & elaborations.”

The second concern, that we need some way to go beyond human level, is a central philosophical issue and I’ll return to it in the subsequent section “Going beyond the human prior.”

#### Notes & elaborations

• Searching over long texts may be extremely difficult. One idea to avoid this is to try to have a human guide the search, by either generating hypotheses Z at random or sampling perturbations to the current value of Z. Then we can fit a generative model of that exploration process and perform search in the latent space (and also fit f in the latent space rather than having it take Z as input). That rests on two hopes: (i) learning the exploration model is easy relative to the other optimization we are doing, (ii) searching for Z in the latent space of the human exploration process is strictly easier than the corresponding search over neural nets. Both of those seem quite plausible to me.
• We don’t necessarily need to learn f everywhere, it only needs to be valid in a small neighborhood of the current Z. That may not be much harder than learning the unconditional f.
• Z represents a full posterior rather than a deterministic “hypothesis” about the world, e.g. it might say “R0 is uniform between 2 and 3.” What I’m calling Prior(Z) is really the KL between the prior and Z, and P(y|x,Z) will itself reflect the uncertainty in Z. The motivation is that we want a flexible and learnable posterior. (This is particularly valuable once we go beyond human level.)
• This formulation queries the human for Prior(Z) before each fitness evaluation. That might be fine, or you might need to learn a predictor of that judgment. It might be easier for a human to report a ratio Prior(Z)/Prior(Z′) than to give an absolute prior probability, but that’s also fine for optimization. I think there are a lot of difficulties of this flavor that are similar to other efforts to learn from humans.
• For the purpose of studying the ML optimization difficulties I think we can basically treat the human as an oracle for a reasonable prior. We will then need to relax that rationality assumption in the same way we do for other instances of learning from humans (though a lot of the work will also be done by our efforts to go beyond the human prior, described in the next section).

### Going beyond the human prior

How do we get predictions better than explicit human reasoning?

We need to have a richer latent space Z, a better Prior(Z), and a better conditional P(y|x, Z).

Instead of having a human predict y given x and Z, we can use amplification or debate to train f(y|x, Z) and Prior(Z). This allows Z to be a large object that cannot be directly accessed by a human.

For example, Z might be a full library of books describing important facts about the world, heuristics, and so on. Then we may have two powerful models debating “What should we predict about x, assuming that everything in Z is true?” Over the course of that debate they can cite small components of Z to help make their case, without the human needing to understand almost anything written in Z.

In order to make this approach work, we need to do a lot of things:

1. We still need to deal with all the ML difficulties described in the preceding section.
2. We still need to analyze debate/amplification, and now we’ve increased the problem difficulty slightly. Rather than merely requiring them to produce the “right” answers to questions, we also need them to implement the “right” prior. We already needed to implement the right prior as part of answering questions correctly, so this isn’t too much of a strengthening, but we are calling attention to a particularly challenging case. It also imposes a particular structure on that reasoning which is a real (but hopefully slight) strengthening.
3. Entangled with the new analysis of amplification/debate, we also need to ensure that Z is able to represent a rich enough latent space. I’ll discuss implicit representations of Z in the next section “Representing Z.”
4. Representing Z implicitly and using amplification or debate may make the optimization problem even more difficult. I’ll discuss this in the subsequent section “Jointly optimizing Mz and f.”

#### Representing Z

I’ve described Z as being a giant string of text. If debate/amplification work at all then I think text is in some sense “universal,” so this isn’t a crazy restriction.

That said, representing complex beliefs might require very long text, perhaps many orders of magnitude larger than the model f itself. That means that optimizing for (Z, f) jointly will be much harder than optimizing for f alone.

The approach I’m most optimistic about is representing Z implicitly as the output of another model Mz. For example, if Z is a text that is trillions of words long, you could have Mz output the ith word of Z on input i.

(To be really efficient you’ll need to share parameters between f and Mz but that’s not the hard part.)

This can get around the most obvious problem — that Z is too long to possibly write down in its entirety — but I think you actually have to be pretty careful about the implicit representation or else we will make Mz’s job too hard (in a way that will be tied up the competitiveness of debate/amplification).

In particular, I think that representing Z as implicit flat text is unlikely to be workable. I’m more optimistic about the kind of approach described in approval-maximizing representations — Z is a complex object that can be related to slightly simpler objects, which can themselves be related to slightly simpler objects… until eventually bottoming out with something simple enough to be read directly by a human. Then Mz implicitly represents Z as an exponentially large tree, and only needs to be able to do one step of unpacking at a time.

#### Jointly optimizing Mz and f

In the first section I discussed a model where we learn f(y|x, Z) and then use it to optimize Z. This is harder if Z is represented implicitly by Mz, since we can’t really afford to let f take Mz as input.

I think the most promising approach is to have Mz and f both operate on a compact latent space, and perform optimization in this space. I mention that idea in Notes & Elaborations above, but want to go into more detail now since it gets a little more complicated and becomes a more central part of the proposal.

(There are other plausible approaches to this problem; having more angles of attack makes me feel more comfortable with the problem, but all of the others feel less promising to me and I wanted to keep this blog post a bit shorter.)

The main idea is that rather than training a model Mz(·) which implicitly represents Z, we train a model Mz(·, z) which implicitly represents a distribution over Z, parameterized by a compact latent z.

Mz is trained by iterated amplification to imitate a superhuman exploration distribution, analogous to the way that we could ask a human to sample Z and then train a generative model of the human’s hypothesis-generation. Training Mz this way is itself an open ML problem, similar to the ML problem of making iterated amplification work for question-answering.

Now we can train f(y|x, z) using amplification or debate. Whenever we would want to reference Z, we use Mz(·, z). Similarly, we can train Prior(z). Then we choose z* to optimize log Prior(z*) + sum((x, y) ~ D) log f(y|x, z*).

Rather than ending up with a human-comprehensible posterior Z*, we’ll end up with a compact latent z*. The human-comprehensible posterior Z* is implemented implicitly by Mz(·, z*).

### Outlook

I think the approach in this post can potentially resolve the issue described in Inaccessible Information, which I think is one of the largest remaining conceptual obstacles for amplification/debate. So overall I feel very excited about it.

Taking this approach means that amplification/debate need to meet a slightly higher bar than they otherwise would, and introduces a bit of extra philosophical difficulty. It remains to be seen whether amplification/debate will work at all, much less whether they can meet this higher bar. But overall I feel pretty excited about this outcome, since I was expecting to need a larger reworking of amplification/debate.

I think it’s still very possible that the approach in this post can’t work for fundamental philosophical reasons. I’m not saying this blog post is anywhere close to a convincing argument for feasibility.

Even if the approach in this post is conceptually sound, it involves several serious ML challenges. I don’t see any reason those challenges should be impossible, so I feel pretty good about that — it always seems like good news when you can move from philosophical difficulty to technical difficulty. That said, it’s still quite possible that one of these technical issues will be a fundamental deal-breaker for competitiveness.

My current view is that we don’t have candidate obstructions for amplification/debate as an approach to AI alignment, though we have a lot of work to do to actually flesh those out into a workable approach. This is a more optimistic place than I was at a month ago when I wrote Inaccessible Information.

Learning the prior was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

comment by ESRogs · 2020-07-05T23:24:36.589Z · LW(p) · GW(p)

In this case, I can pay humans to make forecasts for many randomly chosen x* in D*, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D*.

The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.

Perhaps a dumb question, but don't we now have the same problem at one remove? The model for predicting what the human would predict would still come from a "strange" prior (based on the l2 norm, or whatever).

Does the strangeness just get washed out by the one layer of indirection? Would you ever want to do two (or more) steps, and train a model to predict what a human would predict a human would predict?

comment by paulfchristiano · 2020-07-06T03:27:22.524Z · LW(p) · GW(p)

The difference is that you can draw as many samples as you want from D* and they are all iid. Neural nets are fine in that regime.

comment by ESRogs · 2020-07-06T07:17:53.187Z · LW(p) · GW(p)

Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you're testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-07-06T12:48:56.073Z · LW(p) · GW(p)

Is that actually true though? Why is that true? Say we are training the model on a dataset of N human answers, and then we are doing to deploy it to answer 10N more questions, all from the same big pool of questions. The AI can't tell whether it is in training or deployment, but it could decide to follow a policy of giving some sort of catastrophic answer with probability 1/10N, so that probably it'll make it through training just fine and then still get to cause catastrophe.

comment by paulfchristiano · 2020-07-08T00:46:38.228Z · LW(p) · GW(p)

That's right---you still only get a bound on average quality, and you need to do something to cope with failures so rare they never appear in training (here's a post reviewing my best guesses).

But before you weren't even in the game, it wouldn't matter how well adversarial training worked because you didn't even have the knowledge to tell whether a given behavior is good or bad. You weren't even getting the right behavior on average.

(In the OP I think the claim "the generalization is now coming entirely from human beliefs" is fine, I meant generalization from one distribution to another. "Neural nets are are fine" was sweeping these issues under the rug. Though note that in the real world the distribution will change from neural net training to deployment, it's just exactly the normal robustness problem. The point of this post is just to get it down to only a robustness problem that you could solve with some kind of generalization of adversarial training, the reason to set it up as in the OP was to make the issue more clear.)

comment by evhub · 2020-07-06T19:08:07.545Z · LW(p) · GW(p)

I agree with Daniel. Certainly training on actual iid samples from the deployment distribution helps a lot—as it ensures that your limiting behavior is correct—but in the finite data regime you can still find a deceptive model that defects some percentage of the time.

comment by ESRogs · 2020-07-06T18:25:33.497Z · LW(p) · GW(p)

This is a good question, and I don't know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.

comment by paulfchristiano · 2020-07-08T00:57:43.765Z · LW(p) · GW(p)

Yeah, that's my view.

comment by ESRogs · 2020-07-08T03:07:06.218Z · LW(p) · GW(p)

Thanks for confirming.

comment by ofer · 2020-07-07T18:55:17.679Z · LW(p) · GW(p)

I'm confused about this point. My understanding is that, if we sample iid examples from some dataset and then naively train a neural network with them, in the limit we may run into universal prior problems, even during training (e.g. an inference execution that leverages some software vulnerability in the computer that runs the training process).

comment by Nisan · 2020-07-06T03:54:00.815Z · LW(p) · GW(p)

In this case humans are doing the job of transferring from to , and the training algorithm just has to generalize from a representative sample of to the test set.

comment by ESRogs · 2020-07-06T07:25:54.658Z · LW(p) · GW(p)

Thank you, this was helpful. I hadn't understood what was meant by "the generalization is now coming entirely from human beliefs", but now it seems clear. (And in retrospect obvious if I'd just read/thought more carefully.)

comment by wangscarpet · 2020-07-07T01:15:57.518Z · LW(p) · GW(p)

This is a good and valid question -- I agree, it isn't fair to say generalization comes entirely from human beliefs.

An illustrative example: suppose we're talking about deep learning, so our predicting model is a neural network. We haven't specified the architecture of the model yet. We choose two architectures, and train both of them from our subsampled human-labeled D* items. Almost surely, these two models won't give exactly the same outputs on every input, even in expectation. So where did this variability come from? Some sort of bias from the model architecture!

comment by rohinmshah · 2020-07-14T01:36:28.393Z · LW(p) · GW(p)

The Alignment Newsletter summary + opinion for this post is here [AF(p) · GW(p)].

comment by Charlie Steiner · 2020-07-06T01:02:13.667Z · LW(p) · GW(p)
The motivation is that we want a flexible and learnable posterior.

-Paul Christiano, 2020

Ahem, back on topic, I'm not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn't look much like human-written text.

Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it's not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever. There will still be some human-incomprehensible bits that can be transmitted through Z (Because otherwise you'd need a discriminator so good that Z couldn't be superhuman), but at least the amount is sharply limited.

But I'm really lost on how your could hope to limit the f side of this dichotomy. Penalize it for understanding the world too well given a random Z? Now it just has an incentive to notice random Zs and "play dead." Somehow you want it not to do better by just becoming a catch-all model of the training data, even on the actual training data. This might be one of those philosophical problems, given that you're expecting it to interpret natural language passages, and the lack of bright line between "understanding natural language" and "modeling the world."

comment by paulfchristiano · 2020-07-06T03:30:40.809Z · LW(p) · GW(p)
I'm not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn't look much like human-written text.

f is just predicting P(y|x, Z), it's not trying to model D. So you don't gain anything by putting facts about the data distribution in f---you have to put them in Z so that it changes P(y|x,Z).

Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it's not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever.

The only thing Z does is get handed to the human for computing P(y|x,Z).

comment by Charlie Steiner · 2020-07-06T23:11:22.353Z · LW(p) · GW(p)

Ah, I think I see, thanks for explaining. So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions. Or maybe some amount of fine-tuning for "better" predictions by some method of eliciting its own standards, but not by actually comparing it to the ground truth.

This (along with eventually reading your companion post) also helps resolve the confusion I was having over what exactly was the prior in "learning the prior" - Z is just like a latent space, and f is the decoder from Z to predictions. My impression is that your hope is that if Z and f start out human-like, then this is like specifying the "programming language" of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.

Is that somewhat in the right ballpark?

comment by paulfchristiano · 2020-07-07T00:59:51.934Z · LW(p) · GW(p)
So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions.

That's right, f is either imitating a human, or it's trained by iterated amplification / debate---in any case the loss function is defined by the human. In no case is f optimized to make good predictions about the underlying data.

My impression is that your hope is that if Z and f start out human-like, then this is like specifying the "programming language" of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.

Z should always be a human-readable (or amplified-human-readable) latent; it will necessarily remain human-readable because it has no purpose other than to help a human make predictions. f is going to remain human-like because it's predicting what the human would say (or what the human-consulting-f would say etc.).

The amplified human is like the programming language of the universal prior, Z is like the program that is chosen (or slightly more precisely: Z is like a distribution over programs, described in a human-comprehensible way) and f is an efficient distillation of the intractable ideal.