# The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalization

post by Zachary Robertson (zachary-robertson) · 2021-05-20T20:57:20.118Z · LW · GW · 5 comments## Contents

Error Threshold PAC-Bayes Bound None 5 comments

*Epistemological Status: Correct, probably too technical for Less Wrong, but these results are interesting and relevant.*

There's been a flurry of posts on generalization recently. Some talk about simplicity priors [LW · GW] and other about the bias of SGD [LW · GW]. However, I think it's worth bringing up the fact that there is already an alternative that already works well enough to provides non-vacuous bounds for neural networks. Namely, I'm talking about the PAC-Bayes approach.

The twist I'll give in this brief note is a motivation of the bound from a biological perspective and then derive the full PAC-Bayes bound. The first part relies on the error threshold for replication and the patching interpretation [LW · GW] of the KL-divergence. The second part relies on the Donsker-Varadhan variational characterization of KL-divergence.

## Error Threshold

A replication of an object is an approximate copy (up to mutation). The environment determines whether or not an object gets to replicate via a fitness . Suppose is an instruction set or parameterization of . When replicates is copied.

We begin with a prior distribution of instructions and then the distribution changes as the population replicates to . The amount of information needed to achieve the modification is the patching cost [LW · GW] and is equal to .

Only information from the fitness landscape is useful for converging to the fittest individual(s). On the other hand, the amount of information available from the environment is equal to . Thus, transitions are favorable only when we have, As an example, suppose that the space of instructions is the set of boolean strings of length and that mutation flips a bit in the string with uniform independent probability . Suppose that always survives and our fitness is simply whether or not the replicate survives. This means and is a Bernoulli random variable with parameter . Then we have,

## PAC-Bayes Bound

Now suppose that the instructions are parameterizations for some sort of learning model and that the fitness is the training metric. In particular, the survival rate could be a classification error metric. If we are given examples that induce an error rate of we have the requirement, This implies that using a lot of information to update our estimate for the optimal model paramters will require high error rates.

Specifically, a sublinear relationship between the information used and the data obtained is necessary for a low error rate if we are to avoid an error catastrophe. This would be a situation where the model continues to update despite being at the optimum.

At this point it's natural to wonder if a sublinear relationship is sufficient. This is precisely the content of the PAC-Bayes theorem. To show this first note that for any , As an aside, the astute reader might notice this as a Young-Inequality between dual functions in which case we have the famous relation, Either way, it's clearer now that if we take to be a generalization error such as, then the left-hand side of our Young-inequality yeilds an empirical loss which is similar to what we had before. The major difference is that we have a contribution from a cumulant function which we can bound using the Markov and Hoeffding inequalities, Now we put everything together and then optimize over . Optimizing we obtain,

## 5 comments

Comments sorted by top scores.

## comment by Steveot · 2021-05-21T11:29:28.993Z · LW(p) · GW(p)

Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how and depend on (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when . Minor comment: I think ?

Replies from: zachary-robertson## ↑ comment by Zachary Robertson (zachary-robertson) · 2021-05-21T14:23:42.924Z · LW(p) · GW(p)

The term is meant to be a posterior distribution after seeing data. If you have a good prior you could take . However, note could be high. You want trade-off between the cost of updating the prior and the loss reduction.

Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.

(Btw thanks for the correction)

Replies from: Steveot## comment by Charlie Steiner · 2021-05-22T22:02:50.517Z · LW(p) · GW(p)

I'm still confused about the part where you use the Hoeffding inequality - how is the lambda in that step and the lambda in the loss function "the same lambda"?

Replies from: zachary-robertson## ↑ comment by Zachary Robertson (zachary-robertson) · 2021-05-22T23:15:52.850Z · LW(p) · GW(p)

Because . They are the same. Does that help?