The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalization

post by Zachary Robertson (zachary-robertson) · 2021-05-20T20:57:20.118Z · LW · GW · 5 comments

Contents

  Error Threshold
  PAC-Bayes Bound
None
5 comments

Epistemological Status: Correct, probably too technical for Less Wrong, but these results are interesting and relevant.

There's been a flurry of posts on generalization recently. Some talk about simplicity priors [LW · GW] and other about the bias of SGD [LW · GW]. However, I think it's worth bringing up the fact that there is already an alternative that already works well enough to provides non-vacuous bounds for neural networks. Namely, I'm talking about the PAC-Bayes approach.

The twist I'll give in this brief note is a motivation of the bound from a biological perspective and then derive the full PAC-Bayes bound. The first part relies on the error threshold for replication and the patching interpretation [LW · GW] of the KL-divergence. The second part relies on the Donsker-Varadhan variational characterization of KL-divergence.

Error Threshold

A replication of an object is an approximate copy (up to mutation). The environment determines whether or not an object gets to replicate via a fitness . Suppose is an instruction set or parameterization of . When replicates is copied.

We begin with a prior distribution of instructions and then the distribution changes as the population replicates to . The amount of information needed to achieve the modification is the patching cost [LW · GW] and is equal to .

Only information from the fitness landscape is useful for converging to the fittest individual(s). On the other hand, the amount of information available from the environment is equal to . Thus, transitions are favorable only when we have, As an example, suppose that the space of instructions is the set of boolean strings of length ​ and that mutation flips a bit in the string with uniform independent probability . Suppose that always survives and our fitness is simply whether or not the replicate survives. This means and is a Bernoulli random variable with parameter . Then we have,

PAC-Bayes Bound

Now suppose that the instructions are parameterizations for some sort of learning model and that the fitness is the training metric. In particular, the survival rate could be a classification error metric. If we are given examples that induce an error rate of we have the requirement, This implies that using a lot of information to update our estimate for the optimal model paramters will require high error rates.

Specifically, a sublinear relationship between the information used and the data obtained is necessary for a low error rate if we are to avoid an error catastrophe. This would be a situation where the model continues to update despite being at the optimum.

At this point it's natural to wonder if a sublinear relationship is sufficient. This is precisely the content of the PAC-Bayes theorem. To show this first note that for any , As an aside, the astute reader might notice this as a Young-Inequality between dual functions in which case we have the famous relation, Either way, it's clearer now that if we take to be a generalization error such as, then the left-hand side of our Young-inequality yeilds an empirical loss which is similar to what we had before. The major difference is that we have a contribution from a cumulant function which we can bound using the Markov and Hoeffding inequalities, Now we put everything together and then optimize over . Optimizing we obtain,

5 comments

Comments sorted by top scores.

comment by Steveot · 2021-05-21T11:29:28.993Z · LW(p) · GW(p)

Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how  and  depend on   (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when . Minor comment: I think ?

Replies from: zachary-robertson
comment by Zachary Robertson (zachary-robertson) · 2021-05-21T14:23:42.924Z · LW(p) · GW(p)

The term is meant to be a posterior distribution after seeing data. If you have a good prior you could take . However, note could be high. You want trade-off between the cost of updating the prior and the loss reduction.

Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.

(Btw thanks for the correction)

Replies from: Steveot
comment by Steveot · 2021-05-24T13:05:13.694Z · LW(p) · GW(p)

Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high  probability (i.e., as you say,  is the data), while the learning bound or loss reduction is given for 

comment by Charlie Steiner · 2021-05-22T22:02:50.517Z · LW(p) · GW(p)

I'm still confused about the part where you use the Hoeffding inequality - how is the lambda in that step and the lambda in the loss function "the same lambda"?

Replies from: zachary-robertson
comment by Zachary Robertson (zachary-robertson) · 2021-05-22T23:15:52.850Z · LW(p) · GW(p)

Because . They are the same. Does that help?