The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalizationpost by Zachary Robertson (zachary-robertson) · 2021-05-20T20:57:20.118Z · LW · GW · 5 comments
Epistemological Status: Correct, probably too technical for Less Wrong, but these results are interesting and relevant.
There's been a flurry of posts on generalization recently. Some talk about simplicity priors [LW · GW] and other about the bias of SGD [LW · GW]. However, I think it's worth bringing up the fact that there is already an alternative that already works well enough to provides non-vacuous bounds for neural networks. Namely, I'm talking about the PAC-Bayes approach.
The twist I'll give in this brief note is a motivation of the bound from a biological perspective and then derive the full PAC-Bayes bound. The first part relies on the error threshold for replication and the patching interpretation [LW · GW] of the KL-divergence. The second part relies on the Donsker-Varadhan variational characterization of KL-divergence.
A replication of an object is an approximate copy (up to mutation). The environment determines whether or not an object gets to replicate via a fitness . Suppose is an instruction set or parameterization of . When replicates is copied.
We begin with a prior distribution of instructions and then the distribution changes as the population replicates to . The amount of information needed to achieve the modification is the patching cost [LW · GW] and is equal to .
Only information from the fitness landscape is useful for converging to the fittest individual(s). On the other hand, the amount of information available from the environment is equal to . Thus, transitions are favorable only when we have, As an example, suppose that the space of instructions is the set of boolean strings of length and that mutation flips a bit in the string with uniform independent probability . Suppose that always survives and our fitness is simply whether or not the replicate survives. This means and is a Bernoulli random variable with parameter . Then we have,
Now suppose that the instructions are parameterizations for some sort of learning model and that the fitness is the training metric. In particular, the survival rate could be a classification error metric. If we are given examples that induce an error rate of we have the requirement, This implies that using a lot of information to update our estimate for the optimal model paramters will require high error rates.
Specifically, a sublinear relationship between the information used and the data obtained is necessary for a low error rate if we are to avoid an error catastrophe. This would be a situation where the model continues to update despite being at the optimum.
At this point it's natural to wonder if a sublinear relationship is sufficient. This is precisely the content of the PAC-Bayes theorem. To show this first note that for any , As an aside, the astute reader might notice this as a Young-Inequality between dual functions in which case we have the famous relation, Either way, it's clearer now that if we take to be a generalization error such as, then the left-hand side of our Young-inequality yeilds an empirical loss which is similar to what we had before. The major difference is that we have a contribution from a cumulant function which we can bound using the Markov and Hoeffding inequalities, Now we put everything together and then optimize over . Optimizing we obtain,
Comments sorted by top scores.