DSLT 2. Why Neural Networks obey Occam's Razor
post by Liam Carroll (liam-carroll) · 2023-06-18T00:23:14.725Z · LW · GW · 14 commentsContents
Information Criteria Help Avoid Underfitting and Overfitting Watanabe's Free Energy Formula for Singular Models Free Energy, Generalisation and Model Selection The Free Energy Formula for Singular Models The Widely Applicable Bayesian Information Criterion The Accuracy-Complexity Tradeoff Why Singular Models (Can) Generalise Well From points to local neighbourhoods Intuitive Examples to Interpret the WBIC Example 1: True parameters are preferred according to their RLCT K(w) is like a potential well, where a lower RLCT means a flatter floor Animation 1: The posterior concentrates on both true parameters as n→∞, but it prefers the one with lower RLCT Example 2: Non-true parameters can be preferred at finite n because of the accuracy-complexity tradeoff Animation 2: At low n the simpler model is preferred, but as n→∞, the true parameter is preferred Asymptotic Analysis Requires Care Where are we going from here? References None 14 comments
TLDR; This is the second main post of Distilling Singular Learning Theory [? · GW] which is introduced in DSLT0 [LW · GW]. I synthesise why Watanabe's free energy formula explains why neural networks have the capacity to generalise well, since different regions of the loss landscape have different accuracy-complexity tradeoffs. I also provide some simple intuitive examples [LW · GW] that visually demonstrate why true parameters (i.e. optimally accurate parameters) are preferred according to the RLCT as , and why non-true parameters can still be preferred at finite if they have lower RLCT's, due to the accuracy-complexity tradeoff. (The RLCT is introduced and explained in DSLT1 [? · GW]).
It is an amazing fact that deep neural networks seem to have an inductive bias towards "simple" models, suggesting [LW · GW] that they obey a kind of Occam's Razor:
Plurality should not be posited without necessity.
or in modern parlance,
If two models of the world are similarly accurate, the simpler explanation should be preferred.
This allows them to achieve exceptionally low generalisation error despite classical statistics predictions that they should overfit data:
This fact has come to be known as the generalisation problem and has been discussed at length in Zhang et. al 2017 (and a 2021 supplement), and in Bengio et al., amongst countless others.
Remarkably, Singular Learning Theory can help explain why neural networks, which are singular models, have the capacity to generalise so well.
The degeneracy of the Fisher information matrix is actually a feature of singular models, not a bug. This is because different regions of parameter space can have different complexities as measured by the RLCT , unlike regular models where the complexity is fixed to the total number of parameters in the model . This is the implicit content of Watanabe's profound free energy formula, called the Widely Applicable Bayesian Information Criterion (WBIC), which quantifies a precise asymptotic tradeoff between inaccuracy and complexity,
giving a mathematically rigorous realisation of Occam's Razor, since in singular models.
In this post we will explore Watanabe's free energy formula and provide an intuitive example of why the RLCT matters so much. If you are new to statistical learning theory, I would recommend jumping straight to the examples [LW · GW] and their related animations to gain the intuition first, and then return to the theory afterwards.
The four key points to take away are:
- As , true parameters with the best accuracy will always be preferred.
- As , if two true parameters are equally accurate but have different RLCT's, the parameter with the lower RLCT is preferred.
- For finite but large , non-true parameters can be preferred by the posterior because of an accuracy-complexity tradeoff as measured by the WBIC.
- Parameters with low inaccuracy and small RLCT's have low generalisation error (in a Bayesian sense) since the Bayes generalisation error is the "derivative" of the free energy, so
Information Criteria Help Avoid Underfitting and Overfitting
In the last post, we derived the asymptotic free energy as for regular models, called the Bayesian Information Criterion (BIC):
where is the total number of datapoints in the dataset , the optimal loss is where is a maximum likelihood estimate (i.e. ), and is the total dimension of parameter space .
As a statistical practitioner, given some dataset , your goal is to find a model that you hope will represent the truth from some candidate list. You only have access to the truth via your (training) dataset , but you also want to ensure that it generalises to data beyond the dataset. You can use the BIC to compare model candidates across a set of model classes that you can think to compare, since it captures a precise asymptotic tradeoff between inaccuracy and complexity . Under this paradigm, we should choose the model that achieves the lowest BIC as it is the best option for avoiding both underfitting and overfitting the data. Let's consider a simple example in action:
Example 1: Suppose we have datapoints drawn from a quadratic with Gaussian noise, where [1], where is drawn according to a uniform prior . After looking at our scatterplot of data , we could try models across the following set of model classes:
Name | Model | |
---|---|---|
Linear | 2 | |
Quadratic | 3 | |
Cubic | 4 | |
Degree 15 | 16 |
(The degree 15 model is an extremity just to illustrate a point.)
Within each model class, we can then perform ordinary least squares regression [2] to find the model fit with optimal loss (which, in the regression case, is simply the mean-squared-error plus a constant[3]). With the optimal loss of each model class in hand, we can then compare the over our set of candidates.
Model | |||
---|---|---|---|
0.9668 | 1 | 63.09 | |
0.9283 | 2 | 62.80 | |
0.9280 | 3 | 64.83 | |
0.9243 | 15 | 87.22 |
As one expects, in a similar vein to the bias-variance tradeoff, there is a clear optimal (lowest) . As the dimension increases, the accuracy gets better and better, but at the cost of the complexity of the model (and therefore, its generalisability[4]). The linear model is simple, but has high loss. The cubic has marginally lower loss, but at the expense of a complexity increase that isn't worth it. The degree 15 polynomial has the lowest loss of them all, but is penalised heavily for its complexity (as it should be - it is clearly overfitting). The Goldilocks choice is unsurprisingly the quadratic model, , because the tradeoff between accuracy and complexity is just right.
Other than the fact that the simply does not hold in the singular case, it also points us towards a limitation of regular models. Once you pick your model class to optimise, every point on the loss landscape has a fixed model complexity. If your goal is to minimise the , you only have one choice: to find the single point that optimises the loss at the bottom of the well.
In fact, in our particular case, we can calculate the KL divergence [5] for the linear model ,
which we can see a plot of below.
In singular models, this all changes: within the same model class, different models in parameter space have different effective dimensionalities as measured by the RLCT . The learning procedure does the work of the statistician for us, because the loss landscape contains information about both the accuracy and the complexity.
Watanabe's Free Energy Formula for Singular Models
Free Energy, Generalisation and Model Selection
Fundamentally, we care about the free energy because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1 [LW · GW], it tells us something about the information geometry of the posterior. In particular, given a compact neighbourhood of parameter space we can define its local free energy
providing a direct tool for comparing different regions of the posterior, and thus different models. Since there is a correspondence
we say the posterior prefers a region when it has low free energy relative to other regions of .
But this isn't the only reason to care about the free energy. In fact, it is explicitly related to generalisation, at least in the Bayesian sense.
In the frequentist framework that real-world deep learning takes place in (i.e. estimating a single parameter using SGD), we typically split our dataset into a training set and a test set. We then say a model defined by generalises well when it has low loss on the test set - in other words, it performs well on data that it hasn't seen before.
In the Bayesian paradigm, generalisation can be formulated according to a number of different estimation methods and quantities, depending on how you extract information from the posterior (e.g. you could estimate a maximum-likelihood parameter, a maximum a posterior estimate, or an average over samples from the posterior, etc.). Let's focus on one for the moment that involves the Bayes predictive distribution given by
which weights the probability of an output given an input according to the posterior measure. The Bayes generalisation loss is then given by
Intuitively, it is the expected loss of the predictive distribution over all possible inputs and outputs . It can be shown with relative ease [6] that the Bayes generalisation loss is equal to the average increase in free energy,
In an informal-yet-conceptually-correct way, we can treat this difference as being the "derivative with respect to " [7].
It follows immediately that the generalisation loss of a region is
Ergo, to understand the information contained in the posterior, and which regions contain models with low generalisation error, we want to calculate the free energy. In even modestly simple settings this integral is intractable, thus why we need to calculate its asymptotic form.
The Free Energy Formula for Singular Models
This subsection is a little bit more technical. If this overwhelms you, I recommend skipping ahead to the next subsection where I interpret the free energy formula.
As I explained in DSLT1 [LW · GW], finding the asymptotic form of the free energy as when is degenerate is hard, and depends of theorems of algebraic geometry and distribution theory. This formula has been refined over the course of many papers [8], adjusted and generalised with various hypotheses. Here we will focus on the form given in [Wat13], which also applies to the unrealisable case where the set of true parameters may be empty. Thus we instead care about the set of optimal parameters
and if is non-empty, then . As it stands, the free energy formula in the unrealisable case depends on a hypothesis of relatively finite variance [9].
Watanabe shows that the free energy of asymptotically satisfies
where:
- is the size of the dataset .
- is a most singular optimal point of (which we will explain below), which is in the interior of .
- is the global RLCT, associated to the most singular point .
- is a sequence of random variables that satisfies and converges to a Gaussian random variable in law as .
- denotes a sequence that is bounded in probability.
We will interpret the formula in a moment, but let me briefly clarify what it means to call a "most singular point", a notion is made precise in [Lin11, Proposition 3.9]. (See the below figure, too).
The gist is that every singularity has an associated local RLCT defined by considering small neighbourhoods around the point. The global RLCT of is defined to be the minimum over these optimal points, , and an optimal point is a most singular point if [10]. Note also that the formula currently depends on an optimal parameter being in the interior of , so you can think of a most singular point as being a local minimum of .
The Widely Applicable Bayesian Information Criterion
In the asymptotic limit as , we can ignore the last two terms of the free energy formula and arrive at the Widely Applicable Bayesian Information Criterion (WBIC) across the full parameter space ,
Notice how the WBIC formula is the same as the BIC except that complexity is measured by instead of . But in DSLT1 [LW · GW], we explained how
- In regular models .
- In singular models in general.
Thus the WBIC is a generalisation of the BIC, containing the regular result as a special case.
Though the WBIC can be used to compare model classes like in the case of the BIC, its real power is what it tells us about the information geometry of the posterior within the same class of models. To this end, we can calculate the local free energy of a compact neighbourhood
where is the most singular optimal point in with associated RLCT .
The Accuracy-Complexity Tradeoff
The WBIC shows that the free energy of a region is comprised of the accuracy and complexity (or, to the physicists, energy and entropy) of the most singular point in
What makes this so profound is that in singular models, different regions can have different RLCT's , each with a different tradeoff between accuracy and complexity, unlike the regular model case where every region has a fixed complexity:
So, the region in that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam's Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model.
Interpreting the terms in the free energy formula leads us to three main points:
- As , regions containing optimal parameters will always be preferred, since the inaccuracy is the leading order term.
- As , if there are multiple regions and with equally optimal accuracy, then they are preferred according to their respective RLCT's. Lower is better, so if then is preferred by the posterior.
- For finite but large , regions that do not contain a globally optimal parameter can be preferred by the posterior because they can have a better tradeoff between accuracy and complexity.
Why Singular Models (Can) Generalise Well
Armed with our free energy formula, we can now understand why singular models have the capacity to generalise well.
Recall that the Bayes generalisation error can be expressed as the difference , which can be interpreted as the "derivative" of the free energy with respect to . Then since , Watanabe is able to prove in [Wat18, Chapter 8] that, asymptotically, the Bayes generalisation error is
In fact, he goes one step further by considering two other forms of generalisation, the leave-one-out-cross-validation-loss , and the WAIC (an empirically measurable form of the WBIC), and shows that asymptotically
Once again, we find that the RLCT plays a central role in the learning process. Most importantly, we have a correspondence:
On top of this, we can carry out the same analysis on our local to find that the local generalisation loss of is
Since the RLCT can differ region to region, this tells us that:
All of this is to say: under any reasonable conception of Bayesian generalisation, the RLCT plays a central role in minimising generalisation loss, asymptotically. And since in singular models, the generalisation error of a singular model will always be better than that of a regular model with the same number of parameters. In DSLT3 we will show neural networks are singular, so:
This is why neural networks can generalise well!
...sort of. Don't forget, we are in the Bayesian paradigm here, and it is not a given that Stochastic Gradient Descent (SGD) finds the same regions of parameter space that the Bayesian posterior says are "good". We postulate that in some sense they are equivalent, and the work of Mingard et al. in Is SGD a Bayesian Sampler? Well, Almost. agrees with this postulate. But, formalising this relationship does remain a key open problem in conclusively applying SLT to modern deep learning with SGD and its variants.
From points to local neighbourhoods
For those with less background knowledge, there is an important conceptual shift we have made here that I want to elaborate on briefly.
In frequentist statistics we care about particular point estimates in parameter space . But in Bayesian statistics, we care about measurable regions of parameter space, and the probability to which the posterior assigns those regions. This is a powerful shift in perspective, and points towards why SLT is placed in a Bayesian paradigm: the observation the geometry of contains a lot more information than simple point estimates do lends itself naturally to Bayesian statistics.
But in modern deep learning, we only ever have access to a point estimate at the end of training via SGD. Sampling from the Bayesian posterior for large neural networks would not only be silly, but it would be, essentially, computationally impossible. So does this mean SLT has absolutely no applicability to modern deep learning?
Not at all. By studying the local geometry of the loss landscape - which is to say, arbitrarily small neighbourhoods of the posterior - we are able to analyse the set of points that are arbitrarily close to the singularities of .
What Watanabe shows is that the singularities contained in these small neighbourhoods affect the geometry of the other points in the neighbourhood. If contains one true parameter , the other points in may not be equal minima of , but they are extremely close to being so. The same logic applies for the RLCT of a region: perhaps there is only one most singular point in , but any nearby parameter within the small neighbourhood will define a model whose functional output is nearly identical to that of which has a lower complexity.
In focusing only on points, we lose an extraordinary amount of information. So, we localise to neighbourhoods.
Intuitive Examples to Interpret the WBIC
It's time we looked at a specific example to build intuition about what the WBIC is telling us.
I have constructed this toy example specifically to illustrate the main points here, and a lot of the details about the sub-leading terms in the free energy expansion are obfuscated, as well as the random fluctuations that make different to . But, it is conceptually correct, and helps to illustrate the dominant features of the learning process as . Don't take it too literally, but do take it seriously.
We will start with some calculations, and then visualise what this means for the posterior as .
Example 1: True parameters are preferred according to their RLCT
Let's consider a one parameter model with KL divergence defined by
on the region with uniform prior . There are two singularities in the set of true parameters,
which we will label as and respectively. The Fisher information at true parameters is just the Hessian, which in the one dimensional case is simply the second derivative,
An easy calculation shows
meaning is a regular point and is a singular point since
meaning the Fisher information "matrix" (albeit it is one-dimensional) is degenerate at but not at . We thus expect the RLCT of to be less than
Let denote a small neighbourhood of radius centred at . To analyse the geometry of near the two singularities, we can define compact local regions
By taking a Taylor expansion about each singularity, the leading order terms of in each region are
since is very small. Recalling the definition of the RLCT from DSLT1 [LW · GW], since is in normal crossing form we can read off the local RLCTs
meaning the effective dimensionalities associated to each singularity are and respectively.
Here's the crux: since both singularities and are true parameters, so , they both have the same accuracy (a constant),
But, since the RLCT associated to is smaller, , the free energy formula tells us that is preferred by the posterior since it has lower free energy,
and our generalisation formula tells us that will have lower expected Bayesian generalisation loss,
The simpler model is preferred, and has a lower generalisation loss, because it has a lower RLCT.
Notice here how our was arbitrary. It doesn't matter exactly what it is, because as long as it is small, will look like and in each respective region. Its geometry is dominated by these terms when very close to each singularity.
is like a potential well, where a lower RLCT means a flatter floor
We argued in the first post that the RLCT is the correct measure of "flatness" in the loss landscape. We can now see that with our own eyes by plotting in our above example:
Looking at this plot, a good intuition to have is that the loss landscape is a potential well for some kinetic ball bouncing around, with its Hamiltonian given by . The ball, with enough random kinetic energy, will explore both regions and , but it will spend more time in the latter since it is "flatter" [11].
Animation 1: The posterior concentrates on both true parameters as , but it prefers the one with lower RLCT
For large , the empirical KL divergence is approximately equal to the KL divergence , and so normalising out by the constant , the posterior effectively only depends on :
which is shown with some easy algebraic manipulations [12].
The free energy formula predicts that is preferred because is smaller than . So let's look at how the posterior changes with :
As predicted, we visually see in the top posterior plot that the posterior concentration in is greater than for all . The bottom plot backs this up precisely by showing that the free energy is always less than . The RLCT clearly matters!
Note here that we have been able to numerically calculate the integrals and , indicated by the shaded regions precisely, and not based on the free energy formula. This is because numerical integration is easy in low dimensions. In even modest dimensionality, these partition functions are notoriously difficult to compute, thus giving rise to the field of computational Bayesian statistics and algorithms like Markov Chain Monte Carlo. Our neural network experiments in DSLT4 require the MCMC approach.
Example 2: Non-true parameters can be preferred at finite because of the accuracy-complexity tradeoff
Let's alter our example ever so slightly to the case where still has a lower RLCT, but it has marginally worse accuracy than and thus is not a true parameter, but is an optimal parameter within . We will set
for some constant . Our KL divergence is then given by
where and are constants chosen such that is a local but not global minima of , so and .
Then by the free energy formula [13],
We can visualise how this changes the loss landscape:
Animation 2: At low the simpler model is preferred, but as , the true parameter is preferred
Let's visualise how the posterior concentration changes with for and (yes, is reasonably large here, but you want to be able to see this process, right?).
For the first few values of , the free energy is minimised by with its lower RLCT. But at about the inaccuracy of has become too great to bare and the more accurate is preferred. After this point the free energy curves are monotonic - the more accurate model will be preferred for all .
Asymptotic Analysis Requires Care
I know what you're thinking: how come we could just approximate in each region? Don't other curvature effects matter to the posterior density too? Like the prior, and those Taylor coefficients in ? Wait, and don't we care about instead of ? Wait wait, , but we never have access to , so why would we care about , let alone ? This all seems far too contrived...
Yes, and no. Remember, the free energy formula gives us an asymptotic approximation of the posterior density has . Thinking carefully about asymptotic approximations does require some care, particularly when one tries to convert these formulas into statements about finite - after all, we can only ever train a model on a finite number of datapoints. But Watanabe's free energy formula tells us what aspects of the geometry of affect the learning process as more data is collected. He is able to rigorously prove that the structure of singularities, as measured by the RLCT, are central to the long-term behaviour of model training. Yes, many other effects get thrown away in the term of the free energy formula, and the sequence is also loaded with other random fluctuations, but Watanabe nonetheless shows what makes singular models so powerful.
Where are we going from here?
In this post, we have argued for the shift in perspective from points to local neighbourhoods (frequentist to Bayesian statistics), and then from local neighbourhoods to singularities of (classical learning theory to singular learning theory). We have gone from caring about an enormous parameter space , to only caring about certain singularities. Since singularities have different local geometry, and local neighbourhoods around them minimise the free energy, we have found our phases of statistical learning:
We will explain this correspondence between phases and singularities in detail in DSLT4. Before that, though, in DSLT3 we are going to study the set of true parameters of a particular toy model: two layer feedforward ReLU neural networks with two inputs and one output. In doing so, we will see how the singularities of in these models are identifiable with the different functionally equivalent symmetries of the neural network. With a full classification of in hand, we will then look at some experiments on these toy models that provide tractable and precise illustrations of phase transitions in the posterior in these ReLU neural networks, which Watanabe's free energy formula predicts.
References
[Wat09] S. Watanabe Algebraic Geometry and Statistical Learning Theory 2009
[Wat13] S. Watanabe A Widely Applicable Bayesian Information Criterion 2013
[Wat18] S. Watanabe Mathematical Theory of Bayesian Statistics [? · GW] 2018
[Lin11] S. Lin Algebraic Methods for Evaluating Integrals in Bayesian Statistics 2011
- ^
In other words, the true distribution is . ↩︎
- ^
In other circumstances without a closed-form solution we can use an optimisation method like SGD.
- ^
- ^
In the sense that if we kept drawing more data from the truth , the error of would get worse and worse - it is overfitting data. ↩︎
- ^
Okay, so this is a bit of a white lie. If the model and truth are both defined by a regression model with equal variance , then
I haven't checked the case where the variances are different, but I believe it only requires a slight adjustment to this formula with a scaling factor somewhere. But this example nonetheless illustrates the point - is a paraboloid in regular models.
- ^
- ^
The reason this works is because of the asymptotics of .
- ^
Originally, in [Wat09], Watanabe proves the formula for the realisable case where is a true parameter, meaning and .
- ^
[Wat18, Def 7, Chap 3] For a given pair and , the log density ratio function is defined by
If there exists such that for any arbitrary pair and
then it is said that the log density ratio function has a relatively finite variance. ↩︎
- ^
And if it achieves the maximum multiplicity, but we will leave this vague to avoid more technical clutter.
- ^
In fact, this analogy to statistical physics runs quite deep. Hamiltonian Monte Carlo, one of the main methods for sampling from a Bayesian posterior, is essentially just simulating a particle subject to a Gibbs distribution . There's a reason we've been calling it the free "energy", after all! ↩︎
- ^
Specifically, as , the KL divergence converges,
Since is a constant, we are able to write
If one defines the posterior in terms of , we call this the normalised posterior.
- ^
For this pathological example, these formulas are certainly not an adequate approximation for the exact numerical free energy at low - for one thing, they don't even depend on the radius of the ball . However, writing the formulas like this does point us towards the conceptual idea, which is verified in the below animations.
14 comments
Comments sorted by top scores.
comment by bhauth · 2023-06-18T01:46:17.926Z · LW(p) · GW(p)
Trained neural networks only generalize at all with certain types of activation functions. So if your theory doesn't consider activation functions it's probably wrong.
Replies from: liam-carroll↑ comment by Liam Carroll (liam-carroll) · 2023-06-18T02:10:30.852Z · LW(p) · GW(p)
If a model is singular, then Watanabe’s Free Energy Formula (FEF) can have big implications for the geometry of the loss landscape. Whether or not a particular neural network model is singular does indeed depend on its activation function, amongst other structures in its architecture.
In DSLT3 I will outline the ways simple two layer feedforward ReLU neural networks are singular models (ie I will show the symmetries in parameter space that produce the same input-output function), which is generalisable to deeper feedforward ReLU networks. There I will also discuss similar results for tanh networks, alluding to the fact that there are many (but not all) activation functions that produce these symmetries, thus making neural networks with those activation functions singular models, thus meaning the content and interpretation of Watanabe’s free energy formula is applicable.
Replies from: bhauth↑ comment by bhauth · 2023-06-18T02:47:53.264Z · LW(p) · GW(p)
This is all pretty complicated compared to my understanding of why neural networks generalize, and I'm not sure why I should prefer it. Does this complex and detailed theory have any concrete predictions about NN design or performance in different circumstances? Can you accurately predict which activation functions work well?
My view is that this "singularity" of networks - which I don't think is a good term, it's already overloaded with far too many meanings - is applicable to convergence properties but not to generalization ability.
Replies from: liam-carroll↑ comment by Liam Carroll (liam-carroll) · 2023-06-18T22:27:20.521Z · LW(p) · GW(p)
What is your understanding? It is indeed a deep mathematical theory, but it is not convoluted. Watanabe proves the FEF, and shows the RLCT is the natural generalisation of complexity in this setting. There is a long history of deep/complicated mathematics, with natural (and beautiful) theorems at the core, being pivotal to describing real world phenomena.
The point of the posts is not to argue that we can prove why particular architectures perform better than others (yet). This field has had, comparatively, very little work done to it yet within AI research, and these sorts of facts are where SLT might take us (modulo AI capabilities concerns). The point is to demonstrate the key insights of the theory and signpost the fact that “hey, there might be something very meaningful here.” What we can predict with the theory is why certain phase transitions happen, in particular the two layer feedforward ReLU nets I will show in DSLT4. This is a seed from which to generalise to deeper nets and more intricate architectures - the natural way of doing good mathematics.
As to the “singularity” problem, you will have to take that up with the algebraic geometers who have been studying singularities for over 50 years. The fact is that optimal parameters are singularities of K(w) in non-trivial neural networks - hence, singular learning theory.
What do “convergence properties” and “generalisation ability“ mean to you, precisely? It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation. As I say, its relation to SGD dynamics is certainly an open question. But in the Bayesian setting, the case is really quite closed.
Replies from: bhauth↑ comment by bhauth · 2023-06-18T22:40:01.522Z · LW(p) · GW(p)
What is your understanding?
I guess I'll write a post.
It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation
No. What was proven is that there are some points which can be represented by lots of possible configurations, more so than other points. There is no proof or even evidence that those are reached by NN training by SGD, or that those points represent good solutions to problems. As far as I can tell, you're just assuming that because it seems to you like a logical reason for NN generalization.
Replies from: liam-carroll↑ comment by Liam Carroll (liam-carroll) · 2023-06-18T23:36:01.592Z · LW(p) · GW(p)
With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is
its relation to SGD dynamics is certainly an open question.
What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post [LW · GW], strongly depends on the singularity structure of the minima of , as measured by the RLCT . This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is an interesting and important open problem, if it can be done at all.
Replies from: bhauth↑ comment by bhauth · 2023-06-19T00:16:58.508Z · LW(p) · GW(p)
I said I'd write a post, and I wrote a post [LW · GW].
the Bayes generalisation error Gn is the "derivative" of the free energy
I think calling that "Bayes generalisation error" is where you went wrong. I see no good basis for saying that's true in the sense people normally mean "generalization".
I understand some things about a Free Energy Formula are proved, but I don't think you've shown anything about low RLCT points tending to be the sort of useful solutions which neural networks find.
Replies from: liam-carroll↑ comment by Liam Carroll (liam-carroll) · 2023-06-19T01:01:33.364Z · LW(p) · GW(p)
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it - the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: "what is the probability of this output given this input and given the data of the posterior". The Bayes training loss (which I haven't delved into in this post) is the empirical counterpart to the Bayes generalisation loss,
and so it adds up the "entropy" of the predictive distribution over the training datapoints - if the predictive distribution is certain about all training datapoints, then the training loss will be 0. (Admittedly, this is an object that I am still getting my head around).
The Bayes generalisation loss satisfies , and therefore it averages this training loss over the whole space of inputs and outputs. This is the sense in which it is reasonable to call it "generalisation". As I say in the post, there are other ways you can think of generalisation in the Bayesian setting, like the leave-one-out-cross-validation, or other methods of extracting data from the posterior like Gibbs sampling. Watanabe shows (see pg236 of [Wat18]) that the RLCT is a central object in all of these alternative conceptions.
As for the last point, neural networks do not "find" anything themselves - either an optimisation method like SGD does, or the Bayesian posterior "finds" regions of high posterior concentration (i.e. it contains this information). SLT tells us that the Bayesian posterior of singular models does "find" low RLCT points (as long as they are sufficiently accurate). Neural networks are singular models (as I will explain in DSLT3), so the posterior of neural networks "finds" low RLCT points.
Does this mean SGD does? We don't know yet! And whether your intuition says this is a fruitful line of research to investigate is completely personal to your own mental model of the world, I suppose.
comment by Leon Lang (leon-lang) · 2023-06-25T22:46:25.722Z · LW(p) · GW(p)
Thanks for this nice post! I fight it slightly more vague than the first post, but I guess that is hard to avoid when trying to distill highly technical topics. I got a lot out of it.
Fundamentally, we care about the free energy because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1 [LW · GW], it tells us something about the information geometry of the posterior.
Can you tell more about why it is a measure of posterior concentration (It gets a bit clearer further below, but I state my question nonetheless to express that this statement isn't locally clear to me here)? I may lack some background in Bayesian statistics here. In the first post, you wrote the posterior as
and it seems like you want to say that if free energy is low, then the posterior is more concentrated. If I look at this formula, then low free energy corresponds to high , meaning the prior and likelihood have to "work quite a bit" to ensure that this expression overall integrates to . Are you claiming that most of that work happens very localized in a small parameter region?
Additionally, I am not quite sure what you mean with "it tells us something about the information geometry of the posterior", or even what you mean by "information geometry" here. I guess one answer is that you showed in post 1 that the Fisher information matrix appears in the formula for the free energy, which contains geometric information about the loss landscape. But then in the proof, you regarded that as a constant that you ignored in the final BIC formula, so I'm not sure if that's what you are referring to here. More explicit references would be useful to me.
Since there is a correspondence
we say the posterior prefers a region when it has low free energy relative to other regions of .
Note to other readers (as this wasn't clear to me immediately): That correspondence holds because one can show that
Here, is the global partition function.
The Bayes generalisation loss is then given by
I believe the first expression should be an expectation over .
It follows immediately that the generalisation loss of a region is
I didn't find a definition of the left expression.
So, the region in that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam's Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model.
Purposefully naive question: can I just choose a region that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.
So I guess you also want to choose small regions. You hinted at that already by saying that should be compact. But now I of course wonder if sometimes just all of lies within a compact set.
There are two singularities in the set of true parameters,
which we will label as and respectively.
Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).
Let's consider a one parameter model with KL divergence defined by
on the region with uniform prior
The prior seems to do some work here: if it doesn't properly support the region with low RLCT, then the posterior cannot converge there. I guess a similar story might a priori hold for SGD, where how you initialize your neural network might matter for convergence.
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
Replies from: liam-carroll↑ comment by Liam Carroll (liam-carroll) · 2023-06-27T06:30:14.560Z · LW(p) · GW(p)
Can you tell more about why it is a measure of posterior concentration.
...
Are you claiming that most of that work happens very localized in a small parameter region?
Given a small neighbourhood , the free energy is and measures the posterior concentration in since
where the inner term is the posterior, modulo its normalisation constant . The key here is that if we are comparing different regions of parameter space , then the free energy doesn't care about that normalisation constant as it is just a shift in by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions are the same "size". Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn't really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of around the singularity).
I am not quite sure what you mean with "it tells us something about the information geometry of the posterior"
This is sloppily written by me, apologies. I merely mean to say "the free energy tells us what models the posterior likes".
I didn't find a definition of the left expression.
I mean, the relation between and tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to (instead of in the integral).
Purposefully naive question: can I just choose a region that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.
Yes - and as you say, this would be very uninteresting (and in general you wouldn't know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of in DSLT3]). The point is that at no point are you just magically "choosing" a anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry depends on the singularity structure and this varies across parameter space.
Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).
As discussed in the comment [LW(p) · GW(p)] in your DSLT1 question, they are both singularities of since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT .
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a "regularisation" term (and can be thought of as adding the "simple harmonic oscillator" part to the story). The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the limit. Also if you were concerned with the requirement for to be compact, you can just define it to be compact on the space of "numbers that my computer can deal with".
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells , you find that it just gets stuck in the basin it was closest to. So, what's going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it's very much an open problem.
Replies from: leon-lang↑ comment by Leon Lang (leon-lang) · 2023-07-03T23:13:00.661Z · LW(p) · GW(p)
Thanks for the answer! I think my first question was confused because I didn't realize you were talking about local free energies instead of the global one :)
As discussed in the comment [LW(p) · GW(p)] in your DSLT1 question, they are both singularities of since they are both critical points (local minima).
Oh, I actually may have missed that aspect of your answer back then. I'm confused by that: in algebraic geometry, the zero's of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two axes, which form an algebraic variety, but only at is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in is singular. Is this correct?
So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the limit.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that.
Replies from: liam-carroll↑ comment by Liam Carroll (liam-carroll) · 2023-07-11T23:40:29.116Z · LW(p) · GW(p)
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in is singular. Is this correct?
Correct! So, the point is that things get interesting when is more than just a single point (which is the regular case). In essence, singularities are local minima of . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of as a singularity. The TLDR of this is:
So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
As I show in the examples in DSLT1 [? · GW], having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,
Anything less than this is a strictly singular case.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that.
Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey's prior. I haven't studied it in detail but to the best of my reading he is basically saying "from the point of view of SLT, the Jeffrey's prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be if the Jeffrey's prior is employed." (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey's prior is employed).
Replies from: leon-lang↑ comment by Leon Lang (leon-lang) · 2023-07-12T06:26:39.865Z · LW(p) · GW(p)
Thanks for the reply!
As I show in the examples in DSLT1 [? · GW], having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,
Anything less than this is a strictly singular case.
So if , then is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it's justified from the algebraic-geometry--perspective.
comment by watermark (w̵a̵t̸e̶r̴m̷a̸r̷k̷) · 2023-09-18T04:15:34.451Z · LW(p) · GW(p)
How similar is Watanabe's formulation of free energy to Karl Friston's?