Singular learning theory: exercises

post by Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · LW · GW · 5 comments

Contents

  References
None
5 comments

Thanks to Jesse Hoogland and George Wang for feedback on these exercises. 

In learning singular learning theory (SLT), I found it was often much easier to understand by working through examples, rather than try to work through the (fairly technical) theorems in their full generality. These exercises are an attempt to collect the sorts of examples that I worked through to understand SLT.

Before doing these exercises, you should have read the Distilling Singular Learning Theory (DSLT) sequence [? · GW], watched the SLT summit YouTube videos, or studied something equivalent. DSLT is a good reference to keep open while solving these problems, perhaps alongside Watanabe's textbook, the Gray Book. Note that some of these exercises cover the basics, which are well-covered in the above distillations, but some deliver material which will likely be new to you (because it's buried deep in a textbook, because it's only found in adjacent literature, etc).

Exercises are presented mostly in conceptual order: later exercises freely use concepts developed in earlier exercises. Starred (*) exercises are what I consider the most essential exercises, and the ones I recommend you complete first.

  1. *The normal distribution, like most classical statistical models, is a regular (i.e. non-singular[1]) statistical model. A univariate normal model with unit variance and mean  is given by the probability density . Assume the true distribution  of the data is realizable by the model: that is,  for some true parameter .
    1. Calculate the Fisher information matrix of this model (note that since we have only a single parameter, the FIM will be a 1x1 matrix). Use this to show the model is regular.
    2. Write an explicit expression for the KL divergence  between  and , as a function of the parameter . This quantity is sometimes also called the population loss. [See Example 1.1, Gray Book, for the case of a 2D normal distribution]
    3. Using  from b), give an explicit formula for the volume of "almost optimal" parameters, , where  is the prior distribution. For convenience, let  be the improper prior .
    4. The volume scaling formula for the learning coefficient  (also known as RLCT[2]) is  for any  [Theorem 7.1, Gray Book]. Using this formula, combined with the expression for  derived in b), calculate the learning coefficient[3]. Given that the model is regular, we expect the learning coefficient to be ; compare your answer.
  2. *We can make the normal distribution a singular model by changing the parameterization. Let a cubicly-parameterized normal model be the model . Assume the true parameter is .
    1. Show that the cubicly-parameterized normal model is just as expressive as an ordinary normal model: that is, they both can express all univariate normal distributions.
    2. Repeat 1a) with this model; calculate the Fisher information matrix to demonstrate that the model is singular, and find which parameters  are singular.
    3. Repeat 1b) - 1d) to calculate the learning coefficient this model, for , and for .
      Recall that the learning coefficient is a volume scaling exponent, such that  [4] as . Based on this, interpret your results. How does this make the cubicly-parameterized normal model different from the ordinary normal model?
    4. Instead of taking  to get the learning coefficient, fix a small but nonzero value for , such as . Plot  as a function of . As we saw from c), the learning coefficient changes discontinuously when  - what happens with  as  gets close to zero? What changes if you make  smaller or larger?
      Even though the asymptotic learning coefficient () only changes when  exactly, note how the non-asymptotic volume ( finite) is affected in a larger neighborhood.
  3. *For each of the two different  formulas obtained in 1b) and 2c), create plots of  and , with  and  as variable sliders (e.g. using Desmos). Note how these functions change as  and  change. Compare results between the  obtained in 1b) and the  obtained in 2c).
  4. Consider yet another alternative parameterization of the normal distribution, this time with two parameters: . Assume the true distribution is realizable with true parameters  and .
    This type of model is called minimally singular, for reasons that will become clear shortly. Minimally singular models can be dealt with easier than more general singular models.
    1. Show that this model is just as expressive as an ordinary normal model; they both can express all univariate normal distributions.
    2. Calculate the Fisher information matrix of the model and show that the model is singular. Further, calculate the eigenvectors of the matrix, and interpret them.
    3. Calculate the learning coefficient using the same procedure as 2c). Note that you will need a proper prior  this time, in order for the integral to converge: a uniform prior over the unit square should work, i.e. .
    4. As we saw from part b), this model has one direction in parameter space which changes the distribution , and one direction in parameter space which does not - and this is true globally. It seems sensible to just ignore the direction in parameter space which is globally redundant. Apply the formula for the learning coefficient of regular models, but use  instead of . Compare your result to part c).
    5. It turns out that the strategy from part d), merely ignoring or quotienting out redundant submanifolds, works for any minimally singular model. However, many singular models, including neural networks, are not minimally singular. Why does this strategy not work for singular models in general? What would stop you from applying it to the model in problem 2), for instance?
  5. Consider a two-parameter alternatively-parameterized normal distribution: , where  are fixed constants. Assume the true distribution is realizable with true parameters  and .
    1. Calculate the Fisher information matrix of the model. For what parameters is the model singular? For singular parameters, when is the rank of the matrix zero, and when is it one?
    2. Compute the learning coefficient using the same procedure as 1b), for .
    3. In the Gray Book, Watanabe proves that any model with KL divergence of the form , known as normal crossing form, has learning coefficient  (in the case where the prior is non-vanishing). This result is more than a special case: Watanabe showed that any statistical model may (locally) be converted into normal crossing form, via a change of variables from algebraic geometry known as resolution of singularities.
      Show that your answer from part b) matches Watanabe's formula.
    4. The general form of the asymptotic volume scaling formula is , where  is the learning coefficient and  is the multiplicity. So far we have been able to ignore the multiplicity, because we have only considered examples where . However, in this example, if , then the multiplicity . Show that the general form of the volume scaling formula holds when .
  6. *Repeat 3), but with the KL divergence  from problem 5) instead. Vary , and , and see how the functions change. In particular, note what happens near . (Math3D is helpful here.)
  7. Many practical machine learning models are conditional - they have inputs and outputs. Sometimes, as in the case of regression models, they may not naively appear to be probabilistic models at all. These properties are both demonstrated in univariate linear regression, a model given by , where  is the input,  is the slope parameter and  is the intercept parameter. [Example 1.4, Gray Book].
    1. Assume we are doing least-squares regression (the loss function is mean squared error loss). Show how to write this model as a probabilistic statistical model: that is, find a formula for the model's conditional probability  such that the negative log-likelihood yields mean squared error loss. (Hint.)
    2. Assuming the true distribution  is realizable with true slope  and true intercept , calculate the Fisher information matrix of this model. Use this to show that the model is regular.
  8. *Consider a constant-width, two-layer linear network , where  are the inputs,  are the outputs, and  are linear transformations[5]. The parameter space is  and the parameters are the pair .
    1. Show that this two-layer linear model is just as expressive as an ordinary linear model; they both can express all linear functions from  to .
    2. Let us examine the singular directions of this model; that is, directions in parameter space which do not change the function implemented by the model, to first order. More precisely, the parameter-function map can be seen as a (nonlinear) map , from parameters in  to linear transformations . A tangent direction  in parameter space is a singular direction at a parameter  if the directional derivative .[6]
      1. Assume  and  are full rank. What are the singular directions?
      2. Assume  and/or  are not full rank. What are the singular directions now? How does the number of singular directions depend on the rank of  and ?
      3. Interpret these results. How does this make the two-layer linear model differ from the ordinary linear model?
      4. Note that the condition that  and/or  are not full rank is zero measure in the parameter space, which could lead to the interpretation that such a scenario will never occur in practice, and we should ignore the new singular directions that occur here. Why is this interpretation misleading? What happens if  and/or  are almost not full rank (have very small eigenvalues)?
    3. In the realizable case, Aoyagi & Watanabe (2005) derived a formula for the learning coefficient , in terms of the rank and dimensions of the true parameters  and  [Main Theorem].
      1. Aoyagi & Watanabe do not assume the network is constant-width, like we do. Simplify their formula for  using this assumption.
      2. How does the learning coefficient  depend on , the rank of ? Compare with the results of part b).
      3. Repeat b.iii), but with  instead.
      4. Repeat b.iv), but with  instead. (Hint: see problem 2d).)
  9. The learning coefficient is a fairly global property of the KL divergence; in many cases, it is helpful to have a more local quantity. The local learning coefficient (LLC) of a parameter  is defined to be the learning coefficient of the model with the support of the prior  restricted to an arbitrarily small neighborhood of .
    For illustration, we will consider the parameterized normal model 
    for true parameter .
    1. Find and plot . Qualitatively, how does  look near  versus ?
    2. Compute and compare the LLC of  and . How do these values formalize the qualitative intuition found via a)?
    3. Compute the (global) learning coefficient of the model. In general, the global learning coefficient is the minimum of the local learning coefficients for the minima of . Confirm that this holds here.
    4. The global learning coefficient is the more important invariant for Bayesian learning, due to e.g. the free energy formula; why might the local learning coefficient be better suited for describing the complexity of a model where only a point value of  is estimated (e.g. using SGD)?
  10. *Up to this point, we have implicitly assumed that we have direct access to the true distribution , for instance when calculating the learning coefficient using the KL divergence . However, in practice, we can only access the true distribution indirectly, via  samples  drawn IID from . This is a significant gap.
    Still, much like e.g. air resistance in physics, moving from idealized population quantities to realistic empirical quantities adds new complications, but much of the fundamental intuitions continue to hold.
    1. We may use the negative log-likelihood [7] as a proxy for the KL divergence . Prove that , where  is the (constant) entropy of the true distribution.[8]
    2. Take the model from 2) with  and numerically sample from  for , and .
      1. Plot  compared to . How do the two differ, and what changes as  increases? 
      2. Plot the unnormalized Bayesian posterior[9]  compared to . How do the two differ, and what changes as  increases? It is much easier to reason about  theoretically compared to  - what does this suggest we can use  for?
    3. Take the model from 2) with  and numerically compute  as a function of  using . Verify that the proportionality  holds as expected, by plotting  versus .
    4. Define the empirical version of  as . Repeat c), but with  instead of . How do the results differ? How well does the proportionality  hold for larger , and how well does it hold for smaller ? Based on what you saw in part b), give an intuitive explanation for why  behaves differently for small .
  11. The tempered Bayesian posterior is the distribution , where  is a parameter known as the inverse temperature, and  is the normalizing constant known as the partition function or marginal likelihood. Note that the tempered Bayesian posterior results in the ordinary Bayesian posterior when . Qualitatively, describe what increasing  does to the tempered Bayesian posterior. How does it differ from the effect of increasing ?
  12. The free energy or log marginal likelihood is the quantity 
    obtained as the log of marginalizing the tempered Bayesian posterior at inverse temperature . Perhaps the most central result of SLT is the asymptotic expansion of the free energy:
    ,
    where the empirical entropy  [Gray Book, Main Formula II].
    1. Interpret the meaning of the free energy from the definition (it is the negative log of the marginal likelihood ). Why is it advantageous to choose a model with lower free energy on a given dataset?
    2. Take the model from 2) with , and numerically compute the free energy for many values of  between  and . Compare the calculated value to the estimate .
    3. Repeat b), but with .
    4. Repeat b), but with . Technically, the learning coefficient at  is still , the same as part c), but note how the behavior of the free energy seems closer to b) than c) for low . The effects of the singularity at  extend to the neighborhood around it.
  13. The local free energy of a subset of parameter space  is the free energy of the model with the parameter space restricted to . We show how phase transitions can occur when different local free energies change at different rates.
    1. Suppose we partition the overall parameter space into two subsets,  and  . Denote the overall free energy by  and the local free energies by  and , respectively. Show that the relationship  holds.
    2. Suppose  and . Plot the overall free energy . Compare against . What happens around ?
    3. In statistical physics, a phase transition is traditionally defined as a discontinuity (or rapid change, for finite-size systems) in the derivatives of the free energy. Plot  and explain why this justifies calling the phenomenon from b) a phase transition.
    4. Change the coefficients of the terms in b). How does this change when a phase transition occurs? Does a phase transition always occur?
  14. We sometimes informally refer to singular models as "models where you can (to first order) move in some direction in parameter space without changing the function implemented by the model." However, singular models are defined in terms of the Fisher information matrix; it may not be immediately clear how the informal statement follows.
    Suppose we have a (potentially nonlinear) regression model given by a map  from a parameter space   to a function space  with outputs in , for which we use mean squared error loss.[10] We may write this as a statistical model: 
    where  denotes the multivariate normal distribution and  is the identity matrix [See here]. 
    1. Show that  where  is the Jacobian of  with respect to .
    2. Suppose that the model is realizable such that the true parameter is ; that is,  for true input distribution . Prove that the Fisher information matrix is given by .
    3. Prove that a vector  is in the null space of the Fisher information matrix  if and only if it is in the null space of  for all  in the support of .
      Conclude that a regression model is singular at a parameter  if and only if there exists a vector  such that the directional derivative  for all inputs  in the support of .
    4. In the case where the support of  is the entire input space of , show that the null space of the Fisher information matrix  is equal to the null space of the total derivative .
  15. The naive method of determining "how singular a model is" at a particular parameter  is by counting the number of singular directions at ; more precisely, using the rank of the Fisher information matrix[11] at .
    1. Construct two statistical models where the rank of the Fisher information matrix at the true parameter  is the same, but the learning coefficient differs between the two. (Hint: modify the model from problem 2).)
      In this example, what information is the learning coefficient giving you that is missing from the rank of the Fisher information matrix?
    2. The rank of the Fisher information matrix does give some information, however. Show that if the rank of the Fisher information matrix at  is , the learning coefficient satisfies the lower bound . [Hard.]
  16. In practice, the Hessian of the loss function is a more directly visible quantity than the Fisher information matrix; however, the two are closely connected.
    1. Recall the population loss is given by . Suppose that the true distribution is realizable with true parameter . Prove that the Hessian of the population loss at  is equal to the Fisher information matrix at .
    2. Recall the empirical loss is given by , where  IID. The empirical Fisher information matrix is given by . Suppose that the true distribution is realizable with true parameter . Prove that the Hessian of the empirical loss at  is equal to the empirical Fisher information matrix at .
    3. Suppose that a direction  is singular at a parameter  - that is,  lies in the null space of the Fisher information matrix at . Prove that it also lies in the null space of the population Hessian and empirical Hessian at . (Note that unlike a) or b), you do not need to assume realizability or even that  is a local minimum of the loss.)
    4. If one finds empirically that the loss Hessian has many zero eigenvalues at a parameter, what does this suggest about the Fisher information matrix at that parameter, and consequently about how singular the model is?
  17. Here we prove that the learning coefficient is invariant to diffeomorphism; that is, roughly, a smooth and invertible change of variables. A diffeomorphism is an invertible map  between two spaces  such that both  and  are infinitely differentiable.
    1. Show that  is a diffeomorphism from  to , and  is a diffeomorphism from  to  . Explain why  is not a diffeomorphism from  to .
    2. Let  be a model defined on the parameter space . We apply a diffeomorphism  to create the model  defined on the parameter space . Abusing notation, let  denote the KL divergence of the model on  and  the KL divergence of the model on .  Define  and . Rewrite  as an integral over  using the change of variables formula.
    3. Using the result from part b) together with the fact that  is a diffeomorphism, show that there exist positive constants  such that .
    4. Using the result from c) together with the volume scaling formula for the learning coefficient  and its multiplicity , prove that  as . Conclude that the learning coefficient and multiplicity of the two models is the same, and thus invariant to diffeomorphism.
  18. The role of the partition function  and free energy  can be partially explained by their role as moment and cumulant generating functions of the posterior loss. This connection also leads to the scalable estimators of the free energy and learning coefficient used in practice.
    Let the random variable  be a sample from the tempered posterior . Then  is a real-valued random variable giving the negative log-likelihood of a randomly sampled parameter from the tempered posterior.
    1. Recall that the moment generating function of a random variable  with PDF  is given by , and the partition function is given by .
      1. Show that .
      2. The -th moment  of  is given by . Compute  for .
    2. Recall the cumulant generating function of a random variable  with PDF  is given by , and the free energy is given by .
      1. Show that .
      2. The -th cumulant  of  is given by . Show that  for .
    3. We may use b.ii) to estimate  or  empirically from samples, far more efficiently than naive methods. The estimator of  is known as the WBIC.
      1. Using b.ii), conclude that .
      2. The free energy formula can be written as . There exists an optimal inverse temperature  such that . Find an approximate expression for  using i) and assuming that the free energy formula holds after differentiating both sides, i.e..
      3. Instead of estimating , we may estimate the learning coefficient  directly. Use i) and the free energy formula to get an approximate expression for  in terms of , and . Is it still necessary to set  to an optimal inverse temperature for this estimator to work?
  19. A priori, it seems like the choice of prior distribution  could significantly affect the learning coefficient we get, via the volume scaling formula or otherwise. However, this is not the case - the learning coefficient remains the same so long as the prior is supported at  [Gray Book, Remark 6.7]. Thus, the prior has at most a secondary role in singular learning theory.
    To demonstrate this, redo 1c) and 1d) but with a different prior distribution that also has support at .

References

 

  1. ^

    Technically, if we follow Watanabe's terminology, a regular model is non-strictly-singular rather than non-singular. Watanabe defines singular models as a superset of regular models, so every regular model is also singular - non-regular models are referred to as strictly singular models.

  2. ^

    Technically, the learning coefficient is not the same thing as the real log-canonical threshold (RLCT); the learning coefficient is an invariant of a statistical system (model, truth, prior triplet), whereas the RLCT is an invariant of an analytic function. However, the RLCT of the model/truth KL divergence coincides with the learning coefficient if the prior is supported along the true parameter set.

  3. ^

    Note that in practice, analytically calculating or numerically estimating the learning coefficient directly via this volume scaling formula is completely intractable. Instead, methods based on the WBIC and MCMC are necessary.

  4. ^

    In the case where the multiplicity .

  5. ^

    This kind of two-layer linear model is sometimes called reduced rank regression.

  6. ^

    By 14d), this is equivalent to the null space of the FIM.

  7. ^

    Note that the Gray Book uses  to refer to the likelihood instead of the negative log-likelihood.

  8. ^

    Note that this pointwise expected value only tells us about  for a fixed ; it does not give us enough information to talk about properties of  which depend on many , like volume scaling, the free energy, etc. Establishing this is highly nontrivial, and the Gray Book spends a significant amount of time doing so.

  9. ^

    Implicitly, with improper prior , but the prior isn't important for intuition here.

  10. ^

    A neural network under mean squared error loss would satisfy these assumptions.

  11. ^

    In some cases, the rank of the Hessian may also be used here, given the correspondence between the FIM and the Hessian (see problem 16).

5 comments

Comments sorted by top scores.

comment by Dmitry Vaintrob (dmitry-vaintrob) · 2024-08-31T08:11:56.315Z · LW(p) · GW(p)

This is awesome!

comment by JamesH (AtlasOfCharts) · 2024-09-22T11:06:16.790Z · LW(p) · GW(p)

I think there's a mistake in 17: \sin(x) is not a diffeomorphism between (-\pi,\pi) and (-1,1) (since it is e.g. not bijective between these domains). Either you mean sin(x/2) or the interval bounds should be (-\pi/2, \pi/2)

Replies from: zfurman
comment by Zach Furman (zfurman) · 2024-09-24T06:08:31.595Z · LW(p) · GW(p)

Good catch, thanks! Fixed now.

comment by Sheikh Abdur Raheem Ali (sheikh-abdur-raheem-ali) · 2024-08-31T09:25:16.577Z · LW(p) · GW(p)

I wish this was available as a jupyter notebook.