The Laws of Large Numbers

dmitry-vaintrob

The Laws of Large Numbers

post by Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-04T11:54:16.967Z · LW · GW · 11 comments

  Introduction
  Review of the central limit theorem (as a law of large numbers correction)
    The law of large numbers
    The central limit theorem
  The third law and beyond
    Higher cumulants and higher laws
    Multiple random variables
  Connections to neural nets
  Connections to physics and the stationary phase formula
None
11 comments

Introduction

In this short post we'll discuss fine-grained variants of the law of large numbers beyond the central limit theorem. In particular we'll introduce cumulants as a crucial (and very nice) invariant of probability distributions to track. We'll also briefly discusses parallels with physics. This post should be interesting on its own, but the reason I'm writing it is that this story contains a central idea for (one point of view) on a certain exciting physics-inspired point of view on neural nets. While the point of view has so far been explained in somewhat sophisticated physics language (involving quantum fields and Feynman diagrams), the main points can be explained without any physics background, purely in terms of statistics. Introducing this "more elementary" view on the subject is one of the core goals of this series of posts. This first post is purely introductory, and other than "ideological" parallels, it has essentially no physics (only statistics).

Review of the central limit theorem (as a law of large numbers correction)

The law of large numbers

Most people intuitively know the law of large numbers: that if you take n independent measurements in a random process and average them, this will converge to a specific number as n goes to infinity, namely the expectation of this process, [X].

The law of large numbers can be split up into two parts, the first of which doesn’t depend on n going to infinity. Namely:

The cumulative property of means, which itself consists of two parts:
1. Additivity of means. The expectation of the sum of n random variables, $[X_{1} + \dots + X_{n}]$ , is equal to the sum of their expectations: $[X_{1}] + [X_{2}] + \dots + [X_{n}]$ . Here the variables don’t even have to be independent^[1] or drawn from the same distribution.
2. Linear homogeneity of means. For any real constant $λ,$ the expectation of the rescaled variable $[λ X]$ is equal to the rescaled expectation $λ [X] .$
Existence of exact limit of averages. The average of n independent and identical random variables $\frac{X_{1} + \dots + X_{n}}{n}$ has a well-defined limit, and this limit is a $δ$ distribution (i.e., concentrates all probability at a single number).

Here a random variable X is a probability distribution on real numbers: the standard way of abstracting out the notion of taking a measurement in a random process. Here and later, there are some analytic conditions one should impose on the random variables, and the notion of convergence of a sequence of random variables is a slightly complicated one; we sweep these issues under the rug. Generally, a random variable can be understood as a probability function $P_{X} (r)$ on the reals which takes nonnegative values and integrates to 1, so: $\int_{- \infty}^{\infty} P_{X} (r) d r = 1.$ This encodes the familiar fact that probabilities sum to 1. Here $P_{X}$ measures the probability density. However, some singular limits of functions (called distributions) are allowed, and behave well with respect to the limits we will care about here so long as their tails are suitably well-behaved.

The fact that the law of large numbers follows from the two above properties is obvious, but let’s quickly spell it out. First, applying additivity and homogeneity, we see that the mean of the average of n copies of $X$ is equal to the average of n copies of the mean $[X]$ , i.e., $[X]$ (we’re averaging n copies of the same number). Existence of the limit means that as n goes to infinity, these averages have a deterministic limit. Since a deterministic random variable is determined by its mean, we see this limit is $[X]$ .

The central limit theorem

Most people reading this will also know the standard refinement of the law of large numbers, which is the central limit theorem. This theorem states that the average of independent variables $\frac{[X_{1}] + [X_{2}] + \dots + [X_{n}]}{n}$ drawn from the same distribution can be approximated much better than by a delta distribution, by the Gaussian distribution

$m a t h c a l N ([X], Var (X) / n)$ : here the variance $V a r (X) \geq 0$ is the expectation $[X^{2}]$ if $X$ has mean zero, and otherwise is the expectation of the square of the mean-zero shift: \[\text{Var}(X) = [(X-[X])^2].\

Once again, the central limit theorem can be decomposed into two (new) results, the first of which, fully analogous to the cumulative property of means, holds more generally (in particular, not just in the limit):

The cumulative property of variance:
1. Additivity of variance. The variance of the sum of n independent random variables is the sum of their variances.
2. Quadratic homogeneity. Variances behave quadratically under scaling, so $Var (λ X) = λ^{2} Var (X)$ for any real constant $λ$ .
Gaussianity of the normalized limit. If $X$ is a random variable with mean $[X] = 0,$ and $X_{1}, \dots, X_{n}$ are iid variables distributed like X, then as n goes to infinity, the sequence normalized random variables $\frac{X_{1} + \dots + X_{n}}{\sqrt{n}}$ has a well-defined limit, and this limit is a Gaussian.

Using these items (along with the additivity of means from the previous part), we can deduce the central limit theorem. Indeed, without loss of generality we can assume that we are working with a random variable $X$ with zero mean (since adding a constant to X results in adding the same constant to the average of n independent draws of X). The normalized sum $S_{n}^{n o r m} := \frac{X_{1} + \dots + X_{n}}{\sqrt{n}}$ is now a probability distribution with mean zero, and applying the two parts of the cumulative property, we see that each $S_{n}$ also has variance $Var (S_{n}) = Var (X)$ (the square root is there because of the quadratic part of homogeneity).

Thus the limit of the normalized $S_{n}^{n o r m}$ , if it exists, must be a random variable with mean 0 and variance $Var (X) .$ Now Gaussianity tells us that the limit indeed exists and is a Gaussian. Since a Gaussian is fully determined by its mean and variance, we are done.

The third law and beyond

If you’ve read the above two sections, you can probably guess where I’m going. If we think of the central limit theorem as a second-order “correction” to the law of large numbers that takes into account quadratic information about our random variable X, then there should be a “third-order” correction, which takes into account cubic information. I’m going to immediately skip from writing the law down directly to the equivalent decomposed version, which is easier to work with. The first part is a particularly straightforward extension of the “cumulative properties” that we’ve seen so far, and involves the third cumulant, which is $κ_{3} (X) := [(X - [X])^{3}]$ (so we “adjust” X to have mean zero, then take the expectation of the cube, i.e., the third moment). Indeed, we have already seen the first and second cumulants: we have $κ_{1} (X) = [X]$ is the mean and $κ_{2} (X) = Var (X)$ is the variance.

We now have

The cumulative property of the third cumulant.
1. additivity. The third cumulant behaves additively when adding together independent variables: $κ_{3} (X_{1} + \dots + X_{n}) = κ_{3} (X_{1}) + \dots + κ_{3} (X_{n}) .$
2. cubic homogeneity. The third cumulant is homogeneous under rescaling, with $κ_{3} (λ X) = λ^{3} κ_{3} (X) .$

Now what should we write for part 2? A naive guess might be that we’re now writing some kind of asymptotic formula for a different equivariant average, perhaps $\frac{X_{1} + \dots + X_{n}}{n^{1 / 3}} .$ But unfortunately that doesn’t work. Indeed, as before we can assume for free that X has zero mean. Now if X has nonzero second moment, then the new normalization above cannot have a limit, since we know that already when dividing by $n^{1 / 2}$ we have a well-defined limit (a Gaussian), so if we changed normalization this would just smear it out and not give a reasonable distribution. Perhaps, then the thing to do is to assume that X has zero variance? But unfortunately here the limitations of reality make this uninteresting, as any distribution with zero variance is a deterministic delta distribution.^[2]

Instead, the next step in the sequence must be perturbative: we will not say anything new about the $n \to \infty$ limit of any normalization of the sum variable $S_{n} = X_{1} + \dots + X_{n},$ but rather we will give an asymptotic correction to the law of large numbers at finite n, accurate to higher-order corrections. With this in mind, let’s write down the new limit result:

2. Third perturbative limit form. Assume that our random variable $X$ has mean zero. Then there exists a cubic polynomial $P_{1} (x) = a_{1} + b_{1} x + c_{1} x^{2} + d_{1} x^{3}$ (independent of n) with the following property: the probability density function $ψ_{n}^{n o r m} (x)$ associated to the usual normalized sum variable $S_{n}^{n o r m} = \frac{X_{1} + \dots + X_{n}}{\sqrt{n}}$ has the asymptotic form $ψ_{n}^{n o r m} (x) = N (0, Var (X)) (x) \cdot (1 + \frac{1}{\sqrt{n}} P_{1} (r)),$ up to a lower-order error term of order $o (\frac{1}{\sqrt{n}}),$ where (abusing notation) I’m writing $N (0, Var (X)) (x)$ both for the Gaussian probability distribution and its probability density.

This is the third-order “correction” to the law of large numbers. It takes some unpacking. First, we did a bit of flipping from the Gaussian random variable $N (0, λ)$ to its associated probability distribution, which is always a bit of a headache. However, the way to think about this is that we just introduced a new class of probability distributions beyond Gaussians, which are Gaussians times a linear polynomial. We’re now looking for an asymptotic form of this type, where the polynomial $P (r)$ has a constant part $P_0$ that is independent of n and a "perturbative" part $P_1$ that scales like $\frac{1}{\sqrt{n}}$ . Of course as n goes to $\infty$ , the "perturbative" term $\frac{P_{1} (x)}{\sqrt{n}}$ goes to zero. Thus by the usual central limit theorem, we must have $P_{0} = 1$ ; otherwise we'll get the wrong $n \to \infty$ limit. Finally, note that though it is scaled by a small number, for any finite n, the polynomial $P_{1}$ will eventually be negative, which technically isn’t allowed for probability distributions. It turns out that this is ok, since the place where this happens is so far away that the Gaussian tail contributes much less than the allowable order of error to the probability distribution. However this accentuates the point that being rigorous about limits and asymptotics of probability distributions is tricky and requires some analytic formalism, which as before we’ll completely rug-sweep and ignore. (The mathematicians in the audience may notice here that I am behaving like a physicist.)

Now, with all of this information in place, I claim that finding the value of $P_{1} (x) = a + b x + c x^{2} + d x^{3}$ as an easy exercise. Indeed, there are four free real parameters, a-d, giving a four-dimensional family of possibilities for the limit. We can check that all three cumulants $κ_{1}, κ_{2}, κ_{3}$ (i.e., the mean, variance and third cumulant) of the limit are linear functions in a-d; the cumulative property of the cumulants thus gives us three linear equations on a-d. We get a fourth linear equation from the normalization requirement $\int P_{1} (x) N (0, Var (X)) = 1.$ At the end of the day, we have four equations on four variables. These are solvable, and we get a formula for the first-order “cubic $\cdot$ Gaussian” correction. I don’t want to derive this formula here, but see the formulas on the second page of this pdf for the resulting formula (the pdf also gives a more rigorous derivation).

Aside: note that the first-order correction to the central limit theorem involves a cubic polynomial. On the one hand this makes sense, since we’re keeping track of up to the third cumulant. But on the other hand, the previous “correction”, namely the central limit theorem itself, doesn’t have a second-order polynomial scaling the Gaussian. One way to explain this is that in the perturbative formulas we’re generating, the Gaussian term already absorbs into itself any first- and second-order information: remember that we got the parameters of the Gaussian by fitting the mean and variance to be correct.

Higher cumulants and higher laws

We get higher laws similarly. For each degree d, we start out with the dth cumulant, which can always be expressed in terms of the moments: $κ_{d} (X) = μ_{d} (X) + {poly}_{d} (μ_{1} (X), \dots, μ_{k - 1} (X),$ where “poly” denotes some fixed polynomial (independent of X) and $μ_{k} (X) := [X^{k}]$ is the kth moment. Note here that flipping the formula (and iteratively expanding) lets you express the moments as polynomials of cumulants, and so cumulants and moments are two interchangeable series of “summary statistics” associated to a variable, with one or the other being better depending on context. The key property of the cumulant is as before, the “cumulative property”, i.e.,

additivity: $κ_{d} (X_{1} + \dots + X_{n}) = κ_{d} (X_{1}) + \dots + κ_{d} (X_{n})$
homogeneity: $κ_{d} (λ X) = λ^{d} κ_{d} (X)$ .

Now for the “order d” correction, we write down a general form of the correction, working with probability density functions $ρ_{X}$ :

$ρ (S_{d}^{n o r m}, x) \approx ρ (N_{limit}) (x) \cdot (1 + \frac{1}{\sqrt{n}} P_{1} (x) + \frac{1}{n} P_{2} (x) + \dots) .$ (As before, $N_{limit}$ denotes the usual limit Gaussian, $N (0, Var (X))$ for a mean-zero variable X.) In general, the dth correction term is the perturbative order, $\frac{1}{{\sqrt{n}}^{d - 2}}$ times a degree d polynomial in x that depends on the first through d’th cumulants of X. There is lots of pretty deep combinatorics (that I don’t know well) in the resulting formulas, involving Hermite polynomials (familiar as the natural quantum perturbations of the Harmonic oscillator in physics – this is not a coincidence!) and the Edgeworth series. The degree-d expansion has terms of order up to ${(\frac{1}{\sqrt{n}})}^{d - 2}$ and is correct up to an error of order ${(\frac{1}{\sqrt{n}})}^{d - 1}$ (though as before, since probability distributions can be singular, one needs to be careful when interpreting the meaning of “size of error term” rigorously).

One might hope that this will give a Taylor series for the sum distribution $S_{n}^{n o r m}$ , which might converge even for n = 1. In fact, this is not generally the case: this expansion is fundamentally an asymptotic expansion (i.e., it might diverge, or converge to the wrong value, if we take the number of terms to $\infty$ instead of taking n to $\infty$ ). However the convergence is quite good in practice. (Note that here I was supposed to have a diagram of some examples of comparing the true sum distribution to the Edgeworth approximations; after fighting with chatgpt for an hour and not getting correctly-normalized graphs, I’m going to use my prerogative of publishing unpolished drafts.)

Multiple random variables

So far, we’ve been looking throughout at a single random variable X (which is a probability distribution on “one-dimensional”) values in $R$ . When we actually apply these techniques to physics-flavored analyses of LLM’s, it will be very important that we have some fixed number (say, D) of random variables (associated to different training examples), and these are not independent.

It turns out that all of the analysis we worked out applies almost verbatim in this case. The key difference is that now we should conceptualize both the random variable X and the sum variable $S = X_{1} + \dots + X_{n}$ as vector-valued, i.e., probability distributions on $R^{D}$ . Once we do this, we once again have a mean value theorem (with the difference being that the variance $Var (X)$ now is no longer a positive number, but a positive-definite DxD matrix). We can once again write down a normalized limiting Gaussian $N_{limit}$ as the second-order approximation to our variable, and then the third- and higher-order approximations will multiply the corresponding Gaussian by polynomials of appropriate degree, now in D variables. Otherwise, the story is exactly the same. We look at cumulants, write down polynomial corrections of appropriate order, and get an expansion.

Connections to neural nets

This will be explained in much more depth in future posts, but I’ll explain very briefly the reason one might care about extending the law of large numbers for studying (realistic) neural nets. Namely, a standard entry point for physics techniques into neural nets is the “large-width” limit, where the number of neurons (corresponding to our number of independent variables in the large-number expansions above) is large. At initialization, weight parameters are uncorrelated (leading to evident iid behaviors), and as learning occurs, the relative probabilities of the parameter choices are suitably updated. Now for much of this process, it is still reasonable to model parts of the process as sums of independent random variables (this is because even during learning, a lot of what happens just consists of taking an activation, applying a function to it, rescaling it by a weight, and summing a bunch of these together in a "close enough to iid" way). Now taking only the second-order approximation -- i.e., the usual central limit theorem -- leads to modeling the neural net as a Gaussian process. This implies a certain picture of learning that is nontrivial (it can learn simple “clusterable” real-life classification problems like MNIST), but highly limited in terms of what it can learn (in some sense, it can only do clustering, and can’t use any more "interesting" geometric properties of inputs).

A priori, looking at higher perturbative terms only perturbs the resulting predictions by a small parameter. However, in some critical hyperparameter choices (that turn out to actually be preferred by efficient learning algorithms), one particular class of corrections (namely, the fourth-order ones) gets into a self-reinforcing loop and become dominant in controlling the large-scale behavior, and this leads to interesting new phenomena. This is very much not an explanation of the whole theory, but should be taken as an advertisement/appetizer for future write-ups.

Connections to physics and the stationary phase formula

The idea of this series of posts is to remove or defang the “physics” part of the “physics of LLM’s” ideas inherent in papers such as the beautiful “PDLT” paper. However, I can’t resist quickly giving a (slightly more mathy) addendum here, that explains a direct connection between “law of large numbers” corrections and physical perturbation laws (including ones related to Feynman diagrams). This section will be more math-heavy at the end, and can be safely skipped.

The first “moral” point though can be explained without math. You see, a perennial concern of physicists (to which all of physics can sort of be reduced) is computing the so-called “Feynman path integral” of some energy functionals. This integral is in general nasty, undefined (in the sense of diverging due to various infinities) and undefinable (in the sense of the very process of Feynman integration being mathematically self-contradictory if you impose any meaningful properties), but physicists love and use it all the time.

Now just like the “sum of iid variables” example we worked out here, the way physicists approach these is in terms of a sequence of “perturbative” approximations in some parameter (called the “coupling constant” or "perturbative parameter"). To first order, physics is classical and you only care about the “deterministic” limit of the theory, which can be defined and worked with pretty nicely. The magic happens when you look at second-order behaviors (for a suitable notion of "order"). Here the physicists claim (after intoning a special ritual and sprinkling some incense, which in physics circles is what passes for rigor) that in nice cases, if you look at a suitably quadratic approximation of the energy function, then the Feynman integral should be a particular Gaussian (or a complex-valued analog of a Gaussian). And once they sell you this snake oil, then they say that well, a lot of interesting energy functions are close to being second-order, and we can therefore perturb the Gaussian to fit some higher-order behaviors. And just like in our law of large numbers example, instead of passing to some new class of functions beyond Gaussians, all higher corrections are incorporated as polynomial “corrections” times the original quadratic Gaussian approximation (known as the “free theory”).

Now though the Feynman integral formalism as used by physicists is arcane and buggy due to being very infinite-dimensional, it is based on a much more rigorously established property of certain perturbative Gaussian integrals in finite dimensions, called the “stationary phase” principle. The stationary phase principle says that, for a small perturbative parameter, certain quantum-mechanical integrals are well approximated by a formula involving higher derivatives of the energy function at its stationary points (i.e., points with zero derivative). The quantum “stationary phase” principle also has a statistical analog. Here one takes thermodynamic integrals instead of quantum ones, and the small "perturbative" parameter in this context is the temperature (rather than the coupling constant). In this case the integral is similarly dominated by terms at stationary points, with the added requirement that they be maxima^[3] rather than minima or (in higher-dimensional contexts,) saddlepoints. There is also a “mixed” form of the stationary phase formula, with separate imaginary (quantum) and real (statistical) energy components.

Now it turns out that the corrections to the central limit theorem can be precisely explained as higher-order versions of this ‘mixed’ stationary phase formula applied to the Fourier transform of the probability density functional of a random variable.

The key pair of results needed to make the connection are as follows.

Let $X$ be a random variable with probability density function $ρ (x)$ . Let $u (θ) :=^ρ (θ)$ be the Fourier transform. Then

$u (θ)$ has maximal absolute value at at $u (0) = 1$ , and unless $ψ$ is finitely supported (i.e., X only takes finitely many values), u has no other maxima.
The Fourier transform of (the probability density of) the sum random variable $S_{n}$ is equal to the nth power of the initial Fourier transform $u (θ)$ , i.e.: ${^S}_{n} = u (θ)^{n}$ .

From this it follows that we can write down a new complex-valued “energy” function $h (θ) := log (u (θ)),$ with a stationary point (with maximal real part) at $h (0) = 0$ , and then for large n, the nth sum variable has Fourier transform related to a low-temperature limit, with a temperature parameter $T = 1 / n$ . Under this point of view, one can now express values of the probability density function of $S_{n}$ in terms of certain temperature-1/n expectations of the energy function h, which are well-approximated (at small values of $1 / n$ ) by a stationary phase expansion. This stationary phase expansion now exactly recovers the cumulant-order expansion for the sum variable that I described in the previous section; this makes explicit the connection between the approximations we saw and similar perturbative expansions studied by physicists.

^{^}
Note that in higher-order iterations of this result, we will also assume that the variables $X_{1}, \dots, X_{n}$ are independent (thought they still won't have to be drawn from the same distribution). The fact that means are additive for non-independent variables is a very special property of means, and of means only.
^{^}
One could ask whether replacing real-valued complex variables by complex-valued ones (where [X^2] could be zero) would make this interesting. But this ends up still not working. Even if we assume that [X] = 0, the formally defined value of [X^2] no longer serves the purpose of the variance (we can still write down a law of large numbers, and its corrections -- see the later section on vector-valued random variables).
^{^}
There are potentially confusing sign conventions here. With usual conventions, you actually take the minimum of the energy, but for our purposes it will be a little easier to take the convention where the maxima are relevant. Since the treatment in this section is entirely impressionistic and formula-free, this detail is mostly academic.

11 comments

Comments sorted by top scores.

comment by Simon Pepin Lehalleur (SPLH) · 2025-01-09T16:25:01.664Z · LW(p) · GW(p)

A closely related perspective on fluctuations of sequences of random variables has been studied recently in pure probability theory under the name of "mod-Gaussian convergence" (and more generally "mod-phi convergence"). Mod-Gaussian convergence of a sequence of RVs or random vectors is just the right amount of control over the characteristic functions - or in a useful variant, the whole complex Laplace transforms - to imply a clean description of the fluctuations at various scales (CLT, Edgeworth expansion, "normality zone", local CLT, moderate deviations, sharp large deviations,...). Unsurprisingly, the theory is full of cumulants.

Here is a nice introduction with applications to statistical mechanics models:

https://arxiv.org/abs/1409.2849

and the book with the general theory (which I still have to read!)

https://link.springer.com/book/10.1007/978-3-319-46822-8

This leads for instance to a clean approach of some "anomalous" CLTs with non-Gaussian limit laws (not for the mod-Gaussian convergent sequences themselves but for modified versions thereof) for some stat mech models at continuous phase transitions, see Theorems 8 and 11 in the first reference above. As far as I know, those theorems are the simplest "SLT-like" phenomenon in probability theory!

Replies from: dmitry-vaintrob

↑ comment by Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-11T16:57:23.330Z · LW(p) · GW(p)

This is very nice! So the way I understand what you linked is this: the class of perturbative expansions in the "Edgeworth expansion" picture I was distilling is that the order-d approximation for the probability distribution associated to the sum variable S_n above is $p_{Gauss}^{n} (x) \cdot F_{n} (x)$ where $p_{Gauss}^{n} (x)$ is the probability distribution associated with a Gaussian $N (0, const / n)$ and $F_{n} (x)$ is a polynomial in t and the perturbative parameter $1 / \sqrt{n}$ . The paper you linked says that a related natural thing to do is to take the Fourier transform, which will be the product of the Gaussian pdf $N (0, n / const)$ and a different polynomial $F_{n}^{'}$ in the fourier parameter t and the inverse perturbation parameter $\sqrt{n}$ . You can then look at the leading terms, which will be (maybe up to some fixed scaling) a polynomial in $t \cdot \sqrt{n},$ and this gives some kind of "leading" Edgeworth contribution.

Here this can be interpreted as a stationary phase formula, but you can only get "perturbative" theories, i.e. the relevant critical set will be nonsingular (and everything is expressed as a Feynman diagram with edges decorated by the inverse Hessian). But you're saying that if you take this idea and apply it to different interesting sequences of random variables (not sum variables, but other natural asymptotic limits of other random processes), you can get singular stationary phase (i.e. the Watanabe expansion). Is there an easy way to describe the simplest case that gives an interesting Watanabe expansion?

Replies from: SPLH

↑ comment by Simon Pepin Lehalleur (SPLH) · 2025-01-13T10:18:29.577Z · LW(p) · GW(p)

Q: How can I use LaTeX in these comments? I tried to follow https://www.lesswrong.com/tag/guide-to-the-lesswrong-editor#LaTeX [? · GW] but it does not seem to render.

Here is the simplest case I know, which is a sum of dependent identically distributed variables. In physical terms, it is about the magnetisation of the 1d Curie-Weiss (=mean-field Ising) model. I follow the notation of the paper https://arxiv.org/abs/1409.2849 for ease of reference, this is roughly Theorem 8 + Theorem 10:

Let $M_n=\sum_{i=1}^n \sigma(i)$ be the sum of n dependent Bernouilli random variables $\sigma(i)\in\{\pm 1}$, where the joint distribution is given by

\mathbb{P}(\sigma)\sim \exp(\frac{\beta}{n}M_n^2))

Then

When $\beta=1$, the fluctuations of $M_n$ are very large and we have an anomalous CLT: $\frac{M_n}{n^{3/4}}$ converges in law to the probability distribution $\sim \exp(-frac{x^4}{12})$.
When $\beta<1$, $M_n$ satisfies a normal CLT: $\frac{M_n}{n^{1/2}}$ converges to a Gaussian.
When $\beta>1$, $M_n$ does not satisfy a limit theorem (there are two lower energy configurations)

In statistical mechanics, this is an old result of Ellis-Newman from 1978; the paper above puts it into a more systematic probabilistic framework, and proves finer results about the fluctuations (Theorems 16 and 17).

The physical intuition is that $\beta=1$ is the critical inverse temperature at which the 1d Curie-Weiss model goes through a continuous phase transition. In general, one should expect such anomalous CLTs in the thermodynamic limit of continuous phase transitions in statistical mechanics, with the shape of the CLT controlled by the Taylor expansion of the microcanonical entropy around the critical parameters. Indeed Ellis and his collaborators have worked out a number of such cases for various mean-field models (which according to Meliot-Nikeghbali also fit in their mod-Gaussian framework). It is of course very difficult to prove such results rigorously outside of mean-field models, since even proving that there is a phase transition is often out of reach.

A limitation of the Curie-Weiss result is that it is 1d and so the "singularity" is pretty limited. The Meliot-Nikeghbali paper has 2d and 3d generalisations where the singularities are a bit more interesting: see Theorem 11 and Equations (10) and (11). And here is another recent example from the stat mech literature

https://link.springer.com/article/10.1007/s10955-016-1667-9

You were actually asking about Edgeworth expansions rather than just the CLT. It may be that with this method of producing anomalous CLTs, starting with a nice mod-Gaussian convergent sequence and doing a change of measure, one could write down further terms in the expansion? I haven't thought about this.

Since the main result of SLT is roughly speaking an "anomalous CLT for the Bayesian posterior", I would love to use the results above to think of singular Bayesian statistical models as "at a continuous phase transition" (probably with quenched disorder to be more physically accurate), with the tuning to criticality coming from a combination of structure in data and hyperparameter tuning, but I don't really know what to do with this analogy!

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-01-04T14:49:28.970Z · LW(p) · GW(p)

Wonderful.

I do remember learning with a shock all the extremely confusing physicist talk about feynmann diagrams and connected correlators was just about cumulants of multivariate gaussians. One wonders how much faster and deeper one could learn theoretical physics if somebody could write a sensible exposition shorn from vague terms like energy, temperature, connected correlators, propagators and particles...

Anyway.

I don't know about these low temperature perturbative expansions. In SLT one is interested in a tempered Boltzmann distribution... do you see a way in which this perturbative expansion story might come into play or is a no go because of singularities ? (Hence failure of gaussianity)

Replies from: dmitry-vaintrob

↑ comment by Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-06T00:04:41.231Z · LW(p) · GW(p)

Yes, I actually thought about this a bit. It is definitely the case that the LC (or RLCT) in the SLT context is also exactly a (singular) stationary phase expansion. Unfortunately, the Fourier transform of a random variable, including a higher-dimensional one, really does have an isolated nondegenerate maximum at 0 (unless the support of your random variable is contained in a union of linear subspaces, which is kinda boring/ reducible to simpler contexts). Maybe if you think about some kind of small perturbation of a lower-dimensional system, you can get some components of the singular free energy expansion, but the expansion relevant here is really nonsingular. This is also the type signature of the expansion you see in most physical QFT systems, at least if they have a perturbative form (in which case, the free theory will in general be nondegenerate).

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-01-06T00:56:48.895Z · LW(p) · GW(p)

Sorry these words are not super meaningful to me. Would you be able to translate this from physics speak ?

Replies from: dmitry-vaintrob

↑ comment by Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-07T20:18:00.133Z · LW(p) · GW(p)

So the oscillating phase formula is about approximately integrating the function against various "priors" p(x) (or more generally any fixed function g), where f is a Lagrangian (think energy) and (\hbar) is a small parameter. It gives an asymptotic series in powers of $ℏ$ . The key point is that (more or less) the kth perturbative term only depends on the kth-order power series expansion of f around the "stationary points" (i.e., saddlepoints, Jac(f) = 0) when f is imaginary, on the maxima of f when f is real, and there is a mixed form that depends on stationary points of the imaginary part which are also maxima of the real part (if these exist); the formulae are all exactly the same, with the only difference between real and imaginary f (i.e. statistical vs. quantum mechanics) being whether you only keep maxima or all saddle points.

Now in SLT, you're exactly applying the "real" stationary phase formula, i.e., looking at maxima of the (negative) loss function -L(w). The key thing that can happen is that there can be infinitely many maxima, and these might be singular (both in the sense of having higher degree of stationarity, and in the sense of forming a singular manifold). In this case the stationary phase formula is more complicated and AFAIK isn't completely worked out; Watanabe was the first person who contributed to finding expressions for the general case here beyond the leading correction.

In the case of maxima which are nondegenerate, i.e., have positive-definite Hessian, the full perturbative expansion is known; in fact, at least in one very useful frame on it, terms are indexed by Feynman diagrams.

Now the energy function f that appears in this context is the log of the Fourier transform $^p (θ)$ of a probability distribution p(x). Notice that p(x) satisfies $p (x) \geq 0$ and $\int p (x) d x = 1$ . This means that (\hat{p}(0) = \int f(x) dx) is 1 and its log is 0. You can check that all other values of the Fourier transform are $\leq 1$ in absolute value (this follows from the fact that $\int | f (x) | \geq | \int f (x) |)$ . In fact, the Hessian is equal (up to scale) to the variance of p. Now the point is that the only way that variance can be zero is if your pd is concentrated on a lower-dimensional affine subspace, in which case you can simply reduce your problem to a lower-dimensional one with nonsingular Hessian. When this doesn't happen, the function you're applying stationary phase to has only nondegenerate maxima, and so the "standard" Feynman-diagram formula applies instead of the more sophisticated Watanabe one that's used in SLT.

comment by Simon Pepin Lehalleur (SPLH) · 2025-01-13T14:00:04.946Z · LW(p) · GW(p)

For sufficiently nice regular, 1-dimensional Bayesian models, Edgeworth-type asymptotic expansions for the Bayesian posterior have been derived in

https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-41/issue-3/Asymptotic-Expansions-Associated-with-Posterior-Distributions/10.1214/aoms/1177696963.full

comment by transhumanist_atom_understander · 2025-01-04T13:28:31.927Z · LW(p) · GW(p)

is the correction to the probability density function really what you want, and are other deviations from Gaussianity expressible with cumulants? All I can think of is that the Gaussian is the maximum entropy distribution so maybe there's a formula for how far below the maximum entropy you are. I don't know what it'd be good for though.

Replies from: dmitry-vaintrob

↑ comment by Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-06T00:09:13.795Z · LW(p) · GW(p)

I'm not exactly sure about what you mean wrt "what you want" here. It is not the case that you can exactly reconstruct most probability distributions you'll encounter in real life from their moments/ cumulants (hence the expansion is perturbative, not exact).

But in the interpretability/ field-theoretic model of wide NN's point of view, this is what you want (specifically, the fourth-order correction)

Replies from: transhumanist_atom_understander

↑ comment by transhumanist_atom_understander · 2025-01-11T03:23:06.643Z · LW(p) · GW(p)

Well, usually I'm not inherently interested in a probability density function, I'm using it to calculate something else, like a moment or an entropy or something. But I guess I'll see what you use it for in future posts.

The Laws of Large Numbers

Contents

Introduction

Review of the central limit theorem (as a law of large numbers correction)

The law of large numbers

The central limit theorem

The third law and beyond

Higher cumulants and higher laws

Multiple random variables

Connections to neural nets

Connections to physics and the stationary phase formula

11 comments