A hundredth of a bit of extra entropy

post by Adam Scherlis (adam-scherlis) · 2022-12-24T21:12:41.517Z · LW · GW · 4 comments

Contents

  Continued fractions
  The Gauss-Kuzmin distribution
  Lochs's theorem
  The difference
None
4 comments

There are two ways to calculate the amount of information in one term of a continued fraction:

These differ by about 0.0088 bits. It took me a while to figure out why they were different at all, and now I'm surprised by how close they are.

Continued fractions

π is about 3.

Actually, 3 is a bit too small, by about 1/7.

Actually, 7 is a bit too small, by about 1/15.

Actually, 15 is a bit a lot too small, by almost 1/1.

But the 1 in the denominator is a tiny bit too small, by about 1/292.

We can keep doing this forever, and get an expression like:

This is pretty useful for things like finding rational approximations. We can drop that 1/(292+...), since it's really small, and get the excellent approximation . Or we can drop the 1/(15+...) and get the rougher approximation .

In general, the continued fraction for a real number  is of the form

where the terms  are positive integers (except  which is an arbitrary integer).

This is infinite for any irrational number; for a rational number, it stops after some number of terms, e.g.

Truncating a continued fraction[1] is the optimal way to approximate a real number with a rational: it gives a better approximation than anything with a smaller denominator.

Continued fractions show up every now and then in different ways, but the approximation property is usually how I end up bumping into them.

The Gauss-Kuzmin distribution

Suppose we choose a random real number (say, uniformly between 0 and 1). The probability that the th term is equal to  converges to  as .[2] This is the Gauss-Kuzmin distribution.

I think it's also true that the fraction of 's in the continued fraction converges to .[3]

The entropy of this distribution is just ,[4] which is pretty quick to compute out to a few digits of accuracy in your favorite programming language. It works out to about  bits, so that's how much information there is in the term , for large .

Lochs's theorem

Lochs's theorem: For almost all real numbers, asymptotically, the accuracy of the continued fraction improves by a factor of  per term.[5] This is amusingly close to 10, which is how much the accuracy of decimals improves per digit.

You get  bits of information from a digit of decimal, so this means that you get  bits[6] of information per term of a continued fraction, asymptotically.

The difference

So what gives? Why aren't those the same??

Hint:

It's not about how you take the limits to get these two things, as I initially thought. Eventually, the error in both numbers above will drop so far that 0.0088 bits looks enormous by comparison, so we can just think about the behavior after a bajillion terms and not worry about the fact that both theorems are asymptotic.

The answer, unless I'm mistaken:

The Gauss-Kuzmin distribution is a marginal distribution, so its entropy is the entropy of the nth term if you don't know anything about previous terms.

The Khinchin-Lévy constant is the entropy of the conditional distribution; it measures the entropy of the nth term if you know every previous term.

These are different because -- evidently -- different terms of the continued fraction are correlated, even for a generic real number, even in the limit of large n!

I implicitly assumed that the correlation would tend to zero, and barely noticed I was making this assumption.

It seems that if you know the first bajillion terms of a continued fraction, you actually have a tiny amount of information about the bajillion-plus-first term; less than a hundredth of a bit, but not zero.

I haven't tried to calculate the conditional distribution and confirm this yet, because it seems really hard to measure accurately. But I'm tempted, because I have no idea what this tiny correlation looks like. Is it just about consecutive terms, or longer-ranged than that? Is the size of consecutive terms correlated or anticorrelated? Is the information concentrated in one outcome -- "the next term will probably not be 283434" -- or spread out evenly across the whole distribution? Presumably there are results on this in the literature?

But I'm still confused about why there's only a 2% difference. Why should these numbers be so close, but still different?

If you have any intuition for this, please let me know.

  1. ^

    If I'm remembering this right, you truncate by replacing a term  with either  or . (In other words, by rounding  down to zero or up to one.)

  2. ^

    The logarithm has to be base 2 to make the probabilities add up to 1. Isn't that weird?

  3. ^

    Formally: with probability 1, p(k) is the limit of the fraction of 's in the first  terms, as .

    This part of the post originally said this was equivalent to the previous statement, but it's actually stronger.

  4. ^

    Yes, there's gonna a double logarithm in there.

  5. ^

    Just to throw another theorem on the pile: for the rational approximation you get by truncating after  terms, the th root of the denominator approaches  as . This is the square root of the number in Lochs's theorem and is called the Khinchin-Lévy constant, after Khinchin who proved it was a constant and Lévy who found its value using the Gauss-Kuzmin distribution.

    Why the square root? I think because the accuracy of these approximations goes like the square of the denominator -- see also Dirichlet's approximation theorem.

    Lochs's theorem came late; I assume it relies on Lévy's work but I haven't seen the proof.

  6. ^

    One factor of  in the denominator is to make the unit "bits", and the other is from the exponent of the Khinchin-Lévy constant (and related to the Gauss-Kuzmin distribution having  in it, I believe).

4 comments

Comments sorted by top scores.

comment by Oscar_Cunningham · 2022-12-24T23:06:29.590Z · LW(p) · GW(p)

Nice idea! We can show directly that each term provides information about the next.

The density function of the distribution of the fractional part in the continued fractional algorithm converges to 1/[(1+x) ln(2)] (it seems this is also called the Gauss-Kuzmin distribution, since the two are so closely associated). So we can directly calculate the probability of getting a coefficient of n by integrating this from 1/(n+1) to 1/n, which gives -lg(1-1/(n+1)^2) as you say above. But we can also calculate the probability of getting an n followed by an m, by integrating this from 1/(n+1/m) to 1/(n+1/(m+1)), which gives -lg(1-1/(mn+1)(mn+m+n+2)). So dividing one by the other gives P(m|n) = lg(1-1/(mn+1)(mn+m+n+2))/lg(1-1/(n+1)^2), which is rather ugly, but the point is that it does depend on n.

This turns out to be an anticorrelation. High numbers are more likely to by followed by low numbers, and vice-versa. The probability of getting a 1 given you've just had a 1 is 36.6%, whereas if you've just had a very high number the probability of getting a 1 will be very close to 50% (since the distribution of the fractional part is tending to uniform).

Replies from: adam-scherlis, JacobKopczynski
comment by Adam Scherlis (adam-scherlis) · 2022-12-24T23:29:07.199Z · LW(p) · GW(p)

Thanks! That's surprisingly straightforward.

comment by Czynski (JacobKopczynski) · 2022-12-26T01:46:06.773Z · LW(p) · GW(p)

Huh, an extra reason why the golden ratio is the "most irrational"/most unusual irrational number.

Replies from: Oscar_Cunningham
comment by Oscar_Cunningham · 2022-12-26T16:20:46.872Z · LW(p) · GW(p)

I tend to view the golden ratio as the least irrational irrational number. It fills in the next gap after all the rational numbers. In the same way, 1/2 is the noninteger which shares the most algebraic properties with the integers, even though it's furthest from them in a metric sense.