# Message Length

post by Zack_M_Davis · 2020-10-20T05:52:56.277Z · LW · GW · 22 comments

Someone is broadcasting a stream of bits. You don't know why. A 500-bit-long sample looks like this:

01100110110101011011111100001001110000100011010001101011011010000001010000001010
10100111101000101111010100100101010010101010101000010100110101010011111111010101
01010101011111110101011010101101111101010110110101010100000001101111100000111010
11100000000000001111101010110101010101001010101101010101100111001100001100110101
11111111111111111100011001011010011010101010101100000010101011101101010010110011
11111010111101110100010101010111001111010001101101010101101011000101100000101010
10011001101010101111...


The thought occurs to you to do Science to it—to ponder if there's some way you could better predict [LW · GW] what bits are going to come next. At first you think you can't—it's just a bunch of random bits. You can't predict it, because that's what random means.

Or does it? True, if the sequence represented flips of a fair coin—every flip independently landing either 0 or 1 with exactly equal probability—then there would be no way you could predict what would come next: any continuation you could posit would be exactly as probable as any other.

But if the sequence represented flips of a biased coin—if, say, 1 came up 0.55 of the time instead of exactly 0.5—then it would be possible to predict better or worse. Your best bet for the next bit in isolation would always be 1 [LW · GW], and you would more strongly anticipate sequences with slightly more 1s than 0s.

You count 265 1s in the sample of 500 bits. Given the hypothesis that the bits were generated by a fair coin, the number of 1s (or without loss of generality, 0s) would be given by the binomial distribution , which has a standard deviation of , so your observation of excess 1s is about standard deviations from the mean—well within the realm of plausibility of happening by chance, although you're at least slightly suspicious that the coin behind these bits might not be quite fair.

... that is, if it's even a coin. You love talking in terms of shiny, if hypothetical, "coins" rather than stodgy old "independent and identically distributed binary-valued random variables", but looking at the sample again, you begin to further doubt whether the bits are independent of each other. You've heard that humans are biased [LW · GW] to overestimate the frequency of alternations (101010...) and underestimate the frequency of consecutive runs (00000... or 11111...) in "truly" (uniformly) random data, but the 500-bit sample contains a run of 13 0s (starting at position 243) and a run of 19 1s (starting at position 319). You're not immediately sure how to calculate the probability of that, but your gut says that should be very unlikely given the biased-coin model, even after taking into account that human guts aren't very good at estimating these things.

Maybe not everything in the universe is a coin. What if the bits were being generated by a Markov chain—if the probability of the next bit depended on the value of the one just before? If a 0 made the next bit more likely to be a 0, and the same for 1, that would make the 00000... and 11111... runs less improbable.

Except ... the sample also has a run of 17 alternations (starting at position 153). On the "fair coin" model, shouldn't that itself be times as suspicious as the run of 13 0s and as suspicious as the run of 19 1s which led you to hypothesize a Markov chain?—or rather, 8 and times as suspicious, respectively, because there are two ways for an alternation to occur (0101010... or 1010101...).

A Markov chain in which a 0 or 1 makes another of the same more likely, makes alternations less likely: the Markov chain hypothesis can only make the consecutive runs look less surprising at the expense of making the run of alternations look more surprising.

So maybe it's all just a coincidence: the broadcast is random—whatever that means—and you're just apophenically pattern-matching on noise. Unless ...

Could it be that some things in the universe are neither coins nor Markov chains? You don't know who is broadcasting these bits or why; you called it "random" because you didn't see [LW · GW] any obvious pattern, but now that you think about it, it would be pretty weird for someone to just be broadcasting random bits. Probably the broadcast is something like a movie or a stock ticker; if a close-up sample of the individual bits looks "random", that's only because you don't know the codec.

Trying to guess a video codec is obviously impossible [LW · GW]. Does that kill all hope of being able to better predict future bits? Maybe not. Even if you don't know what the broadcast is really for, there might be some nontrivial local structure to it, where bits are statistically related to the bits nearby, like how a dumb encoding of a video might have consecutive runs of the same bit-pattern where a large portion of a frame is the same color, like the sky.

Local structure, where bits are statistically related to the bits nearby ... kind of like a Markov chain, except in a Markov chain the probability of the next state only depends on the one immediately before, which is a pretty narrow notion of "nearby." To broaden that, you could imagine the bits are being generated by a higher-order Markov chain, where the probability of the next bit depends on the previous n bits for some specific value of n.

And that's how you can explain mysteriously frequent consecutive runs and alternations. If the last two bits being 01 (respectively 10) makes it more likely for the next bit to be 0 (respectively 1), and the last two bits being 00 (respectively 11) makes it more likely for the next bit to be 0 (respectively 1), then you would be more likely to see both long 0000... or 1111... consecutive runs and 01010... alternations.

A biased coin is just an n-th-order Markov chain where n = 0. An n-th-order Markov chain where n > 1, is just a first-order Markov chain where each "state" is a tuple of bits, rather than a single bit.

Everything in the universe is a Markov chain!—with respect to the models you've considered so far.

"The bits are being generated by a Markov chain of some order" is a theory, but a pretty broad one. To make it concrete enough to test, you need to posit some specific order n, and, given n, specific parameters for the next-bit-given-previous-n probabilities.

The n = 0 coin has one parameter: the bias of the coin, the probability of the next bit being 0. (Or without loss of generality 1; we just need one parameter p to specify the probability of one of the two possibilities, and then the probability of the other will be 1 − p.)

The n = 1 ordinary Markov chain has two parameters: the probability of the next bit being (without loss of generality) 0 given that the last bit was a 0, and the probability of the next bit being (without loss of ...) 0 given that the last bit was a 1.

The n = 2 second-order Markov chain has four parameters: the probability of the next bit being (without loss ...) 0 given that the last two bits were 00, the probability of the next bit being (without ...) 0 given that the last two bits were 01, the probability of the next bit being—

Enough! You get it! The n-th order Markov chain has parameters!

Okay, but then how do you guess the parameters?

For the n = 0 coin, your best guess at the frequency-of-0 parameter is going to just be the frequency of 0s you've observed. Your best guess could easily be wrong, and probably is: just because you observed 235/500 = 0.47 0s, doesn't mean the parameter is 0.47: it's probably somewhat lower or higher, and your sample got more or fewer 0s than average just by chance. But positing that the observed frequency is the actual parameter is the maximum likelihood estimate—the single value that most makes the data "look normal" [LW · GW].

For n ≥ 1, it's the same idea: your best guess for the frequency-of-0-after-0 parameter is just the frequency of 0 being the next bit, among all the places where 0 was the last bit, and so on.

You can write a program that takes the data and a degree n, and computes the maximum-likelihood estimate for the n-th order Markov chain that might have produced that data. Just slide an (n+1)-bit "window" over the data, and keep a tally of the frequencies of the "plus-one" last bit, for each of the possible n-bit patterns.

In the Rust programming language, that looks like following. (Where the representation of our final theory is output as a map (HashMap) from n+1-bit-patterns to frequencies/parameter-values (stored as a thirty-two bit floating-point number, f32).)

fn maximum_likelihood_estimate(data: &[Bit], degree: usize) -> HashMap<(Vec<Bit>, Bit), f32> {
let mut theory = HashMap::with_capacity(2usize.pow(degree as u32));
// Cartesian product—e.g., if degree 2, [00, 01, 10, 11]
let patterns = bit_product(degree);
for pattern in patterns {
let mut zero_continuations = 0;
let mut one_continuations = 0;
for window in data.windows(degree + 1) {
let (prefix, tail) = window.split_at(degree);
let next = tail[0];
if prefix == pattern {
match next {
ZERO => {
zero_continuations += 1;
}
ONE => {
one_continuations += 1;
}
}
}
}
let continuations = zero_continuations + one_continuations;
theory.insert(
(pattern.clone(), ZERO),
zero_continuations as f32 / continuations as f32,
);
theory.insert(
(pattern.clone(), ONE),
one_continuations as f32 / continuations as f32,
);
}
theory
}


Now that you have the best theory for each particular n, you can compare how well each of them predict the data! For example, according to n = 0 coin model with maximum-likelihood parameter p = 0.47, the probability of your 500-bit sample is about ... 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007517883433770135.

Uh. The tiny probability makes sense: there's a lot of randomness in 500 flips of a biased coin. Even if you know the bias, the probability of any particular 500-flip sequence is going to be tiny. But a number that tiny is kind of unwieldy to work with. You'd almost rather just count the zeros and ignore the specific digits afterwards.

But counting the zeros is just taking the logarithm—well, the negative logarithm in the case of zeros after the decimal point. Better make the log base-two—it's thematic. Call this measurement the log loss.

fn log_loss(theory: &HashMap<(Vec<Bit>, Bit), f32>, data: &[Bit]) -> f32 {
let mut total = 0.;
let degree = log2(theory.keys().count()) - 1;
for window in data.windows(degree + 1) {
let (prefix, tail) = window.split_at(degree);
let next = tail[0];
total += -theory
.get(&(prefix.to_vec(), next))
.expect("theory should have param value for prefix-and-continuation")
.log2();
}
total
}


Now you can compare different theories to see which order of Markov chain is the best theory to "fit" your 500-bit sample ... right?

    for hypothesized_degree in 0..15 {
let theory = maximum_likelihood_estimate(&data, hypothesized_degree);
println!(
"{}th-order theory: fit = {}",
hypothesized_degree,
log_loss(&theory, &data)
);
}

0th-order theory: fit = 498.69882
1th-order theory: fit = 483.86075
2th-order theory: fit = 459.01752
3th-order theory: fit = 438.90198
4th-order theory: fit = 435.9401
5th-order theory: fit = 425.77222
6th-order theory: fit = 404.2693
7th-order theory: fit = 344.68494
8th-order theory: fit = 270.51175
9th-order theory: fit = 199.88765
10th-order theory: fit = 147.10117
11th-order theory: fit = 107.72962
12th-order theory: fit = 79.99724
13th-order theory: fit = 57.16126
14th-order theory: fit = 33.409912


There's a problem. Higher choices of n monotonically achieve a better "fit". You got the idea of higher-order Markov chains because the idea of a biased coin didn't seem adequate to explain the consecutive and alternating runs you saw, but you somehow have trouble believing that the bitstream was generated by a fifteenth-order Markov chain with a completely separate probability for the next bit for each of the = 32,768 prefixes 000000000000000, 000000000000001, 000000000000010, &c. Having had the "higher-order Markov chain" idea, are you now obligated to set n as large as possible? What would that even mean?

In retrospect, the problem should have been obvious from the start. Using your sample data to choose maximum-likelihood parameters, and then using the model with those parameters to "predict" the same data puts you in the position of the vaunted "sharpshooter" who paints a target around a clump of bullet holes after firing wildly at the broad side of a barn.

Higher values of n are like a ... thinner paintbrush?—or a squigglier, more "gerrymandered" painting of a target. Higher-order Markov chains are strictly more expressive than lower-order ones: the zeroth-order coin is just a first-order Markov chain where the next-bit-after-0 and next-bit-after-1 parameters just happen to be the same; the first-order Markov chain is just a second-order chain where the next-bit-after-00 and next-bit-after-10 parameters happen to be the same, as well as the next-bit-after-01 and—enough! You get it!

The broadcast is ongoing; you're not limited to the particular 500-bit sample you've been playing with. If the worry were just that the higher-order models will (somehow, you intuit) fail to predict future data, you could use different samples for estimating parameters and validating the resulting models, but you think you're suffering from some more fundamental confusion—one that's probably not limited to Markov chains in particular.

Your working concept of what it means for a theory to "fit" the data, is for it to maximize the probability with which the theory predicts the data. This is an objective, quantitative measurement. (Okay, the log loss is taking the negative logarithm of that to avoid so many zeros after the decimal point, but minimizing the log loss and maximizing the probability are both expressing the same preference on theories.)

How do you know (and your gut says that you know) that the higher-order models will do badly on future data, if your objective criterion of model-goodness says they're better? The log loss always "wants" to you to choose ever-more-complex models. You asked: what would that even mean? But maybe it doesn't have to be a rhetorical question: what would that even mean?

Well ... in the limit, you could choose a theory that assigns Probability One [LW · GW] to the observed data. The "too many zeros"/"avoid working with really tiny numbers" justification for taking the negative log doesn't really apply here, but for consistency with your earlier results, you dutifully note that the logarithm of 1 is 0 ...

But maybe "too many zeros" isn't the only good motivation for taking the logarithm? Intelligence is prediction is compression. [LW · GW] The log loss of a model against the data can be interpreted as the expected number of bits you would need to describe the data, given the optimal code [LW · GW] implied by your model.

In order to communicate a reduction in your uncertainty, you need to send a signal [LW · GW]—something you can choose to vary in response to the reality of the data. A signal you can vary to take two possible states, can distinguish between two sets among which you've divided the remaining possibilities; writing down a bit means halving your uncertainty.

On this interpretation, what the logarithm of Probability One being zero means is that if your theory predicted the exact outcome with certainty, then once you stated the theory, you wouldn't have to say anything more in order to describe the data—you would just know with zero further bits.

Once you stated the theory. A theory implies an optimally efficient coding by which further bits can whittle down the space of possibilities to the data that actually happened. More complicated or unlikely data requires more bits just to specify—to single out that one outcome amongst the vastness of equally remote alternatives. But the same thing goes for theories [LW · GW].

Given a particular precision to which parameters are specified, there are exponentially more Markov chains of higher degrees, which can continue to drive down the log loss—but not faster than their own probability decreases. You need exponentially more data just to learn the parameters of a higher-order model. If you don't have that much data [LW · GW]—enough to pin down the parameters that single out this particular higher-order Markov chain amongst the vastness of equally remote alternatives—then your maximum-likelihood best guess is not going to be very good on future data, for the same reason you can't expect to correctly guess [LW · GW] that a biased coin has a probability of landing Heads of exactly 0.23 if you've only seen it flipped twice.

If you do have enough data to learn a more complex model, but the data was actually generated by a simpler model, then the parameters of the complex model will approximately take the settings that produce the same behavior as the simpler model—like a second-order Markov chain for which the bit-following-01 parameter happens to take the same value as the bit-following-11 parameter. And if you're deciding what theory to prefer based on both fit and complexity, the more complex model won't be able to "pay" for its increased complexity [LW · GW] with its own predictions.

Now that you know what's going on, you can modify your code to penalize [LW · GW] more complex models. Since the parameters in your implementation are 32-bit floats, you assign a complexity cost of bits to n-th order Markov chains, and look at the sum of fit (log loss) and complexity. Trying out your code again on a larger sample of 10,000 bits from the broadcast—

    for hypothesized_degree in 0..10 {
let theory = maximum_likelihood_estimate(&data, hypothesized_degree);
let fit = log_loss(&theory, &data);
let complexity = 2f32.powi(hypothesized_degree as i32) * 32.;
println!(
"{}th-order theory: fit = {}, complexity = {}, total = {}",
hypothesized_degree, fit, complexity, fit + complexity
);
}

0th-order theory: fit = 9970.838, complexity = 32, total = 10002.838
1th-order theory: fit = 9677.269, complexity = 64, total = 9741.269
2th-order theory: fit = 9111.029, complexity = 128, total = 9239.029
3th-order theory: fit = 8646.953, complexity = 256, total = 8902.953
4th-order theory: fit = 8638.786, complexity = 512, total = 9150.786
5th-order theory: fit = 8627.224, complexity = 1024, total = 9651.224
6th-order theory: fit = 8610.54, complexity = 2048, total = 10658.54
7th-order theory: fit = 8562.568, complexity = 4096, total = 12658.568
8th-order theory: fit = 8470.953, complexity = 8192, total = 16662.953
9th-order theory: fit = 8262.546, complexity = 16384, total = 24646.547


—reveals a clear preference for the third-order theory (that for which the fit-plus-complexity score is the lowest), allowing you to enjoy the huge 450-plus–bit leap in compression/prediction from n := 2 to 3 and logically stop there, the steepness of the ascent into the madness of arbitrary complexity successfully dissuading you from chasing after diminishing returns (which you suspect are only hallucinatory). That's the power packed by parsimony—the sublime simplicity of Science.

(Full source code.)

comment by cousin_it · 2020-10-20T10:28:56.457Z · LW(p) · GW(p)

You lay it out very nicely. But I'd quibble that as long as your nth-order Markov chain isn't exceptionally small and fully deterministic, there might be room for more explanation. Maybe there's no explanation and the data is genuinely random, but what if it's a binary encoded Russian poem? When you've exhausted all self-contained short theories, that doesn't mean the work of science is done. You also need to exhaust all analogies with everything in the world whose complexity is already "paid for", and then look at that in turn, and so on.

Replies from: Zack_M_Davis, Hate9
comment by Zack_M_Davis · 2020-10-20T15:46:17.749Z · LW(p) · GW(p)

Yeah, but I couldn't get that program to run on my desktop.

comment by Hate9 · 2020-10-20T19:17:19.086Z · LW(p) · GW(p)

I agree. I definitely would have run through common encodings before going to Markov Chains.

Replies from: FireStormOOO
comment by FireStormOOO · 2020-10-22T04:42:31.336Z · LW(p) · GW(p)

First thing I did before even reading the article is see that it wasn't ASCII or UTF-8 (or at least if it was it wasn't bit-aligned).  Definitely on the short list of things technical folks are going to instinctively check, along with maybe common "magic bytes" at the start of the maybe-a-file.

comment by dxu · 2020-10-22T01:44:47.852Z · LW(p) · GW(p)

Since the parameters in your implementation are 32-bit floats, you assign a complexity cost of 32 ⋅ 2^n bits to n-th order Markov chains, and look at the sum of fit (log loss) and complexity.

Something about this feels wrong. The precision of your floats shouldn't be what determines the complexity of your Markov chain; the expressivity of an nth-order Markov chain will almost always be worse than that of a (n+1)th-order Markov chain, even if the latter has access to higher precision floats than the former. Also, in the extreme case where you're working with real numbers, you'd end up with the absurd conclusion that every Markov chain has infinite complexity, which is obviously nonsensical.

This does raise the question of how to assign complexity to Markov chains; it's clearly going to be linear in the number of parameters (and hence exponential in the order of the chain), which means the general form k ⋅ 2^n seems correct... but the value you choose for the coefficient k seems underdetermined.

Replies from: Zack_M_Davis
comment by Zack_M_Davis · 2020-10-22T02:57:01.372Z · LW(p) · GW(p)

You're right! This is something that the literature has details on, that I chose to simplify/gloss-over in pursuit of my goal of "write Actual Executable Code implementing minimum-description-length model selection and then tell a cute story about it."

Chapter 5 of Peter D. Grünwald's The Minimum Description Length Principle gives the complexity term (assuming that the parameter space is compact) as , where is the number of parameters, is the bits of percision per parameter, and is the codelength function for a universal prior on the integers (the theoretical ideal that Levenshtein coding and the Elias ω coding approach). Basically, in order to specify your parameters, you need to first say how many of them there are and to what precision, and for that, you need a prefix-free code for the integers, which is itself kind of intricate: you can't give every integer the same codelength, so you end up doing something like "first give the order of magnitude (in, um, unary), and then pick out a specific integer in that range", but recursively (so that you first give the order of magnitude of the order of magnitude if it gets too big, &c.).

(Oh, and Grünwald warns that this still isn't optimal and that question of the best encoding is left to §15.3, but I'm not there yet—it's a big textbook, okay?)

I was originally imagining implementing the integer coding, but I decided it was too complicated for the cute story I wanted to tell about model selection, particularly since I wanted to provide Actual Executable Code with minimal boilerplate and minimal dependencies. (Rust just gives me f32 and f64, not variable-length floats.) I think this kind of handwaving should be admissible in the cute-blog-story genre as long as the author addresses objections in the comment section.

comment by rainy · 2020-10-21T07:14:21.311Z · LW(p) · GW(p)

I did a strange experiment once, I implemented a markov predictor like this and used it to predict the next bit of a sequence. Then I extended the sequence with the negation of that bit an increased the length of the markov chain. The result was a very random unpredictable looking sequence, but of course it was completely deterministic. I wonder if this sequence has a name?

Replies from: Zack_M_Davis
comment by Donald Hobson (donald-hobson) · 2020-10-22T16:59:33.669Z · LW(p) · GW(p)

Of course, you also need to store the concept of a Markov chain in the abstract as part of your models. (But that is constant, and should be fairly small in a good encoding. ) On the other hand, 32 bit floats are excessive in precision. And a substantial proportion of floats aren't in the range [0,1] at all. You could probably cut the precision down to 16 bits, maybe less. Of course, you also need a few bits for the order of the markov chain, and the precision used.

comment by habryka (habryka4) · 2020-10-21T23:03:27.616Z · LW(p) · GW(p)

Promoted to curated: This really is a quite good and straightforward explanation of what I think of as one of the really core ideas in theoretical rationality, covering surprisingly much ground in a way that feels approachable. I remember how impactful reading "A technical explanation of a technical explanation" was for my thinking about rationality, and this feels like a much less daunting introduction to some of the same ideas.

Replies from: Zack_M_Davis
comment by Zack_M_Davis · 2020-10-22T01:54:53.080Z · LW(p) · GW(p)

Thanks! But ... I'm not seeing the Curated "★" icon (e.g., in the Posts list on my profile [LW · GW])? Am I missing something, or is there an extra button you still have to click?

Replies from: jimrandomh
comment by jimrandomh · 2020-10-22T02:25:48.657Z · LW(p) · GW(p)

(When we curate something, admins get an email version of the post immediately, then everyone else who's subscribed to curated gets the email after a 20 minute delay. Sometimes we notice a formatting problem with the email, un-curate during the 20 minute window, then re-curate it; that's what happened in this case.)

comment by dynomight · 2020-10-20T15:10:43.458Z · LW(p) · GW(p)

This covers a really impressive range of material -- well done! I just wanted to point out that if someone followed all of this and wanted more, Shannon's 1948 paper is surprisingly readable even today and is probably a nice companion:

http://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

comment by Gunnar_Zarncke · 2020-10-20T18:46:08.422Z · LW(p) · GW(p)

I liked the previous title - Msg Len - better, but I agree that it was maybe a bit too much ;-)

comment by BeanSprugget · 2020-10-26T18:42:16.357Z · LW(p) · GW(p)

Really interesting post. To me, approaching information with mathematics seems like a black box - and in this post, it feels like magic.

I'm a little confused by the concept of cost: I understand that it takes more data to represent more complex systems, which grows exponentially faster than than the amount of bits. But doesn't the more complex model still strictly fit the data better? - is it just trying to go for a different goal than accuracy? I feel like I'm missing the entire point of the end.

Replies from: Korz
comment by Korz · 2020-10-30T11:29:54.327Z · LW(p) · GW(p)

I am not sure whether my take on this is correct, so I'd be thankful if someone corrects me if I am wrong:

I think that if the goal was only 'predicting' this bit-sequence after knowing the sequence itself, one could just state probability 1 for the known sequence.

In the OP instead, we regard the bit-sequence as stemming from some sequence-generator, of which only this part of the output is known. Here, we only have limited data such that singling out a highly complex model out of model-space has to be weighed against the models' fit to the bit-sequence.

comment by ic · 2020-10-22T23:05:02.430Z · LW(p) · GW(p)

A recommended reading for understanding more of what happens with "regularization" in optimization and search algorithms.

Interesting also the post comes the same week as a discussion on the Solomonoff prior [LW · GW].

comment by Mathisco · 2020-10-22T08:49:12.940Z · LW(p) · GW(p)

May I ask why you choose Rust to write math and algorithms? I would have chosen Julia :p

Replies from: Zack_M_Davis
comment by Zack_M_Davis · 2020-10-23T03:31:55.920Z · LW(p) · GW(p)

Realistically?—Python and Rust are the only two languages I'm Really Good at, and I'm more Excited about Rust because I write enough Python at my dayjob? In my defense, the .windows iterator on slices was really handy here and is shamefully missing from Python's itertools.

Replies from: gilch, Mathisco
comment by gilch · 2020-10-23T04:04:22.922Z · LW(p) · GW(p)

Toolz has a sliding window.

Hy's partition can also do a sliding window like Clojure's can, by setting the step argument.

Replies from: Mathisco
comment by Mathisco · 2020-10-23T17:31:39.603Z · LW(p) · GW(p)

Julia's IterTools has the partition with step argument as well

comment by Mathisco · 2020-10-23T19:24:19.107Z · LW(p) · GW(p)

Rust is a fascinating new language to learn, but not designed for scientific computing. For that Julia is the new kid on the block, and quite easy to pick up if you know Python.