Where can one learn deep intuitions about information theory?
post by Valentine · 2021-12-16T15:47:01.076Z · LW · GW · 8 commentsThis is a question post.
Contents
Answers 17 Radford Neal 14 G Gordon Worley III 11 interstice 6 Neel Nanda 6 jessicata 5 Levi Finkelstein 5 jdp 4 plex 1 Mdoan 1 JacobDotVI 1 GWS 1 kyleherndon 1 James_Miller None 8 comments
I'm currently going through Brilliant's course on "Knowledge and Uncertainty". I just got through the part where it explains what Shannon entropy is. I'm now watching a wave of realizations cascade in my mind. For instance, I now strongly suspect that the "deep law" I've been intuiting for years that makes evolution, economics, and thermodynamics somehow instances of the same thing is actually an application of information theory.
(I'm honestly kind of amazed I was able to follow as much of rationalist thought and Eliezer's writings as I was without any clue what the formal definition of information was. It looks to me like it's more central than is Bayes' Theorem, and that it provides essential context for why and how that theorem is relevant for rationality.)
I'm ravenous to grok more. Sadly, though, I'm bumping into a familiar wall I've seen in basically all other technical subjects: There's something of a desert of obvious resources between "Here's an article offering a quick introduction to the general idea using some fuzzy metaphors" and "Here's a textbook that gives the formal definitions and proofs."
For instance, the book "Thinking Physics" by Lewis Carroll Epstein massively helps to fill this gap for classical physics, especially classical mechanics. By way of contrast, most intro to physics textbooks are awful at this. ("Here we derive the kinematic equation for an object's movement under uniform acceleration. Now calculate how far this object goes when thrown at this angle at this velocity." Why? Is this really a pathway optimized for helping me grok how the physical world works? No? So why are you asking me to do this? Oh, because it's easy to measure whether students get those answers right? Thank you, Goodhart.)
Another excellent non-example is the Wikipedia article on how entropy in thermodynamics is a special case of Shannon entropy. Its length is great as a kind of quick overview, but it's too short to really develop intuitions. And it also leans too heavily on formalism instead of lived experience.
(For instance, it references shannons (= bits of information), but it gives no hint that what a shannon is measuring is the average number of yes/no questions of probability 1/2 that you have to ask to remove your uncertainty. Knowing that's what a shannon is (courtesy of Brilliant's course) gives me some hint about what a hartley (= base ten version instead of base two) probably is: I'm guessing it's the average number of questions with ten possible answers each, where the prior on each answer is 1/10, that you'd have to ask to remove your uncertainty. But then what's a nat (= base e version)? What does it mean for a question to have an irrational number of possible equally likely answers? I'm guessing you'd have to take a limit of some kind to make sense of this, but it's not immediately obvious to me what that limit is let alone how to intuitively interpret what it's saying. The Wikipedia article doesn't even hint at this question let alone start to answer it. It's quite happy just to show that the algebra works out.)
I want to learn to see information theory in my lived experience. I'm fine with technical details, but I want them tied to intuitions. I want to grok this. I don't care about being able to calculate detailed probabilities or whatever except inasmuch as my doing those exercises actually helps with grokking this.
Even a good intuitive explanation of thermodynamics as seen through the lens of information theory would be helpful.
Any suggestions?
Answers
A caution here...
Shannon's information theory is indeed a great intellectual achievement, and is enormously useful in the fields of data compression and error correcting codes, but there is a tendency for people to try to apply it to beyond the areas where it is useful.
Some of these applications are ok but not essential. For instance, some people like to view maximum likelihood estimation of parameters as minimizing relative entropy. If that helps you visualize it, that's fine. But it doesn't really add anything to just directly visualizing maximization of the likelihood function. Mutual information can sometimes be a helpful thing to think about. But the deep theorems of Shannon on things like channel capacity don't really play a role here.
Other applications seem to me to be self deception, in which the glamour of Shannon's achievement conceals that there's really no justification for some supposed application of it.
Some of Jaynes' work is in this category. One example is his (early? he may have later abandoned it...) view that "ignorance" should be expressed by a probability distribution that maximizes entropy, subject to constraints on the observed expectations of certain functions. This is "not even wrong". Jaynes viewed the probability distributions as being subjective (ie, possibly differing between people). But he viewed the observed expectations as being objective. This is incoherent. It's also almost never relevant in practice.
The idea seems to have come about by thinking of statistical physics, in which although in theory measurements of quantities such as temperature are random, in practice, the number of molecules involved is so enormous that the temperature is in effect a well-defined number, representing a expectation with respect to the distribution of states of the system. It is assumed that this is somehow generalizable to thought experiments such as "suppose that you know that the expected value from rolling a loaded die is 3.27, what should you use as the distribution over possible dice rolls...". But how could I possibly know that the expected value is 3.27 when I don't know the distribution? And if I did (eg, I recorded results of many rolls, giving me a good idea of the distribution, but then lost all my records except the average), why would I use the maximum entropy distribution? There's just no actual justification for this. The Bayesian procedure would be to define your prior distribution over distributions, then condition on the expected value being 3.27, and find the average distribution over the posterior distribution of distributions. There's no reason to think the result of this is the maximum entropy distribution.
I highly recommend An Introduction to Information Theory: Symbols, Signals and Noise by John R. Pierce. Several things that are great about this book:
- it's short and concise, like many Dover books
- focuses on helping you build models with gears about how information and related concepts works
- has a bit of a bias towards talking about signals, like over a phone line, but this is actually really helpful because it's the original application of most of the information theory metaphors
- dives into thermodynamics without getting bogged down in a lot of calculations you'll probably never do
I think you'll also appreciate that it is self-aware that information theory is a big deal and a deep concept that explains a lot of stuff, and has some chapters on some of the far reaching stuff. There's chapters at the end on cybernetics, psychology, and art.
This was one of the books that had the most impact on me and how I think and I basically can't recommend it highly enough.
↑ comment by JenniferRM · 2021-12-16T21:27:42.980Z · LW(p) · GW(p)
I came here to suggest the same book which I think of as "that green one that's really great".
One thing I liked about it was the way that it makes background temperature into a super important concept that can drive intuitions that are more "trigonometric/geometric" and amenable to visualization... with random waves always existing as a background relative to "the waves that have energy pumped into them in order to transmit a signal that is visible as a signal against this background of chaos".
"Signal / noise ratio" is a useful phrase. Being able to see this concept in a perfectly quiet swimming pool (where the first disturbance that generates waves produces "lonely waves" from which an observer can reconstruct almost exactly where the first splash must have occurred) is a deeper thing, that I got from this book.
↑ comment by Gordon Seidoh Worley (gworley) · 2021-12-18T03:33:49.193Z · LW(p) · GW(p)
Okay, gotta punch up my recommendation a little bit.
About 10 years ago I moved houses and, thanks to the growing popularity of fancy ebooks, I decided to divest myself of most of my library. I donated 100s of books that weighted 100s of pounds and ate up 10s of boxes. I kept only a small set of books, small enough to fit in a single box and taking up only about half a shelf.
An Introduction to Information Theory made the cut and I still have my copy today, happily sitting on a shelf next to me as I type. It's that good and that important.
I really love the essay Visual Information Theory
Shannon's original paper is surprisingly readable. Huffman coding is a concrete algorithm making use of information theory, which itself demonstrates information-theoretic principles, as are error correcting codes.
IMO the main concept to deeply understand when studying information theory is the notion of information content/self-information/Shannon-information. Most other things seems to be applications or expansions on this concept. For example entropy is just the expected information content when sampling from a distribution. Mutual information is the shared information content in two distributions. KL-divergence describes how much information content you're getting relative to your choice of encoding. Information gain is the difference in information content after and before you drew a sample.
For this I would recommend this essay written by me. I would also recommend Terrence Tao's post on internet anonymity. Or if you've seen Death Note, Gwern's post on the mistakes of Light. Also this video on KL divergence. And this video by intelligent systems lab.
The book Silicon Dreams: Information, Man, and Machine by Robert Lucky is where I got mine. It's a pop science book that explores the theoretical limits of human computer interaction using information theory. It's written to do exactly the thing you're asking for: Convey deep intuitions about information theory using a variety of practical examples without getting bogged down in math equations or rote exercises.
Covers topics like:
- What are the bottlenecks to human information processing?
- What is Shannon's theory of information and how does it work?
- What input methods exist for computers and what is their bandwidth/theoretical limit?
- What's the best keyboard layout?
- How do (contemporary, the book was written in 1989) compression methods work?
- How fast can a person read, and what are the limits of methods that purport to make it faster?
- If an n-gram Markov chain becomes increasingly English like as it's scaled, does that imply a sufficiently advanced Markov chain is indistinguishable from human intelligence?
A lot of his question is to what extent AI methods can bridge the fundamental gaps between human and electronic computer information processing. As a result he spends a lot of time breaking down the way that various GOFAI methods work in the context of information theory. Given the things you want to understand it for, this seems like it would be very useful to you.
Arbital actually has a bunch of pages on this kind of thing!
The Information by James Gleick is an interesting book on information theory.
I would recommend The Origin of Wealth, by Eric D. Beinhocker. To relate very specifically to your path here, Chapter Fourteen is an update of Micholas Georgescu-Roegen’s 1971 paper “The Entropy Law and the Economic Process”. The entire book is a fantastic telling of how the economy can be modeled as an evolutionary system.
I enjoyed this series of lectures available on youtube.
I like the professor, and tend to form better intuitions from the lecture format than from reading a textbook. In total, it would take about 15 hours to go through the course, possibly less with a bit of skipping/fastforwarding. I don't think this lecture series would give you a deep understanding of information theory, but it worked well to build my intuitions and so is a good starting point.
To start, I propose a different frame to help you. Ask yourself not "How do I get intuition about information theory?" instead ask "How is information theory informing my intuitions?"
It looks to me like it's more central than is Bayes' Theorem, and that it provides essential context for why and how that theorem is relevant for rationality.
You've already noticed that this is "deep" and "widely applicable." Another way of saying these things is "abstract," and abstraction reflects generalizations over some domain of experience. These generalizations are the exact sort of things which form heuristics and intuitions to apply to more specific cases.
To the meat of question:
Step 1) grok the core technology (which you seem have already started)
Step 2) ask yourself the aforementioned question.
Step 3) try to apply it to as many domains as possible
Step 4) as you come into trouble with 3), restart from 1).
When you find yourself looking for more of 1) from where you are now, I recommend at least checking out Shannon's original paper on information. I find his writing style to be much more approachable than average for what is a highly technical paper. Be careful when reading though, because his writing is very dense with each sentence carrying a lot of information ;)
Consider learning game theory.
↑ comment by Valentine · 2021-12-16T15:59:48.718Z · LW(p) · GW(p)
I feel like I know a fair amount of game theory already. Is there a good bridge you could point toward between game theory and information theory? I was able to debate details about emergent game-theoretic engines for years, and reasoning under uncertainty, without the slightest hint about what "bits of information" really were.
Replies from: James_Miller↑ comment by James_Miller · 2021-12-16T16:11:34.226Z · LW(p) · GW(p)
Chapter 5 of A Course In Game Theory. Although you might already know the material.
Replies from: Valentine8 comments
Comments sorted by top scores.
comment by johnswentworth · 2021-12-18T01:02:35.684Z · LW(p) · GW(p)
Even a good intuitive explanation of thermodynamics as seen through the lens of information theory would be helpful.
I have a post [LW · GW] which will probably help with this in particular.
comment by lsusr · 2021-12-16T15:53:16.263Z · LW(p) · GW(p)
My intuitions about the mathematics of entropy didn't come from mathematics, statistics, computer science or even cryptography. They came from the statistical mechanics I learned while studying thermodynamics for my physics degree. Only many years later did I apply the ideas to information theory. I see information theory through the lens of thermodynamics.
Replies from: Valentine↑ comment by Valentine · 2021-12-16T15:56:35.610Z · LW(p) · GW(p)
Do you have suggestions for where to dive into that? That same gap between "Here's a fuzzy overview" and "Here's a textbook optimized for demonstrating your ability to regurgitate formalisms" appears in my skimming of that too. I have strong but fuzzy intuitions for how thermodynamics works, and I have a lot of formal skill, but I have basically zero connection between those two.
Replies from: lsusr, adele-lopez-1↑ comment by lsusr · 2021-12-16T16:03:19.212Z · LW(p) · GW(p)
I wish I could answer your question. When I studied physics, I bucketed textbooks into "good textbooks I like" (such as Griffiths [LW · GW]) and boring forgettable textbooks. Alas, thermodynamics belongs to the latter category. I literally don't remember what textbooks I read.
I've been contemplating whether I should just write my own sequence on statistical mechanics. Your post is good evidence that such a sequence might be valuable.
Replies from: Valentine↑ comment by Valentine · 2021-12-16T16:31:33.641Z · LW(p) · GW(p)
Oh, I would certainly love that. Statistical mechanics looks like it's magic, and it strikes me as absolutely worth grokking, and yeah I haven't found any entry point into it other than the Great Formal Slog.
I remember learning about "inner product spaces" as a graduate student, and memorizing structures and theorems about it, but it wasn't until I had already finished something like a year of grad school that I found out that the intuition behind inner products was "What kind of thing is a dot product in a vector space? What would 'dot product' mean in vector spaces other than the Euclidean ones?" Without that guiding intuition, the whole thing becomes a series of steps of "Yep, I agree, that's true and you've proven it. I don't know why we're proving that or where we're going, but okay. One more theorem to memorize."
I wonder if most "teachers" of formal topics either assume the guiding intuitions are obvious or implicitly think they don't matter. And maybe for truly gifted researchers they don't? But at least for people like me, they're damn close to all that matters.
↑ comment by Adele Lopez (adele-lopez-1) · 2021-12-16T20:20:19.118Z · LW(p) · GW(p)
I believe this is the best one for learning about entropy and thermodynamics (and a bit of stat mech) from this perspective: http://www.av8n.com/physics/thermo/
comment by martinkunev · 2023-06-20T01:24:28.637Z · LW(p) · GW(p)
Is it worth it to read "Information Theory: A Tutorial Introduction 2nd edition" (James V Stone)?
https://www.amazon.com/Information-Theory-Tutorial-Introduction-2nd/dp/1739672704/ref=sr_1_2
comment by Maximum_Skull · 2021-12-17T11:49:18.215Z · LW(p) · GW(p)
I would suggest E.T. Jaynes' excellent Probability Theory: The Logic of Science. While this is a book about Bayesian probability theory and it's applications, it contains a great discussion of entropy, including, e.g., why entropy "works" in thermodynamics.